Corey Ashford | 1 Nov 2009 02:15
Picon

[PATCH 1/1] fix Power6 event codes (affects PAPI/perf_events mainly)

Hello,

Apparently when I created the power6_events.h file for libpfm, I had a 
symbolic event to event code translation table that was "polluted" with 
a large number of ppc970 event codes.  This was no doubt due to some 
sort of slip up on my end, manipulating the event code files.

This caused about half of the event codes to come out wrong for Power6. 
  Consequently, many of the counts for events will be wrong on Power6 
when using PAPI/perf_events.

The attached patch addresses this issue.  Sorry for the error!

Thanks for your consideration,

- Corey

Corey Ashford
Software Engineer
IBM Linux Technology Center, Linux Toolchain
Beaverton, OR
cjashfor <at> us.ibm.com
Index: power6_events.h
===================================================================
RCS file: /cvsroot/perfmon2/libpfm/lib/power6_events.h,v
retrieving revision 1.3
diff -u -p -r1.3 power6_events.h
--- power6_events.h	7 Aug 2009 13:01:18 -0000	1.3
(Continue reading)

Carole Wu | 1 Nov 2009 04:37
Picon

Re: PEBS support update

Hello,



pfmon --smpl-module=pebs-ll -e MEM_INST_RETIRED:LATENCY_ABOVE_THRESHOLD \
         --ld-lat-threshold=4 --long-smpl-periods=2000 --smpl-compact
--with-header ...

This should generate a trace of sample, where each sample represents the 2000th long latency (>4 cycles) memory operations. 2000 x (the number of samples in the trace) should equal to the number of LAST_LEVEL_CACHE_MISSES collected with perfmon2's "counting" capability, e.g. pfmon -e LAST_LEVEL_CACHE_MISSES. Is this right?

I am seeing mismatching numbers for counts collected with PEBS and the counting feature in perfmon2 for my Nehalem machine. 

Your help is appreciated. Thanks,
Carole

On Wed, Oct 28, 2009 at 4:36 AM, stephane eranian <eranian <at> googlemail.com> wrote:
Hi,

I am happy to report that I have now uploaded all the code necessary to use
PEBS on Intel Core, Atom, and Nehalem. That includes PEBS-LL on Nehalem
which is used to sample where cache misses occur.

What you need:
   - latest libpfm sources from CVS

   - latest pfmon sources from CVS

   - perfmon2 2.6.30 from GIT

     git clone
git://git.kernel.org/pub/scm/linux/kernel/git/eranian/linux-2.6.git
     Make sure you enabled 'Unified PEBS'


This kernel includes a unified PEBS sampling format which supports Netburst,
Core, Atom, and Nehalem. You must insert the module perfmon_pebs_smpl
(or compile in the code).

Next, to use PEBS, you can simply do:

  pfmon --smpl-module=pebs --smpl-compact --with-header -einst_retired:any_p \
            --long-smpl-period=2400000 ...

  Not all events support PEBS. In --smpl-compact mode, each line
contains a PEBS
  sample.

To collect cache misses on Nehalem, you can do:

pfmon --smpl-module=pebs-ll -e MEM_INST_RETIRED:LATENCY_ABOVE_THRESHOLD \
         --ld-lat-threshold=4 --long-smpl-periods=2000 --smpl-compact
--with-header ...

 You must use the MEM_INST_RETIRED:LATENCY_ABOVE_THRESHOLD to activate this
 HW feature.

Each line contains a PEBS record, including the cache miss
information. The ld-lat parameter
is the minimal threshold for the miss latency. Only misses >=
threshold are captured. It must
be at least 4. 4 cycles is the L1D hit latency. For each captured
miss, you get an instruction addr,
data addr, miss latency, source of the data (where did it come from,
refer to Intel documentation).
It is important to understand that the instruction addr does NOT point
to the load instruction but
ALWAYS to the next dynamic instruction, i.e., the whole state is
recorded at retirement of the load.


On Mon, Oct 5, 2009 at 1:55 AM, Carole Wu <cwu272 <at> gmail.com> wrote:
> Hello,
>
> I'd like to collect information about my workload, running on Nehalem, using
> PEBS, so I use the following command.
>
>>> pfmon -e MEM_INST_RETIRED:LATENCY_ABOVE_THRESHOLD --ld-lat-threshold=1
>>> --long-smpl-periods=2000 --short-smpl-periods=200 ./mcf_base inp.in
>>> load latency threshold not yet supported
> However, the response seems to suggest that my machine does not currently
> support PEBS? Is it true, or am I not setting parameters correctly?
>
> Any help is greatly appreciated.
>
> Carole
> ------------------------------------------------------------------------------
> Come build with us! The BlackBerry&reg; Developer Conference in SF, CA
> is the only developer event you need to attend this year. Jumpstart your
> developing skills, take BlackBerry mobile applications to market and stay
> ahead of the curve. Join us from November 9&#45;12, 2009. Register now&#33;
> http://p.sf.net/sfu/devconf
> _______________________________________________
> perfmon2-devel mailing list
> perfmon2-devel <at> lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/perfmon2-devel
>
>

------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay 
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
perfmon2-devel mailing list
perfmon2-devel <at> lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/perfmon2-devel
stephane eranian | 1 Nov 2009 12:06

Re: PEBS support update

On Sun, Nov 1, 2009 at 4:37 AM, Carole Wu <cwu272 <at> gmail.com> wrote:
> Hello,
>
> pfmon --smpl-module=pebs-ll -e MEM_INST_RETIRED:LATENCY_ABOVE_THRESHOLD \
>          --ld-lat-threshold=4 --long-smpl-periods=2000 --smpl-compact
> --with-header ...
>
> This should generate a trace of sample, where each sample represents the
> 2000th long latency (>4 cycles) memory operations. 2000 x (the number of
> samples in the trace) should equal to the number of LAST_LEVEL_CACHE_MISSES
> collected with perfmon2's "counting" capability, e.g. pfmon -e
> LAST_LEVEL_CACHE_MISSES. Is this right?

No.

First of all this is not tracing but statistical sampling. You are
going to loose some misses.
There are shadowing effects. In order to complete a sample, you need
the latency. Until
it is collected, no other loads can be sampled even if they exceed the
threshold, i.e., PEBS
can only track one load at a time. When the PEBS buffer fills up,
monitoring stop, but execution
continues, you are missing some more loads there.

Note that this discrepancy is not specific to PEBS, you get the same
behavior with AMD IBS or
Itanium D-EAR. But the idea is that if you run for long enough, you
will eventually get enough
representative samples to approximate a trace.

> I am seeing mismatching numbers for counts collected with PEBS and the
> counting feature in perfmon2 for my Nehalem machine.
> Your help is appreciated. Thanks,
> Carole
> On Wed, Oct 28, 2009 at 4:36 AM, stephane eranian <eranian <at> googlemail.com>
> wrote:
>>
>> Hi,
>>
>> I am happy to report that I have now uploaded all the code necessary to
>> use
>> PEBS on Intel Core, Atom, and Nehalem. That includes PEBS-LL on Nehalem
>> which is used to sample where cache misses occur.
>>
>> What you need:
>>    - latest libpfm sources from CVS
>>
>>    - latest pfmon sources from CVS
>>
>>    - perfmon2 2.6.30 from GIT
>>
>>      git clone
>> git://git.kernel.org/pub/scm/linux/kernel/git/eranian/linux-2.6.git
>>      Make sure you enabled 'Unified PEBS'
>>
>>
>> This kernel includes a unified PEBS sampling format which supports
>> Netburst,
>> Core, Atom, and Nehalem. You must insert the module perfmon_pebs_smpl
>> (or compile in the code).
>>
>> Next, to use PEBS, you can simply do:
>>
>>   pfmon --smpl-module=pebs --smpl-compact --with-header
>> -einst_retired:any_p \
>>             --long-smpl-period=2400000 ...
>>
>>   Not all events support PEBS. In --smpl-compact mode, each line
>> contains a PEBS
>>   sample.
>>
>> To collect cache misses on Nehalem, you can do:
>>
>> pfmon --smpl-module=pebs-ll -e MEM_INST_RETIRED:LATENCY_ABOVE_THRESHOLD \
>>          --ld-lat-threshold=4 --long-smpl-periods=2000 --smpl-compact
>> --with-header ...
>>
>>  You must use the MEM_INST_RETIRED:LATENCY_ABOVE_THRESHOLD to activate
>> this
>>  HW feature.
>>
>> Each line contains a PEBS record, including the cache miss
>> information. The ld-lat parameter
>> is the minimal threshold for the miss latency. Only misses >=
>> threshold are captured. It must
>> be at least 4. 4 cycles is the L1D hit latency. For each captured
>> miss, you get an instruction addr,
>> data addr, miss latency, source of the data (where did it come from,
>> refer to Intel documentation).
>> It is important to understand that the instruction addr does NOT point
>> to the load instruction but
>> ALWAYS to the next dynamic instruction, i.e., the whole state is
>> recorded at retirement of the load.
>>
>>
>> On Mon, Oct 5, 2009 at 1:55 AM, Carole Wu <cwu272 <at> gmail.com> wrote:
>> > Hello,
>> >
>> > I'd like to collect information about my workload, running on Nehalem,
>> > using
>> > PEBS, so I use the following command.
>> >
>> >>> pfmon -e MEM_INST_RETIRED:LATENCY_ABOVE_THRESHOLD --ld-lat-threshold=1
>> >>> --long-smpl-periods=2000 --short-smpl-periods=200 ./mcf_base inp.in
>> >>> load latency threshold not yet supported
>> > However, the response seems to suggest that my machine does not
>> > currently
>> > support PEBS? Is it true, or am I not setting parameters correctly?
>> >
>> > Any help is greatly appreciated.
>> >
>> > Carole
>> >
>> > ------------------------------------------------------------------------------
>> > Come build with us! The BlackBerry&reg; Developer Conference in SF, CA
>> > is the only developer event you need to attend this year. Jumpstart your
>> > developing skills, take BlackBerry mobile applications to market and
>> > stay
>> > ahead of the curve. Join us from November 9&#45;12, 2009. Register
>> > now&#33;
>> > http://p.sf.net/sfu/devconf
>> > _______________________________________________
>> > perfmon2-devel mailing list
>> > perfmon2-devel <at> lists.sourceforge.net
>> > https://lists.sourceforge.net/lists/listinfo/perfmon2-devel
>> >
>> >
>
>

------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay 
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
perfmon2-devel mailing list
perfmon2-devel <at> lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/perfmon2-devel
Carole Wu | 1 Nov 2009 14:30
Picon

Re: PEBS support update

Got it. Thanks very much for the explanation. 


--Carole

On Sun, Nov 1, 2009 at 7:06 AM, stephane eranian <eranian <at> googlemail.com> wrote:
On Sun, Nov 1, 2009 at 4:37 AM, Carole Wu <cwu272 <at> gmail.com> wrote:
> Hello,
>
> pfmon --smpl-module=pebs-ll -e MEM_INST_RETIRED:LATENCY_ABOVE_THRESHOLD \
>          --ld-lat-threshold=4 --long-smpl-periods=2000 --smpl-compact
> --with-header ...
>
> This should generate a trace of sample, where each sample represents the
> 2000th long latency (>4 cycles) memory operations. 2000 x (the number of
> samples in the trace) should equal to the number of LAST_LEVEL_CACHE_MISSES
> collected with perfmon2's "counting" capability, e.g. pfmon -e
> LAST_LEVEL_CACHE_MISSES. Is this right?

No.

First of all this is not tracing but statistical sampling. You are
going to loose some misses.
There are shadowing effects. In order to complete a sample, you need
the latency. Until
it is collected, no other loads can be sampled even if they exceed the
threshold, i.e., PEBS
can only track one load at a time. When the PEBS buffer fills up,
monitoring stop, but execution
continues, you are missing some more loads there.

Note that this discrepancy is not specific to PEBS, you get the same
behavior with AMD IBS or
Itanium D-EAR. But the idea is that if you run for long enough, you
will eventually get enough
representative samples to approximate a trace.


> I am seeing mismatching numbers for counts collected with PEBS and the
> counting feature in perfmon2 for my Nehalem machine.
> Your help is appreciated. Thanks,
> Carole
> On Wed, Oct 28, 2009 at 4:36 AM, stephane eranian <eranian <at> googlemail.com>
> wrote:
>>
>> Hi,
>>
>> I am happy to report that I have now uploaded all the code necessary to
>> use
>> PEBS on Intel Core, Atom, and Nehalem. That includes PEBS-LL on Nehalem
>> which is used to sample where cache misses occur.
>>
>> What you need:
>>    - latest libpfm sources from CVS
>>
>>    - latest pfmon sources from CVS
>>
>>    - perfmon2 2.6.30 from GIT
>>
>>      git clone
>> git://git.kernel.org/pub/scm/linux/kernel/git/eranian/linux-2.6.git
>>      Make sure you enabled 'Unified PEBS'
>>
>>
>> This kernel includes a unified PEBS sampling format which supports
>> Netburst,
>> Core, Atom, and Nehalem. You must insert the module perfmon_pebs_smpl
>> (or compile in the code).
>>
>> Next, to use PEBS, you can simply do:
>>
>>   pfmon --smpl-module=pebs --smpl-compact --with-header
>> -einst_retired:any_p \
>>             --long-smpl-period=2400000 ...
>>
>>   Not all events support PEBS. In --smpl-compact mode, each line
>> contains a PEBS
>>   sample.
>>
>> To collect cache misses on Nehalem, you can do:
>>
>> pfmon --smpl-module=pebs-ll -e MEM_INST_RETIRED:LATENCY_ABOVE_THRESHOLD \
>>          --ld-lat-threshold=4 --long-smpl-periods=2000 --smpl-compact
>> --with-header ...
>>
>>  You must use the MEM_INST_RETIRED:LATENCY_ABOVE_THRESHOLD to activate
>> this
>>  HW feature.
>>
>> Each line contains a PEBS record, including the cache miss
>> information. The ld-lat parameter
>> is the minimal threshold for the miss latency. Only misses >=
>> threshold are captured. It must
>> be at least 4. 4 cycles is the L1D hit latency. For each captured
>> miss, you get an instruction addr,
>> data addr, miss latency, source of the data (where did it come from,
>> refer to Intel documentation).
>> It is important to understand that the instruction addr does NOT point
>> to the load instruction but
>> ALWAYS to the next dynamic instruction, i.e., the whole state is
>> recorded at retirement of the load.
>>
>>
>> On Mon, Oct 5, 2009 at 1:55 AM, Carole Wu <cwu272 <at> gmail.com> wrote:
>> > Hello,
>> >
>> > I'd like to collect information about my workload, running on Nehalem,
>> > using
>> > PEBS, so I use the following command.
>> >
>> >>> pfmon -e MEM_INST_RETIRED:LATENCY_ABOVE_THRESHOLD --ld-lat-threshold=1
>> >>> --long-smpl-periods=2000 --short-smpl-periods=200 ./mcf_base inp.in
>> >>> load latency threshold not yet supported
>> > However, the response seems to suggest that my machine does not
>> > currently
>> > support PEBS? Is it true, or am I not setting parameters correctly?
>> >
>> > Any help is greatly appreciated.
>> >
>> > Carole
>> >
>> > ------------------------------------------------------------------------------
>> > Come build with us! The BlackBerry&reg; Developer Conference in SF, CA
>> > is the only developer event you need to attend this year. Jumpstart your
>> > developing skills, take BlackBerry mobile applications to market and
>> > stay
>> > ahead of the curve. Join us from November 9&#45;12, 2009. Register
>> > now&#33;
>> > http://p.sf.net/sfu/devconf
>> > _______________________________________________
>> > perfmon2-devel mailing list
>> > perfmon2-devel <at> lists.sourceforge.net
>> > https://lists.sourceforge.net/lists/listinfo/perfmon2-devel
>> >
>> >
>
>

------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay 
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
perfmon2-devel mailing list
perfmon2-devel <at> lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/perfmon2-devel
stephane eranian | 2 Nov 2009 12:10

Re: [PATCH 1/1] fix Power6 event codes (affects PAPI/perf_events mainly)

Applied.
thanks.

On Sun, Nov 1, 2009 at 2:15 AM, Corey Ashford
<cjashfor <at> linux.vnet.ibm.com> wrote:
> Hello,
>
> Apparently when I created the power6_events.h file for libpfm, I had a
> symbolic event to event code translation table that was "polluted" with a
> large number of ppc970 event codes.  This was no doubt due to some sort of
> slip up on my end, manipulating the event code files.
>
> This caused about half of the event codes to come out wrong for Power6.
>  Consequently, many of the counts for events will be wrong on Power6 when
> using PAPI/perf_events.
>
> The attached patch addresses this issue.  Sorry for the error!
>
> Thanks for your consideration,
>
> - Corey
>
> Corey Ashford
> Software Engineer
> IBM Linux Technology Center, Linux Toolchain
> Beaverton, OR
> cjashfor <at> us.ibm.com
>
> ------------------------------------------------------------------------------
> Come build with us! The BlackBerry(R) Developer Conference in SF, CA
> is the only developer event you need to attend this year. Jumpstart your
> developing skills, take BlackBerry mobile applications to market and stay
> ahead of the curve. Join us from November 9 - 12, 2009. Register now!
> http://p.sf.net/sfu/devconference
> _______________________________________________
> perfmon2-devel mailing list
> perfmon2-devel <at> lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/perfmon2-devel
>
>

------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay 
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
perfmon2-devel mailing list
perfmon2-devel <at> lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/perfmon2-devel
Drongowski, Paul | 2 Nov 2009 17:24
Picon
Favicon

Re: same counter values for several processes onBarcelona

Hello Sergey --

Just to add a few comments to Stephane's.

Northbridge resources are shared across all cores
in the "node" (the cores that are local to the
Northbridge and communicate directly with the NB).
Northbridge events measure memory controller events,
crossbar events, HyperTransport(tm) events and
L3 cache events. The events generally do not break
out by core, i.e., these X events belong to core 0,
these Y events belong to core 1, etc. They measure
events due to requests, etc. across all local cores.

If all cores are measuring the same NB event, the
events will be counted multiple times -- not good.
So, perfmon2 follows the recommended practice, which is
to measure NB events on a single core
belonging to a given NB.

There's one potential gotcha. The core measuring
the NB events cannot go into halt mode (idle) as the
clock for the performance counter logic may be stopped
and events are not counted. It may be necessary to
keep the core awake by running a low-priority
background task.

More information concerning Northbridge events can be
found in Section 3.12 of the BIOS and Kernel Developer's
Guide (BKDG) for AMD Family 10h processors.

-- pj

-----Original Message-----
From: stephane eranian [mailto:eranian <at> googlemail.com] 
Sent: Friday, October 30, 2009 6:59 PM
To: Sergey Blagodurov
Cc: perfmon2-devel <at> lists.sourceforge.net
Subject: Re: [perfmon2] same counter values for several processes
onBarcelona

Sergey,

On Fri, Oct 30, 2009 at 11:23 PM, Sergey Blagodurov
<blagodurov <at> gmail.com> wrote:
> Hello,
>
> I have the following problem while trying to gather hardware counters
> on Barcelona: whenever I start several programs with the same counter
> profiling on AMD Barcelona, all of the program invocations except the
> first failing with conflicting monitoring session error. The problem
> can be revealed with smth like that:
>
> $ pfmon --events=MEMORY_CONTROLLER_REQUESTS:READ_REQUESTS sleep 20 &
> [1]
> $ pfmon --events=MEMORY_CONTROLLER_REQUESTS:READ_REQUESTS sleep 20 &
> error conflicting monitoring session exists
>

This is due to a constraint on all Northbridge events.
The way the current perfmon code implements the restriction is to
limit the number of perfmon
sessions which have access to NB events to 1 per socket in system-wide
and 1 global in per-thread
mode.

> There is no such behaviour on Intel Xeon processors. I wonder what may
> cause this strange problem and are there any workarounds to it (i.e.
> is it still possible to gather same counter info for several
> programs/threads at the same time)?
>
> Thank you!
>
> --
> Sincerely,
>
> Sergey Blagodurov
>

------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay 
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
Rob Fowler | 2 Nov 2009 18:02
Favicon

Re: same counter values for several processes onBarcelona

One detail: As I've mentioned in other messages on the subject, you can
set up a "system-wide" session using pfmon to count the NB events.  This
will continue to count even when the thread that started the session
sleeps.

Drongowski, Paul wrote:
> Hello Sergey --
> 
> Just to add a few comments to Stephane's.
> 
> Northbridge resources are shared across all cores
> in the "node" (the cores that are local to the
> Northbridge and communicate directly with the NB).
> Northbridge events measure memory controller events,
> crossbar events, HyperTransport(tm) events and
> L3 cache events. The events generally do not break
> out by core, i.e., these X events belong to core 0,
> these Y events belong to core 1, etc. They measure
> events due to requests, etc. across all local cores.
> 
> If all cores are measuring the same NB event, the
> events will be counted multiple times -- not good.
> So, perfmon2 follows the recommended practice, which is
> to measure NB events on a single core
> belonging to a given NB.
> 
> There's one potential gotcha. The core measuring
> the NB events cannot go into halt mode (idle) as the
> clock for the performance counter logic may be stopped
> and events are not counted. It may be necessary to
> keep the core awake by running a low-priority
> background task.
> 
> More information concerning Northbridge events can be
> found in Section 3.12 of the BIOS and Kernel Developer's
> Guide (BKDG) for AMD Family 10h processors.
> 
> -- pj
> 
> 
> -----Original Message-----
> From: stephane eranian [mailto:eranian <at> googlemail.com] 
> Sent: Friday, October 30, 2009 6:59 PM
> To: Sergey Blagodurov
> Cc: perfmon2-devel <at> lists.sourceforge.net
> Subject: Re: [perfmon2] same counter values for several processes
> onBarcelona
> 
> Sergey,
> 
> On Fri, Oct 30, 2009 at 11:23 PM, Sergey Blagodurov
> <blagodurov <at> gmail.com> wrote:
>> Hello,
>>
>> I have the following problem while trying to gather hardware counters
>> on Barcelona: whenever I start several programs with the same counter
>> profiling on AMD Barcelona, all of the program invocations except the
>> first failing with conflicting monitoring session error. The problem
>> can be revealed with smth like that:
>>
>> $ pfmon --events=MEMORY_CONTROLLER_REQUESTS:READ_REQUESTS sleep 20 &
>> [1]
>> $ pfmon --events=MEMORY_CONTROLLER_REQUESTS:READ_REQUESTS sleep 20 &
>> error conflicting monitoring session exists
>>
> 
> This is due to a constraint on all Northbridge events.
> The way the current perfmon code implements the restriction is to
> limit the number of perfmon
> sessions which have access to NB events to 1 per socket in system-wide
> and 1 global in per-thread
> mode.
> 
> 
>> There is no such behaviour on Intel Xeon processors. I wonder what may
>> cause this strange problem and are there any workarounds to it (i.e.
>> is it still possible to gather same counter info for several
>> programs/threads at the same time)?
>>
>> Thank you!
>>
>> --
>> Sincerely,
>>
>> Sergey Blagodurov
>>
> 
> 
> 
> ------------------------------------------------------------------------------
> Come build with us! The BlackBerry(R) Developer Conference in SF, CA
> is the only developer event you need to attend this year. Jumpstart your
> developing skills, take BlackBerry mobile applications to market and stay 
> ahead of the curve. Join us from November 9 - 12, 2009. Register now!
> http://p.sf.net/sfu/devconference
> _______________________________________________
> perfmon2-devel mailing list
> perfmon2-devel <at> lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/perfmon2-devel

------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay 
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
Sergey Blagodurov | 2 Nov 2009 19:49
Picon

Re: same counter values for several processes onBarcelona

Hi Paul,

> Northbridge resources are shared across all cores
> in the "node" (the cores that are local to the
> Northbridge and communicate directly with the NB).
> Northbridge events measure memory controller events,
> crossbar events, HyperTransport(tm) events and
> L3 cache events. The events generally do not break
> out by core, i.e., these X events belong to core 0,
> these Y events belong to core 1, etc. They measure
> events due to requests, etc. across all local cores.

There are some per core events like L3_CACHE_MISSES:CORE_X_SELECT, but
I found I can still gather it for several cores in system-wide per
core session from only one core. And when I need to gather more than 4
events, I use event sets for this session.

Thank you all for your helpful comments!

--
Sincerely,

Sergey Blagodurov

On Mon, Nov 2, 2009 at 8:24 AM, Drongowski, Paul
<paul.drongowski <at> amd.com> wrote:
> Hello Sergey --
>
> Just to add a few comments to Stephane's.
>
> Northbridge resources are shared across all cores
> in the "node" (the cores that are local to the
> Northbridge and communicate directly with the NB).
> Northbridge events measure memory controller events,
> crossbar events, HyperTransport(tm) events and
> L3 cache events. The events generally do not break
> out by core, i.e., these X events belong to core 0,
> these Y events belong to core 1, etc. They measure
> events due to requests, etc. across all local cores.
>
> If all cores are measuring the same NB event, the
> events will be counted multiple times -- not good.
> So, perfmon2 follows the recommended practice, which is
> to measure NB events on a single core
> belonging to a given NB.
>
> There's one potential gotcha. The core measuring
> the NB events cannot go into halt mode (idle) as the
> clock for the performance counter logic may be stopped
> and events are not counted. It may be necessary to
> keep the core awake by running a low-priority
> background task.
>
> More information concerning Northbridge events can be
> found in Section 3.12 of the BIOS and Kernel Developer's
> Guide (BKDG) for AMD Family 10h processors.
>
> -- pj
>
>
> -----Original Message-----
> From: stephane eranian [mailto:eranian <at> googlemail.com]
> Sent: Friday, October 30, 2009 6:59 PM
> To: Sergey Blagodurov
> Cc: perfmon2-devel <at> lists.sourceforge.net
> Subject: Re: [perfmon2] same counter values for several processes
> onBarcelona
>
> Sergey,
>
> On Fri, Oct 30, 2009 at 11:23 PM, Sergey Blagodurov
> <blagodurov <at> gmail.com> wrote:
>> Hello,
>>
>> I have the following problem while trying to gather hardware counters
>> on Barcelona: whenever I start several programs with the same counter
>> profiling on AMD Barcelona, all of the program invocations except the
>> first failing with conflicting monitoring session error. The problem
>> can be revealed with smth like that:
>>
>> $ pfmon --events=MEMORY_CONTROLLER_REQUESTS:READ_REQUESTS sleep 20 &
>> [1]
>> $ pfmon --events=MEMORY_CONTROLLER_REQUESTS:READ_REQUESTS sleep 20 &
>> error conflicting monitoring session exists
>>
>
> This is due to a constraint on all Northbridge events.
> The way the current perfmon code implements the restriction is to
> limit the number of perfmon
> sessions which have access to NB events to 1 per socket in system-wide
> and 1 global in per-thread
> mode.
>
>
>> There is no such behaviour on Intel Xeon processors. I wonder what may
>> cause this strange problem and are there any workarounds to it (i.e.
>> is it still possible to gather same counter info for several
>> programs/threads at the same time)?
>>
>> Thank you!
>>
>> --
>> Sincerely,
>>
>> Sergey Blagodurov
>>
>
>
>

------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay 
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
Carole Wu | 3 Nov 2009 04:10
Picon

Re: PEBS support update

Hello,


In my PEBS-LL trace, collected with 

pfmon --smpl-module=pebs-ll -e MEM_INST_RETIRED:LATENCY_ABOVE_THRESHOLD --ld-lat-threshold=4 --long-smpl-periods=2000 --smpl-compact --with-header

I am seeing samples with latency 0x4 and the data source information field says it is a L3 miss. 

However, the latency for L1 I&D cache is about 4 cycles, for L2 cache is about 11 cycles, for L3 cache is about 40 cycles, and >100 cycles going offchip for the Nehalem machine. 

Is this because the latency in the trace is "scaled" somehow?

Thanks in advance for your help.
Carole



On Wed, Oct 28, 2009 at 3:36 AM, stephane eranian <eranian <at> googlemail.com> wrote:
Hi,

I am happy to report that I have now uploaded all the code necessary to use
PEBS on Intel Core, Atom, and Nehalem. That includes PEBS-LL on Nehalem
which is used to sample where cache misses occur.

What you need:
   - latest libpfm sources from CVS

   - latest pfmon sources from CVS

   - perfmon2 2.6.30 from GIT

     git clone
git://git.kernel.org/pub/scm/linux/kernel/git/eranian/linux-2.6.git
     Make sure you enabled 'Unified PEBS'


This kernel includes a unified PEBS sampling format which supports Netburst,
Core, Atom, and Nehalem. You must insert the module perfmon_pebs_smpl
(or compile in the code).

Next, to use PEBS, you can simply do:

  pfmon --smpl-module=pebs --smpl-compact --with-header -einst_retired:any_p \
            --long-smpl-period=2400000 ...

  Not all events support PEBS. In --smpl-compact mode, each line
contains a PEBS
  sample.

To collect cache misses on Nehalem, you can do:

pfmon --smpl-module=pebs-ll -e MEM_INST_RETIRED:LATENCY_ABOVE_THRESHOLD \
         --ld-lat-threshold=4 --long-smpl-periods=2000 --smpl-compact
--with-header ...

 You must use the MEM_INST_RETIRED:LATENCY_ABOVE_THRESHOLD to activate this
 HW feature.

Each line contains a PEBS record, including the cache miss
information. The ld-lat parameter
is the minimal threshold for the miss latency. Only misses >=
threshold are captured. It must
be at least 4. 4 cycles is the L1D hit latency. For each captured
miss, you get an instruction addr,
data addr, miss latency, source of the data (where did it come from,
refer to Intel documentation).
It is important to understand that the instruction addr does NOT point
to the load instruction but
ALWAYS to the next dynamic instruction, i.e., the whole state is
recorded at retirement of the load.


On Mon, Oct 5, 2009 at 1:55 AM, Carole Wu <cwu272 <at> gmail.com> wrote:
> Hello,
>
> I'd like to collect information about my workload, running on Nehalem, using
> PEBS, so I use the following command.
>
>>> pfmon -e MEM_INST_RETIRED:LATENCY_ABOVE_THRESHOLD --ld-lat-threshold=1
>>> --long-smpl-periods=2000 --short-smpl-periods=200 ./mcf_base inp.in
>>> load latency threshold not yet supported
> However, the response seems to suggest that my machine does not currently
> support PEBS? Is it true, or am I not setting parameters correctly?
>
> Any help is greatly appreciated.
>
> Carole
> ------------------------------------------------------------------------------
> Come build with us! The BlackBerry&reg; Developer Conference in SF, CA
> is the only developer event you need to attend this year. Jumpstart your
> developing skills, take BlackBerry mobile applications to market and stay
> ahead of the curve. Join us from November 9&#45;12, 2009. Register now&#33;
> _______________________________________________
> perfmon2-devel mailing list
> perfmon2-devel <at> lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/perfmon2-devel
>
>

------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay 
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
perfmon2-devel mailing list
perfmon2-devel <at> lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/perfmon2-devel
stephane eranian | 3 Nov 2009 09:51

Re: PEBS support update

On Tue, Nov 3, 2009 at 4:10 AM, Carole Wu <cwu272 <at> gmail.com> wrote:
> Hello,
> In my PEBS-LL trace, collected with
> pfmon --smpl-module=pebs-ll -e
> MEM_INST_RETIRED:LATENCY_ABOVE_THRESHOLD --ld-lat-threshold=4
> --long-smpl-periods=2000 --smpl-compact --with-header
> I am seeing samples with latency 0x4 and the data source information field
> says it is a L3 miss.
> However, the latency for L1 I&D cache is about 4 cycles, for L2 cache is
> about 11 cycles, for L3 cache is about 40 cycles, and >100 cycles going
> offchip for the Nehalem machine.
> Is this because the latency in the trace is "scaled" somehow?

I am seen similar samples myself, but they should represent a tiny fraction of
all samples. Not clear where this is coming from, either a bug or some
corner case
situation. I would ignore those if possible.

> Thanks in advance for your help.
> Carole
>
>
> On Wed, Oct 28, 2009 at 3:36 AM, stephane eranian <eranian <at> googlemail.com>
> wrote:
>>
>> Hi,
>>
>> I am happy to report that I have now uploaded all the code necessary to
>> use
>> PEBS on Intel Core, Atom, and Nehalem. That includes PEBS-LL on Nehalem
>> which is used to sample where cache misses occur.
>>
>> What you need:
>>    - latest libpfm sources from CVS
>>
>>    - latest pfmon sources from CVS
>>
>>    - perfmon2 2.6.30 from GIT
>>
>>      git clone
>> git://git.kernel.org/pub/scm/linux/kernel/git/eranian/linux-2.6.git
>>      Make sure you enabled 'Unified PEBS'
>>
>>
>> This kernel includes a unified PEBS sampling format which supports
>> Netburst,
>> Core, Atom, and Nehalem. You must insert the module perfmon_pebs_smpl
>> (or compile in the code).
>>
>> Next, to use PEBS, you can simply do:
>>
>>   pfmon --smpl-module=pebs --smpl-compact --with-header
>> -einst_retired:any_p \
>>             --long-smpl-period=2400000 ...
>>
>>   Not all events support PEBS. In --smpl-compact mode, each line
>> contains a PEBS
>>   sample.
>>
>> To collect cache misses on Nehalem, you can do:
>>
>> pfmon --smpl-module=pebs-ll -e MEM_INST_RETIRED:LATENCY_ABOVE_THRESHOLD \
>>          --ld-lat-threshold=4 --long-smpl-periods=2000 --smpl-compact
>> --with-header ...
>>
>>  You must use the MEM_INST_RETIRED:LATENCY_ABOVE_THRESHOLD to activate
>> this
>>  HW feature.
>>
>> Each line contains a PEBS record, including the cache miss
>> information. The ld-lat parameter
>> is the minimal threshold for the miss latency. Only misses >=
>> threshold are captured. It must
>> be at least 4. 4 cycles is the L1D hit latency. For each captured
>> miss, you get an instruction addr,
>> data addr, miss latency, source of the data (where did it come from,
>> refer to Intel documentation).
>> It is important to understand that the instruction addr does NOT point
>> to the load instruction but
>> ALWAYS to the next dynamic instruction, i.e., the whole state is
>> recorded at retirement of the load.
>>
>>
>> On Mon, Oct 5, 2009 at 1:55 AM, Carole Wu <cwu272 <at> gmail.com> wrote:
>> > Hello,
>> >
>> > I'd like to collect information about my workload, running on Nehalem,
>> > using
>> > PEBS, so I use the following command.
>> >
>> >>> pfmon -e MEM_INST_RETIRED:LATENCY_ABOVE_THRESHOLD --ld-lat-threshold=1
>> >>> --long-smpl-periods=2000 --short-smpl-periods=200 ./mcf_base inp.in
>> >>> load latency threshold not yet supported
>> > However, the response seems to suggest that my machine does not
>> > currently
>> > support PEBS? Is it true, or am I not setting parameters correctly?
>> >
>> > Any help is greatly appreciated.
>> >
>> > Carole
>> >
>> > ------------------------------------------------------------------------------
>> > Come build with us! The BlackBerry&reg; Developer Conference in SF, CA
>> > is the only developer event you need to attend this year. Jumpstart your
>> > developing skills, take BlackBerry mobile applications to market and
>> > stay
>> > ahead of the curve. Join us from November 9&#45;12, 2009. Register
>> > now&#33;
>> > http://p.sf.net/sfu/devconf
>> > _______________________________________________
>> > perfmon2-devel mailing list
>> > perfmon2-devel <at> lists.sourceforge.net
>> > https://lists.sourceforge.net/lists/listinfo/perfmon2-devel
>> >
>> >
>
>

------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay 
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
perfmon2-devel mailing list
perfmon2-devel <at> lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/perfmon2-devel

Gmane