大平怜 | 25 Jun 02:06 2016
Picon
Gravatar

Fwd: Potential deadlock in operf when using --pid

I forgot to include oprofile-list.


Regards,
Rei Odaira

---------- Forwarded message ----------
From: 大平怜 <rei.odaira <at> gmail.com>
Date: 2016-06-24 19:04 GMT-05:00
Subject: Re: Potential deadlock in operf when using --pid
To: William Cohen <wcohen <at> redhat.com>


Hi,

Now I come back to this problem.

I am attaching two patches to solve it.  They apply to the latest master in the git repository.  Please apply oprofile_avoid_deadlock_1.patch first and then oprofile_avoid_deadlock_2.patch. The first patch is just for refactoring the code to make the second patch possible, and the second patch is for the actual problem solving.

So, the problem is that the operf-record process is blocked on a write to the sample output pipe when the pipe is full, while the operf-read process, which is the consumer of the pipe, cannot read from the pipe because it is blocked on a read from the comm pipe, to which the operf-record process is supposed to write.  Please refer to the previous emails for the detailed problem description.

            |<--sample output pipe--|
operf-read  |-------comm pipe------>|  operf-record
            |<------comm pipe-------|

My solution is to make the sample output pipe non-blocking on the operf-record side.  When the sample output pipe is full, instead of waiting on it, the operf-record process polls both the sample output pipe and the (read-end of) comm pipe.  If there is a message in the comm pipe from the operf-read process, the operf-record process handles it and writes a response to the comm pipe, so that the operf-read process can finish being blocked on the read from the comm pipe and eventually can consume the sample output pipe.

There are some other ways to solve this problem, so any comments are welcome.  With my patches, you can finish profiling oprofile_multithread_test by a single ctrl-c. They passed all the tests in oprofile-tests.


Regards,
Rei Odaira


2015-10-02 14:19 GMT-05:00 William Cohen <wcohen <at> redhat.com>:
On 10/02/2015 11:10 AM, 大平怜 wrote:
> Sorry, I sent wrong source code.  I am attaching the right one.
>
>
> Regards,
> Rei Odaira
>
> 2015-10-01 18:32 GMT-05:00 大平怜 <rei.odaira <at> gmail.com <mailto:rei.odaira <at> gmail.com>>:
>
>     Hi Will,
>
>     How about the attached test program?  This almost always causes the problem in my environment.
>
>     > gcc -o oprofile_multithread_test oprofile_multithread_test.c -lpthread
>     > ./oprofile_multithread_test
>     Usage: oprofile_multithread_test <number of spawns> <number of threads> <number of operations per thread>
>     > ./oprofile_multithread_test -1 16 100000
>
>     In this example, oprofile_multithread_test spawns threads infinitely but runs maximum 16 threads simultaneously.  Each thread performs addition 100000 times and then completes.  Please use ^C to end this program if you specify -1 to the number of spawns.
>
>     If you profile this program with operf --pid, I expect you will not be able to finish operf by a single ^C.
>
>
>     Regards,
>     Rei Odaira

Hi,

I was able to get the failure mentioned with the example code on a rhel7 machine.

$ /usr/local/bin/operf  --events=CPU_CLK_UNHALTED:100000000:0:1:1 --pid `pgrep oprofile_multi`
operf: Press Ctl-c or 'kill -SIGINT 8976' to stop profiling
operf: Profiler started
Unable to collect samples for forked process 9569. Process may have ended before recording could be started.
Unable to collect samples for forked process 9570. Process may have ended before recording could be started.
Unable to collect samples for forked process 9571. Process may have ended before recording could be started.
Unable to collect samples for forked process 9572. Process may have ended before recording could be started.
Unable to collect samples for forked process 9573. Process may have ended before recording could be started.
Unable to collect samples for forked process 9574. Process may have ended before recording could be started.
Unable to collect samples for forked process 9576. Process may have ended before recording could be started.
Unable to collect samples for forked process 9577. Process may have ended before recording could be started.
Unable to collect samples for forked process 9578. Process may have ended before recording could be started.
Unable to collect samples for forked process 9579. Process may have ended before recording could be started.
Unable to collect samples for forked process 9580. Process may have ended before recording could be started.
Unable to collect samples for forked process 9581. Process may have ended before recording could be started.
Unable to collect samples for forked process 9582. Process may have ended before recording could be started.
Unable to collect samples for forked process 9583. Process may have ended before recording could be started.
Unable to collect samples for forked process 9584. Process may have ended before recording could be started.
Unable to collect samples for forked process 9585. Process may have ended before recording could be started.
Unable to collect samples for forked process 9586. Process may have ended before recording could be started.
Unable to collect samples for forked process 9587. Process may have ended before recording could be started.
Unable to collect samples for forked process 9588. Process may have ended before recording could be started.
Unable to collect samples for forked process 9589. Process may have ended before recording could be started.
Unable to collect samples for forked process 9590. Process may have ended before recording could be started.
Unable to collect samples for forked process 9591. Process may have ended before recording could be started.
Unable to collect samples for forked process 9592. Process may have ended before recording could be started.
Unable to collect samples for forked process 9593. Process may have ended before recording could be started.
Unable to collect samples for forked process 9594. Process may have ended before recording could be started.
Unable to collect samples for forked process 9595. Process may have ended before recording could be started.
Unable to collect samples for forked process 9596. Process may have ended before recording could be started.
Unable to collect samples for forked process 9597. Process may have ended before recording could be started.
Unable to collect samples for forked process 9598. Process may have ended before recording could be started.
Unable to collect samples for forked process 9599. Process may have ended before recording could be started.
Unable to collect samples for forked process 9600. Process may have ended before recording could be started.
Unable to collect samples for forked process 9601. Process may have ended before recording could be started.
^C^Cwaitpid for operf-record process failed: Interrupted system call
^Cwaitpid for operf-read process failed: Interrupted system call
Error running profiler

Threads are being created and destroyed very often in the example code. It took multiple times to get operf to connect up to all the threads. Many times I get messages like the following:

$ /usr/local/bin/operf  --events=CPU_CLK_UNHALTED:100000000:0:1:1 --pid `pgrep oprofile_multi`
!!!! No samples collected !!!
The target program/command ended before profiling was started.
operf record init failed
usage: operf [ options ] [ --system-wide | --pid <pid> | [ command [ args ] ] ]
Error running profiler

However, it operf gets started, it seems to reliably fail with the ctl-c.

-Will

>
>     2015-10-01 15:42 GMT-05:00 William Cohen <wcohen <at> redhat.com <mailto:wcohen <at> redhat.com>>:
>
>         On 09/30/2015 06:07 PM, 大平怜 wrote:
>         > Hello again,
>         >
>         > When using --pid, I have occasionally seen operf does not end by hitting ^C once. By hitting ^C multiple times, operf ends with error messages:
>
>         Hi,
>
>         I tried to replicate this on my local machine, but haven't seen it occur yet.  How often does it happen?  Also does it make a difference when the event sampling rate is changed?  There is just one monitored process and it isn't spawning new processes?
>
>         >
>         > -----
>         > operf --events=CPU_CLK_UNHALTED:100000000:0:1:1 --pid `pgrep -f CassandraDaemon`
>         > Kernel profiling is not possible with current system config.
>         > Set /proc/sys/kernel/kptr_restrict to 0 to collect kernel samples.
>         > operf: Press Ctl-c or 'kill -SIGINT 18042' to stop profiling
>         > operf: Profiler started
>         > ^C^Cwaitpid for operf-record process failed: Interrupted system call
>         > ^Cwaitpid for operf-read process failed: Interrupted system call
>         > Error running profiler
>         > -----
>         >
>         > I am using the master branch in the git repository.
>         >
>         > Here is what I found:
>         > The operf-read process was waiting for a read of a sample ID from the operf-record process to return:
>         > (gdb) bt
>         > #0  0x00007fd90e0fa480 in __read_nocancel ()
>         >     at ../sysdeps/unix/syscall-template.S:81
>         > #1  0x0000000000412999 in read (__nbytes=8, __buf=0x7ffddc82b620,
>         >     __fd=<optimized out>) at /usr/include/x86_64-linux-gnu/bits/unistd.h:44
>         > #2  __handle_fork_event (event=0x98e860) at operf_utils.cpp:125
>         > #3  OP_perf_utils::op_write_event (event=event <at> entry=0x98e860,
>         >     sample_type=<optimized out>) at operf_utils.cpp:834
>         > #4  0x0000000000417250 in operf_read::convertPerfData (
>         >     this=this <at> entry=0x648000 <operfRead>) at operf_counter.cpp:1147
>         > #5  0x000000000040a4cb in convert_sample_data () at operf.cpp:947
>         > #6  0x0000000000407482 in _run () at operf.cpp:625
>         > #7  main (argc=4, argv=0x7ffddc82be48) at operf.cpp:1539
>         >
>         > The operf-record process was waiting for a write of sample data to the operf-read process to complete.  Why did the write of the sample data block?  My guess is that the sample_data_pipe was full:
>         > (gdb) bt
>         > #0  0x00007fbe0dc9f4e0 in __write_nocancel ()
>         >     at ../sysdeps/unix/syscall-template.S:81
>         > #1  0x000000000040cd0e in OP_perf_utils::op_write_output (output=6,
>         >     buf=0x7fbe0e4e5140, size=32) at operf_utils.cpp:989
>         > #2  0x000000000040d605 in OP_perf_utils::op_get_kernel_event_data (
>         >     md=0xd3c7f0, pr=pr <at> entry=0xd07900) at operf_utils.cpp:1443
>         > #3  0x000000000041bc12 in operf_record::recordPerfData (this=0xd07900)
>         >     at operf_counter.cpp:846
>         > #4  0x00000000004098b8 in start_profiling () at operf.cpp:402
>         > #5  0x0000000000406305 in _run () at operf.cpp:596
>         > #6  main (argc=4, argv=0x7ffdc6fcde58) at operf.cpp:1539
>         >
>         > As a result, when I hit ^C, the operf main process sent SIGUSR1 to the operf-record process, in which the write returned with EINTR and simply got retried. Since the operf-record process did not end, the operf main process waited at waitpid(2) forever.
>         >
>         > Do you think my guess makes sense?  What would be a fundamental solution?  Simply extending the pipe size would not be appropriate....
>
>         I am wondering if there are any other nuggets of information that can be gathered by using "--verbose debug,misc" and other "--verbose" variations on the operf command line.  It would be worthwhile to take a close look at the code in operf.cpp and see how ctl-c is being handled.  There could be a problem with the order that things are shutdown, causing the problem.  I noticed around line 406 and  of operf.cpp there is the following code:
>
>         catch (const runtime_error & re) {
>                                 /* If the user does ctl-c, the operf-record process may get interrupted
>                                  * in a system call, causing problems with writes to the sample data pipe.
>                                  * So we'll ignore such errors unless the user requests debug info.
>                                  */
>                                 if (!ctl_c || (cverb << vmisc)) {
>                                         cerr << "Caught runtime_error: " << re.what() << endl;
>                                         exit_code = EXIT_FAILURE;
>                                 }
>                                 goto fail_out;
>                         }
>
>         -Will
>
>
>
>
>
>
>
> ------------------------------------------------------------------------------
>
>
>
> _______________________________________________
> oprofile-list mailing list
> oprofile-list <at> lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/oprofile-list
>



Attachment (oprofile_avoid_deadlock_1.patch): application/octet-stream, 6885 bytes
Attachment (oprofile_avoid_deadlock_2.patch): application/octet-stream, 13 KiB
------------------------------------------------------------------------------
Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T Park in San
Francisco, CA to explore cutting-edge tech and listen to tech luminaries
present their vision of the future. This family event has something for
everyone, including kids. Get more information and register today.
http://sdm.link/attshape
_______________________________________________
oprofile-list mailing list
oprofile-list <at> lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/oprofile-list
Dimitris Ganosis | 23 Jun 23:22 2016
Picon
Gravatar

Measure memory bandwidth

I run an Intel(R) Xeon(R) CPU E5520 <at> 2.27GHz and follow, apart from perf list, this list for the available counters. I read this article about memory bandwidth but unfortunately my CPU doesn't support these counters.
I found, here, that
Memory Bandwidth =
1.0e-9 * (UNC_IMC_NORMAL_READS.ANY + UNC_IMC_WRITES.FULL.ANY) *64 / (wall clock time in seconds)
but my CPU doesn't support these two counters as well. The closest available events in my CPU are uncore/qmc_normal_reads_any/ and uncore/qmc_writes_full_any/

Are these events equivalent? Is it valid to measure memory bandwidth with this formula?
Memory Bandwidth = 1.0e-9 * (uncore/qmc_normal_reads_any/+uncore/qmc_writes_full_any/)*64 / (wall clock time in seconds)

Thank you vere much!
------------------------------------------------------------------------------
Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T Park in San
Francisco, CA to explore cutting-edge tech and listen to tech luminaries
present their vision of the future. This family event has something for
everyone, including kids. Get more information and register today.
http://sdm.link/attshape
_______________________________________________
oprofile-list mailing list
oprofile-list <at> lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/oprofile-list
Dimitris Ganosis | 18 Jun 14:29 2016
Picon
Gravatar

HPCs for DRAM

I run an Intel(R) Xeon(R) CPU E5520 <at> 2.27GHz and follow, apart from perf list, this list for the available counters.
Based on Linux perf events Event Sources I expected that counters like

MEM_STORE_RETIRED and MEM_LOAD_RETIRED

would capture only DRAM's activity. I mean that if I don't use DRAM it's bizarre to me to see these counters have the same trend with CPI and other cpu counters. I run a cpu intensive benchmark which does not use DRAM, as I could see from htop, and these two counters followed the same trend, in the graph that I created later, with almost all the other counters. So, my question is if there are any counters which are "triggered" only when I use DRAM?
Like the IO_TRANSACTIONS which is triggered only when I have I/Os.
Thank you!
------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are 
consuming the most bandwidth. Provides multi-vendor support for NetFlow, 
J-Flow, sFlow and other flows. Make informed decisions using capacity planning
reports. http://sdm.link/zohomanageengine
_______________________________________________
oprofile-list mailing list
oprofile-list <at> lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/oprofile-list
Francisco de MeLo Jr | 13 Jun 07:03 2016
Picon

Visualization tool integration

Hi guys.


Im francisco and Im currently developing a visualization tool for profiling, in fact, call graphs and calling context trees.

Well, Id like to integrate in my tool oprofile outputs, so I was thinking about doing a parser to read the profiling report from oprofile to displayed.

Is there any place I can find some guidelines to understand the output?


tks,

-fr 





------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are 
consuming the most bandwidth. Provides multi-vendor support for NetFlow, 
J-Flow, sFlow and other flows. Make informed decisions using capacity 
planning reports. https://ad.doubleclick.net/ddm/clk/305295220;132659582;e
_______________________________________________
oprofile-list mailing list
oprofile-list <at> lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/oprofile-list
Andrea Gelmini | 21 May 14:06 2016
Picon
Gravatar

[PATCH 0231/1529] Fix typo

Signed-off-by: Andrea Gelmini <andrea.gelmini <at> gelma.net>
---
 arch/powerpc/oprofile/cell/spu_task_sync.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/oprofile/cell/spu_task_sync.c b/arch/powerpc/oprofile/cell/spu_task_sync.c
index ed7b097..ef2142f 100644
--- a/arch/powerpc/oprofile/cell/spu_task_sync.c
+++ b/arch/powerpc/oprofile/cell/spu_task_sync.c
 <at>  <at>  -51,7 +51,7  <at>  <at>  static void spu_buff_add(unsigned long int value, int spu)
 	 * That way we can tell the difference between the
 	 * buffer being full versus empty.
 	 *
-	 *  ASSUPTION: the buffer_lock is held when this function
+	 *  ASSUMPTION: the buffer_lock is held when this function
 	 *             is called to lock the buffer, head and tail.
 	 */
 	int full = 1;
--

-- 
2.8.2.534.g1f66975

------------------------------------------------------------------------------
Mobile security can be enabling, not merely restricting. Employees who
bring their own devices (BYOD) to work are irked by the imposition of MDM
restrictions. Mobile Device Manager Plus allows you to control only the
apps on BYO-devices by containerizing them, leaving personal data untouched!
https://ad.doubleclick.net/ddm/clk/304595813;131938128;j
Andrea Gelmini | 21 May 14:01 2016
Picon
Gravatar

[PATCH 0198/1529] Fix typo

Signed-off-by: Andrea Gelmini <andrea.gelmini <at> gelma.net>
---
 arch/mips/oprofile/op_impl.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/mips/oprofile/op_impl.h b/arch/mips/oprofile/op_impl.h
index 7c2da27..a4e758a 100644
--- a/arch/mips/oprofile/op_impl.h
+++ b/arch/mips/oprofile/op_impl.h
 <at>  <at>  -24,7 +24,7  <at>  <at>  struct op_counter_config {
 	unsigned long unit_mask;
 };

-/* Per-architecture configury and hooks.  */
+/* Per-architecture configure and hooks.  */
 struct op_mips_model {
 	void (*reg_setup) (struct op_counter_config *);
 	void (*cpu_setup) (void *dummy);
--

-- 
2.8.2.534.g1f66975

------------------------------------------------------------------------------
Mobile security can be enabling, not merely restricting. Employees who
bring their own devices (BYOD) to work are irked by the imposition of MDM
restrictions. Mobile Device Manager Plus allows you to control only the
apps on BYO-devices by containerizing them, leaving personal data untouched!
https://ad.doubleclick.net/ddm/clk/304595813;131938128;j
Andreas Arnez | 11 May 16:15 2016
Picon

[PATCH] s390: Add support for z13

So far oprofile supported z Systems (s390) machines up to zEC12.  This
adds support for z13 as well.

Signed-off-by: Andreas Arnez <arnez <at> linux.vnet.ibm.com>
---
 events/Makefile.am  | 3 ++-
 libop/op_cpu_type.c | 3 +++
 libop/op_cpu_type.h | 1 +
 libop/op_events.c   | 1 +
 utils/ophelp.c      | 1 +
 5 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/events/Makefile.am b/events/Makefile.am
index 677b05f..29d4b5f 100644
--- a/events/Makefile.am
+++ b/events/Makefile.am
 <at>  <at>  -81,7 +81,8  <at>  <at>  event_files = \
 	tile/tilegx/events tile/tilegx/unit_masks \
 	s390/z10/events s390/z10/unit_masks \
 	s390/z196/events s390/z196/unit_masks \
-	s390/zEC12/events s390/zEC12/unit_masks
+	s390/zEC12/events s390/zEC12/unit_masks \
+	s390/z13/events s390/z13/unit_masks

 install-data-local:
 	for i in ${event_files} ; do \
diff --git a/libop/op_cpu_type.c b/libop/op_cpu_type.c
index 7bdde53..e70e4f6 100644
--- a/libop/op_cpu_type.c
+++ b/libop/op_cpu_type.c
 <at>  <at>  -123,6 +123,7  <at>  <at>  static struct cpu_descr const cpu_descrs[MAX_CPU_TYPE] = {
 	{ "ARM Cortex-A53", "arm/armv8-ca53", CPU_ARM_V8_CA53, 6},
 	{ "Intel Skylake microarchitecture", "i386/skylake", CPU_SKYLAKE, 4 },
 	{ "Intel Goldmont microarchitecture", "i386/goldmont", CPU_GOLDMONT, 4 },
+	{ "IBM z13", "s390/z13", CPU_S390_Z13, 1 },
 };

 static size_t const nr_cpu_descrs = sizeof(cpu_descrs) / sizeof(struct cpu_descr);
 <at>  <at>  -680,6 +681,8  <at>  <at>  static op_cpu _get_s390_cpu_type(void)
 	case 2827:
 	case 2828:
 		return CPU_S390_ZEC12;
+	case 2964:
+		return CPU_S390_Z13;
 	}
 	return CPU_NO_GOOD;
 }
diff --git a/libop/op_cpu_type.h b/libop/op_cpu_type.h
index 98289c5..4f896a0 100644
--- a/libop/op_cpu_type.h
+++ b/libop/op_cpu_type.h
 <at>  <at>  -103,6 +103,7  <at>  <at>  typedef enum {
 	CPU_ARM_V8_CA53, /* ARM Cortex-A53 */
 	CPU_SKYLAKE, /** < Intel Skylake microarchitecture */
 	CPU_GOLDMONT, /** < Intel Goldmont microarchitecture */
+	CPU_S390_Z13, /** < IBM z13 */
 	MAX_CPU_TYPE
 } op_cpu;

diff --git a/libop/op_events.c b/libop/op_events.c
index cdd0409..ea6ced3 100644
--- a/libop/op_events.c
+++ b/libop/op_events.c
 <at>  <at>  -1307,6 +1307,7  <at>  <at>  void op_default_event(op_cpu cpu_type, struct op_default_event_descr * descr)
 		case CPU_S390_Z10:
 		case CPU_S390_Z196:
 		case CPU_S390_ZEC12:
+		case CPU_S390_Z13:
 			descr->name = "CPU_CYCLES";
 			descr->count = 4127518;
 			break;
diff --git a/utils/ophelp.c b/utils/ophelp.c
index 5821593..3cb1c08 100644
--- a/utils/ophelp.c
+++ b/utils/ophelp.c
 <at>  <at>  -779,6 +779,7  <at>  <at>  int main(int argc, char const * argv[])
 	case CPU_S390_Z10:
 	case CPU_S390_Z196:
 	case CPU_S390_ZEC12:
+	case CPU_S390_Z13:
 		event_doc = "IBM System z CPU Measurement Facility\n"
 				"http://www-01.ibm.com/support/docview.wss"
 				"?uid=isg26fcd1cc32246f4c8852574ce0044734a\n";
--

-- 
2.3.0

------------------------------------------------------------------------------
Mobile security can be enabling, not merely restricting. Employees who
bring their own devices (BYOD) to work are irked by the imposition of MDM
restrictions. Mobile Device Manager Plus allows you to control only the
apps on BYO-devices by containerizing them, leaving personal data untouched!
https://ad.doubleclick.net/ddm/clk/304595813;131938128;j
William Cohen | 6 May 21:55 2016
Picon
Gravatar

[PATCH v2] Additional X-Gene 1 performance events

The initial OProfile X-Gene 1 support only had the ARMv8 generic
performance events.  There are many additional microarchitecture
performance events listed for X-Gene 1 at:

https://github.com/AppliedMicro/ENGLinuxLatest/blob/apm_linux_v3.17-rc4/Documentation/arm64/xgene_pmu.txt

This patch adds those X-Gene 1 specific events.

v2: Updated to exclude armv3 architected events not supported by X-Gene
Signed-off-by: William Cohen <wcohen <at> redhat.com>
---
 events/arm/armv8-xgene/events | 119 +++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 117 insertions(+), 2 deletions(-)

diff --git a/events/arm/armv8-xgene/events b/events/arm/armv8-xgene/events
index 3e28463..1df573a 100644
--- a/events/arm/armv8-xgene/events
+++ b/events/arm/armv8-xgene/events
 <at>  <at>  -2,6 +2,121  <at>  <at> 
 # Copyright (c) Red Hat, 2014.
 # Contributed by William Cohen <wcohen <at> redhat.com>
 #
-# Basic ARM V8 events
+# Applied Micro X-Gene events
 #
-include:arm/armv8-pmuv3-common
+# The X-Gene processor excludes a few of the basic ARMv8 architected events.
+# Thus, need to explicitly list them rather than include
+# arm/armv8-pmuv3-common
+
+# The basic ARMv8 architect events supported by X-Gene
+event:0x00 um:zero minimum:500 name:SW_INCR : Instruction architecturally executed, condition code
check pass, software increment
+event:0x01 um:zero minimum:5000 name:L1I_CACHE_REFILL : Level 1 instruction cache refill
+event:0x02 um:zero minimum:5000 name:L1I_TLB_REFILL : Level 1 instruction TLB refill
+event:0x03 um:zero minimum:5000 name:L1D_CACHE_REFILL : Level 1 data cache refill
+event:0x04 um:zero minimum:5000 name:L1D_CACHE : Level 1 data cache access
+event:0x05 um:zero minimum:5000 name:L1D_TLB_REFILL : Level 1 data TLB refill
+# event:0x06 um:zero minimum:100000 name:LD_RETIRED : Instruction architecturally executed,
condition code check pass, load
+# event:0x07 um:zero minimum:100000 name:ST_RETIRED : Instruction architecturally executed,
condition code check pass, store
+event:0x08 um:zero minimum:100000 name:INST_RETIRED : Instruction architecturally executed
+event:0x09 um:zero minimum:500 name:EXC_TAKEN : Exception taken
+event:0x0A um:zero minimum:500 name:EXC_RETURN : Instruction architecturally executed, condition
code check pass, exception return
+event:0x0B um:zero minimum:500 name:CID_WRITE_RETIRED : Instruction architecturally executed,
condition code check pass, write to CONTEXTIDR
+# event:0x0C um:zero minimum:5000 name:PC_WRITE_RETIRED : Instruction architecturally executed,
condition code check pass, software change of the PC
+# event:0x0D um:zero minimum:5000 name:BR_IMMED_RETIRED : Instruction architecturally executed,
immediate branch
+# event:0x0E um:zero minimum:5000 name:BR_RETURN_RETIRED : Instruction architecturally executed,
condition code check pass, procedure return
+# event:0x0F um:zero minimum:500 name:UNALIGNED_LDST_RETIRED : Instruction architecturally
executed, condition code check pass, unaligned load or store
+event:0x10 um:zero minimum:5000 name:BR_MIS_PRED : Mispredicted or not predicted branch
speculatively executed
+event:0x11 um:zero minimum:100000 name:CPU_CYCLES : Cycle
+event:0x12 um:zero minimum:5000 name:BR_PRED : Predictable branch speculatively executed
+event:0x13 um:zero minimum:100000 name:MEM_ACCESS : Data memory access
+event:0x14 um:zero minimum:5000 name:L1I_CACHE : Level 1 instruction cache access
+# event:0x15 um:zero minimum:5000 name:L1D_CACHE_WB : Level 1 data cache write-back
+event:0x16 um:zero minimum:5000 name:L2D_CACHE : Level 2 data cache access
+event:0x17 um:zero minimum:5000 name:L2D_CACHE_REFILL : Level 2 data cache refill
+event:0x18 um:zero minimum:5000 name:L2D_CACHE_WB : Level 2 data cache write-back
+event:0x19 um:zero minimum:5000 name:BUS_ACCESS : Bus access
+event:0x1A um:zero minimum:500 name:MEMORY_ERROR : Local memory error
+event:0x1B um:zero minimum:100000 name:INST_SPEC : Operation speculatively executed
+event:0x1C um:zero minimum:5000 name:TTBR_WRITE_RETIRED : Instruction architecturally executed,
condition code check pass, write to TTBR
+# event:0x1D um:zero minimum:5000 name:BUS_CYCLES : Bus cycle
+# event:0x1F um:zero minimum:5000 name:L1D_CACHE_ALLOCATE : Level 1 data cache allocation without refill
+# event:0x20 um:zero minimum:5000 name:L2D_CACHE_ALLOCATE : Level 2 data cache allocation without refill
+# X-Gene specific events
+event:0x040 um:zero minimum:10007 name:L1D_CACHE_LD : L1 data cache access - Read
+event:0x041 um:zero minimum:10007 name:L1D_CACHE_ST : L1 data cache access - Write
+event:0x042 um:zero minimum:10007 name:L1D_CACHE_REFILL_LD : L1 data cache refill - Read
+event:0x048 um:zero minimum:10007 name:L1D_CACHE_INVAL : L1 data cache invalidate
+event:0x04C um:zero minimum:10007 name:L1D_TLB_REFILL_LD : L1 data TLB refill - Read
+event:0x04D um:zero minimum:10007 name:L1D_TLB_REFILL_ST : L1 data TLB refill - Write
+event:0x050 um:zero minimum:10007 name:L2D_CACHE_LD : L2 data cache access - Read
+event:0x051 um:zero minimum:10007 name:L2D_CACHE_ST : L2 data cache access - Write
+event:0x052 um:zero minimum:10007 name:L2D_CACHE_REFILL_LD : L2 data cache refill - Read
+event:0x053 um:zero minimum:10007 name:L2D_CACHE_REFILL_ST : L2 data cache refill - Write
+event:0x056 um:zero minimum:10007 name:L2D_CACHE_WB_VICTIM : L2 data cache write-back - victim
+event:0x057 um:zero minimum:10007 name:L2D_CACHE_WB_CLEAN : L2 data cache write-back - Cleaning and coherency
+event:0x058 um:zero minimum:10007 name:L2D_CACHE_INVAL : L2 data cache invalidate
+event:0x060 um:zero minimum:10007 name:BUS_ACCESS_LD : Bus access - Read
+event:0x061 um:zero minimum:10007 name:BUS_ACCESS_ST : Bus access - Write
+event:0x062 um:zero minimum:10007 name:BUS_ACCESS_SHARED : Bus access - Normal, cacheable, sharable
+event:0x063 um:zero minimum:10007 name:BUS_ACCESS_NOT_SHARED : Bus access - Not normal, cacheable, sharable
+event:0x064 um:zero minimum:10007 name:BUS_ACCESS_NORMAL : Bus access - Normal
+event:0x065 um:zero minimum:10007 name:BUS_ACCESS_PERIPH : Bus access - Peripheral
+event:0x066 um:zero minimum:10007 name:MEM_ACCESS_LD : Data memory access - Read
+event:0x067 um:zero minimum:10007 name:MEM_ACCESS_ST : Data memory access - write
+event:0x068 um:zero minimum:10007 name:UNALIGNED_LD_SPEC : Unaligned access - Read
+event:0x069 um:zero minimum:10007 name:UNALIGNED_ST_SPEC : Unaligned access - Write
+event:0x06A um:zero minimum:10007 name:UNALIGNED_LDST_SPEC : Unaligned access
+event:0x06C um:zero minimum:10007 name:LDREX_SPEC : Exclusive operation speculatively executed -
Load exclusive
+event:0x06D um:zero minimum:10007 name:STREX_PASS_SPEC : Exclusive operation speculative executed
- Store exclusive pass
+event:0x06E um:zero minimum:10007 name:STREX_FAIL_SPEC : Exclusive operation speculative executed
- Store exclusive fail
+event:0x06F um:zero minimum:10007 name:STREX_SPEC : Exclusive operation speculatively executed -
Store exclusive
+event:0x070 um:zero minimum:10007 name:LD_SPEC : Operation speculatively executed - Load
+event:0x071 um:zero minimum:10007 name:ST_SPEC : Operation speculatively executed - Store
+event:0x072 um:zero minimum:10007 name:LDST_SPEC : Operation speculatively executed - Load or store
+event:0x073 um:zero minimum:10007 name:DP_SPEC : Operation speculatively executed - Integer data processing
+event:0x074 um:zero minimum:10007 name:ASE_SPEC : Operation speculatively executed - Advanced SIMD
+event:0x075 um:zero minimum:10007 name:VFP_SPEC : Operation speculatively executed - FP
+event:0x076 um:zero minimum:10007 name:PC_WRITE_SPEC : Operation speculatively executed - Software
change of PC
+event:0x078 um:zero minimum:10007 name:BR_IMMED_SPEC : Branch speculative executed - Immediate branch
+event:0x079 um:zero minimum:10007 name:BR_RETURN_SPEC : Branch speculative executed - Procedure return
+event:0x07A um:zero minimum:10007 name:BR_INDIRECT_SPEC : Branch speculative executed - Indirect branch
+event:0x07C um:zero minimum:10007 name:ISB_SPEC : Barrier speculatively executed - ISB
+event:0x07D um:zero minimum:10007 name:DSB_SPEC : Barrier speculatively executed - DSB
+event:0x07E um:zero minimum:10007 name:DMB_SPEC : Barrier speculatively executed - DMB
+event:0x081 um:zero minimum:10007 name:EXC_UNDEF : Exception taken, other synchronous
+event:0x082 um:zero minimum:10007 name:EXC_SVC : Exception taken, Supervisor Call
+event:0x083 um:zero minimum:10007 name:EXC_PABORT : Exception taken, Instruction Abort
+event:0x084 um:zero minimum:10007 name:EXC_DABORT : Exception taken, Data Abort or SError
+event:0x086 um:zero minimum:10007 name:EXC_IRQ : Exception taken, IRQ
+event:0x087 um:zero minimum:10007 name:EXC_FIQ : Exception taken, FIQ
+event:0x08A um:zero minimum:10007 name:EXC_HVC : Exception taken, Hypervisor Call
+event:0x08B um:zero minimum:10007 name:EXC_TRAP_PABORT : Exception taken, Instruction Abort not
taken locally
+event:0x08C um:zero minimum:10007 name:EXC_TRAP_DABORT : Exception taken, Data Abort or SError not
taken locally
+event:0x08D um:zero minimum:10007 name:EXC_TRAP_OTHER : Exception taken, other traps not taken locally
+event:0x08E um:zero minimum:10007 name:EXC_TRAP_IRQ : Exception taken, IRQ not taken locally
+event:0x08F um:zero minimum:10007 name:EXC_TRAP_FIQ : Exception taken, FIQ not taken locally
+event:0x090 um:zero minimum:10007 name:RC_LD_SPEC : Release consistency instruction speculatively
executed - Load Acquire
+event:0x091 um:zero minimum:10007 name:RC_ST_SPEC : Release consistency instruction speculatively
executed - Store Release
+event:0x100 um:zero minimum:10007 name:NOP_SPEC : Operation speculatively executed - NOP
+event:0x101 um:zero minimum:10007 name:FSU_CLOCK_OFF_CYCLES : FSU clocking gated off cycle
+event:0x102 um:zero minimum:10007 name:BTB_MIS_PRED : BTB misprediction
+event:0x103 um:zero minimum:10007 name:ITB_MISS : ITB miss
+event:0x104 um:zero minimum:10007 name:DTB_MISS : DTB miss
+event:0x105 um:zero minimum:10007 name:L1D_CACHE_LATE_MISS : L1 data cache late miss
+event:0x106 um:zero minimum:10007 name:L1D_CACHE_PREFETCH : L1 data cache prefetch request
+event:0x107 um:zero minimum:10007 name:L2D_CACHE_PREFETCH : L2 data prefetch request
+event:0x108 um:zero minimum:10007 name:DECODE_STALL : Decode starved for instruction cycle
+event:0x109 um:zero minimum:10007 name:DISPATCH_STALL : Op dispatch stalled cycle
+event:0x10A um:zero minimum:10007 name:IXA_STALL : IXA Op non-issue
+event:0x10B um:zero minimum:10007 name:IXB_STALL : IXB Op non-issue
+event:0x10C um:zero minimum:10007 name:BX_STALL : BX Op non-issue
+event:0x10D um:zero minimum:10007 name:LX_STALL : LX Op non-issue
+event:0x10E um:zero minimum:10007 name:SX_STALL : SX Op non-issue
+event:0x10F um:zero minimum:10007 name:FX_STALL : FX Op non-issue
+event:0x110 um:zero minimum:10007 name:WAIT_CYCLES : Wait state cycle
+event:0x111 um:zero minimum:10007 name:L1_STAGE2_TLB_REFILL : L1 stage-2 TLB refill
+event:0x112 um:zero minimum:10007 name:PAGE_WALK_L0_STAGE1_HIT : Page Walk Cache level-0 stage-1 hit
+event:0x113 um:zero minimum:10007 name:PAGE_WALK_L1_STAGE1_HIT : Page Walk Cache level-1 stage-1 hit
+event:0x114 um:zero minimum:10007 name:PAGE_WALK_L2_STAGE1_HIT : Page Walk Cache level-2 stage-1 hit
+event:0x115 um:zero minimum:10007 name:PAGE_WALK_L1_STAGE2_HIT : Page Walk Cache level-1 stage-2 hit
+event:0x116 um:zero minimum:10007 name:PAGE_WALK_L2_STAGE2_HIT : Page Walk Cache level-2 stage-2 hit
--

-- 
2.5.5

------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
Andi Kleen | 6 May 21:11 2016

[PATCH] oprofile: Update Goldmont events

From: Andi Kleen <ak <at> linux.intel.com>

This patch adds some updates to the Goldmont events. Mainly it is editorial updates
to the event descriptions. In addition it also removes the events not listed
in the SDM (which were not intended to be included)

v2: Minor edits
Signed-off-by: Andi Kleen <ak <at> linux.intel.com>
---
 events/i386/goldmont/unit_masks | 96 ++++++++++++++++++++---------------------
 1 file changed, 47 insertions(+), 49 deletions(-)

diff --git a/events/i386/goldmont/unit_masks b/events/i386/goldmont/unit_masks
index 2f265b3da3f4..d1c08d4169f6 100644
--- a/events/i386/goldmont/unit_masks
+++ b/events/i386/goldmont/unit_masks
 <at>  <at>  -10,17 +10,17  <at>  <at>  name:decode_restriction type:mandatory default:0x1
 name:dl1 type:mandatory default:0x1
 	0x1 extra: dirty_eviction Counts when a modified (dirty) cache line is evicted from the data L1 cache and
needs to be written back to memory.  No count will occur if the evicted line is clean, and hence does not
require a writeback.
 name:fetch_stall type:mandatory default:0x2
-	0x2 extra: icache_fill_pending_cycles Counts the number of cycles fetch stalls because of an icache
miss. This is a cummulative count of cycles stalled for all icache misses.
+	0x2 extra: icache_fill_pending_cycles Counts cycles that an ICache miss is outstanding, and
instruction fetch is stalled.  That is, the decoder queue is able to accept bytes, but the fetch unit is
unable to provide bytes, while an Icache miss outstanding.  Note this event is not the same as cycles to
retrieve an instruction due to an Icache miss.  Rather, it is the part of the Instruction Cache (ICache)
miss time where no bytes are available for the decoder.
 name:itlb type:mandatory default:0x4
 	0x4 extra: miss Counts the number of times the machine was unable to find a translation in the Instruction
Translation Lookaside Buffer (ITLB) for a linear address of an instruction fetch.  It counts when new
translation are filled into the ITLB.  The event is speculative in nature, but will not count translations
(page walks) that are begun and not finished, or translations that are finished but not filled into the ITLB.
 name:l2_reject_xq type:mandatory default:0x0
 	0x0 extra: all Counts the number of demand and prefetch transactions that the L2 XQ rejects due to a full or
near full condition which likely indicates back pressure from the intra-die interconnect (IDI) fabric.
The XQ may reject transactions from the L2Q (non-cacheable requests), L2 misses and L2 write-back victims.
 name:ms_decoded type:mandatory default:0x1
-	0x1 extra: ms_entry Counts the number of times the Microcde Sequencer (MS) starts a flow of uops from the
MSROM. It does not count every time a uop is read from the MSROM.  The most common case that this counts is when
a micro-coded instruction is encountered by the front end of the machine.  Other cases include when an
instruction encounters a fault, trap, or microcode assist of any sort that initiates a flow of uops.  The
event will count MS startups for uops that are speculative, and subsequently cleared by branch
mispredict or a machine clear.
+	0x1 extra: ms_entry Counts the number of times the Microcode Sequencer (MS) starts a flow of uops from the
MSROM. It does not count every time a uop is read from the MSROM.  The most common case that this counts is when
a micro-coded instruction is encountered by the front end of the machine.  Other cases include when an
instruction encounters a fault, trap, or microcode assist of any sort that initiates a flow of uops.  The
event will count MS startups for uops that are speculative, and subsequently cleared by branch
mispredict or a machine clear.
 name:uops_issued type:mandatory default:0x0
 	0x0 extra: any Counts uops issued by the front end and allocated into the back end of the machine.  This event
counts uops that retire as well as uops that were speculatively executed but didn't retire. The sort of
speculative uops that might be counted includes, but is not limited to those uops issued in the shadow of a
miss-predicted branch, those uops that are inserted during an assist (such as for a denormal floating
point result), and (previously allocated) uops that might be canceled during a machine clear.
 name:uops_not_delivered type:mandatory default:0x0
-	0x0 extra: any This event used to measure front-end inefficiencies. I.e. when front-end of the machine
is not delivering uops to the back-end and the back-end has is not stalled. This event can be used to
identify if the machine is truly front-end bound.  When this event occurs, it is an indication that the
front-end of the machine is operating at less than its theoretical peak performance.
+	0x0 extra: any This event used to measure front-end inefficiencies. I.e. when front-end of the machine
is not delivering uops to the back-end and the back-end has is not stalled. This event can be used to
identify if the machine is truly front-end bound.  When this event occurs, it is an indication that the
front-end of the machine is operating at less than its theoretical peak performance. Background: We can
think of the processor pipeline as being divided into 2 broader parts: Front-end and Back-end. Front-end
is responsible for fetching the instruction, decoding into uops in machine understandable format and
putting them into a uop queue to be consumed by back end. The back-end then takes these uops, allocates the
required resources.  When all resources are ready, uops are executed. I
 f the back-end is not ready to accept uops from the front-end, then we do not want to count these as front-end
bottlenecks.  However, whenever we have bottlenecks in the back-end, we will have alloc
 ation unit stalls and eventually forcing the front-end to wait until the back-end is ready to receive more
uops. This event counts only when back-end is requesting more uops and front-end is not able to provide
them. When 3 uops are requested and no uops are delivered, the event counts 3. When 3 are requested, and only
1 is delivered, the event counts 2. When only 2 are delivered, the event counts 1. Alternatively stated,
the event will not count if 3 uops are delivered, or if the back end is stalled and not requesting any uops at
all.  Counts indicate missed opportunities for the front-end to deliver a uop to the back end. Some
examples of conditions that cause front-end efficiencies are: ICache misses, ITLB misses, and decoder
restrictions that limit the front-end bandwidth. Known Issue
 s: Some uops require multiple allocation slots.  These uops will not be charged as a front end 'not
delivered' opportunity, and will be regarded as a back end problem. For example, the INC instructi
 on has one uop that requires 2 issue slots.  A stream of INC instructions will not count as
UOPS_NOT_DELIVERED, even though only one instruction can be issued per clock.  The low uop issue rate for a
stream of INC instructions is considered to be a back end issue.
 name:cpu_clk_unhalted type:exclusive default:core
 	0x2 extra: core Counts the number of core cycles while the core is not in a halt state.  The core enters the
halt state when it is running the HLT instruction. In mobile systems the core frequency may change from
time to time. For this reason this event may have a changing ratio with regards to time.  This event uses
fixed counter 1.  You cannot collect a PEBs record for this event.
 	0x1 extra: ref_tsc Counts the number of reference cycles that the core is not in a halt state. The core
enters the halt state when it is running the HLT instruction.  In mobile systems the core frequency may
change from time.  This event is not affected by core frequency changes but counts as if the core is running
at the maximum frequency all the time.  This event uses fixed counter 2.  You cannot collect a PEBs record for
this event
 <at>  <at>  -31,12 +31,14  <at>  <at>  name:ld_blocks type:exclusive default:all_block
 	0x10 extra:pebs all_block_pebs Counts anytime a load that retires is blocked for any reason.
 	0x8 extra: utlb_miss Counts loads blocked because they are unable to find their physical address in the
micro TLB (UTLB).
 	0x8 extra:pebs utlb_miss_pebs Counts loads blocked because they are unable to find their physical
address in the micro TLB (UTLB).
+	0x2 extra: store_forward Counts a load blocked from using a store forward because of an address/size
mismatch, only one of the loads blocked from each store will be counted.
+	0x2 extra:pebs store_forward_pebs Counts a load blocked from using a store forward because of an
address/size mismatch, only one of the loads blocked from each store will be counted.
 	0x1 extra: data_unknown Counts a load blocked from using a store forward, but did not occur because the
store data was not available at the right time.  The forward might occur subsequently when the data is available.
 	0x1 extra:pebs data_unknown_pebs Counts a load blocked from using a store forward, but did not occur
because the store data was not available at the right time.  The forward might occur subsequently when the
data is available.
 	0x4 extra: u4k_alias Counts loads that block because their address modulo 4K matches a pending store.
 	0x4 extra:pebs u4k_alias_pebs Counts loads that block because their address modulo 4K matches a pending store.
 name:page_walks type:exclusive default:0x1
-	0x1 extra: d_side_cycles Counts every core cycle when a Data-side walks (due to data operation) page
walk is in progress.
+	0x1 extra: d_side_cycles Counts every core cycle when a Data-side (walks due to a data operation) page
walk is in progress.
 	0x2 extra: i_side_cycles Counts every core cycle when a Instruction-side (walks due to an instruction
fetch) page walk is in progress.
 	0x3 extra: cycles Counts every core cycle a page-walk is in progress due to either a data memory operation
or an instruction fetch.
 name:misalign_mem_ref type:exclusive default:load_page_split
 <at>  <at>  -48,35 +50,31  <at>  <at>  name:longest_lat_cache type:exclusive default:0x4f
 	0x4f extra: reference Counts memory requests originating from the core that reference a cache line in the
L2 cache.
 	0x41 extra: miss Counts memory requests originating from the core that miss in the L2 cache.
 name:icache type:exclusive default:0x1
-	0x1 extra: hit Counts each cache line access to the Icache that are fulfilled (hit) by the Icache
-	0x2 extra: misses Counts each cache line access to the Icache that are not fullfilled (miss) by the Icache
-	0x3 extra: accesses Counts each cache line access to the Icache
+	0x1 extra: hit Counts requests to the Instruction Cache (ICache) for one or more bytes in an ICache Line
and that cache line is in the ICache (hit).  The event strives to count on a cache line basis, so that multiple
accesses which hit in a single cache line count as one ICACHE.HIT.  Specifically, the event counts when
straight line code crosses the cache line boundary, or when a branch target is to a new line, and that cache
line is in the ICache. This event counts differently than Intel processors based on Silvermont microarchitecture.
+	0x2 extra: misses Counts requests to the Instruction Cache (ICache)  for one or more bytes in an ICache
Line and that cache line is not in the ICache (miss).  The event strives to count on a cache line basis, so that
multiple accesses which miss in a single cache line count as one ICACHE.MISS.  Specifically, the event
counts when straight line code crosses the cache line boundary, or when a branch target is to a new line, and
that cache line is not in the ICache. This event counts differently than Intel processors based on
Silvermont microarchitecture.
+	0x3 extra: accesses Counts requests to the Instruction Cache (ICache) for one or more bytes in an ICache
Line.  The event strives to count on a cache line basis, so that multiple fetches to a single cache line count
as one ICACHE.ACCESS.  Specifically, the event counts when accesses from straight line code crosses the
cache line boundary, or when a branch target is to a new line.  This event counts differently than Intel
processors based on Silvermont microarchitecture.
 name:inst_retired type:exclusive default:any
-	0x0 extra: any Counts the number of instructions that retire execution. For instructions that consist
of multiple uops, this event counts the retirement of the last uop of the instruction. The counter
continues counting during hardware interrupts, traps, and inside interrupt handlers.  This event uses
fixed counter 0.  You cannot collect a PEBs record for this event
-	0x0 extra: any_p Counts the number of instructions that retire execution. For instructions that
consist of multiple uops, this event counts the retirement of the last uop of the instruction. The event
continues counting during hardware interrupts, traps, and inside interrupt handlers.  This is an
architectural performance event.  This event uses a (_P)rogrammable general purpose performance counter.
-	0x0 extra:pebs any_pebs Counts the number of instructions that retire execution. For instructions
that consist of multiple uops, this event counts the retirement of the last uop of the instruction. The
event continues counting during hardware interrupts, traps, and inside interrupt handlers.  This is an
architectural performance event.  This event uses a (_P)rogrammable general purpose performance
counter. *This event is Precise Event capable:  The EventingRIP field in the PEBS record is precise to the
address of the instruction which caused the event.  Note: Because PEBS records can be collected only on
IA32_PMC0, only one event can use the PEBS facility at a time.
+	0x0 extra: any Counts the number of instructions that retire execution. For instructions that consist
of multiple uops, this event counts the retirement of the last uop of the instruction. The counter
continues counting during hardware interrupts, traps, and inside interrupt handlers.  This event uses
fixed counter 0.  You cannot collect a PEBs record for this event.
+	0x0 extra: any_p Counts the number of instructions that retire execution. For instructions that
consist of multiple uops, this event counts the retirement of the last uop of the instruction. The event
continues counting during hardware interrupts, traps, and inside interrupt handlers.  This is an
architectural performance event.  This event uses a (_P)rogrammable general purpose performance
counter. *This event is Precise Event capable:  The EventingRIP field in the PEBS record is precise to the
address of the instruction which caused the event.  Note: Because PEBS records can be collected only on
IA32_PMC0, only one event can use the PEBS facility at a time.
+	0x0 extra:pebs any_p_pebs Counts the number of instructions that retire execution. For instructions
that consist of multiple uops, this event counts the retirement of the last uop of the instruction. The
event continues counting during hardware interrupts, traps, and inside interrupt handlers.  This is an
architectural performance event.  This event uses a (_P)rogrammable general purpose performance
counter. *This event is Precise Event capable:  The EventingRIP field in the PEBS record is precise to the
address of the instruction which caused the event.  Note: Because PEBS records can be collected only on
IA32_PMC0, only one event can use the PEBS facility at a time.
 name:uops_retired type:exclusive default:any
 	0x0 extra: any Counts uops which retired
 	0x0 extra:pebs any_pebs Counts uops which retired
 	0x1 extra: ms Counts uops retired that are from the complex flows issued by the micro-sequencer (MS). 
Counts both the uops from a micro-coded instruction, and the uops that might be generated from a
micro-coded assist.
 	0x1 extra:pebs ms_pebs Counts uops retired that are from the complex flows issued by the micro-sequencer
(MS).  Counts both the uops from a micro-coded instruction, and the uops that might be generated from a
micro-coded assist.
-	0x8 extra: fpdiv Counts the number of floating point divide uops retired.
-	0x8 extra:pebs fpdiv_pebs Counts the number of floating point divide uops retired.
-	0x10 extra: idiv Counts the number of integer divide uops retired.
-	0x10 extra:pebs idiv_pebs Counts the number of integer divide uops retired.
 name:machine_clears type:exclusive default:0x0
 	0x0 extra: all Counts machine clears for any reason
 	0x1 extra: smc Counts the number of times that the processor detects that a program is writing to a code
section and has to perform a machine clear because of that modification.  Self-modifying code (SMC)
causes a severe penalty in all Intel architecture processors.
-	0x2 extra: memory_ordering Counts machine clears due to memory ordering issues.  This occurs when a
snoop request happens and the machine is uncertain if memory ordering will be preserved, as another core
is in the process of modifying the data.
+	0x2 extra: memory_ordering Counts machine clears due to memory ordering issues.  This occurs when a
snoop request happens and the machine is uncertain if memory ordering will be preserved - as another core
is in the process of modifying the data.
 	0x4 extra: fp_assist Counts machine clears due to floating point (FP) operations needing assists.  For
instance, if the result was a floating point denormal, the hardware clears the pipeline and reissues uops
to produce the correct IEEE compliant denormal result.
 	0x8 extra: disambiguation Counts machine clears due to memory disambiguation.  Memory disambiguation
happens when a load which has been issued conflicts with a previous unretired store in the pipeline whose
address was not known at issue time, but is later resolved to be the same as the load address.
 name:br_inst_retired type:exclusive default:all_branches
 	0x0 extra: all_branches Counts branch instructions retired for all branch types.  This is an
architectural performance event.
 	0x0 extra:pebs all_branches_pebs Counts branch instructions retired for all branch types.  This is an
architectural performance event.
-	0x7e extra: jcc Counts retired Jcc (Jump on Conditional Code/Jump if Conditon is Met) branch
instructions retired, including both when the branch was taken and when it was not taken.
-	0x7e extra:pebs jcc_pebs Counts retired Jcc (Jump on Conditional Code/Jump if Conditon is Met) branch
instructions retired, including both when the branch was taken and when it was not taken.
-	0xfe extra: taken_jcc Counts Jcc (Jump on Conditional Code/Jump if Conditon is Met) branch
instructions retired that were taken and does not count when the Jcc branch instruction were not taken.
-	0xfe extra:pebs taken_jcc_pebs Counts Jcc (Jump on Conditional Code/Jump if Conditon is Met) branch
instructions retired that were taken and does not count when the Jcc branch instruction were not taken.
+	0x7e extra: jcc Counts retired Jcc (Jump on Conditional Code/Jump if Condition is Met) branch
instructions retired, including both when the branch was taken and when it was not taken.
+	0x7e extra:pebs jcc_pebs Counts retired Jcc (Jump on Conditional Code/Jump if Condition is Met) branch
instructions retired, including both when the branch was taken and when it was not taken.
+	0xfe extra: taken_jcc Counts Jcc (Jump on Conditional Code/Jump if Condition is Met) branch
instructions retired that were taken and does not count when the Jcc branch instruction were not taken.
+	0xfe extra:pebs taken_jcc_pebs Counts Jcc (Jump on Conditional Code/Jump if Condition is Met) branch
instructions retired that were taken and does not count when the Jcc branch instruction were not taken.
 	0xf9 extra: call Counts near CALL branch instructions retired.
 	0xf9 extra:pebs call_pebs Counts near CALL branch instructions retired.
 	0xfd extra: rel_call Counts near relative CALL branch instructions retired.
 <at>  <at>  -87,24 +85,24  <at>  <at>  name:br_inst_retired type:exclusive default:all_branches
 	0xf7 extra:pebs return_pebs Counts near return branch instructions retired.
 	0xeb extra: non_return_ind Counts near indirect call or near indirect jmp branch instructions retired.
 	0xeb extra:pebs non_return_ind_pebs Counts near indirect call or near indirect jmp branch
instructions retired.
-	0xbf extra: far_branch Counts far branch instructions retired.  This includes far jump, far call and
return, and Interrupt call and return.  Intel Architecture uses far branches to transition to a different
privilege level (ex: kernel/user).
-	0xbf extra:pebs far_branch_pebs Counts far branch instructions retired.  This includes far jump, far
call and return, and Interrupt call and return.  Intel Architecture uses far branches to transition to a
different privilege level (ex: kernel/user).
+	0xbf extra: far_branch Counts far branch instructions retired.  This includes far jump, far call and
return, and Interrupt call and return.
+	0xbf extra:pebs far_branch_pebs Counts far branch instructions retired.  This includes far jump, far
call and return, and Interrupt call and return.
 name:br_misp_retired type:exclusive default:all_branches
 	0x0 extra: all_branches Counts mispredicted branch instructions retired including all branch types.
 	0x0 extra:pebs all_branches_pebs Counts mispredicted branch instructions retired including all
branch types.
-	0x7e extra: jcc Counts mispredicted retired Jcc (Jump on Conditional Code/Jump if Conditon is Met)
branch instructions retired, including both when the branch was supposed to be taken and when it was not
supposed to be taken (but the processor predicted the opposite condition).
-	0x7e extra:pebs jcc_pebs Counts mispredicted retired Jcc (Jump on Conditional Code/Jump if Conditon
is Met) branch instructions retired, including both when the branch was supposed to be taken and when it
was not supposed to be taken (but the processor predicted the opposite condition).
+	0x7e extra: jcc Counts mispredicted retired Jcc (Jump on Conditional Code/Jump if Condition is Met)
branch instructions retired, including both when the branch was supposed to be taken and when it was not
supposed to be taken (but the processor predicted the opposite condition).
+	0x7e extra:pebs jcc_pebs Counts mispredicted retired Jcc (Jump on Conditional Code/Jump if Condition
is Met) branch instructions retired, including both when the branch was supposed to be taken and when it
was not supposed to be taken (but the processor predicted the opposite condition).
 	0xfe extra: taken_jcc Counts mispredicted retired Jcc (Jump on Conditional Code/Jump if Condition is
Met) branch instructions retired that were supposed to be taken but the processor predicted that it would
not be taken.
 	0xfe extra:pebs taken_jcc_pebs Counts mispredicted retired Jcc (Jump on Conditional Code/Jump if
Condition is Met) branch instructions retired that were supposed to be taken but the processor predicted
that it would not be taken.
-	0xfb extra: ind_call Counts mispredicted near indirect CALL branch instructions retired, where the
target address taken was not what the processor  predicted.
-	0xfb extra:pebs ind_call_pebs Counts mispredicted near indirect CALL branch instructions retired,
where the target address taken was not what the processor  predicted.
-	0xf7 extra: return Counts mispredicted near RET branch instructions retired, where the return address
taken was not what the processor  predicted.
-	0xf7 extra:pebs return_pebs Counts mispredicted near RET branch instructions retired, where the
return address taken was not what the processor  predicted.
+	0xfb extra: ind_call Counts mispredicted near indirect CALL branch instructions retired, where the
target address taken was not what the processor predicted.
+	0xfb extra:pebs ind_call_pebs counts mispredicted near indirect CALL branch instructions retired,
where the target address taken was not what the processor predicted.
+	0xf7 extra: return Counts mispredicted near RET branch instructions retired, where the return address
taken was not what the processor predicted.
+	0xf7 extra:pebs return_pebs Counts mispredicted near RET branch instructions retired, where the
return address taken was not what the processor predicted.
 	0xeb extra: non_return_ind Counts mispredicted branch instructions retired that were near indirect
call or near indirect jmp, where the target address taken was not what the processor predicted.
 	0xeb extra:pebs non_return_ind_pebs Counts mispredicted branch instructions retired that were near
indirect call or near indirect jmp, where the target address taken was not what the processor predicted.
 name:issue_slots_not_consumed type:exclusive default:0x0
 	0x0 extra: any Counts the number of issue slots per core cycle that were not consumed by the backend due to
either a full resource  in the backend (RESOURCE_FULL) or due to the processor recovering from some event (RECOVERY)
-	0x1 extra: resource_full Counts the number of issue slots per core cycle that were not consumed because
of a full resource in the backend.  Including but not limited the Re-order Buffer (ROB), reservation
stations (RS), load/store buffers, physical registers, or any other needed machine resource that is
currently unavailable.   Note that uops must be available for consumption in order for this event to fire. 
If a uop is not available (Instruction Queue is empty), this event will not count.
+	0x1 extra: resource_full Counts the number of issue slots per core cycle that were not consumed because
of a full resource in the backend.  Including but not limited to resources such as the Re-order Buffer
(ROB), reservation stations (RS), load/store buffers, physical registers, or any other needed machine
resource that is currently unavailable.   Note that uops must be available for consumption in order for
this event to fire.  If a uop is not available (Instruction Queue is empty), this event will not count.
 	0x2 extra: recovery Counts the number of issue slots per core cycle that were not consumed by the backend
because allocation is stalled waiting for a mispredicted jump to retire or other branch-like conditions
(e.g. the event is relevant during certain microcode flows).   Counts all issue slots blocked while within
this window including slots where uops were not available in the Instruction Queue.
 name:hw_interrupts type:exclusive default:0x1
 	0x1 extra: received Counts hardware interrupts received by the processor.
 <at>  <at>  -117,8 +115,8  <at>  <at>  name:mem_uops_retired type:exclusive default:all
 	0x83 extra: all Counts the number of memory uops retired that is either a loads or a store or both.
 	0x81 extra: all_loads Counts the number of load uops retired
 	0x81 extra:pebs all_loads_pebs Counts the number of load uops retired
-	0x82 extra: all_stores Counts the number of store uops retired
-	0x82 extra:pebs all_stores_pebs Counts the number of store uops retired
+	0x82 extra: all_stores Counts the number of store uops retired.
+	0x82 extra:pebs all_stores_pebs Counts the number of store uops retired.
 	0x83 extra:pebs all_pebs Counts the number of memory uops retired that is either a loads or a store or both.
 	0x11 extra: dtlb_miss_loads Counts load uops retired that caused a DTLB miss.
 	0x11 extra:pebs dtlb_miss_loads_pebs Counts load uops retired that caused a DTLB miss.
 <at>  <at>  -128,28 +126,28  <at>  <at>  name:mem_uops_retired type:exclusive default:all
 	0x13 extra:pebs dtlb_miss_pebs Counts uops retired that had a DTLB miss on load, store or either.  Note
that when two distinct memory operations to the same page miss the DTLB, only one of them will be recorded as
a DTLB miss.
 	0x21 extra: lock_loads Counts locked memory uops retired.  This includes "regular" locks and bus locks.
(To specifically count bus locks only, see the Offcore response event.)  A locked access is one with a lock
prefix, or an exchange to memory.  See the SDM for a complete description of which memory load accesses are locks.
 	0x21 extra:pebs lock_loads_pebs Counts locked memory uops retired.  This includes "regular" locks and
bus locks. (To specifically count bus locks only, see the Offcore response event.)  A locked access is one
with a lock prefix, or an exchange to memory.  See the SDM for a complete description of which memory load
accesses are locks.
-	0x41 extra: split_loads Counts load uops retired where the data requested spans a 64 byte cache line boundry.
-	0x41 extra:pebs split_loads_pebs Counts load uops retired where the data requested spans a 64 byte
cache line boundry.
-	0x42 extra: split_stores Counts store uops retired where the data requested spans a 64 byte cache line boundry.
-	0x42 extra:pebs split_stores_pebs Counts store uops retired where the data requested spans a 64 byte
cache line boundry.
-	0x43 extra: split Counts memory uops retired where the data requested spans a 64 byte cache line boundry.
-	0x43 extra:pebs split_pebs Counts memory uops retired where the data requested spans a 64 byte cache
line boundry.
+	0x41 extra: split_loads Counts load uops retired where the data requested spans a 64 byte cache line boundary.
+	0x41 extra:pebs split_loads_pebs Counts load uops retired where the data requested spans a 64 byte
cache line boundary.
+	0x42 extra: split_stores Counts store uops retired where the data requested spans a 64 byte cache line boundary.
+	0x42 extra:pebs split_stores_pebs Counts store uops retired where the data requested spans a 64 byte
cache line boundary.
+	0x43 extra: split Counts memory uops retired where the data requested spans a 64 byte cache line boundary.
+	0x43 extra:pebs split_pebs Counts memory uops retired where the data requested spans a 64 byte cache
line boundary.
 name:mem_load_uops_retired type:exclusive default:l1_hit
-	0x1 extra: l1_hit Counts load uops retired that hit the L1 data cache
-	0x1 extra:pebs l1_hit_pebs Counts load uops retired that hit the L1 data cache
-	0x8 extra: l1_miss Counts load uops retired that miss the L1 data cache
-	0x8 extra:pebs l1_miss_pebs Counts load uops retired that miss the L1 data cache
-	0x2 extra: l2_hit Counts load uops retired that hit in the L2 cache
-	0x2 extra:pebs l2_hit_pebs Counts load uops retired that hit in the L2 cache
-	0x10 extra: l2_miss Counts load uops retired that miss in the L2 cache
-	0x10 extra:pebs l2_miss_pebs Counts load uops retired that miss in the L2 cache
+	0x1 extra: l1_hit Counts load uops retired that hit the L1 data cache.
+	0x1 extra:pebs l1_hit_pebs Counts load uops retired that hit the L1 data cache.
+	0x8 extra: l1_miss Counts load uops retired that miss the L1 data cache.
+	0x8 extra:pebs l1_miss_pebs Counts load uops retired that miss the L1 data cache.
+	0x2 extra: l2_hit Counts load uops retired that hit in the L2 cache.
+	0x2 extra:pebs l2_hit_pebs Counts load uops retired that hit in the L2 cache.
+	0x10 extra: l2_miss Counts load uops retired that miss in the L2 cache.
+	0x10 extra:pebs l2_miss_pebs Counts load uops retired that miss in the L2 cache.
 	0x20 extra: hitm Counts load uops retired where the cache line containing the data was in the modified
state of another core or modules cache (HITM).  More specifically, this means that when the load address
was checked by other caching agents (typically another processor) in the system, one of those caching
agents indicated that they had a dirty copy of the data.  Loads that obtain a HITM response incur greater
latency than most is typical for a load.  In addition, since HITM indicates that some other processor had
this data in its cache, it implies that the data was shared between processors, or potentially was a lock or
semaphore value.  This event is useful for locating sharing, false sharing, and contended locks.
 	0x20 extra:pebs hitm_pebs Counts load uops retired where the cache line containing the data was in the
modified state of another core or modules cache (HITM).  More specifically, this means that when the load
address was checked by other caching agents (typically another processor) in the system, one of those
caching agents indicated that they had a dirty copy of the data.  Loads that obtain a HITM response incur
greater latency than most is typical for a load.  In addition, since HITM indicates that some other
processor had this data in its cache, it implies that the data was shared between processors, or
potentially was a lock or semaphore value.  This event is useful for locating sharing, false sharing, and
contended locks.
-	0x40 extra: wcb_hit Counts memory load uops retired where the data is retrieved from the WCB (or fill
buffer), indicating that the load found its data while that data was in the process of being brought into
the L1 cache.  Typically a load will receive this indication when some other load or prefetch missed the L1
cache and was in the process of retrieving the cache line containing the data , but that process had not yet
finished (and written the data back to the cache).  For example, consider load X and Y, both referencing the
same cache line that is not in the L1 cache.  If load X misses cache first, it obtains and WCB (or fill buffer)
and begins the process of requesting the data.  When load Y requests the data, it will either hit the WCB, or
the L1 cache, depending on exactly what ti
 me the request to Y occurs.
-	0x40 extra:pebs wcb_hit_pebs Counts memory load uops retired where the data is retrieved from the WCB
(or fill buffer), indicating that the load found its data while that data was in the process of being
brought into the L1 cache.  Typically a load will receive this indication when some other load or prefetch
missed the L1 cache and was in the process of retrieving the cache line containing the data , but that
process had not yet finished (and written the data back to the cache).  For example, consider load X and Y,
both referencing the same cache line that is not in the L1 cache.  If load X misses cache first, it obtains and
WCB (or fill buffer) and begins the process of requesting the data.  When load Y requests the data, it will
either hit the WCB, or the L1 cache, depending on exactl
 y what time the request to Y occurs.
-	0x80 extra: dram_hit Counts memory load uops retired where the data is retrieved from DRAM.  Event is
counted at retirment, so the speculative loads are ignored.  A memory load can hit (or miss) the L1 cache,
hit (or miss) the L2 cache, hit DRAM, hit in the WCB or receive a HITM response.
-	0x80 extra:pebs dram_hit_pebs Counts memory load uops retired where the data is retrieved from DRAM. 
Event is counted at retirment, so the speculative loads are ignored.  A memory load can hit (or miss) the L1
cache, hit (or miss) the L2 cache, hit DRAM, hit in the WCB or receive a HITM response.
+	0x40 extra: wcb_hit Counts memory load uops retired where the data is retrieved from the WCB (or fill
buffer), indicating that the load found its data while that data was in the process of being brought into
the L1 cache.  Typically a load will receive this indication when some other load or prefetch missed the L1
cache and was in the process of retrieving the cache line containing the data, but that process had not yet
finished (and written the data back to the cache). For example, consider load X and Y, both referencing the
same cache line that is not in the L1 cache.  If load X misses cache first, it obtains and WCB (or fill buffer)
and begins the process of requesting the data.  When load Y requests the data, it will either hit the WCB, or
the L1 cache, depending on exactly what time
  the request to Y occurs.
+	0x40 extra:pebs wcb_hit_pebs Counts memory load uops retired where the data is retrieved from the WCB
(or fill buffer), indicating that the load found its data while that data was in the process of being
brought into the L1 cache.  Typically a load will receive this indication when some other load or prefetch
missed the L1 cache and was in the process of retrieving the cache line containing the data, but that
process had not yet finished (and written the data back to the cache). For example, consider load X and Y,
both referencing the same cache line that is not in the L1 cache.  If load X misses cache first, it obtains and
WCB (or fill buffer) and begins the process of requesting the data.  When load Y requests the data, it will
either hit the WCB, or the L1 cache, depending on exactly 
 what time the request to Y occurs.
+	0x80 extra: dram_hit Counts memory load uops retired where the data is retrieved from DRAM.  Event is
counted at retirement, so the speculative loads are ignored.  A memory load can hit (or miss) the L1 cache,
hit (or miss) the L2 cache, hit DRAM, hit in the WCB or receive a HITM response.
+	0x80 extra:pebs dram_hit_pebs Counts memory load uops retired where the data is retrieved from DRAM. 
Event is counted at retirement, so the speculative loads are ignored.  A memory load can hit (or miss) the L1
cache, hit (or miss) the L2 cache, hit DRAM, hit in the WCB or receive a HITM response.
 name:baclears type:exclusive default:0x1
 	0x1 extra: all Counts the number of times a BACLEAR is signaled for any reason, including, but not limited
to indirect branch/call,  Jcc (Jump on Conditional Code/Jump if Condition is Met) branch, unconditional
branch/call, and returns.
 	0x8 extra: return Counts BACLEARS on return instructions.
-	0x10 extra: cond Counts BACLEARS on Jcc (Jump on Conditional Code/Jump if Conditon is Met) branches.
+	0x10 extra: cond Counts BACLEARS on Jcc (Jump on Conditional Code/Jump if Condition is Met) branches.
--

-- 
2.5.5

------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
Andi Kleen | 6 May 01:47 2016

[PATCH] oprofile: Update Goldmont events

From: Andi Kleen <ak <at> linux.intel.com>

This patch adds some updates to the Goldmont events. Mainly it is editorial updates
to the event descriptions. In addition it also removes the events not listed
in the SDM (which were not intended to be included)

Signed-off-by: Andi Kleen <ak <at> linux.intel.com>
---
 events/i386/goldmont/unit_masks | 98 ++++++++++++++++++++---------------------
 1 file changed, 48 insertions(+), 50 deletions(-)

diff --git a/events/i386/goldmont/unit_masks b/events/i386/goldmont/unit_masks
index 2f265b3da3f4..873e1b4d9fd7 100644
--- a/events/i386/goldmont/unit_masks
+++ b/events/i386/goldmont/unit_masks
 <at>  <at>  -10,17 +10,17  <at>  <at>  name:decode_restriction type:mandatory default:0x1
 name:dl1 type:mandatory default:0x1
 	0x1 extra: dirty_eviction Counts when a modified (dirty) cache line is evicted from the data L1 cache and
needs to be written back to memory.  No count will occur if the evicted line is clean, and hence does not
require a writeback.
 name:fetch_stall type:mandatory default:0x2
-	0x2 extra: icache_fill_pending_cycles Counts the number of cycles fetch stalls because of an icache
miss. This is a cummulative count of cycles stalled for all icache misses.
+	0x2 extra: icache_fill_pending_cycles Counts cycles that an ICache miss is outstanding, and
instruction fetch is stalled.  That is, the decoder queue is able to accept bytes, but the fetch unit is
unable to provide bytes, while an Icache miss outstanding.  Note this event is not the same as cycles to
retrieve an instruction due to an Icache miss.  Rather, it is the part of the Instruction Cache (ICache)
miss time where no bytes are available for the decoder.
 name:itlb type:mandatory default:0x4
 	0x4 extra: miss Counts the number of times the machine was unable to find a translation in the Instruction
Translation Lookaside Buffer (ITLB) for a linear address of an instruction fetch.  It counts when new
translation are filled into the ITLB.  The event is speculative in nature, but will not count translations
(page walks) that are begun and not finished, or translations that are finished but not filled into the ITLB.
 name:l2_reject_xq type:mandatory default:0x0
 	0x0 extra: all Counts the number of demand and prefetch transactions that the L2 XQ rejects due to a full or
near full condition which likely indicates back pressure from the intra-die interconnect (IDI) fabric.
The XQ may reject transactions from the L2Q (non-cacheable requests), L2 misses and L2 write-back victims.
 name:ms_decoded type:mandatory default:0x1
-	0x1 extra: ms_entry Counts the number of times the Microcde Sequencer (MS) starts a flow of uops from the
MSROM. It does not count every time a uop is read from the MSROM.  The most common case that this counts is when
a micro-coded instruction is encountered by the front end of the machine.  Other cases include when an
instruction encounters a fault, trap, or microcode assist of any sort that initiates a flow of uops.  The
event will count MS startups for uops that are speculative, and subsequently cleared by branch
mispredict or a machine clear.
+	0x1 extra: ms_entry Counts the number of times the Microcode Sequencer (MS) starts a flow of uops from the
MSROM. It does not count every time a uop is read from the MSROM.  The most common case that this counts is when
a micro-coded instruction is encountered by the front end of the machine.  Other cases include when an
instruction encounters a fault, trap, or microcode assist of any sort that initiates a flow of uops.  The
event will count MS startups for uops that are speculative, and subsequently cleared by branch
mispredict or a machine clear.
 name:uops_issued type:mandatory default:0x0
 	0x0 extra: any Counts uops issued by the front end and allocated into the back end of the machine.  This event
counts uops that retire as well as uops that were speculatively executed but didn't retire. The sort of
speculative uops that might be counted includes, but is not limited to those uops issued in the shadow of a
miss-predicted branch, those uops that are inserted during an assist (such as for a denormal floating
point result), and (previously allocated) uops that might be canceled during a machine clear.
 name:uops_not_delivered type:mandatory default:0x0
-	0x0 extra: any This event used to measure front-end inefficiencies. I.e. when front-end of the machine
is not delivering uops to the back-end and the back-end has is not stalled. This event can be used to
identify if the machine is truly front-end bound.  When this event occurs, it is an indication that the
front-end of the machine is operating at less than its theoretical peak performance.
+	0x0 extra: any This event used to measure front-end inefficiencies. I.e. when front-end of the machine
is not delivering uops to the back-end and the back-end has is not stalled. This event can be used to
identify if the machine is truly front-end bound.  When this event occurs, it is an indication that the
front-end of the machine is operating at less than its theoretical peak performance. Background: We can
think of the processor pipeline as being divided into 2 broader parts: Front-end and Back-end. Front-end
is responsible for fetching the instruction, decoding into uops in machine understandable format and
putting them into a uop queue to be consumed by back end. The back-end then takes these uops, allocates the
required resources.  When all resources are ready, uops are executed. I
 f the back-end is not ready to accept uops from the front-end, then we do not want to count these as front-end
bottlenecks.  However, whenever we have bottlenecks in the back-end, we will have alloc
 ation unit stalls and eventually forcing the front-end to wait until the back-end is ready to receive more
uops. This event counts only when back-end is requesting more uops and front-end is not able to provide
them. When 3 uops are requested and no uops are delivered, the event counts 3. When 3 are requested, and only
1 is delivered, the event counts 2. When only 2 are delivered, the event counts 1. Alternatively stated,
the event will not count if 3 uops are delivered, or if the back end is stalled and not requesting any uops at
all.  Counts indicate missed opportunities for the front-end to deliver a uop to the back end. Some
examples of conditions that cause front-end efficiencies are: ICache misses, ITLB misses, and decoder
restrictions that limit the front-end bandwidth. Known Issue
 s: Some uops require multiple allocation slots.  These uops will not be charged as a front end 'not
delivered' opportunity, and will be regarded as a back end problem. For example, the INC instructi
 on has one uop that requires 2 issue slots.  A stream of INC instructions will not count as
UOPS_NOT_DELIVERED, even though only one instruction can be issued per clock.  The low uop issue rate for a
stream of INC instructions is considered to be a back end issue.
 name:cpu_clk_unhalted type:exclusive default:core
 	0x2 extra: core Counts the number of core cycles while the core is not in a halt state.  The core enters the
halt state when it is running the HLT instruction. In mobile systems the core frequency may change from
time to time. For this reason this event may have a changing ratio with regards to time.  This event uses
fixed counter 1.  You cannot collect a PEBs record for this event.
 	0x1 extra: ref_tsc Counts the number of reference cycles that the core is not in a halt state. The core
enters the halt state when it is running the HLT instruction.  In mobile systems the core frequency may
change from time.  This event is not affected by core frequency changes but counts as if the core is running
at the maximum frequency all the time.  This event uses fixed counter 2.  You cannot collect a PEBs record for
this event
 <at>  <at>  -31,12 +31,14  <at>  <at>  name:ld_blocks type:exclusive default:all_block
 	0x10 extra:pebs all_block_pebs Counts anytime a load that retires is blocked for any reason.
 	0x8 extra: utlb_miss Counts loads blocked because they are unable to find their physical address in the
micro TLB (UTLB).
 	0x8 extra:pebs utlb_miss_pebs Counts loads blocked because they are unable to find their physical
address in the micro TLB (UTLB).
+	0x2 extra: store_forward Counts a load blocked from using a store forward because of an address/size
mismatch, only one of the loads blocked from each store will be counted.
+	0x2 extra:pebs store_forward_pebs Counts a load blocked from using a store forward because of an
address/size mismatch, only one of the loads blocked from each store will be counted.
 	0x1 extra: data_unknown Counts a load blocked from using a store forward, but did not occur because the
store data was not available at the right time.  The forward might occur subsequently when the data is available.
 	0x1 extra:pebs data_unknown_pebs Counts a load blocked from using a store forward, but did not occur
because the store data was not available at the right time.  The forward might occur subsequently when the
data is available.
 	0x4 extra: u4k_alias Counts loads that block because their address modulo 4K matches a pending store.
 	0x4 extra:pebs u4k_alias_pebs Counts loads that block because their address modulo 4K matches a pending store.
 name:page_walks type:exclusive default:0x1
-	0x1 extra: d_side_cycles Counts every core cycle when a Data-side walks (due to data operation) page
walk is in progress.
+	0x1 extra: d_side_cycles Counts every core cycle when a Data-side (walks due to a data operation) page
walk is in progress.
 	0x2 extra: i_side_cycles Counts every core cycle when a Instruction-side (walks due to an instruction
fetch) page walk is in progress.
 	0x3 extra: cycles Counts every core cycle a page-walk is in progress due to either a data memory operation
or an instruction fetch.
 name:misalign_mem_ref type:exclusive default:load_page_split
 <at>  <at>  -48,35 +50,31  <at>  <at>  name:longest_lat_cache type:exclusive default:0x4f
 	0x4f extra: reference Counts memory requests originating from the core that reference a cache line in the
L2 cache.
 	0x41 extra: miss Counts memory requests originating from the core that miss in the L2 cache.
 name:icache type:exclusive default:0x1
-	0x1 extra: hit Counts each cache line access to the Icache that are fulfilled (hit) by the Icache
-	0x2 extra: misses Counts each cache line access to the Icache that are not fullfilled (miss) by the Icache
-	0x3 extra: accesses Counts each cache line access to the Icache
+	0x1 extra: hit Counts requests to the Instruction Cache (ICache) for one or more bytes in an ICache Line
and that cache line is in the ICache (hit).  The event strives to count on a cache line basis, so that multiple
accesses which hit in a single cache line count as one ICACHE.HIT.  Specifically, the event counts when
straight line code crosses the cache line boundary, or when a branch target is to a new line, and that cache
line is in the ICache. This event counts differently than Intel processors based on Silvermont microarchitecture.
+	0x2 extra: misses Counts requests to the Instruction Cache (ICache)  for one or more bytes in an ICache
Line and that cache line is not in the ICache (miss).  The event strives to count on a cache line basis, so that
multiple accesses which miss in a single cache line count as one ICACHE.MISS.  Specifically, the event
counts when straight line code crosses the cache line boundary, or when a branch target is to a new line, and
that cache line is not in the ICache. This event counts differently than Intel processors based on
Silvermont microarchitecture.
+	0x3 extra: accesses Counts requests to the Instruction Cache (ICache) for one or more bytes in an ICache
Line.  The event strives to count on a cache line basis, so that multiple fetches to a single cache line count
as one ICACHE.ACCESS.  Specifically, the event counts when accesses from straight line code crosses the
cache line boundary, or when a branch target is to a new line.  This event counts differently than Intel
processors based on Silvermont microarchitecture.
 name:inst_retired type:exclusive default:any
-	0x0 extra: any Counts the number of instructions that retire execution. For instructions that consist
of multiple uops, this event counts the retirement of the last uop of the instruction. The counter
continues counting during hardware interrupts, traps, and inside interrupt handlers.  This event uses
fixed counter 0.  You cannot collect a PEBs record for this event
-	0x0 extra: any_p Counts the number of instructions that retire execution. For instructions that
consist of multiple uops, this event counts the retirement of the last uop of the instruction. The event
continues counting during hardware interrupts, traps, and inside interrupt handlers.  This is an
architectural performance event.  This event uses a (_P)rogrammable general purpose performance counter.
-	0x0 extra:pebs any_pebs Counts the number of instructions that retire execution. For instructions
that consist of multiple uops, this event counts the retirement of the last uop of the instruction. The
event continues counting during hardware interrupts, traps, and inside interrupt handlers.  This is an
architectural performance event.  This event uses a (_P)rogrammable general purpose performance
counter. *This event is Precise Event capable:  The EventingRIP field in the PEBS record is precise to the
address of the instruction which caused the event.  Note: Because PEBS records can be collected only on
IA32_PMC0, only one event can use the PEBS facility at a time.
+	0x0 extra: any Counts the number of instructions that retire execution. For instructions that consist
of multiple uops, this event counts the retirement of the last uop of the instruction. The counter
continues counting during hardware interrupts, traps, and inside interrupt handlers.  This event uses
fixed counter 0.  You cannot collect a PEBs record for this event.
+	0x0 extra: any_p Counts the number of instructions that retire execution. For instructions that
consist of multiple uops, this event counts the retirement of the last uop of the instruction. The event
continues counting during hardware interrupts, traps, and inside interrupt handlers.  This is an
architectural performance event.  This event uses a (_P)rogrammable general purpose performance
counter. *This event is Precise Event capable:  The EventingRIP field in the PEBS record is precise to the
address of the instruction which caused the event.  Note: Because PEBS records can be collected only on
IA32_PMC0, only one event can use the PEBS facility at a time.
+	0x0 extra:pebs any_p_pebs Counts the number of instructions that retire execution. For instructions
that consist of multiple uops, this event counts the retirement of the last uop of the instruction. The
event continues counting during hardware interrupts, traps, and inside interrupt handlers.  This is an
architectural performance event.  This event uses a (_P)rogrammable general purpose performance
counter. *This event is Precise Event capable:  The EventingRIP field in the PEBS record is precise to the
address of the instruction which caused the event.  Note: Because PEBS records can be collected only on
IA32_PMC0, only one event can use the PEBS facility at a time.
 name:uops_retired type:exclusive default:any
 	0x0 extra: any Counts uops which retired
 	0x0 extra:pebs any_pebs Counts uops which retired
 	0x1 extra: ms Counts uops retired that are from the complex flows issued by the micro-sequencer (MS). 
Counts both the uops from a micro-coded instruction, and the uops that might be generated from a
micro-coded assist.
 	0x1 extra:pebs ms_pebs Counts uops retired that are from the complex flows issued by the micro-sequencer
(MS).  Counts both the uops from a micro-coded instruction, and the uops that might be generated from a
micro-coded assist.
-	0x8 extra: fpdiv Counts the number of floating point divide uops retired.
-	0x8 extra:pebs fpdiv_pebs Counts the number of floating point divide uops retired.
-	0x10 extra: idiv Counts the number of integer divide uops retired.
-	0x10 extra:pebs idiv_pebs Counts the number of integer divide uops retired.
 name:machine_clears type:exclusive default:0x0
 	0x0 extra: all Counts machine clears for any reason
-	0x1 extra: smc Counts the number of times that the processor detects that a program is writing to a code
section and has to perform a machine clear because of that modification.  Self-modifying code (SMC)
causes a severe penalty in all Intel architecture processors.
-	0x2 extra: memory_ordering Counts machine clears due to memory ordering issues.  This occurs when a
snoop request happens and the machine is uncertain if memory ordering will be preserved, as another core
is in the process of modifying the data.
+	0x1 extra: smc Counts the number of times that the processor detects that a program is writing to a code
section and has to perform a machine clear because of that modification.  Self-modifying code (SMC)
causes a severe penalty in all Intel? architecture processors.
+	0x2 extra: memory_ordering Counts machine clears due to memory ordering issues.  This occurs when a
snoop request happens and the machine is uncertain if memory ordering will be preserved - as another core
is in the process of modifying the data.
 	0x4 extra: fp_assist Counts machine clears due to floating point (FP) operations needing assists.  For
instance, if the result was a floating point denormal, the hardware clears the pipeline and reissues uops
to produce the correct IEEE compliant denormal result.
 	0x8 extra: disambiguation Counts machine clears due to memory disambiguation.  Memory disambiguation
happens when a load which has been issued conflicts with a previous unretired store in the pipeline whose
address was not known at issue time, but is later resolved to be the same as the load address.
 name:br_inst_retired type:exclusive default:all_branches
 	0x0 extra: all_branches Counts branch instructions retired for all branch types.  This is an
architectural performance event.
 	0x0 extra:pebs all_branches_pebs Counts branch instructions retired for all branch types.  This is an
architectural performance event.
-	0x7e extra: jcc Counts retired Jcc (Jump on Conditional Code/Jump if Conditon is Met) branch
instructions retired, including both when the branch was taken and when it was not taken.
-	0x7e extra:pebs jcc_pebs Counts retired Jcc (Jump on Conditional Code/Jump if Conditon is Met) branch
instructions retired, including both when the branch was taken and when it was not taken.
-	0xfe extra: taken_jcc Counts Jcc (Jump on Conditional Code/Jump if Conditon is Met) branch
instructions retired that were taken and does not count when the Jcc branch instruction were not taken.
-	0xfe extra:pebs taken_jcc_pebs Counts Jcc (Jump on Conditional Code/Jump if Conditon is Met) branch
instructions retired that were taken and does not count when the Jcc branch instruction were not taken.
+	0x7e extra: jcc Counts retired Jcc (Jump on Conditional Code/Jump if Condition is Met) branch
instructions retired, including both when the branch was taken and when it was not taken.
+	0x7e extra:pebs jcc_pebs Counts retired Jcc (Jump on Conditional Code/Jump if Condition is Met) branch
instructions retired, including both when the branch was taken and when it was not taken.
+	0xfe extra: taken_jcc Counts Jcc (Jump on Conditional Code/Jump if Condition is Met) branch
instructions retired that were taken and does not count when the Jcc branch instruction were not taken.
+	0xfe extra:pebs taken_jcc_pebs Counts Jcc (Jump on Conditional Code/Jump if Condition is Met) branch
instructions retired that were taken and does not count when the Jcc branch instruction were not taken.
 	0xf9 extra: call Counts near CALL branch instructions retired.
 	0xf9 extra:pebs call_pebs Counts near CALL branch instructions retired.
 	0xfd extra: rel_call Counts near relative CALL branch instructions retired.
 <at>  <at>  -87,24 +85,24  <at>  <at>  name:br_inst_retired type:exclusive default:all_branches
 	0xf7 extra:pebs return_pebs Counts near return branch instructions retired.
 	0xeb extra: non_return_ind Counts near indirect call or near indirect jmp branch instructions retired.
 	0xeb extra:pebs non_return_ind_pebs Counts near indirect call or near indirect jmp branch
instructions retired.
-	0xbf extra: far_branch Counts far branch instructions retired.  This includes far jump, far call and
return, and Interrupt call and return.  Intel Architecture uses far branches to transition to a different
privilege level (ex: kernel/user).
-	0xbf extra:pebs far_branch_pebs Counts far branch instructions retired.  This includes far jump, far
call and return, and Interrupt call and return.  Intel Architecture uses far branches to transition to a
different privilege level (ex: kernel/user).
+	0xbf extra: far_branch Counts far branch instructions retired.  This includes far jump, far call and
return, and Interrupt call and return.
+	0xbf extra:pebs far_branch_pebs Counts far branch instructions retired.  This includes far jump, far
call and return, and Interrupt call and return.
 name:br_misp_retired type:exclusive default:all_branches
 	0x0 extra: all_branches Counts mispredicted branch instructions retired including all branch types.
 	0x0 extra:pebs all_branches_pebs Counts mispredicted branch instructions retired including all
branch types.
-	0x7e extra: jcc Counts mispredicted retired Jcc (Jump on Conditional Code/Jump if Conditon is Met)
branch instructions retired, including both when the branch was supposed to be taken and when it was not
supposed to be taken (but the processor predicted the opposite condition).
-	0x7e extra:pebs jcc_pebs Counts mispredicted retired Jcc (Jump on Conditional Code/Jump if Conditon
is Met) branch instructions retired, including both when the branch was supposed to be taken and when it
was not supposed to be taken (but the processor predicted the opposite condition).
+	0x7e extra: jcc Counts mispredicted retired Jcc (Jump on Conditional Code/Jump if Condition is Met)
branch instructions retired, including both when the branch was supposed to be taken and when it was not
supposed to be taken (but the processor predicted the opposite condition).
+	0x7e extra:pebs jcc_pebs Counts mispredicted retired Jcc (Jump on Conditional Code/Jump if Condition
is Met) branch instructions retired, including both when the branch was supposed to be taken and when it
was not supposed to be taken (but the processor predicted the opposite condition).
 	0xfe extra: taken_jcc Counts mispredicted retired Jcc (Jump on Conditional Code/Jump if Condition is
Met) branch instructions retired that were supposed to be taken but the processor predicted that it would
not be taken.
 	0xfe extra:pebs taken_jcc_pebs Counts mispredicted retired Jcc (Jump on Conditional Code/Jump if
Condition is Met) branch instructions retired that were supposed to be taken but the processor predicted
that it would not be taken.
-	0xfb extra: ind_call Counts mispredicted near indirect CALL branch instructions retired, where the
target address taken was not what the processor  predicted.
-	0xfb extra:pebs ind_call_pebs Counts mispredicted near indirect CALL branch instructions retired,
where the target address taken was not what the processor  predicted.
-	0xf7 extra: return Counts mispredicted near RET branch instructions retired, where the return address
taken was not what the processor  predicted.
-	0xf7 extra:pebs return_pebs Counts mispredicted near RET branch instructions retired, where the
return address taken was not what the processor  predicted.
+	0xfb extra: ind_call ounts mispredicted near indirect CALL branch instructions retired, where the
target address taken was not what the processor predicted.
+	0xfb extra:pebs ind_call_pebs ounts mispredicted near indirect CALL branch instructions retired,
where the target address taken was not what the processor predicted.
+	0xf7 extra: return Counts mispredicted near RET branch instructions retired, where the return address
taken was not what the processor predicted.
+	0xf7 extra:pebs return_pebs Counts mispredicted near RET branch instructions retired, where the
return address taken was not what the processor predicted.
 	0xeb extra: non_return_ind Counts mispredicted branch instructions retired that were near indirect
call or near indirect jmp, where the target address taken was not what the processor predicted.
 	0xeb extra:pebs non_return_ind_pebs Counts mispredicted branch instructions retired that were near
indirect call or near indirect jmp, where the target address taken was not what the processor predicted.
 name:issue_slots_not_consumed type:exclusive default:0x0
 	0x0 extra: any Counts the number of issue slots per core cycle that were not consumed by the backend due to
either a full resource  in the backend (RESOURCE_FULL) or due to the processor recovering from some event (RECOVERY)
-	0x1 extra: resource_full Counts the number of issue slots per core cycle that were not consumed because
of a full resource in the backend.  Including but not limited the Re-order Buffer (ROB), reservation
stations (RS), load/store buffers, physical registers, or any other needed machine resource that is
currently unavailable.   Note that uops must be available for consumption in order for this event to fire. 
If a uop is not available (Instruction Queue is empty), this event will not count.
+	0x1 extra: resource_full Counts the number of issue slots per core cycle that were not consumed because
of a full resource in the backend.  Including but not limited to resources such as the Re-order Buffer
(ROB), reservation stations (RS), load/store buffers, physical registers, or any other needed machine
resource that is currently unavailable.   Note that uops must be available for consumption in order for
this event to fire.  If a uop is not available (Instruction Queue is empty), this event will not count.
 	0x2 extra: recovery Counts the number of issue slots per core cycle that were not consumed by the backend
because allocation is stalled waiting for a mispredicted jump to retire or other branch-like conditions
(e.g. the event is relevant during certain microcode flows).   Counts all issue slots blocked while within
this window including slots where uops were not available in the Instruction Queue.
 name:hw_interrupts type:exclusive default:0x1
 	0x1 extra: received Counts hardware interrupts received by the processor.
 <at>  <at>  -117,8 +115,8  <at>  <at>  name:mem_uops_retired type:exclusive default:all
 	0x83 extra: all Counts the number of memory uops retired that is either a loads or a store or both.
 	0x81 extra: all_loads Counts the number of load uops retired
 	0x81 extra:pebs all_loads_pebs Counts the number of load uops retired
-	0x82 extra: all_stores Counts the number of store uops retired
-	0x82 extra:pebs all_stores_pebs Counts the number of store uops retired
+	0x82 extra: all_stores Counts the number of store uops retired.
+	0x82 extra:pebs all_stores_pebs Counts the number of store uops retired.
 	0x83 extra:pebs all_pebs Counts the number of memory uops retired that is either a loads or a store or both.
 	0x11 extra: dtlb_miss_loads Counts load uops retired that caused a DTLB miss.
 	0x11 extra:pebs dtlb_miss_loads_pebs Counts load uops retired that caused a DTLB miss.
 <at>  <at>  -128,28 +126,28  <at>  <at>  name:mem_uops_retired type:exclusive default:all
 	0x13 extra:pebs dtlb_miss_pebs Counts uops retired that had a DTLB miss on load, store or either.  Note
that when two distinct memory operations to the same page miss the DTLB, only one of them will be recorded as
a DTLB miss.
 	0x21 extra: lock_loads Counts locked memory uops retired.  This includes "regular" locks and bus locks.
(To specifically count bus locks only, see the Offcore response event.)  A locked access is one with a lock
prefix, or an exchange to memory.  See the SDM for a complete description of which memory load accesses are locks.
 	0x21 extra:pebs lock_loads_pebs Counts locked memory uops retired.  This includes "regular" locks and
bus locks. (To specifically count bus locks only, see the Offcore response event.)  A locked access is one
with a lock prefix, or an exchange to memory.  See the SDM for a complete description of which memory load
accesses are locks.
-	0x41 extra: split_loads Counts load uops retired where the data requested spans a 64 byte cache line boundry.
-	0x41 extra:pebs split_loads_pebs Counts load uops retired where the data requested spans a 64 byte
cache line boundry.
-	0x42 extra: split_stores Counts store uops retired where the data requested spans a 64 byte cache line boundry.
-	0x42 extra:pebs split_stores_pebs Counts store uops retired where the data requested spans a 64 byte
cache line boundry.
-	0x43 extra: split Counts memory uops retired where the data requested spans a 64 byte cache line boundry.
-	0x43 extra:pebs split_pebs Counts memory uops retired where the data requested spans a 64 byte cache
line boundry.
+	0x41 extra: split_loads Counts load uops retired where the data requested spans a 64 byte cache line boundary.
+	0x41 extra:pebs split_loads_pebs Counts load uops retired where the data requested spans a 64 byte
cache line boundary.
+	0x42 extra: split_stores Counts store uops retired where the data requested spans a 64 byte cache line boundary.
+	0x42 extra:pebs split_stores_pebs Counts store uops retired where the data requested spans a 64 byte
cache line boundary.
+	0x43 extra: split Counts memory uops retired where the data requested spans a 64 byte cache line boundary.
+	0x43 extra:pebs split_pebs Counts memory uops retired where the data requested spans a 64 byte cache
line boundary.
 name:mem_load_uops_retired type:exclusive default:l1_hit
-	0x1 extra: l1_hit Counts load uops retired that hit the L1 data cache
-	0x1 extra:pebs l1_hit_pebs Counts load uops retired that hit the L1 data cache
-	0x8 extra: l1_miss Counts load uops retired that miss the L1 data cache
-	0x8 extra:pebs l1_miss_pebs Counts load uops retired that miss the L1 data cache
-	0x2 extra: l2_hit Counts load uops retired that hit in the L2 cache
-	0x2 extra:pebs l2_hit_pebs Counts load uops retired that hit in the L2 cache
-	0x10 extra: l2_miss Counts load uops retired that miss in the L2 cache
-	0x10 extra:pebs l2_miss_pebs Counts load uops retired that miss in the L2 cache
+	0x1 extra: l1_hit Counts load uops retired that hit the L1 data cache.
+	0x1 extra:pebs l1_hit_pebs Counts load uops retired that hit the L1 data cache.
+	0x8 extra: l1_miss Counts load uops retired that miss the L1 data cache.
+	0x8 extra:pebs l1_miss_pebs Counts load uops retired that miss the L1 data cache.
+	0x2 extra: l2_hit Counts load uops retired that hit in the L2 cache.
+	0x2 extra:pebs l2_hit_pebs Counts load uops retired that hit in the L2 cache.
+	0x10 extra: l2_miss Counts load uops retired that miss in the L2 cache.
+	0x10 extra:pebs l2_miss_pebs Counts load uops retired that miss in the L2 cache.
 	0x20 extra: hitm Counts load uops retired where the cache line containing the data was in the modified
state of another core or modules cache (HITM).  More specifically, this means that when the load address
was checked by other caching agents (typically another processor) in the system, one of those caching
agents indicated that they had a dirty copy of the data.  Loads that obtain a HITM response incur greater
latency than most is typical for a load.  In addition, since HITM indicates that some other processor had
this data in its cache, it implies that the data was shared between processors, or potentially was a lock or
semaphore value.  This event is useful for locating sharing, false sharing, and contended locks.
 	0x20 extra:pebs hitm_pebs Counts load uops retired where the cache line containing the data was in the
modified state of another core or modules cache (HITM).  More specifically, this means that when the load
address was checked by other caching agents (typically another processor) in the system, one of those
caching agents indicated that they had a dirty copy of the data.  Loads that obtain a HITM response incur
greater latency than most is typical for a load.  In addition, since HITM indicates that some other
processor had this data in its cache, it implies that the data was shared between processors, or
potentially was a lock or semaphore value.  This event is useful for locating sharing, false sharing, and
contended locks.
-	0x40 extra: wcb_hit Counts memory load uops retired where the data is retrieved from the WCB (or fill
buffer), indicating that the load found its data while that data was in the process of being brought into
the L1 cache.  Typically a load will receive this indication when some other load or prefetch missed the L1
cache and was in the process of retrieving the cache line containing the data , but that process had not yet
finished (and written the data back to the cache).  For example, consider load X and Y, both referencing the
same cache line that is not in the L1 cache.  If load X misses cache first, it obtains and WCB (or fill buffer)
and begins the process of requesting the data.  When load Y requests the data, it will either hit the WCB, or
the L1 cache, depending on exactly what ti
 me the request to Y occurs.
-	0x40 extra:pebs wcb_hit_pebs Counts memory load uops retired where the data is retrieved from the WCB
(or fill buffer), indicating that the load found its data while that data was in the process of being
brought into the L1 cache.  Typically a load will receive this indication when some other load or prefetch
missed the L1 cache and was in the process of retrieving the cache line containing the data , but that
process had not yet finished (and written the data back to the cache).  For example, consider load X and Y,
both referencing the same cache line that is not in the L1 cache.  If load X misses cache first, it obtains and
WCB (or fill buffer) and begins the process of requesting the data.  When load Y requests the data, it will
either hit the WCB, or the L1 cache, depending on exactl
 y what time the request to Y occurs.
-	0x80 extra: dram_hit Counts memory load uops retired where the data is retrieved from DRAM.  Event is
counted at retirment, so the speculative loads are ignored.  A memory load can hit (or miss) the L1 cache,
hit (or miss) the L2 cache, hit DRAM, hit in the WCB or receive a HITM response.
-	0x80 extra:pebs dram_hit_pebs Counts memory load uops retired where the data is retrieved from DRAM. 
Event is counted at retirment, so the speculative loads are ignored.  A memory load can hit (or miss) the L1
cache, hit (or miss) the L2 cache, hit DRAM, hit in the WCB or receive a HITM response.
+	0x40 extra: wcb_hit Counts memory load uops retired where the data is retrieved from the WCB (or fill
buffer), indicating that the load found its data while that data was in the process of being brought into
the L1 cache.  Typically a load will receive this indication when some other load or prefetch missed the L1
cache and was in the process of retrieving the cache line containing the data, but that process had not yet
finished (and written the data back to the cache). For example, consider load X and Y, both referencing the
same cache line that is not in the L1 cache.  If load X misses cache first, it obtains and WCB (or fill buffer)
and begins the process of requesting the data.  When load Y requests the data, it will either hit the WCB, or
the L1 cache, depending on exactly what time
  the request to Y occurs.
+	0x40 extra:pebs wcb_hit_pebs Counts memory load uops retired where the data is retrieved from the WCB
(or fill buffer), indicating that the load found its data while that data was in the process of being
brought into the L1 cache.  Typically a load will receive this indication when some other load or prefetch
missed the L1 cache and was in the process of retrieving the cache line containing the data, but that
process had not yet finished (and written the data back to the cache). For example, consider load X and Y,
both referencing the same cache line that is not in the L1 cache.  If load X misses cache first, it obtains and
WCB (or fill buffer) and begins the process of requesting the data.  When load Y requests the data, it will
either hit the WCB, or the L1 cache, depending on exactly 
 what time the request to Y occurs.
+	0x80 extra: dram_hit Counts memory load uops retired where the data is retrieved from DRAM.  Event is
counted at retirement, so the speculative loads are ignored.  A memory load can hit (or miss) the L1 cache,
hit (or miss) the L2 cache, hit DRAM, hit in the WCB or receive a HITM response.
+	0x80 extra:pebs dram_hit_pebs Counts memory load uops retired where the data is retrieved from DRAM. 
Event is counted at retirement, so the speculative loads are ignored.  A memory load can hit (or miss) the L1
cache, hit (or miss) the L2 cache, hit DRAM, hit in the WCB or receive a HITM response.
 name:baclears type:exclusive default:0x1
 	0x1 extra: all Counts the number of times a BACLEAR is signaled for any reason, including, but not limited
to indirect branch/call,  Jcc (Jump on Conditional Code/Jump if Condition is Met) branch, unconditional
branch/call, and returns.
 	0x8 extra: return Counts BACLEARS on return instructions.
-	0x10 extra: cond Counts BACLEARS on Jcc (Jump on Conditional Code/Jump if Conditon is Met) branches.
+	0x10 extra: cond Counts BACLEARS on Jcc (Jump on Conditional Code/Jump if Condition is Met) branches.
--

-- 
2.5.5

------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
Tim Chou | 4 May 09:26 2016
Picon

Why does operf seems stop working after several samplings?

Hi All, 

I use oprofile to try profile the database server's (MySQL) performance.

I start with this command before I start my test:
sudo nohup operf --pid `pgrep mysqld` &

All the tests will be run for 150s.

When the CPU utilization is low, the sample number seems reasonable.
However, when the CPU utilization is very high, more than 100%, the sample number decreases dramatically.

The total sample numbers are 42417612 samples and 1226 samples, for 70% CPU utilization and 100% CPU utilization.

I find when I use opreport to get the counts in the course of the testing, the number of counts never changed. I think the oprofile stops working after several samplings.

I first try different reset counts. I find when this value is small, I can get more samplings.

I also try perf. It works well. The sampling number becomes larger as time goes on.

Any advice?

Thanks,
TIm
-----------------------------------------------------------------------
My machine information:
CPU: Intel(R) Xeon(R) CPU E5-2630 v3 <at> 2.40GHz
Memory: 64GB
------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
oprofile-list mailing list
oprofile-list <at> lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/oprofile-list

Gmane