Andi Kleen | 7 Jul 01:48 2015

[PATCH] oprofile: Fixes for Skylake event lists

From: Andi Kleen <ak <at> linux.intel.com>

This fixes the review feedback for the Skylake event list.

- Fix event codes for INST_RETIRED, CPU_CLK_UNHALTED.
- Fix OFFCORE_REQUESTS_OUTSTANDING events
- Add br_inst_retired.all_branches_pebs
- Fill in correct default event
---
 events/i386/skylake/events     |  4 ++--
 events/i386/skylake/unit_masks | 25 +++++++++++++------------
 libop/op_events.c              |  5 ++++-
 3 files changed, 19 insertions(+), 15 deletions(-)

diff --git a/events/i386/skylake/events b/events/i386/skylake/events
index 28d6654..9a04a86 100644
--- a/events/i386/skylake/events
+++ b/events/i386/skylake/events
 <at>  <at>  -6,8 +6,6  <at>  <at> 
 # Note the minimum counts are not discovered experimentally and could be likely
 # lowered in many cases without ill effect.
 #
-event:0x00 counters:1 um:inst_retired minimum:2000003 name:inst_retired :
-event:0x00 counters:cpuid um:cpu_clk_unhalted minimum:2000003 name:cpu_clk_unhalted :
 event:0x03 counters:cpuid um:ld_blocks minimum:100003 name:ld_blocks :
 event:0x07 counters:cpuid um:ld_blocks_partial minimum:100003
name:ld_blocks_partial_address_alias :
 event:0x08 counters:cpuid um:dtlb_load_misses minimum:2000003 name:dtlb_load_misses :
 <at>  <at>  -16,6 +14,7  <at>  <at>  event:0x0e counters:cpuid um:uops_issued minimum:2000003 name:uops_issued :
 event:0x14 counters:cpuid um:arith minimum:2000003 name:arith_divider_active :
(Continue reading)

Andi Kleen | 1 Jul 23:36 2015

[PATCH] Add support for Intel Skylake events

From: Andi Kleen <ak <at> linux.intel.com>

Add support for the Intel Skylake micro architecture to oprofile.

OFFCORE_* and FRONTEND_* events are not supported for now because
oprofile does not support setting up config1

Signed-off-by: Andi Kleen <ak <at> linux.intel.com>
---
 events/Makefile.am             |   1 +
 events/i386/skylake/events     |  62 ++++++++
 events/i386/skylake/unit_masks | 314 +++++++++++++++++++++++++++++++++++++++++
 libop/op_cpu_type.c            |   2 +
 libop/op_cpu_type.h            |   1 +
 libop/op_events.c              |   1 +
 libop/op_hw_specific.h         |   3 +
 utils/ophelp.c                 |   1 +
 8 files changed, 385 insertions(+)
 create mode 100644 events/i386/skylake/events
 create mode 100644 events/i386/skylake/unit_masks

diff --git a/events/Makefile.am b/events/Makefile.am
index d68f0e8..56f9020 100644
--- a/events/Makefile.am
+++ b/events/Makefile.am
 <at>  <at>  -18,6 +18,7  <at>  <at>  event_files = \
 	i386/ivybridge/events i386/ivybridge/unit_masks \
 	i386/haswell/events i386/haswell/unit_masks \
 	i386/broadwell/events i386/broadwell/unit_masks \
+	i386/skylake/events i386/skylake/unit_masks \
(Continue reading)

William Cohen | 24 Jun 23:19 2015
Picon

Re: [oprof-cvs] How build Oprofile on MIPS architecture

On 06/24/2015 08:36 AM, Яков Мамонтов wrote:
> Hi,
> 
> Please tell me, how can I build and install Oprofile on MIPS processor with linux 2.6.30? I try build 0.9.7
version, but I had failues, because there are no files in "module" directory for MIPS. There are files only
for x86 and ia64.
> 
> Best regards,
> Yakov Mamontov

Hi Yakov,

There is support in oprofile for mips processors and people have used it
(http://www.linux-mips.org/wiki/Oprofile).  However, I haven't compiled oprofile on a mips
processor, so I am not the aware of the mips specifics.  How are you configuring oprofile for the build? What
is output of the build, particularly the tail end?  It would be helpful to have those pieces of information
to diagnose what is going wrong. 

-Will

------------------------------------------------------------------------------
Monitor 25 network devices or servers for free with OpManager!
OpManager is web-based network management software that monitors 
network devices and physical & virtual servers, alerts via email & sms 
for fault. Monitor 25 devices for free with no restriction. Download now
http://ad.doubleclick.net/ddm/clk/292181274;119417398;o
_______________________________________________
oprofile-list mailing list
oprofile-list <at> lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/oprofile-list
(Continue reading)

Andi Kleen | 11 Jun 22:54 2015

[PATCH] oprofile: Add Intel Airmont and Intel Xeon D model numbers

From: Andi Kleen <ak <at> linux.intel.com>

Add a model number for Airmont/Braswell CPUs
Add a model number for Broadwell based Xeon D CPUs.

Signed-off-by: Andi Kleen <ak <at> linux.intel.com>
---
 libop/op_hw_specific.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/libop/op_hw_specific.h b/libop/op_hw_specific.h
index 1d39692..8a7ed1c 100644
--- a/libop/op_hw_specific.h
+++ b/libop/op_hw_specific.h
 <at>  <at>  -152,9 +152,11  <at>  <at>  static inline op_cpu op_cpu_specific_type(op_cpu cpu_type)
 		case 0x3d:
 		case 0x47:
 		case 0x4f:
+		case 0x56:
 			return CPU_BROADWELL;
 		case 0x37:
 		case 0x4d:
+		case 0x4c:
 			return CPU_SILVERMONT;
 		}
 	}
--

-- 
2.4.2

------------------------------------------------------------------------------
(Continue reading)

William Cohen | 11 Jun 18:09 2015
Picon

broadwell cycle_activity unit mask

Hi Andi,

When reviewing the Michael Petlan's unit mask patch I noticed that one of the broadwell unit masks didn't
look quite right.  The cycle_activity seems to have unit mask values with the same values with different
names. Is this intentional?

name:cycle_activity type:exclusive default:0x1
	0x1 extra:cmask=1 cycles_l2_pending Counts number of cycles the CPU has at least one pending  demand*
load request missing the L2 cache.
	0x8 extra:cmask=8 cycles_l1d_pending Counts number of cycles the CPU has at least one pending  demand
load request missing the L1 data cache.
	0x2 extra:cmask=2 cycles_ldm_pending Counts number of cycles the CPU has at least one pending  demand
load request (that is cycles with non-completed load waiting for its data from memory subsystem)
	0x4 extra:cmask=4 cycles_no_execute Counts number of cycles nothing is executed on any execution port.
	0x5 extra:cmask=5 stalls_l2_pending Counts number of cycles nothing is executed on any execution port,
while there was at least one pending demand* load request missing the L2 cache.    (as a footprint) * includes
also L1 HW prefetch requests that may or may not be required by demands
	0x6 extra:cmask=6 stalls_ldm_pending Counts number of cycles nothing is executed on any execution
port, while there was at least one pending demand load request.
	0xc extra:cmask=c stalls_l1d_pending Counts number of cycles nothing is executed on any execution
port, while there was at least one pending demand load request missing the L1 data cache.
	0x8 extra:cmask=8 cycles_l1d_miss Cycles while L1 cache miss demand load is outstanding.
	0x1 extra:cmask=1 cycles_l2_miss Cycles while L2 cache miss demand load is outstanding.
	0x2 extra:cmask=2 cycles_mem_any Cycles while memory subsystem has an outstanding load.
	0x4 extra:cmask=4 stalls_total Total execution stalls.
	0xc extra:cmask=c stalls_l1d_miss Execution stalls while L1 cache miss demand load is outstanding.
	0x5 extra:cmask=5 stalls_l2_miss Execution stalls while L2 cache miss demand load is outstanding.
	0x6 extra:cmask=6 stalls_mem_any Execution stalls while memory subsystem has an outstanding load.

Shouldn't that also be using a name default rather than a numerical value?
(Continue reading)

Janusz Kowal | 10 Jun 10:50 2015

Operf and Xen?

Hello,

Since profiling with Opcontrol does not work on Xen for me (maybe because Opcontrol is deprecated, maybe for some other reason that nobody seems to know...), I would like to try Operf.

Can Operf work with Xen at all? 

My goal is to gather the performance counter data for each virtual machine (domain) passively, i.e. running operf only in the control domain (Dom0) instead of running the profiler inside the guest VMs. Is this possible? If so, how?

I appreciate any help or hints for further reading.

Marek
------------------------------------------------------------------------------
_______________________________________________
oprofile-list mailing list
oprofile-list <at> lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/oprofile-list
Michael Petlan | 8 Jun 17:35 2015
Picon

[PATCH] fixed default unit masks for Broadwells

Since some of the default unit masks for Broadwell events cannot be
uniquely specified by numbers, the defaults have had to be replaced
by the named ones. When the affected events are used on Broadwell
without specifying unit masks after applying this patch, the default
masks are chosen correctly.

Signed-off-by: Michael Petlan <mpetlan <at> redhat.com>
--
diff --git a/events/i386/broadwell/unit_masks b/events/i386/broadwell/unit_masks
index 0d6ccd5..4e69363 100644
--- a/events/i386/broadwell/unit_masks
+++ b/events/i386/broadwell/unit_masks
 <at>  <at>  -31,7 +31,7  <at>  <at>  name:dtlb_load_misses type:exclusive default:0x1
 	0x20 extra: stlb_hit_4k Load misses that miss the  DTLB and hit the STLB (4K)
 	0xe extra: walk_completed Demand load Miss in all translation lookaside buffer (TLB) levels causes a
page walk that completes of any page size.
 	0x60 extra: stlb_hit Load operations that miss the first DTLB level but hit the second and do not cause page walks
-name:uops_issued type:exclusive default:0x1
+name:uops_issued type:exclusive default:any
 	0x1 extra: any This event counts the number of Uops issued by the Resource Allocation Table (RAT) to the
reservation station (RS).
 	0x10 extra: flags_merge Number of flags-merge uops being allocated. Such uops considered perf
sensitive; added by GSR u-arch.
 	0x20 extra: slow_lea Number of slow LEA uops being allocated. A uop is generally considered SlowLea if it
has 3 sources (e.g. 2 sources + immediate) regardless if as a result of LEA instruction or not.
 <at>  <at>  -54,7 +54,7  <at>  <at>  name:l2_rqsts type:exclusive default:0x21
 	0xe7 extra: all_demand_references Demand requests to L2 cache
 	0x3f extra: miss All requests that miss L2 cache
 	0xff extra: references All L2 requests
-name:l1d_pend_miss type:exclusive default:0x1
+name:l1d_pend_miss type:exclusive default:pending
 	0x1 extra: pending This event counts duration of L1D miss outstanding, that is each cycle number of Fill
Buffers (FB) outstanding required by Demand Reads. FB either is held by demand loads, or it is held by
non-demand loads and gets hit at least once by demand. The valid outstanding interval is defined until the
FB deallocation by one of the following ways: from FB allocation, if FB is allocated by demand; from the
demand Hit FB, if it is allocated by hardware or software prefetch. Note: In the L1D, a Demand Read contains
cacheable or noncacheable demand loads, including ones causing cache-line splits and reads due to page
walks resulted from any request type.
 	0x1 extra:cmask=1 pending_cycles This event counts duration of L1D miss outstanding in cycles.
 name:dtlb_store_misses type:exclusive default:0x1
 <at>  <at>  -77,7 +77,7  <at>  <at>  name:move_elimination type:exclusive default:0x1
 	0x2 extra: simd_eliminated Number of SIMD Move Elimination candidate uops that were eliminated.
 	0x4 extra: int_not_eliminated Number of integer Move Elimination candidate uops that were not eliminated.
 	0x8 extra: simd_not_eliminated Number of SIMD Move Elimination candidate uops that were not eliminated.
-name:cpl_cycles type:exclusive default:0x1
+name:cpl_cycles type:exclusive default:ring0
 	0x1 extra: ring0 This event counts the unhalted core cycles during which the thread is in the ring 0
privileged mode.
 	0x2 extra: ring123 This event counts unhalted core cycles during which the thread is in rings 1, 2, or 3.
 	0x1 extra:cmask=1,edge ring0_trans This event counts when there is a transition from ring 1,2 or 3 to ring0.
 <at>  <at>  -87,10 +87,10  <at>  <at>  name:tx_exec type:exclusive default:0x1
 	0x4 extra: misc3 Unfriendly TSX abort triggered by a nest count that is too deep
 	0x8 extra: misc4 RTM region detected inside HLE
 	0x10 extra: misc5 # HLE inside HLE+
-name:rs_events type:exclusive default:0x1
+name:rs_events type:exclusive default:empty_cycles
 	0x1 extra: empty_cycles This event counts cycles during which the reservation station (RS) is empty for
the thread. Note: In ST-mode, not active thread should drive 0. This is usually caused by severely costly
branch mispredictions, or allocator/FE issues.
 	0x1 extra:cmask=1,inv,edge empty_end Counts end of periods where the Reservation Station (RS) was
empty. Could be useful to precisely locate Frontend Latency Bound issues.
-name:offcore_requests_outstanding type:exclusive default:0x1
+name:offcore_requests_outstanding type:exclusive default:demand_data_rd
 	0x1 extra: demand_data_rd This event counts the number of offcore outstanding Demand Data Read
transactions in the super queue (SQ) every cycle. A transaction is considered to be in the Offcore
outstanding state between L2 miss and transaction completion sent to requestor. See the corresponding
Umask under OFFCORE_REQUESTS. Note: A prefetch promoted to Demand is counted from the promotion point.
 	0x2 extra: demand_code_rd This event counts the number of offcore outstanding Code Reads transactions
in the super queue every cycle. The "Offcore outstanding" state of the transaction lasts from the L2 miss
until the sending transaction completion to requestor (SQ deallocation). See the corresponding Umask
under OFFCORE_REQUESTS.
 	0x4 extra: demand_rfo This event counts the number of offcore outstanding RFO (store) transactions in
the super queue (SQ) every cycle. A transaction is considered to be in the Offcore outstanding state
between L2 miss and transaction completion sent to requestor (SQ de-allocation). See corresponding
Umask under OFFCORE_REQUESTS.
 <at>  <at>  -147,14 +147,14  <at>  <at>  name:br_misp_exec type:exclusive default:0xff
 	0xc1 extra: all_conditional This event counts both taken and not taken speculative and retired
mispredicted macro conditional branch instructions.
 	0xc4 extra: all_indirect_jump_non_call_ret This event counts both taken and not taken mispredicted
indirect branches excluding calls and returns.
 	0xa0 extra: taken_indirect_near_call Taken speculative and retired mispredicted indirect calls
-name:idq_uops_not_delivered type:exclusive default:0x1
+name:idq_uops_not_delivered type:exclusive default:core
 	0x1 extra: core This event counts the number of uops not delivered to Resource Allocation Table (RAT) per
thread adding ?4 ? x? when Resource Allocation Table (RAT) is not stalled and Instruction Decode Queue
(IDQ) delivers x uops to Resource Allocation Table (RAT) (where x belongs to {0,1,2,3}). Counting does
not cover cases when:  a. IDQ-Resource Allocation Table (RAT) pipe serves the other thread;  b. Resource
Allocation Table (RAT) is stalled for the thread (including uop drops and clear BE conditions);   c.
Instruction Decode Queue (IDQ) delivers four uops.
 	0x1 extra:cmask=4 cycles_0_uops_deliv_core This event counts, on the per-thread basis, cycles when no
uops are delivered to Resource Allocation Table (RAT). IDQ_Uops_Not_Delivered.core =4.
 	0x1 extra:cmask=3 cycles_le_1_uop_deliv_core This event counts, on the per-thread basis, cycles when
less than 1 uop is  delivered to Resource Allocation Table (RAT). IDQ_Uops_Not_Delivered.core >=3.
 	0x1 extra:cmask=2 cycles_le_2_uop_deliv_core Cycles with less than 2 uops delivered by the front end
 	0x1 extra:cmask=1 cycles_le_3_uop_deliv_core Cycles with less than 3 uops delivered by the front end
 	0x1 extra:cmask=1,inv cycles_fe_was_ok Counts cycles FE delivered 4 uops or Resource Allocation Table
(RAT) was stalling FE.
-name:uops_executed_port type:exclusive default:0x1
+name:uops_executed_port type:exclusive default:port_0
 	0x1 extra:any port_0_core Cycles per core when uops are exectuted in port 0
 	0x2 extra:any port_1_core Cycles per core when uops are exectuted in port 1
 	0x4 extra:any port_2_core Cycles per core when uops are dispatched to port 2
 <at>  <at>  -200,7 +200,7  <at>  <at>  name:cycle_activity type:exclusive default:0x1
 	0xc extra:cmask=c stalls_l1d_miss Execution stalls while L1 cache miss demand load is outstanding.
 	0x5 extra:cmask=5 stalls_l2_miss Execution stalls while L2 cache miss demand load is outstanding.
 	0x6 extra:cmask=6 stalls_mem_any Execution stalls while memory subsystem has an outstanding load.
-name:lsd type:exclusive default:0x1
+name:lsd type:exclusive default:uops
 	0x1 extra: uops Number of Uops delivered by the LSD. Read more on LSD under LSD_REPLAY.REPLAY
 	0x1 extra:cmask=4 cycles_4_uops Cycles 4 Uops delivered by the LSD, but didn't come from the decoder
 	0x1 extra:cmask=1 cycles_active Cycles Uops delivered by the LSD, but didn't come from the decoder
 <at>  <at>  -209,7 +209,7  <at>  <at>  name:offcore_requests type:exclusive default:0x1
 	0x2 extra: demand_code_rd This event counts both cacheable and noncachaeble code read requests.
 	0x4 extra: demand_rfo This event counts the demand RFO (read for ownership) requests including regular
RFOs, locks, ItoM.
 	0x8 extra: all_data_rd This event counts the demand and prefetch data reads. All Core Data Reads include
cacheable "Demands" and L2 prefetchers (not L3 prefetchers). Counting also covers reads due to page
walks resulted from any request type.
-name:uops_executed type:exclusive default:0x1
+name:uops_executed type:exclusive default:thread
 	0x1 extra: thread Number of uops to be executed per-thread each cycle.
 	0x2 extra: core Number of uops executed from any thread
 	0x1 extra:cmask=1,inv stall_cycles This event counts cycles during which no uops were dispatched from
the Reservation Station (RS) per thread.
 <at>  <at>  -232,20 +232,20  <at>  <at>  name:other_assists type:exclusive default:0x8
 	0x8 extra: avx_to_sse This is a non-precise version (that is, does not use PEBS) of the event that counts
the number of transitions from AVX-256 to legacy SSE when penalty is applicable.
 	0x10 extra: sse_to_avx This is a non-precise version (that is, does not use PEBS) of the event that counts
the number of transitions from legacy SSE to AVX-256 when penalty is applicable.
 	0x40 extra: any_wb_assist Number of times any microcode assist is invoked by HW upon uop writeback.
-name:uops_retired type:exclusive default:0x1
+name:uops_retired type:exclusive default:all
 	0x1 extra: all This is a non-precise version (that is, does not use PEBS) of the event that counts all
actually retired uops. Counting increments by two for micro-fused uops, and by one for macro-fused and
other uops. Maximal increment value for one cycle is eight.
 	0x1 extra: all_pebs Counts all actually retired uops. Counting increments by two for micro-fused uops,
and by one for macro-fused and other uops. Maximal increment value for one cycle is eight.
 	0x2 extra: retire_slots This is a non-precise version (that is, does not use PEBS) of the event that counts
the number of retirement slots used.
 	0x2 extra: retire_slots_pebs Counts the number of retirement slots used.
 	0x1 extra:cmask=1,inv stall_cycles This is a non-precise version (that is, does not use PEBS) of the
event that counts cycles without actually retired uops.
 	0x1 extra:cmask=a,inv total_cycles Number of cycles using always true condition (uops_ret < 16)
applied to non PEBS uops retired event.
-name:machine_clears type:exclusive default:0x1
+name:machine_clears type:exclusive default:cycles
 	0x1 extra: cycles This event counts both thread-specific (TS) and all-thread (AT) nukes.
 	0x2 extra: memory_ordering This event counts the number of memory ordering Machine Clears detected.
Memory Ordering Machine Clears can result from one of the following: 1. memory disambiguation, 2.
external snoop, or 3. cross SMT-HW-thread snoop (stores) hitting load buffer.
 	0x4 extra: smc This event counts self-modifying code (SMC) detected, which causes a machine clear.
 	0x20 extra: maskmov Maskmov false fault - counts number of time ucode passes through Maskmov flow due to
instruction's mask being 0 while the flow was completed without raising a fault.
 	0x1 extra:cmask=1,edge count Number of machine clears (nukes) of any type.
-name:br_inst_retired type:exclusive default:0x1
+name:br_inst_retired type:exclusive default:conditional
 	0x1 extra: conditional This is a non-precise version (that is, does not use PEBS) of the event that counts
conditional branch instructions retired.
 	0x1 extra: conditional_pebs Counts conditional branch instructions retired.
 	0x2 extra: near_call This is a non-precise version (that is, does not use PEBS) of the event that counts
both direct and indirect near call instructions retired.
 <at>  <at>  -257,7 +257,7  <at>  <at>  name:br_inst_retired type:exclusive default:0x1
 	0x20 extra: near_taken_pebs Counts taken branch instructions retired.
 	0x40 extra: far_branch This is a non-precise version (that is, does not use PEBS) of the event that counts
far branch instructions retired.
 	0x4 extra:pebs all_branches_pebs This is a precise version of BR_INST_RETIRED.ALL_BRANCHES that
counts all (macro) branch instructions retired.
-name:br_misp_retired type:exclusive default:0x1
+name:br_misp_retired type:exclusive default:conditional
 	0x1 extra: conditional This is a non-precise version (that is, does not use PEBS) of the event that counts
mispredicted conditional branch instructions retired.
 	0x1 extra: conditional_pebs Counts mispredicted conditional branch instructions retired.
 	0x4 extra:pebs all_branches_pebs This is a precise version of BR_MISP_RETIRED.ALL_BRANCHES that
counts all mispredicted macro branch instructions retired.
 <at>  <at>  -289,7 +289,7  <at>  <at>  name:fp_assist type:exclusive default:0x1e
 	0x4 extra: x87_input This is a non-precise version (that is, does not use PEBS) of the event that counts x87
floating point (FP) micro-code assist (invalid operation, denormal operand, SNaN operand) when the
input value (one of the source operands to an FP instruction) is invalid.
 	0x8 extra: simd_output This is a non-precise version (that is, does not use PEBS) of the event that counts
the number of SSE* floating point (FP) micro-code assist (numeric overflow/underflow) when the output
value (destination register) is invalid. Counting covers only cases involving penalties that require
micro-code assist intervention.
 	0x10 extra: simd_input This is a non-precise version (that is, does not use PEBS) of the event that counts
any input SSE* FP assist - invalid operation, denormal operand, dividing by zero, SNaN operand. Counting
includes only cases involving penalties that required micro-code assist intervention.
-name:mem_uops_retired type:exclusive default:0x11
+name:mem_uops_retired type:exclusive default:stlb_miss_loads
 	0x11 extra: stlb_miss_loads This is a non-precise version (that is, does not use PEBS) of the event that
counts load uops with true STLB miss retired to the architected path. True STLB miss is an uop triggering
page walk that gets completed without blocks, and later gets retired. This page walk can end up with or
without a fault.
 	0x11 extra: stlb_miss_loads_pebs Counts load uops with true STLB miss retired to the architected path.
True STLB miss is an uop triggering page walk that gets completed without blocks, and later gets retired.
This page walk can end up with or without a fault.
 	0x12 extra: stlb_miss_stores This is a non-precise version (that is, does not use PEBS) of the event that
counts store uops with true STLB miss retired to the architected path. True STLB miss is an uop triggering
page walk that gets completed without blocks, and later gets retired. This page walk can end up with or
without a fault.
 <at>  <at>  -304,7 +304,7  <at>  <at>  name:mem_uops_retired type:exclusive default:0x11
 	0x81 extra: all_loads_pebs Counts load uops retired to the architected path with a filter on bits 0 and 1
applied. Note: This event ?ounts AVX-256bit load/store double-pump memory uops as a single uop at
retirement. This event also counts SW prefetches.
 	0x82 extra: all_stores This is a non-precise version (that is, does not use PEBS) of the event that counts
store uops retired to the architected path with a filter on bits 0 and 1 applied. Note: This event ?ounts
AVX-256bit load/store double-pump memory uops as a single uop at retirement.
 	0x82 extra: all_stores_pebs Counts store uops retired to the architected path with a filter on bits 0 and 1
applied. Note: This event ?ounts AVX-256bit load/store double-pump memory uops as a single uop at retirement.
-name:mem_load_uops_retired type:exclusive default:0x1
+name:mem_load_uops_retired type:exclusive default:l1_hit
 	0x1 extra: l1_hit This is a non-precise version (that is, does not use PEBS) of the event that counts
retired load uops which data sources were hits in the nearest-level (L1) cache. Note: Only two
data-sources of L1/FB are applicable for AVX-256bit  even though the corresponding AVX load could be
serviced by a deeper level in the memory hierarchy. Data source is reported for the Low-half load. This
event also counts SW prefetches independent of the actual data source
 	0x1 extra: l1_hit_pebs Counts retired load uops which data sources were hits in the nearest-level (L1)
cache. Note: Only two data-sources of L1/FB are applicable for AVX-256bit  even though the corresponding
AVX load could be serviced by a deeper level in the memory hierarchy. Data source is reported for the
Low-half load. This event also counts SW prefetches independent of the actual data source
 	0x2 extra: l2_hit This is a non-precise version (that is, does not use PEBS) of the event that counts
retired load uops which data sources were hits in the mid-level (L2) cache.
 <at>  <at>  -319,7 +319,7  <at>  <at>  name:mem_load_uops_retired type:exclusive default:0x1
 	0x20 extra: l3_miss_pebs Miss in last-level (L3) cache. Excludes Unknown data-source.
 	0x40 extra: hit_lfb This is a non-precise version (that is, does not use PEBS) of the event that counts
retired load uops which data sources were load uops missed L1 but hit a fill buffer due to a preceding miss to
the same cache line with the data not ready. Note: Only two data-sources of L1/FB are applicable for
AVX-256bit  even though the corresponding AVX load could be serviced by a deeper level in the memory
hierarchy. Data source is reported for the Low-half load.
 	0x40 extra: hit_lfb_pebs Counts retired load uops which data sources were load uops missed L1 but hit a
fill buffer due to a preceding miss to the same cache line with the data not ready. Note: Only two
data-sources of L1/FB are applicable for AVX-256bit  even though the corresponding AVX load could be
serviced by a deeper level in the memory hierarchy. Data source is reported for the Low-half load.
-name:mem_load_uops_l3_hit_retired type:exclusive default:0x1
+name:mem_load_uops_l3_hit_retired type:exclusive default:xsnp_miss
 	0x1 extra: xsnp_miss This is a non-precise version (that is, does not use PEBS) of the event that counts
retired load uops which data sources were L3 Hit and a cross-core snoop missed in the on-pkg core cache.
 	0x1 extra: xsnp_miss_pebs Counts retired load uops which data sources were L3 Hit and a cross-core snoop
missed in the on-pkg core cache.
 	0x2 extra: xsnp_hit This is a non-precise version (that is, does not use PEBS) of the event that counts
retired load uops which data sources were L3 hit and a cross-core snoop hit in the on-pkg core cache.
 <at>  <at>  -328,7 +328,7  <at>  <at>  name:mem_load_uops_l3_hit_retired type:exclusive default:0x1
 	0x4 extra: xsnp_hitm_pebs Counts retired load uops which data sources were HitM responses from a core on
same socket (shared L3).
 	0x8 extra: xsnp_none This is a non-precise version (that is, does not use PEBS) of the event that counts
retired load uops which data sources were hits in the last-level (L3) cache without snoops required.
 	0x8 extra: xsnp_none_pebs Counts retired load uops which data sources were hits in the last-level (L3)
cache without snoops required.
-name:mem_load_uops_l3_miss_retired type:exclusive default:0x1
+name:mem_load_uops_l3_miss_retired type:exclusive default:local_dram
 	0x1 extra: local_dram Retired load uop whose Data Source was: local DRAM either Snoop not needed or Snoop
Miss (RspI)
 	0x1 extra: local_dram_pebs Retired load uop whose Data Source was: local DRAM either Snoop not needed or
Snoop Miss (RspI)
 name:l2_trans type:exclusive default:0x80

------------------------------------------------------------------------------
_______________________________________________
oprofile-list mailing list
oprofile-list <at> lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/oprofile-list
Mahmood N | 5 Jun 20:11 2015
Picon

Modifying MSR has no effect on the stats

Hi,
According to the BKDG of 15h family, the following MSR controls the hardware prefetcher at L1.

MSRC001_1022 Data Cache Configuration (DC_CFG)
Bits       ///////     Description
63:16
  ///////  Reserved.
15
  ///////  DisPfHwForSw. Read-write. Reset: 0. 1=Disable hardware prefetches for software prefetches.
14
  ///////  Reserved.
13
  ///////  DisHwPf. Read-write. Reset: 0. 1=Disable the DC hardware prefetcher. BIOS: See 2.3.3 [Using L2 Cache as General Storage During Boot].
12:10
  ///////  Reserved.
9:5
  ///////  Reserved.
4
  ///////  DisSpecTlbRld. Read-write. Reset: 0. 1=Disable speculative TLB reloads. BIOS: See 2.3.3 [UsingL2 Cache as General Storage During Boot].
3:0
  ///////  Reserved.


Problem is when I disable it (setting 13th to 1), the stats regarding the prefetcher remains non-zero!





[root <at> tiger exe]# rdmsr -a -x -0 0xc0011022
0000000000000000    <32 LINES FOR 32 CORES>
[root <at> tiger exe]# wrmsr -a 0xc0011022 0x2000
[root <at> tiger exe]# rdmsr -a -x -0 0xc0011022
0000000000002000   <32 LINES FOR 32 CORES>
[root <at> tiger exe]# ocount -e CPU_CLK_UNHALTED,RETIRED_INSTRUCTIONS,DATA_CACHE_ACCESSES,DATA_CACHE_MISSES,DATA_PREFETCHER,PREFETCH_INSTRUCTIONS_DISPATCHED,REQUESTS_TO_L2,L2_CACHE_MISS,L2_PREFETCHER_TRIGGER ./bzip2_base.amd64-m64-gcc44-nn
Events were actively counted for 36.0 seconds.
Event counts (scaled) for /home/mahmood/spec-cpu2006-x86_64/benchspec/CPU2006/401.bzip2/exe/bzip2_base.amd64-m64-gcc44-nn:
        Event                                    Count                    % time counted
        CPU_CLK_UNHALTED                         107,340,837,082          55.55
        DATA_CACHE_ACCESSES                      55,973,655,378           66.66
        DATA_CACHE_MISSES                        1,720,001,433            66.65
        DATA_PREFETCHER                          1,081,055,468            66.66
        L2_CACHE_MISS                            265,881,095              66.67
        L2_PREFETCHER_TRIGGER                    261,646,431              55.57
        PREFETCH_INSTRUCTIONS_DISPATCHED         397,803                  66.68
        REQUESTS_TO_L2                           2,796,342,677            44.45
        RETIRED_INSTRUCTIONS                     108,589,570,814          55.57





 
Isn't that strange?


Regards,
Mahmood
------------------------------------------------------------------------------
_______________________________________________
oprofile-list mailing list
oprofile-list <at> lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/oprofile-list
Mahmood N | 5 Jun 08:11 2015
Picon

Accuracy of profiling

Hi,
I have a concern over the accuracy of profiling. I profiled two executions immediately. The second execution took less time than the first one. That is correct in real world because of caching. However, I don't expect to see such an effect in profiling.

Events were actively counted for 37.9 seconds.
Event counts (actual) for /home/mahmood/spec-cpu2006-x86_64/benchspec/CPU2006/401.bzip2/exe/bzip2_base.amd64-m64-gcc44-nn:
        Event                        Count                    % time counted
        CPU_CLK_UNHALTED             109,637,705,266          100.00
        RETIRED_INSTRUCTIONS         108,789,176,110          100.00
Events were actively counted for 35.9 seconds.
Event counts (actual) for /home/mahmood/spec-cpu2006-x86_64/benchspec/CPU2006/401.bzip2/exe/bzip2_base.amd64-m64-gcc44-nn:
        Event                        Count                    % time counted
        CPU_CLK_UNHALTED             107,924,858,901          100.00
        RETIRED_INSTRUCTIONS         108,708,808,686          100.00



Another thing is that I expect the same number of RETIRED_INSTRUCTIONS  since bzip2 has a number of fixed static instructions (input file in the same in both runs) and doesn't matter how many times it is executed.
 
Regards,
Mahmood
------------------------------------------------------------------------------
_______________________________________________
oprofile-list mailing list
oprofile-list <at> lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/oprofile-list
Mahmood N | 4 Jun 15:01 2015
Picon

restart opcontrol

Hi,
I am looking for a restart command which basically performs a stop/start job.
Is it sufficient to use --reset alone or it is better to run --stop and --start sequentially?
 
Regards,
Mahmood
------------------------------------------------------------------------------
_______________________________________________
oprofile-list mailing list
oprofile-list <at> lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/oprofile-list
Marek Walasek | 4 Jun 12:40 2015

Counters inaccessible using Xen

Hello,

When I attempt to profile Xen domains using Oprofile the following issue arises:

1. When I initially call ophelp, it returns the list of preformance counters, as expected.
2. As soon as I call opcontrol --init, a subsequent call to ophelp returns "Using timer interrupt", which I cannot work further with.
As a matter of fact, any call to opcontrol using any other option (well, except the --help option...) causes this issue. Only passing --deinit recovers the previous state, i.e. ophelp returns the correct counter list.

Note that I am working on an Intel i7 processor, using Ubuntu 14.04 and Xen 4.4. I have installed Oprofile using apt-get. I was forced to compile Xen from source code, as I could not obtain the uncompressed Xen image otherwise. I already tried passing over the link to the uncompressed image with --xen=/boot/xen-syms-4.4.3-pre, with no success. Please also note that Opcontrol does not complain about the image being invalid.

I do hope this is a standard beginner's issue and will appreciate greatly if someone could help me to resolve it. I will provide any additional information about my system if necessary.

Thank you!

Marek
------------------------------------------------------------------------------
_______________________________________________
oprofile-list mailing list
oprofile-list <at> lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/oprofile-list

Gmane