[Intel Navigation Header]

Intel Architecture MMX(TM) Technology


Chapter 6
MMXTM PERFORMANCE MONITORING EXTENSIONS

The most effective way to improve the performance of your code is to find the performance bottlenecks. Intel Architecture processors include a counter on the processor that will allow you to gather information about the performance of your application. This counter keeps track of events that occur while your code is executing. You can read the counter during execution and determine if your code has stalls. This may be accomplished by using Intel's VTune profiling tool or by using instructions within your code.

The section describes the performance monitoring features for MMX code on Pentium and P6-family processors with MMX technology.

The RDPMC instruction is described in Section 6.3.

6.1 Superscalar (Pentium® Family) Performance Monitoring Events

All Pentium processors feature performance counters and several new events have been added to support MMX technology. All new events are assigned to one of the two event counters (CTR0, CTR1), with the exception of "twin events" (such as " D1 starvation" and "FIFO is empty") which are assigned to different counters to allow their concurrent measurement. The events must be assigned to their specified counter. Table 6-1 lists the performance monitoring events. New events are listed in bold.

Table 6-1. Performance Monitoring Events



Serial


Encoding


Counter 0


Counter 1


Performance Monitoring Event
Occurrence or Duration
0
000000
Yes
Yes
Data Read
OCCURRENCE
1
000001
Yes
Yes
Data Write
OCCURRENCE
2
000010
Yes
Yes
Data TLB Miss
OCCURRENCE
3
000011
Yes
Yes
Data Read Miss
OCCURRENCE
4
000100
Yes
Yes
Data Write Miss
OCCURRENCE
5
000101
Yes
Yes
Write (hit) to M or E state lines
OCCURRENCE
6
000110
Yes
Yes
Data Cache Lines Written Back
OCCURRENCE
7
000111
Yes
Yes
External Data Cache Snoops
OCCURRENCE
8
001000
Yes
Yes
External Data Cache Snoop Hits
OCCURRENCE
9
001001
Yes
Yes
Memory Accesses in Both Pipes
OCCURRENCE
10
001010
Yes
Yes
Bank Conflicts
OCCURRENCE
11
001011
Yes
Yes
Misaligned Data Memory or I/O References
OCCURRENCE
12
001100
Yes
Yes
Code Read
OCCURRENCE
13
001101
Yes
Yes
Code TLB Miss
OCCURRENCE
14
001110
Yes
Yes
Code Cache Miss
OCCURRENCE
15
001111
Yes
Yes
Any Segment Register Loaded
OCCURRENCE
16
010000
Yes
Yes
Reserved
17
010001
Yes
Yes
Reserved
18
010010
Yes
Yes
Branches
OCCURRENCE
19
010011
Yes
Yes
BTB Predictions
OCCURRENCE
20
010100
Yes
Yes
Taken Branch or BTB hit.
OCCURRENCE
21
010101
Yes
Yes
Pipeline Flushes
OCCURRENCE
22
010110
Yes
Yes
Instructions Executed
OCCURRENCE
23
010111
Yes
Yes
Instructions Executed in the v-pipe e.g. parallelism/pairing
OCCURRENCE
24
011000
Yes
Yes
Clocks while a bus cycle is in progress (bus utilization)
DURATION
25
011001
Yes
Yes
Number of clocks stalled due to full write buffers
DURATION
26
011010
Yes
Yes
Pipeline stalled waiting for data memory read
DURATION
27
011011
Yes
Yes
Stall on write to an E or M state line
DURATION
29
011101
Yes
Yes
I/O Read or Write Cycle
OCCURRENCE

Table 6-1. Performance Monitoring Events (Cont'd)


Serial


Encoding


Counter 0


Counter 1


Performance Monitoring Event
Occurrence or Duration
30
011110
Yes
Yes
Non-cacheable memory reads
OCCURRENCE
31
011111
Yes
Yes
Pipeline stalled because of an address generation interlock
DURATION
32
100000
Yes
Yes
Reserved
33
100001
Yes
Yes
Reserved
34
100010
Yes
Yes
FLOPs
OCCURRENCE
35
100011
Yes
Yes
Breakpoint match on DR0 Register
OCCURRENCE
36
100100
Yes
Yes
Breakpoint match on DR1 Register
OCCURRENCE
37
100101
Yes
Yes
Breakpoint match on DR2 Register
OCCURRENCE
38
100110
Yes
Yes
Breakpoint match on DR3 Register
OCCURRENCE
39
100111
Yes
Yes
Hardware Interrupts
OCCURRENCE
40
101000
Yes
Yes
Data Read or Data Write
OCCURRENCE
41
101001
Yes
Yes
Data Read Miss or Data Write Miss
OCCURRENCE
43
101011
Yes
No
MMXTM instructions executed in u-pipe
OCCURRENCE
43
101011
No
Yes
MMX instructions executed in v-pipe
OCCURRENCE
45
101101
Yes
No
EMMS instructions executed
OCCURRENCE
45
101101
No
Yes
Transition between MMX instructions and FP instructions
OCCURRENCE
46
101110
No
Yes
Writes to Non-Cacheable Memory
OCCURRENCE
47
101111
Yes
No
Saturating MMX instructions executed
OCCURRENCE
47
101111
No
Yes
Saturations performed
OCCURRENCE
48
110000
Yes
No
Number of Cycles Not in HLT State
DURATION
49
110001
Yes
No
MMX instruction data reads
OCCURRENCE

Table 6-1. Performance Monitoring Events (Cont'd)


Serial


Encoding


Counter 0


Counter 1


Performance Monitoring Event
Occurrence or Duration
50
110010
Yes
No
Floating Point Stalls
DURATION
50
110010
No
Yes
Taken Branches
OCCURRENCE
51
110011
No
Yes
D1 Starvation and one instruction in FIFO
OCCURRENCE
52
110100
Yes
No
MMX instruction data writes
OCCURRENCE
52
110100
No
Yes
MMX instruction data write misses
OCCURRENCE
53
110101
Yes
No
Pipeline flushes due to wrong branch prediction
OCCURRENCE
53
110101
No
Yes
Pipeline flushes due to wrong branch predictions resolved in WB-stage
OCCURRENCE
54
110110
Yes
No
Misaligned data memory reference on MMX instruction
OCCURRENCE
54
110110
No
Yes
Pipeline stalled waiting for MMX instruction data memory read
DURATION
55
110111
Yes
No
Returns Predicted Incorrectly
OCCURRENCE
55
110111
No
Yes
Returns Predicted (Correctly and Incorrectly)
OCCURRENCE
56
111000
Yes
No
MMX instruction multiply unit interlock
DURATION
56
111000
No
Yes
MOVD/MOVQ store stall due to previous operation
DURATION
57
111001
Yes
No
Returns
OCCURRENCE
57
111001
No
Yes
RSB Overflows
OCCURRENCE
58
111010
Yes
No
BTB false entries
OCCURRENCE
58
111010
No
Yes
BTB miss prediction on a Not-Taken Branch
OCCURRENCE
59
111011
Yes
No
Number of clocks stalled due to full write buffers while executing MMX instructions
DURATION
59
111011
No
Yes
Stall on MMX instruction write to E or M line
DURATION

6.1.1 DESCRIPTION OF MMXTM INSTRUCTION EVENTS

The event codes/counter are provided in parenthesis.

6.2 Dynamic Execution (P6-Family) Performance Monitoring Events

This section describes the counters on P6-family processors. Table 4-2 lists the events that can be counted with the performance-monitoring counters and read with the RDPMC instruction.

In the table, the:

These performance monitoring events are intended to be used as guides for performance tuning. The counter values reported are not guaranteed to be absolutely accurate and should be used as a relative guide for tuning. Known discrepancies are documented where applicable. All performance events are model-specific to P6-family processors and are not architecturally guaranteed in future versions of the processor. All performance event encodings not listed in the table are reserved and their use will result in undefined counter results.

Further details will be made available in a later version of this document.

See the end of the table for notes related to certain entries in the table.

Table 6-2. Performance Monitoring Counters
Unit
Event Num.
Mnemonic Event Name
Unit Mask
Description
Comments
Data Cache Unit (DCU)
43H
DATA_MEM_ REFS
00H
All memory references, both cacheable and non- cacheable
45H
DCU_LINES_IN
00H
Total lines allocated in the DCU.
46H
DCU_M_LINES_IN
00H
Number of M state lines allocated in the DCU.
47H
DCU_M_LINES_
OUT
00H
Number of M state lines evicted from the DCU. This includes evictions via snoop HITM, intervention or replacement.
48H
DCU_MISS_
OUTSTANDING
00H
Weighted number of cycles while a DCU miss is outstanding. An access that also misses the L2 is short-changed by 2 cycles. (i.e. if counts N cycles, should be N+2 cycles.)

Subsequent loads to the same cache line will not result in any additional counts.

Count value not precise, but still useful.

Instruction Fetch Unit (IFU)
80H
IFU_IFETCH
00H
Number of instruction fetches, both cacheable and non-cacheable.
81H
IFU_IFETCH_MISS
00H
Number of instruction fetch misses.
85H
ITLB_MISS
00H
Number of ITLB misses.
86H
IFU_MEM_STALL
00H
Number of cycles that the instruction fetch pipe stage is stalled, including cache misses, ITLB misses, ITLB faults, and victim cache evictions.
87H
ILD_STALL
00H
Number of cycles that the instruction length decoder is stalled.

Table 6-2. Performance Monitoring Counters (Cont'd)


Unit
Event Num.
Mnemonic Event Name
Unit Mask
Description
Comments
29H
L2_LD
MESI
0FH
Number of L2 data loads.
2AH
L2_ST
MESI
0FH
Number of L2 data stores.
24H
L2_LINES_IN
00H
Number of lines allocated in the L2.
26H
L2_LINES_OUT
00H
Number of lines removed from the L2 for any reason.
25H
L2_M_LINES_INM
00H
Number of modified lines allocated in the L2.
27H
L2_M_LINES_OUTM
00H
Number of modified lines removed from the L2 for any reason.
2EH
L2_RQSTS
MESI
0FH
Number of L2 requests.
21H
L2_ADS
00H
Number of L2 address strobes.
22H
L2_DBUS_BUSY
00H
Number of cycles during which the data bus was busy.
23H
L2_DBUS_BUSY_RD
00H
Number of cycles during which the data bus was busy transferring data from L2 to the processor.
External Bus Logic (EBL)2
62H
BUS_DRDY_CLOCKS
00H (Self)
20H (Any)
Number of clocks during which DRDY is asserted. Unit Mask = 00H counts bus clocks when the processor is driving DRDY.

Unit Mask = 20H counts in processor clocks when any agent is driving DRDY.
63H
BUS_LOCK_CLOCKS
00H (Self)
20H (Any)
Number of clocks during which LOCK is asserted Always counts in processor clocks

Table 6-2. Performance Monitoring Counters (Cont'd)


Unit
Event Num.
Mnemonic Event Name
Unit Mask
Description
Comments
66H
BUS_TRAN_RFO
00H (Self)
20H (Any)
Number of read for ownership transactions.
68H
BUS_TRAN_IFETCH
00H (Self)
20H (Any)
Number of instruction fetch transactions.
69H
BUS_TRAN_INVAL
00H (Self)
20H (Any)
Number of invalidate transactions.
6AH
BUS_TRAN_PWR
00H (Self)
20H (Any)
Number of partial write transactions.
6BH
BUS_TRANS_P
00H (Self)
20H (Any)
Number of partial transactions.
6CH
BUS_TRANS_IO
00H (Self)
20H (Any)
Number of I/O transactions.
6DH
BUS_TRAN_DEF
00H (Self)
20H (Any)
Number of deferred transactions.
6EH
BUS_TRAN_BURST
00H (Self)
20H (Any)
Number of burst transactions.
70H
BUS_TRAN_ANY
00H (Self)
20H (Any)
Number of all transactions.
6FH
BUS_TRAN_MEM
00H (Self)
20H (Any)
Number of memory transactions

Table 6-2. Performance Monitoring Counters (Cont'd)


Unit
Event Num.
Mnemonic Event Name
Unit Mask
Description
Comments
61H
BUS_BNR_DRV
00H (Self)
Number of bus clock cycles during which this processor is driving the BNR pin.
7AH
BUS_HIT_DRV
00H (Self)
Number of bus clock cycles during which this processor is driving the HIT pin. Includes cycles due to snoop stalls.
7EH
BUS_SNOOP_STALL
00H (Self)
Number of clock cycles during which the bus is snoop stalled.
Floating Point Unit
C1H
FLOPS
00H
Number of computational floating-point operations retired. Counter 0 only
10H
FP_COMP_OPS_EXE
00H
Number of computational floating-point operations executed. Counter 0 only.
11H
FP_ASSIST
00H
Number of floating-point exception cases handled by microcode. Counter 1 only.
12H
MUL
00H
Number of multiplies. Counter 1 only.
13H
DIV
00H
Number of divides. Counter 1 only.
14H
CYCLES_DIV_BUSY
00H
Number of cycles during which the divider is busy. Counter 0 only.
Memory Ordering
03H
LD_BLOCKS
00H
Number of store buffer blocks
04H
SB_DRAINS
00H
Number of store buffer drain cycles.
05H
MISALIGN_MEM_REF
00H
Number of misaligned data memory references.
Instruction Decoding and Retirement
C0H
INST_RETIRED
OOH
Number of instructions retired.
C2H
UOPS_RETIRED
00H
Number of micro-ops retired.
D0H
INST_DECODER
00H
Number of instructions decoded.
Interrupts
C8H
HW_INT_RX
00H
Number of hardware interrupts received.

Table 6-2. Performance Monitoring Counters (Cont'd)


Unit
Event Num.
Mnemonic Event Name
Unit Mask
Description
Comments
C6H
CYCLES_INT_MASKED
00H
Number of processor cycles for which interrupts are disabled.
Branches
C4H
BR_INST_RETIRED
00H
Number of branch instructions retired.
C5H
BR_MISS_PRED_
RETIRED
00H
Number of mispredicted branches retired.
C9H
BR_TAKEN_RETIRED
00H
Number of taken branches retired.
CAH
BR_MISS_PRED_TAKEN_

RET

00H
Number of taken mispredictions branches retired.
E0H
BR_INST_DECODED
00H
Number of branch instructions decoded.
E4H
BR_BOGUS
00H
Number of bogus branches.
E6H
BACLEARS
00H
Number of time BACLEAR is asserted
Stalls
A2
RESOURCE_STALLS
00H
Number of cycles during which there are resource related stalls.
D2H
PARTIAL_RAT_STALLS
00H
Number of cycles or events for partial stalls
Segment Register Loads
06H
SEGMENT_REG_LOADS
00H
Number of segment register loads
Clocks
79H
CPU_CLK_UNHALTED
00H
Number of cycles during which the processor is not halted

Notes:

  1. Several L2 cache events, where noted, can be further qualified using the Unit Mask (UMSK) field in the PerfEvtSel0 and PerfEvtSel1 registers. The lower four bits of the Unit Mask field are used in conjunction with L2 events to indicate the cache state or cache states involved. The P6-family processor identifies cache states using the ìMESIî protocol, and consequently, each bit in the Unit Mask field represents one of the four states: UMSK[3] = M (8H) state, UMSK[2] = E (4H) state, UMSK[1] = S (2H) state, and UMSK[0] = I (1H) state. UMSK[3:0] = MESî (FH) should be used to collect data for all states; UMSK = 0H, for the applicable events, will result in nothing being counted.
  2. All of the external bus logic (EBL) events, except where noted, can be further qualified using the Unit Mask (UMSK) field in the PerfEvtSel0 and PerfEvtSel1 registers. Bit 5 of the UMSK field is used in conjunction with the EBL events to indicate whether the processor should count transactions that are self generated (UMSK[5] = 0) or transactions that result from any processor on the bus (UMSK[5] = 1).

6.3 RDPMC Instruction

RDPMC enables the user to read the performance monitoring counters in CPL=3 given bit #8 is set in CR4 (CR4.PCE). This is similar to the RDTSC (Read Time Stamp Counter) instruction, which is enabled in CPL=3 if the Time Stamp Disable bit in CR4 (CR4.TSD) is not disabled. Note that access to the performance monitoring Control and Event Select Register (CESR) is not possible in CPL=3.

6.3.1 INSTRUCTION SPECIFICATION

Opcode: 0F 33

Description: Read event monitor counters indicated by ECX into EDX:EAX

Operation: EDX:EAX Event Counter [ECX]

The value in ECX (either 0 or 1) specifies one of the two 40-bit event counters of the processor. EDX is loaded with the high-order 32 bit, and EAX with the low order 32 bits.

IF CR4.PCE = 0 AND CPL <> 0 THEN # GP(0)
IF ECX = 0 THEN EDX:EAX := PerfCntr0
IF ECX = 1 THEN EDX:EAX := PerfCntr1
ELSE #GP(0)
END IF

Protected & Real Address Mode Exceptions.
#GP(0) if ECX does not specify a valid counter (either 0 or 1).
#GP(0) if RDPMC is used in CPL<> 0 and CR4.PCE = 0

Remarks:
16 bit code: RDPMC will execute in 16 bit code and VM mode but will give a 32-bit result. It will use the full ECX index.

Trademark Information