In the previous post I installed Debian 8 (jessie) into Thinkpad X260, but I actually changed my mind and re-installed Debian 9 (stretch), because it supports the wifi equipped in Thinkpad X260. A good thing is Debian 9 is already freezed so I can expect there’re only a few critical bugs remained (well there’re actually one to two hundreds of them as of today, but it’s relatively a small number given that it has over 40K packages).
One big difference between Debian 9 and 8 is the kernel versions they use (4.9 vs 3.16), and especially the support for Intel PEBS (Precise Event Based Sampling) is way better (or I have to say way more proper) in kernel 4.9. This post explains what PEBS is a bit and how its support gets better if you use kernel 4.9.
Precise Event Based Sampling (PEBS)
PEBS is an extension of the performance counters, which is a mechanism to measure various hardware events such as number of cache misses, number of branch prediction misses, and many many others. If you’re not familiar with the performance counters, please refer another site like this.
PEBS can be used from linux
perf tool by specifying
pp suffix after the counter name, such as:
An advantage of PEBS against the normal performance counters is that, as the name suggests, PEBS is more precise because it’s all hardware-based.
For example, a result of measuring
pp might look like this (the result is rendered by
r20D1 measures the number of “Retired load instructions missed L3”, it can never happen on instructions other than the ones accessing memory addresses.
However this result shows that 2.32% of them occured in a
sub between two registers, 9.21% in a
mov between two registers, etc etc.
(An excuse for this is that, for performance analysis in function-level this accuracy might be enough.
Even if the places of events are not accurate within a few instructions, if you look at them in function-level granurality the outcome can be the same.)
For the explanation of each counter, you can refer section 19 of the volume 3 of the super thick manual from Intel.
Note that the event number and the umask have to be specified to
perf in the reversed order of how they appear in the manual.
For example if you measure a counter whose event number is AA and the umask is BB, you have to do
perf record -e rBBAA (not rAABB).
Using PEBS by specifying
:pp for the same workload gets a result like this:
Now you can see that no
r20D1 occurs on any instructions without memory accesses.
Another huge advantage of PEBS is it supports retrieving the register values, the instruction pointer, the memory address accessed, and the source of data at the time the insturction triggering the event occurs. However explaining these requires a whole new long post so I just leave it to another manual from Intel.
How PEBS is handled in the kernel
The Linux kernel holds a list of counters that support PEBS, because not all counters support PEBS so the kernel has to know which ones are PEBS-capable. For Skylake and Kabylake, PEBS is supported for the counters which have “PS” or “PSDLA” in the comment column of the manual. For Broadwell or older CPUs the manual says “Supports PEBS” in the comment column for PEBS-capable counters.
This list is defined in
arch/x86/kernel/cpu/perf_event_intel_ds.c in kernel 3.16 and
arch/x86/events/intel/ds.c in kernel 4.9.
The problem is the list in kernel 3.16 at the time Debian 8 was released was not complete.
For a concrete example,
r20D1 (event number=0xD1, umask=0x20) used in the above example is PEBS-capable, but it is not listed in
linux-source-3.16 of Debian 8.
(Note that it is listed in the newest version of kernel 3.16 in kernel.org, which means it was fixed at some point after Debian 8 was released.)
linux-source-3.16 package of Debian 8, the list is defined as follows:
I don’t explain what each
INTEL_* macro means, but the point here is the kernel defines a counter
rXXYY is PEBS-capable if there’s a line like
You can see there are
r40D1, but no
r20D1, even though
r20D1 is described to be PEBS-capable in Haswell in the intel manual.
Note that Haswell was the latest core generation at the time of kernel 3.16 release, and for newer versions of CPUs such as Skylake the linux kernel just treats them as Haswell.
Therefore, if you try to measure
r20D1:pp in Debian 8, it yields an error:
This issue has been already fixed in kernel 4.9.
Therefore Debian 9 that uses kernel 4.9 can properly handle
r20D1 as PEBS-capable and it allows
perf to measure
The Linux kernel 4.9 defines the list of PEBS-supported counters in
arch/x86/events/intel/ds.c (only the relavant part is extracted):
This macro specifies that any counters ending with
D1 are PEBS-capable.
If you use special hadware functionalities such as PEBS, I do recommend to upgrade your distro and the kernel.
PEBS has existed since Pentium 4, but the supported counters are ever growing and changing (actually
r20D1 was the number of micro operations until Broadwell, but it was changed to the number of instructions since Skylake).
So you’d better use a near-latest kernel as long as you can to get a proper support, and using the latest distro might be an easy way to go.