Architecture-support for pmcstat

August 12, 2009

Pmcstat works! on my machine at least.  I have to communicate w/ Arun to make sure it works on his machines, but so far it looks promising.

Usage: <executable> <row index> <ctrspec>

  • executable – this is just the name&path to the executable (pmcstat_p4, pmcstat_iaf, pmcstat_k8, etc.)
  • row index – this is the counter index you want to set.  For K8: 0 to 3.  For IAP: 0 or 1. For IAF: 0 to 2. For P4: 0 to 18 (but be wary of the restrictions).
  • ctrspec – this is a specification string with no spaces, and arguements are separated by commas

Example Usage:

 ./pmcstat_p4 12 p4-branch-retired,mask=mmtm+mmnm

You see that the ctrspec string can have key=value pairs, or just tags.  Generally just an event name will suffice (e.g. ‘p4-branch-retired’), but in this case we additionally supplied a mask to further refine the events counted.

There are some general events that are shared by (most) architectures, and are listed here: pmc

For the majority of events, you’ll use architecture-specific events, found here:

In addition, for now setting the event causes a side effect of initializing the counter to 0 before the event is enabled.

To read the counters, simply run `rdpmc <rowindex>` where rowindex is the row index of the counter you set the event to.  The exception to this is the IAF counters, to which you have to add 0×4000_0000 to the row index to read them.

For example, on a Core2 machine you would read the programmable counters with

  • ./rdpmc 0
  • ./rdpmc 1

And you could read the fixed-function counters with:

  • ./rdpmc 0×40000000
  • ./rdpmc 0×40000001
  • ./rdpmc 0×40000002

(Also, it goes without saying that you need to set the PCE bit of CR4 in order to read the counters as non-root).

These are using the most recent libpmc functions, including a re purposed general allocate function which reads the counterstring and extracts the event.


Pmcstat builds … now to add architecture support

August 4, 2009

To those that know of xkcd, this may be familiar: http://xkcd.com/349/ .  Ironically, I am actually working with (Free)BSD software, so I get an additional kick out of that comic.

I just pushed pmcstat that builds.  Let me tell you right now that even though it builds, right now it can’t do anything useful (without the architecture specific support).  Here’s my upcoming plans for it (and libpmc, and some other things):

Arch’s

Add architecture-specific support for pmcstat.  Because the hwpmc/libpmc have been previously simplified, pmcstat needs to be told what architecture it’s on.  This is going to take the shape of some preprocessor macros at the top.  They will contain this (and possibly more):

  • Allocation Functions (macros replace a generic allocator with the proper libpmc allocation function)
  • Number of PMCs
  • Bit-length of PMCs (used to calculate printouts)

libpmc

In addition to these, two smaller functions need to be added for each architecture.  Some of the functionality of these is actually already covered by the allocation functions and can be separated.

  • pmc_read() functions (while RDPMC is universal across all the x86 archs, the addressing scheme is not)
  • pmc_set() functions (instead of setting the event registers, this sets the counter value, used to initialize it to 0 or whatever you want)

debug wrap

The last thing for right now is to take the debugging stuff I put in libpmc (mostly printf()s) and just put some conditional compilation around it.  This way the allocators will return just set the PMC events themselves (As opposed to the current setup, where libpmc outputs a value which an event-setter then uses).

The debugging compilation of the libpmc routines would allow the event-setters to still be used, but the regular routines (like those used by pmcstat) would just do the whole job of constructing event/control register values and WRMSR-ing them.

sidenote

I now have pages of notes on PMCTools on my tablet and around my workspace.  One of the conclusions I’ve come to in trying to decipher all of this is that even though I (myself, as a programmer) don’t like Object Orientation, this might have been well-suited for it.  I’ll be the first to admit that I’m must more comfortable with procedural programming, and it works for my (greatly simplified) PMCTools implementation, but for the FreeBSD PMCTools, I think it would make lots of things cleaner and easier to read.  Maybe it’s that I’m still an undergrad (and don’t get my coding powers til grad school) but it’s taken me often many MANY readthroughs and cross referencing both the FreeBSD kernel and the Minix kernel to make heads or tails of it.  (slightly less related to the OO problem, but in general readability has been a barrier for me) </rant>

Some sleep. Back to coding.


pmcstat update

July 30, 2009

Well its 2AM here and unfortunately I’m having to give in for the night.  I had previously given myself seven days to finish the pmcstat port, and it looks like I was a bit off.  I will post now on the design, and put up the usage information in a later post.

First off, because there are no interrupts, only counting modes are supported.  In addition only system-wide counting works for right now.  With a little hook in the scheduler/switcher process-virtual counting should be possible for one process at a time.  More on that later (in planning on the rest of the summer).  Cutting that functionality out for now works, because as I’m removing I see how strongly it’s tied to the FreeBSD kernel.  Adding that functionality would mostly be original code.

pmcstat is a convenient and clean interface to libpmc functionality, which is in turn a clean interface to hwpmc functionality, which is in turn a clean interface to the hardware… wait a minute.  For now pmcstat will work as follows:

  1. pmcstat is called from the command line w/ systemwide counting and some other options
  2. either the options are bad (and it returns w/ an error) or it sets up the appropriate counters and starts counting events
  3. pmcstat issues RDPMC’s at regular intervals to update it’s count (this is because depending on the frequency of the event, counters can overflow regularly)
  4. this information is output to the user (or saved to a file)

For process-virtual counting, is a bit more complicated, but still doable.  Mostly the same as above, with these differences:

  1. pmcstat fork()s off a child process which waits on a signal
  2. pmcstat then sets up the counters/events for the child
  3. pmcstat signals the child to start
  4. the child process exec()s to the desired process
  5. pmcstat monitors the counters at regular intervals (using RDPMC)

More to come soon.

(As an informal note, I should say I gave this figure because pmcstat really is the simplest part of PMCTools so far, thankfully)


Intel Core General Function support

July 21, 2009

Hey all,

I just committed intel Core(2) General function counter support to my branch.  This concludes the architectures for now, and I start grinding at the statistics tools now.

In all reality I’ll probably still be working on the architecture stuff for a while.  I (think) my mentor hasn’t gotten it running on his machines yet, so I still have to work through that.

Anyways, quick rundown of the IAP (as PMCTools calls it) architecture:

  • Syntax very similar to AMD’s PMCs
  • One Counter -> One Event Register (I love this clean simplicity)
  • Note: Global control register can disable counters even if you set their enable bit.  It’s a good idea to just enable all of these from the start.

I won’t go into it any more in depth because this was the architecture I detailed in full at the beginning of this blog.  Check earlier posts to see more.


Libpmc Update

July 14, 2009

An update on my libpmc progress:

iaf_allocate_pmc()

Added half of the Intel Core(2)/Architectural Performance Monitoring capabilites (specifically the fixed-function half, the general-purpose half I have left to do) in iaf_allocate_pmc().  This is called in a virtually identical fashion to the p4_allocate_pmc() (see previous post on said function).  It is important to note that these counters (like the p4 counters) are restricted in what functions they can take on, specifically:

  • Row Index 0 can only count the Instructions-Retired-Any event
  • Row Index 1 can only count the Core CPU Clock cycles (unhalted)
  • Row Index 2 can only count the Core CPU Reference cycles (unhalted)

amd_allocate_pmc()

Second, added the same function for amd (Athlon) processors.  This is not completely debugged (so I guess is the iaf_ above), because I do not have one of these personally.  However, my mentor does and I’m working with him to get it working on his computer.  The AMD counters were described in a previous post, so I’ll not go into them too deeply here.  I will say that there are 4 general purpose counters that can be programmed via its own specific event register.

Build script and command line utilites

I’ve also added a simple build script that I can use to rebuild all of the programs I’m working on:

  • A setpce utility that enables userland RDPMC (and tests the SETPCE kernel call)
  • setevent command-line utilites for controlling events via the command line (one for each processor type)
  • test_libpmc programs for all of the libpmc functions I’ve ported (returns values suitable for the above setevent utilities)
  • rdpmc command-line utility for reading counters directly to stdout. VERY useful in debugging.

Other than that there are minor changes to the System task calls and the libpmc() functions-they are all more verbose and should provide some indication as to why on failure.

General Note:

This is mentioned specifically in the AMD texts, but this should be general enough to apply to all PMCs (Intel Included):

There can be no counting if the processor is halted!

At first glance this may seem like a fairly obvious statement, and why would you even need to count anything if the processor is halted?  Doesn’t that mean no events are taking place?  The exception is when the CPU is halted, but you’re counting events occuring on the Northbridge/Southbridge (like memory caching events).  If you want to count those, just use a little dummy process to make sure the core doesn’t halt while you are counting.


AMD Athlon PMCs

July 14, 2009

I’ve been working up until recently almost exclusively with Intel chips.  This was because (1) thats what I read up on while working on my proposal and (2) thats all I happen to have access to personally.

My mentor has an AMD Athlon 64 processor, so I’ve gotten my hands on some manuals and am in the process of working back and forth to debug it.

So, onto the important information: A description of PMC and their functionality on AMD Athlon chips.

Counters

If there was one architecture that the Athlon PMCs are closest to, its Intel’s Architectural Performance Monitoring.  The Athlon, however, has no fixed function counters, and instead has 4 programmable counters (compare to Intels 2 general-purpose counters in APM).  These are fully programmable, which is very nice and easy compared to my Pentium 4 which has 18 counters but each can only count a handful of events and its all partial overlaps (which makes dynamically allocating counters a nontrivial problem).  So in my opinion (in this regard) AMD actually beats Intel.

Events

Each of the 4 counters has exactly one event register, which determines what event that counter will count.  Additionally, the fields in the event register are lined up on byte boundaries, which makes checking the values by hand much easier.  The layout for the lower 32 bits of the event register are:

  • Byte 0 – Bits [7:0] – Event Mask – This picks the event to count
  • Byte 1 – Bits [15:8] – Unit Mask – This chooses any options your event may have
  • Byte 2 – Bits [23:16] – Flags – See below
  • Byte 3 – Bits [31:24] – Counter Mask – Counting threshold for events that occur multiple times per clock

Flags:

  • Bit 16 – USR Mode – Count in rings 1 through 3
  • Bit 17 – OS Mode – Count in ring-0
  • Bit 18 – EDGE Detect – Only count false-to-true event transitions (instead of every clock cycle the event is true)
  • Bit 19 – PC – Use Pin Control, specified by Bit 23 and the Counter Mask
  • Bit 20 – INT – Interrupt Enable, causes interrupts on counter overflows
  • Bit 21 – Reserved (dont touch)
  • Bit 22 – EN – Enable, setting this to false disables the counter
  • Bit 23 – INV – Invert mask – Count events <= the Counter Mask (as opposed to > the Counter Mask)

Addresses

Finally here are the addresses of the MSRs.

Event Registers:

  • PerfEvtSel0 – 0xC0010000
  • PerfEvtSel1 – 0xC0010001
  • PerfEvtSel2 – 0xC0010002
  • PerfEvtSel3 – 0xC0010003

Counters:

  • PerfCtr0 – 0xC0010004
  • PerfCtr1 – 0xC0010005
  • PerfCtr2 – 0xC0010006
  • PerfCtr3 – 0xC0010007

References

If you want to know more about this, or where I got my information, check out:

http://www.amd.com/us-en/Processors/DevelopWithAMD/0,,30_2252_739_7203,00.html


proc_allocate_pmc() functionality

July 7, 2009

proc_allocate_pmc() now works for ‘p4′ and ‘iaf’ (Pentium 4 and Core(2) Fixed Function counters, respectively), with a slight modification.

The normal allocation functions take 3 parameters, and these take 4.

Prologue:

Before calling these functions you have to #include the entries to the libpmc/hwpmc include tree. First, then second repectively:

  1. #include “pmctools/libpmc.minix/pmc.h”
  2. #include “pmctools/libpmc.minix/libpmc.c”

The first one grabs all of the header files, and the second grabs all of the code/functions.  Those are relative addresses, and I’m assuming your working directory is ‘/usr/src/servers/pmc’ (which is where my working directory is).

p4_allocate_pmc(int ri,  enum pmc_event pe, char *ctrspec, struct pmc_op_pmcallocate pmc_config)

The normal libpmc() form of this function does not require the row index (int ri), but that’s because it dynamically cooses it on context switches as necessary.  For our purposes its much simpler to provide the row index.

Int ri: This stands for Row Index, and its basically the counter number. (for example the pentium 4 has 18 counters numbered 0×0 through 0×11)  On pentium 4’s you have to be really careful about this since not every counter can count every event.

Enum pmc_event pe: This is the name of the event.  It will be of the form PMC_EV_PROC_NAME where PROC is for now either P4 or IAF for pentium 4 and intel core(2) fixed, respectively.  NAME is the name of the event, the P4 list is very long, so I’ll just post the fixed-function events:

  • PMC_EV_IAF_INSTR_RETIRED_ANY
  • PMC_EV_IAF_CPU_CLK_UNHALTED_CORE
  • PMC_EV_IAF_CPU_CLK_UNHALTED_REF

Which should be set with a row index of 0, 1, or 2 respectively.

Char *ctrspec: This is a comma-separated string with the counter specification.  Many events have a level of customization (e.g. which types of branch mispredicts to count), and this is where you can set that, among other things.  The full list of possibilities is given in the FreeBSD Man Page for the respective processor’s libpmc capabilities:

The two options you’ll probably want to have every time are”os,usr” which instructs libpmc to count events that occur during ring-0 (“os”) and rings-1, -2, and -3 (“usr”).  There isn’t a good reason to separate the two, since most of the Minix kernel runs at ring-1 and the vital userspace servers run at ring-2, you won’t get an accurate separation into ‘OS’ events and ‘USR’ events.

Struct pmc_op_pmcallocate pmc_config: This is a struct that (in theory) holds all of the parameters you need to a PMCALLOCATE call to the hwpmc kernel module.  For us, all we need to set is two things:

  • pmc_config.pm_mode = PMC_MODE_SC – this sets a Systemwide Counting mode for that PMC, this is important, as this is the only mode libpmc currently supports
  • pmc_config.pm_caps = 0 – these are the PMC capabilities, as a bunch of bit-flags.  The libpmc routines do not initialize this to 0, and I’m in the process of tracking down why.  If it turns out that it’s not necessary, then libpmc() will do this for you.

Output:

Currently, these functions only result in hex values for the MSRs to be printed to the screen (to expedite debugging).  This is used with other functions (setevent, setiaf) that take command-line input and write them to MSRs.


Libpmc() Progress

July 2, 2009

Woot!  Just finish setting up the build infrastructure for libpmc() functions.  It now #includes everything it needs through the entire libpmc and hwpmc source.  It took a bit longer than I wanted it too, but this way I can work on (develop towards) all the processor types in parallel, instead of having to try to implement each one on its own.

One of the changes in kind of ‘decentralizing’ the hwpmc driver (to userspace functions), I’ve lost the ‘pointer hub’ that hwpmc holds.  It has pointers to all of the generic libpmc functions (like pmc_allocate), and when initialized will assign those to the proper ‘version’ (processor) of that function (e.g. on my Pentium 4 it gets assigned to p4_allocate_pmc).

If I wanted to keep this methodology I’d have to move all of the hwpmc/libpmc code into one of the servers (probably the PM, which already has the RDMSR/WRMSR/SETPCE functionality).  Instead, I’m going to make identifying the processor type a userspace responsibility.  This can be done a number of ways:

  • Calculating the processor type each time the function is run.  This is not efficient, and your processor is not likely to change often. (but the advantage is your user-program is cross-processor)
  • Hand the pointer-glob (for all libpmc functions) as a return for pmc_init().  It calculates the processor and functionalities and assigns the correct functions, so the use can just call them off of the glob (which is what hwpmc does).  This has the advantage of using the portable forms of functions (pmc_allocate() instead of p4_allocate_pmc() for example), and doesn’t have to calculate the processor type every call.  My goal is to have this kind of functionality.
  • Have users call the processor-specific functions themselves (e.g. p4_allocate_pmc).  This is the quickest way, but it results in non-portable code (and having #ifdef’s for each processor type would be a pain).  This will always be possible, and this is the method I’m starting with.  It also happens to be a good way of testing the functions (hooking into them directly).

On a side note, there’s a funny inconsistency in FreeBSD’s libpmc (in the current head): the function prototypes defined in lib/libpmc/libpmc.c dont match the actual functions in sys/dev/hwpmc/hwpmc_*.c.  It almost looks like they’re halfway through upgrading to newer forms (enum pmc instead of int ri for row index, and other such niceties).  I think I’m going to default to the actual code on this one, but the prototypes would be a lot cleaner if they get implemented.  I’ll post an update when I find out (I put a question out to the developers regarding it).

Edit: It turns out both are valid functions, and one is meant to be called by userspace functions (libpmc) and the other is meant to be called in the kernel by the hwpmc module.  It looks like my solution is going to be a mashup of both.


Libpmc progress

June 29, 2009

This is kind of a part-way progress update.  I’m nearly done with the infrastracture I need to start solidly building the libpmc API user programs can use to get hardware counting.

I had a sort-of infrastructure before; but it was heavily Pentium4-specific, so I’m rebuilding it in a more robust way that should incorporate the gamut of processors.  This also makes it easier to work on multiple processor types in parallel.  Right now I’m only targeting Pentium-4 (my machine) and Core2 Duo and AMD Athlons (my mentor’s machines).  I hope to add other processors at a later date, but this is time for my final sprint towards getting libpmc put together in time for Mid-Term eval.

On that note: back to work.  I hope I pass the midterm *knocks on wood for luck*, so onto the sprint.


Design Document

June 25, 2009

I’ll try to cover here the overall design of what I’m working on (PMC tools for minix) as well as the rationale behind it.  I think I’ve kept to myself a lot of the reasoning why I’m doing things, so this is me putting my thoughts and ideas out there.

Kernel/OS Modifications:

New System Calls and Assembly Functions:

  • SETPCE – Sets the PCE bit of Control Register 4, enabling user (ring-3,2,1) programs to read from PMCs.  Note that this is a break from FreeBSD’s PMCTools, which does its’ reads at ring-0.
  • WRMSR- This is the essential ‘unit’ of work we use to deal with PMCs.  This can set configurations, events, and counter values.  The second most important thing the hwpmc driver does is it translates API requests into values to WRMSR to machine registers. (The first being keeping track of who’s using what counters). It’s probably the most complicated task too.
  • RDMSR- This is the partner instruction to WRMSR, it reads MSRs and returns the current value.
  • level0()- This is a function the kernel uses to run functions at ring-o.  Now modified to take a single arguement (void *) which processes can dereference to get whatever arguements they need.  WRMSR/RDMSR are execellent usage examples of this.  Both are assembly functions that need multiple arguments and need to run at ring-0.  The actual function pointer handed to the level0() call is a wrapper function entirely in C that takes the (void*) pointer and dereferences it to get the three arguments needed, then calls the assembly function.

Messages/Communication functionality:

  • Added new message types (full of u32_t’s) that are used by WRMSR/RDMSR and company; which have handy easy-to-use macros in <com.h> to make it clean to use (which seems SOP for Minix).
  • SYSTEM task/Kernel catches SYS_SETPCE, SYS_WRMSR, and SYS_RDMSR calls from the PM server, and does the appropriate action at ring-0 (see above).  Security: WRMSR requests are limited to PMC-related registers
  • PM Server recieves user-process _taskcall()s for the SETPCE, WRMSR, and RDMSR.  This is going to be the entry point for the libpmc API, it will interpret userland calls and translate them to _taskcall()s to the PM server
  • PMC Server in boot image.  I’m still not sure on this one.  It would make more sense if I were porting every single facet of hwpmc functionality, but much of it is not necessary (and extra overhead).  I thought it would be healthier in a microkernel to have PMC functionality in its own separate process, but with the new trimmed hwpmc model it makes less sense.

HWPMC

PMCTools’

The hwpmc driver in FreeBSD is incredibly robust.  It offers a huge amount of functionality and is extremely portable.  ‘Extremely’ because its taken to such an extreme, it has become extremely complicated if processor independent, including all of the non-x86 processors (ARM, SPARC, etc).  It also maximizes counting capbilities, getting the most use out of each counter.  This means a counter in virtual mode can be used by many different processes/threads at ‘once’, where the event is changed at every context switch.  It’s also massively reentrant, so it can preempt itself three or four times over (not to mention SMP problems).  I thought I’d originally been safe removing the locks (since Minix is single-threaded), but even on a single core system the worst case is hwpmc preempting itself twice;

[0] User program executes libpmc API function, translated to System call thats handed to the hwpmc kernel module(). While processing this request it gets context switched [1], but the context switch needs to access the hwpmc() data structures (since it changes events on context switches), and needs to get access to them (even if they are in the process of being edited).  Before the context switch can finish using these structures to set up the PMCs for the next process/thread, it gets [2] iterrupted by a PMC overflowing (or otherwise setup to generate interrupts), which also requires access to modify hwpmc’s internal data structures.

PMCTools’ by itself is incredibly powerful, but at the cost of being exceedingly complex.  I’ve had trouble making heads-or-tails of much of the code, most of it making sense at around the 4th read through.  (To this extent fxr.watson.org has been a great tool)

hwpmc port

This was supposed to be the first step, porting the hwpmc driver to Minix.  First, no locks: the first thing I did was to comment out the locks because Minix is w/o SMP.  Second, it uses variadic macros (PMCDBG) to aid in debugging, which ACK didn’t support.  (Flexible array members too, but thats an easy fix.)  This seemed like it was going to be a simple port, but unfortunately going back over the locking structures I realized that even w/o SMP the locks are still necessary to control access to the data structures.

New hwpmc

I’ve implemented my own minimalist hwpmc, which just takes hwpmc-style arguments and fomulates the correct values to WRMSR.  It compiles only with GCC however, because it does include hwpmc structures and macros (req’d C99).  Right now System-wide Counting mode is possible across all PMCs on my Pentium 4 machine; hopefully I can find a way to port the other 386 arch’s and test them as well.

New Design

I had originally proposed to port all of hwpmc; I think I’ve reached a point of diminishing returns on the port of the complete kernel module–continuing would require me to implement some of the locks used to control access to the internal data structures.  This is doable, but it’s not my first priority; so for now I’m implementing my own (smaller, simpler) hwpmc.  Its starting out with just System-Wide counting, but it can be extended to other modes.  The major design difference is static allocation of PMCs: if a PMC is allocated in virtual mode it is not traded out every context switch, but merely left to run (recording the PMC value at switch time).  This saves having to set events (or enable bits) at ring-0 every context switch (RDPMC can be executed at user privelege).  The PM server keeps track of which PMCs are in use, and that is where the libpmc interface comes in.  The greatest advantage of this is that it does not require locks and I don’t need to worry about it pre-empting itself.  We’re losing the absolute security (User programs can write over each others’ events), but I think that is something we can safely trust.

libpmc

The API used to interact with hwpmc is libpmc, so this API is where I can abstract between FreeBSD’s hwpmc, and my hwpmc.  The new pmc driver shines here, because while many of the libpmc functions have to be recreated, the new functions are trivially simple.  pmc_allocate() just forms a pair of _taskcall()s to the PM server with the calculated values, as long as the PMC is currently unallocated. pmc_read() is just a wrapper around a RDPMC assembly function.

ToDo

The biggest thing on my todo list (after finishing porting the libpmc API functions) is getting logging working.  The logs are one of the more complexly locked structures in hwpmc, since every method of entry to hwpmc code needs to be able to record its actions to the log.  In addition, this is how some of the PMCTools additional tools work (by analyzing the logs).

Future

I’d really like to have a complete port of hwpmc for Minix.  The benefits are huge: locking mechanisms, virtual mode multiplexing (dynamically reallocating virtual-mode PMCs), process PMC security, cross-processor functionality, and a close relation to an actively developed project.  I’m putting this off by making a simple replacement, but I think the advantage outweighs the cost: we get performance-measurement statistics faster, which means more time to make changes/optimize/etc.  As Arun suggested, I’m putting the end-to-end model first.