I’ll try to cover here the overall design of what I’m working on (PMC tools for minix) as well as the rationale behind it. I think I’ve kept to myself a lot of the reasoning why I’m doing things, so this is me putting my thoughts and ideas out there.
Kernel/OS Modifications:
New System Calls and Assembly Functions:
- SETPCE – Sets the PCE bit of Control Register 4, enabling user (ring-3,2,1) programs to read from PMCs. Note that this is a break from FreeBSD’s PMCTools, which does its’ reads at ring-0.
- WRMSR- This is the essential ‘unit’ of work we use to deal with PMCs. This can set configurations, events, and counter values. The second most important thing the hwpmc driver does is it translates API requests into values to WRMSR to machine registers. (The first being keeping track of who’s using what counters). It’s probably the most complicated task too.
- RDMSR- This is the partner instruction to WRMSR, it reads MSRs and returns the current value.
- level0()- This is a function the kernel uses to run functions at ring-o. Now modified to take a single arguement (void *) which processes can dereference to get whatever arguements they need. WRMSR/RDMSR are execellent usage examples of this. Both are assembly functions that need multiple arguments and need to run at ring-0. The actual function pointer handed to the level0() call is a wrapper function entirely in C that takes the (void*) pointer and dereferences it to get the three arguments needed, then calls the assembly function.
Messages/Communication functionality:
- Added new message types (full of u32_t’s) that are used by WRMSR/RDMSR and company; which have handy easy-to-use macros in <com.h> to make it clean to use (which seems SOP for Minix).
- SYSTEM task/Kernel catches SYS_SETPCE, SYS_WRMSR, and SYS_RDMSR calls from the PM server, and does the appropriate action at ring-0 (see above). Security: WRMSR requests are limited to PMC-related registers
- PM Server recieves user-process _taskcall()s for the SETPCE, WRMSR, and RDMSR. This is going to be the entry point for the libpmc API, it will interpret userland calls and translate them to _taskcall()s to the PM server
- PMC Server in boot image. I’m still not sure on this one. It would make more sense if I were porting every single facet of hwpmc functionality, but much of it is not necessary (and extra overhead). I thought it would be healthier in a microkernel to have PMC functionality in its own separate process, but with the new trimmed hwpmc model it makes less sense.
HWPMC
PMCTools’
The hwpmc driver in FreeBSD is incredibly robust. It offers a huge amount of functionality and is extremely portable. ‘Extremely’ because its taken to such an extreme, it has become extremely complicated if processor independent, including all of the non-x86 processors (ARM, SPARC, etc). It also maximizes counting capbilities, getting the most use out of each counter. This means a counter in virtual mode can be used by many different processes/threads at ‘once’, where the event is changed at every context switch. It’s also massively reentrant, so it can preempt itself three or four times over (not to mention SMP problems). I thought I’d originally been safe removing the locks (since Minix is single-threaded), but even on a single core system the worst case is hwpmc preempting itself twice;
[0] User program executes libpmc API function, translated to System call thats handed to the hwpmc kernel module(). While processing this request it gets context switched [1], but the context switch needs to access the hwpmc() data structures (since it changes events on context switches), and needs to get access to them (even if they are in the process of being edited). Before the context switch can finish using these structures to set up the PMCs for the next process/thread, it gets [2] iterrupted by a PMC overflowing (or otherwise setup to generate interrupts), which also requires access to modify hwpmc’s internal data structures.
PMCTools’ by itself is incredibly powerful, but at the cost of being exceedingly complex. I’ve had trouble making heads-or-tails of much of the code, most of it making sense at around the 4th read through. (To this extent fxr.watson.org has been a great tool)
hwpmc port
This was supposed to be the first step, porting the hwpmc driver to Minix. First, no locks: the first thing I did was to comment out the locks because Minix is w/o SMP. Second, it uses variadic macros (PMCDBG) to aid in debugging, which ACK didn’t support. (Flexible array members too, but thats an easy fix.) This seemed like it was going to be a simple port, but unfortunately going back over the locking structures I realized that even w/o SMP the locks are still necessary to control access to the data structures.
New hwpmc
I’ve implemented my own minimalist hwpmc, which just takes hwpmc-style arguments and fomulates the correct values to WRMSR. It compiles only with GCC however, because it does include hwpmc structures and macros (req’d C99). Right now System-wide Counting mode is possible across all PMCs on my Pentium 4 machine; hopefully I can find a way to port the other 386 arch’s and test them as well.
New Design
I had originally proposed to port all of hwpmc; I think I’ve reached a point of diminishing returns on the port of the complete kernel module–continuing would require me to implement some of the locks used to control access to the internal data structures. This is doable, but it’s not my first priority; so for now I’m implementing my own (smaller, simpler) hwpmc. Its starting out with just System-Wide counting, but it can be extended to other modes. The major design difference is static allocation of PMCs: if a PMC is allocated in virtual mode it is not traded out every context switch, but merely left to run (recording the PMC value at switch time). This saves having to set events (or enable bits) at ring-0 every context switch (RDPMC can be executed at user privelege). The PM server keeps track of which PMCs are in use, and that is where the libpmc interface comes in. The greatest advantage of this is that it does not require locks and I don’t need to worry about it pre-empting itself. We’re losing the absolute security (User programs can write over each others’ events), but I think that is something we can safely trust.
libpmc
The API used to interact with hwpmc is libpmc, so this API is where I can abstract between FreeBSD’s hwpmc, and my hwpmc. The new pmc driver shines here, because while many of the libpmc functions have to be recreated, the new functions are trivially simple. pmc_allocate() just forms a pair of _taskcall()s to the PM server with the calculated values, as long as the PMC is currently unallocated. pmc_read() is just a wrapper around a RDPMC assembly function.
ToDo
The biggest thing on my todo list (after finishing porting the libpmc API functions) is getting logging working. The logs are one of the more complexly locked structures in hwpmc, since every method of entry to hwpmc code needs to be able to record its actions to the log. In addition, this is how some of the PMCTools additional tools work (by analyzing the logs).
Future
I’d really like to have a complete port of hwpmc for Minix. The benefits are huge: locking mechanisms, virtual mode multiplexing (dynamically reallocating virtual-mode PMCs), process PMC security, cross-processor functionality, and a close relation to an actively developed project. I’m putting this off by making a simple replacement, but I think the advantage outweighs the cost: we get performance-measurement statistics faster, which means more time to make changes/optimize/etc. As Arun suggested, I’m putting the end-to-end model first.