On the value of static tracepoints [LWN.net] (original) (raw)

Please consider subscribing to LWN

Subscriptions are the lifeblood of LWN.net. If you appreciate this content and would like to see more of it, your subscription will help to ensure that LWN continues to thrive. Please visitthis page to join up and keep LWN on the net.

As has been well publicized by now, the Linux kernel lacks the sort of tracing features which can be found in certain other Unix-like kernels. That gap is not the result of a want of trying. In the past, developers trying to put tracing infrastructure into the kernel have often run into a number of challenges, including opposition from their colleagues who do not see the value of that infrastructure and resent its perceived overhead. More recently, it would seem that the increased interest in tracing has helped developers to overcome some of those objections; an ongoing discussion shows, though, that concerns about tracing are still alive and have the potential to derail the addition of tracing facilities to the kernel.

Sun's DTrace is famously a dynamic tracing facility, meaning that it can be used to insert tracepoints at (almost) any location in the kernel. But the Solaris kernel also comes with an extensive and well-documented set of static tracepoints which can be activated by name. These tracepoints have been placed at carefully-considered locations which facilitate investigations into what the kernel is actually doing. Many real-world DTrace scripts need only the static tracepoints and do no dynamic tracepoint insertion at all.

There is clear value in these static tracepoints. They represent the wisdom of the developers who (presumably) are the most familiar with each kernel subsystem. System administrators can use them to extract a great deal of useful information without having to know the code in question. Properly-placed static tracepoints bring a significant amount of transparency to the kernel. As tracing capabilities in Linux improve, developers naturally want to provide a similar set of static tracepoints. The fact that static tracing is reasonably well supported (via FTrace) in mainline kernels - with more extensive support available via SystemTap and LTTng - also encourages the creation of static tracepoints. As a result, there have been recent patches adding tracepoints to workqueues and some core memory management functions, among others.

Digression: static tracepoints

As an aside, it's worth looking at the form these tracepoints take; the design of Linux tracepoints gives a perspective on the problems they were intended to solve. As an example, consider the following tracepoints for the memory management code which reports on page allocations. The declaration of the tracepoint looks like this:

#include <linux/tracepoint.h>

TRACE_EVENT(mm_page_allocation,

TP_PROTO(unsigned long pfn, unsigned long free),

TP_ARGS(pfn, free),

TP_STRUCT__entry(
    __field(unsigned long, pfn)
    __field(unsigned long, free)
),

TP_fast_assign(
    __entry->pfn = pfn;
    __entry->free = free;
),

TP_printk("pfn=%lx zone_free=%ld", __entry->pfn, __entry->free)
);

That seems like a lot of boilerplate for what is, in a sense, a switchableprintk() call. But, naturally, there is a reason for each piece. The TRACE_EVENT() macro declares a tracepoint - this one is calledmm_page_allocation - but does not yet instantiate it in the code. The tracepoint has arguments which are passed to at its actual instantiation (which we'll get to below); they are declared fully in the TP_PROTO() macro and named in theTP_ARGS() macro. Essentially, TP_PROTO() provides a function prototype for the tracepoint, while TP_ARGS() looks like a call to that tracepoint.

These values are enough to let the programmer place a tracepoint in the code with a line like:

trace_mm_page_allocation(page_to_pfn(page),
             zone_page_state(zone, NR_FREE_PAGES));

This tracepoint is really just a known point in the code which can have, at run time, one or more function pointers stored into it by in-kernel tracing utilities like SystemTap or Ftrace. When the tracepoint is enabled, any functions stored there will be called with the given arguments. In this case, enabling the tracepoint will result in calls whenever a page is allocated; those calls will receive the page frame number of the allocated page and the number of free pages remaining as parameters.

As can be seen in the declaration above, there's more to the tracepoint than those arguments; the rest of the information in the tracepoint declaration is used by the Ftrace subsystem. Ftrace has a couple of seemingly conflicting goals; it wants to be able to quickly enable human-readable output from a tracepoint with no external tools, but the Ftrace developers also want to be able to export trace data from the kernel quickly, without the overhead of encoding it first. And that's where the remaining arguments to TRACE_EVENT() come in.

When properly defined (the magic exists in a bunch of header files underkernel/trace), TP_STRUCT__entry() adds extra fields to the structure which represent the tracepoint; those fields should be capable of holding the binary parameters associated with the tracepoint. The TP_fast_assign() macro provides the code needed to copy the relevant data into that structure. That data can, with some changes merged for 2.6.30, be exported directly to user space in binary format. But, if the user just wants to see formatted information, the TP_printk() macro gives the format string and arguments needed to make that happen.

The end result is that defining a tracepoint takes a small amount of work, but using it thereafter is relatively easy. With Ftrace, it's a simple matter of accessing a couple of debugfs files. But other tools, including LTTng and SystemTap, are also able to make use of these tracepoints.

The disagreement

Given all the talk about tracing in recent years, there is clearly demand for this sort of facility in the kernel. So one might think that adding tracepoints would be uncontroversial. But, naturally enough, it's not that simple.

The first objection that usually arises has to do with the performance impact of tracepoints, which are often placed in the most performance-critical code paths in the kernel. That is, after all, where the real action happens. So adding an unconditional function call to implement a tracepoint is out of the question; even putting an iftest around it is problematic. After literally years of work, the developers came up with a scheme ~~involving run-time code patching~~ that reduces the performance cost of an inactive tracepoint to, for all practical purposes, zero. Even the most performance-conscious developers have stopped fretting about this particular issue. But, of course, there are others.

A tracepoint exists to make specific kernel information available to user space. So, in some real sense, it becomes part of the kernel ABI. As an ABI feature, a tracepoint becomes set in stone once it's shipped in a stable kernel. There is not a universal agreement on the immutability of kernel tracepoints, but the simple fact is that, once these tracepoints become established and prove their usefulness, changing them will cause user-space tracing tools to break. That means that, even if tracepoints are not seen as a stable ABI the way system calls are, there will still be considerable resistance to changing them.

Keeping tracepoints stable when the code around them changes will be a challenge. A substantial subset of the developer community will probably never use those tracepoints, so they will tend to be unaware of them and will not notice when they break. But even a developer who is trying to keep tracepoints stable is going to run into trouble when the code evolves to the point that the original tracepoint no longer makes sense. One can imagine all kinds of cruft being added so that a set of tracepoints gives the illusion of a very different set of decisions than is being made in a future kernel; one can also imagine the hostile reception any such code will find.

The maintenance burden associated with tracepoints is the reason behind Andrew Morton's opposition to their addition. With regard to the workqueue tracepoints, Andrew said:

If someone wants to get down and optimise our use of workqueues then good for them, but that exercise doesn't require the permanent addition of large amounts of code to the kernel. The same amount of additional code and additional churn could be added to probably tens of core kernel subsystems, but what _point_ is there to all this? Who is using it, what problems are they solving?

We keep on adding all these fancy debug gizmos to the core kernel which look like they will be used by one person, once. If that!

Needless to say, the tracing developers see the code as being more widely useful than that. Frederic Weisbecker gave a detailed description of the sort of debugging which can be done with the workqueue tracepoints. Ingo Molnar's response appears to be an attempt to hold up the addition of other types of kernel instrumentation until the tracepoint issue is resolved. Andrew remains unconvinced, though; it seems he would rather see much of this work done with dynamic tracing tools instead.

As of this writing, that's where things stand. If these tracepoints do not get into the mainline, it is hard to see developers going out and creating others in the future. So Linux could end up without a set of well-defined static tracepoints for a long time yet - though it would not be surprising to see the enterprise Linux vendors adding some to their own kernels. Perhaps that is the outcome that the development community as a whole wants, but it's not clear that this feeling is universal at this time. If, instead, Linux is going to end up with a reasonable set of tracepoints, the development community will need to come to some sort of consensus on which kinds of tracing instrumentation is acceptable.

Index entries for this article
Kernel	Development tools/Kernel tracing
Kernel	Ftrace
Kernel	Tracing