[RFC] Debug info coverage tool v2 (original) (raw)

Summary

We* propose to contribute a new extension to llvm-dwarfdump for measuring how well local variables are covered by debug info (e.g. DWARF). This improves on previous approaches to coverage measurement on several dimensions. The initial version proposed here would focus on DWARF, but support for other debug info formats can be added by future work.

* “We” here means J. Ryan Stinnett and Stephen Kell at King’s College London with support from the Sony / SN Systems team.

1 Summary of changes from v1 RFC

We propose to provide the measurement as new options in llvm-dwarfdump, not a standalone tool
We propose to implement two measurements, not just one.
- One measures variable coverage in terms of source lines*, as discussed in v1 (but re-explained in 2.1 below).
- The other measures the coverage of the optimised line table itself. This is in response to requests to “tease apart” the issues of line table coverage and variable coverage. We propose (an idea due to Stephen Livermore-Tozer) a “per-variable projected” form of this that ensures a meaningful relationship with the variable coverage numbers (explained in 2.2 below).
- (* We say “lines” but this could be extended to column ranges or any other enumerable subdivision of the source coordinate space. We will use lines in our initial contribution. The baseline is explained below.)
The variable coverage measurement is essentially unchanged from before, but we can state precisely what the interdependence on the line table is (see 2.2 below).
Unlike before, the tool’s default behaviour will be to measure variable coverage against a baseline derived from the same binary’s line table. So, e.g. an optimised build using the optimised line table to calculate the baseline. Unlike before, this will not require any additional input files to llvm-dwarfdump.
- In this way, a simple invocation of llvm-dwarfdump will show something meaningful -- albeit tending to be over-generous to the compiler (see 3.5 below).
- At the same time, we will also “pluggably” support supplying a more accurate baseline where this is available, e.g. from an -O0 build. This will work by passing an additional option to llvm-dwarfdump.

The discussion for the v1 RFC underlined that there is something conceptually tricky at the heart of this: our proposed measurements rely on a baseline, i.e. the set of source lines believed to be coverable. There are many serviceable ways to obtain this baseline, but no perfect way; all ways involve some kind of approximation. That is why a “pluggable” baseline is an important part of the design. All of this is elaborated below.

2 What is proposed

2.1 Variable coverage in terms of source lines

Our approach makes two main departures from previous attempts at measuring coverage of debugging information:

Measure coverage in terms of source lines
For each variable, calculate its defined ranges. We only expect a variable to be covered over those lines where it takes a defined value.

By measuring coverage in source lines instead of bytes, the measurement is comparable across compilations and better aligned with the typical debugging user experience. By including in the baseline only those lines where the variable being examined is defined, 100% coverage becomes attainable for all variables.

By combining these adjustments, our approach offers an accurate and achievable coverage metric. Variable storage (stack vs. register) also does not affect coverage attainability, unlike how previous metrics accidentally favoured on-stack locals.

2.2 Measuring the line table’s coverage in isolation

In addition to the variable coverage metric above, we also plan to measure per-variable line table coverage as well, which gives us the highest coverage each variable could have given the optimised line table. This is particularly useful for disambiguating cases where the above variable coverage metric shows less than 100% coverage. By also checking line table coverage for the same variable, it becomes clear whether the missing variable coverage is due to source lines missing from the line table or instructions missing from variable location ranges.

More detail on this per-variable line table metric is available (including formulas and a worked example) for those who are interested.

This new line-table-only measurement is in response to comments that our original proposal “entangled” variable coverage with line table coverage. The entanglement is only in one direction, and is more precisely stated as follows: the measured coverage for any variable is relative to the coverage of the line table being used as the baseline. If lines are wrongly omitted from that line table, they can never show as coverable by any variable.

3 Detailed rationale

3.1 Existing tools

Existing tools that compute some form of debug info coverage include llvm-dwarfdump and debuginfo-quality. These tools have been useful for measuring broad trends in variable coverage over time, but we believe our approach brings valuable further improvements.

Coverage should be measured in terms of source coordinates. Existing tools use instruction bytes. The selection of instructions will vary across compilers and compilation options, meaning coverage values are not easily comparable. Optimisations that significantly change the number of bytes (adding by e.g. unrolling loops and also removing code as well) distort the measured coverage, in the sense that debugging users generally are not affected by such changes: they are stepping by source lines, setting breakpoints at particular lines, and so on. While a bytes-based coverage metric can still identify some regressions, a source-coordinates-based metric can more reliably detect issues that affect users.

The baseline should account for variables’ definedness. With existing tools, the bar for full coverage is set as the entire parent scope (block / function) for all variables. But imagine a variable which is first defined (i.e. when it is first assigned / written to, independent of when it was declared) half-way down a function. Optimising compilers typically won’t emit debug coverage for that variable until after it is first defined. So such a variable would be counted as covered for less than the entire parent scope. This means 100% coverage under such a metric is unattainable for these variables. This makes the metric less actionable, and accidentally biases the metric towards unoptimised compilations, for incidental reasons, e.g. when the same variable is placed on the stack for its whole lifetime, this is wrongly counted as full coverage even if that stack slot only holds a meaningful value for the same 50% of the block.

3.2 Further detail available

For the variable coverage metric (2.1), we have previously shared our approach via a EuroLLVM 2024 talk and CC 2024 paper. The talk and paper contains a more detailed story along with experimental evaluations from our research prototype, which used a static analysis approach specific to C. The present proposal differs from these by instead using only language-agnostic data sources (as one might expect for LLVM tools).

For the per-variable line table metric (2.2), more detail is available with formulas and a worked example.

3.3 Use cases

There are quite a few potential use cases for this debug info coverage data to detect both coverage regressions and improvements, including:

Tracking over time (as in LLVM nightly tester)
Pre-merge comparison (similar to LLVM compile time tracker)
Some kind of coverage view in Compiler Explorer
Integration tests

We suspect there are other potential applications as well. (Let us know if you think of any!)

Sony / SN Systems intends to run a bot that would make use of these coverage approaches.

3.4 Data sources

Both our metrics work by enumerating covered lines and coverable lines. Coverage is the fraction of coverable lines that are covered. To compute these we use:

the debug info whose coverage is to be measured
a baseline set of coverable lines (computed using a line table; see below)
per-variable definedness information (computed from IR, similar to a liveness analysis; see below). In effect this further narrows the baseline for each variable, so that we do not expect coverage in places where the variable has no defined value.

The latter two pieces of input can come from multiple places; this is the motivation for “pluggable baselines”. Our proposed default behaviour is for llvm-dwarfdump to use the line table from the optimised binary. But besides this default, we also allow “pluggable baselines” specified by a command-line option.

3.5 Pluggable baselines making use of unoptimised builds

In the approach outlined above, we are trusting as a baseline the line table of an optimised build. This prevents us distinguishing two effects: (1) lines erroneously omitted from the line table, despite a variable being fully debug-covered over all relevant code, from (2) gaps in the coverage of a particular variable, despite a complete and correct line table.

Using just the default baseline (the optimised line table; we will call this the “-O2 baseline”) we will be overly generous to the compiler by allowing effect (2) to be offset by effect (1). Where lines are wrongly omitted from the optimised line table, these lines will be omitted from our baseline “coverable” set, so any variable will not be countable as covered over those lines. An erroneously high coverage level would be reported for variables affected in this way.

As @adrian.prantl has noted, it is not possible to observe a difference between these two effects at debug time, if using only source-level debug operations like linewise stepping or source-level breakpoints. However, we can still distinguish them when using the debugger in other, common ways. Consider stepping instruction by instruction, and printing a given variable at each step. In (1) we would see the complete life cycle of a variable progressing before us, in terms of the values it inhabits, but attributed to the wrong source lines. In (2) we would see at some steps, missing out some of the variable’s life cycle, while appearing to visit the correct set of source lines.

To avoid over-reported coverage in such cases, we propose an optional usage mode that supplies a separate baseline derived from an unoptimised build. One obvious candidate is to use the -O0 DWARF and -O0 IR to get the set of expected lines; we call this the “-O0 baseline”. A more refined option might use -O0 plus some reachability analyses. This option is likely to be the fairest, since it avoids counting unreachable lines that might otherwise wrongly be included because no dead code elimination has run at plain -O0. For the moment we leave open the question of what the “best” baseline is, but we will design in the ability for llvm-dwarfdump to accept an optional second input file to provide the baseline. We will implement support for at least both the “-O2 baseline” and “-O0 baseline”.

In earlier discussion, @pogo59 questioned the reasonableness of using unoptimised code as a baseline, noted that in case of optimisations that delete code "there [may] be no instructions in the final object file for those source statements/expressions”, and questioning whether full coverage is realistically achievable given three example concerns: “CSE, dead-store removal, and unreachable code removal”. All are examples of “code-deleting optimizations”. We address this in two parts. Firstly, we note that unreachable code deletion is different from the other two kinds of optimisation. Consider the failure to set a breakpoint on uncovered code. If the code is unreachable, this failure never degrades the debug experience. In the other two cases, it does: if the source line is reached, but the breakpoint does not fire because it happens to land on optimiser-folded code, the developer will be confused. This is why we propose a reachability filter over the -O0 baseline (above), avoiding spurious reports of coverage gaps due to eliminated unreachable code, whereas we intend that our coverage metric will flag up omissions of the other two kinds. Secondly, we consider it fair to flag a coverage gap even when no instructions remain for the affected code, because in bleeding-edge DWARF, location views allow a source variable to be covered over a zero-length range of instructions. This feature can be seen as allowing the line table to introduce “fractional program counter positions” that re-create intermediate states not modelled at any instruction in the optimised code. LLVM does not currently use this feature, but its use in future would enable debug coverage gains which would be measurable under our approach.

4 Workflows for users of the tool

For a binary to be measurable by the tool, the tool will need access to its IR, in order to perform the def-use analysis providing each variable’s defined ranges. There are many ways this could work, but clearly this places some burden on end users to generate binaries “the right way”. The primary use case of this coverage tool is imagined to be in occasionally-run automated jobs, where these kinds of wrangling should not be too onerous. Our ultimate goal is to minimise the end user effort, but we will need to work on this in stages.

Our tentative proposal is to maximise overlap with current support for storing IR in object files for later use in LTO. For a single .o file, the fat LTO object mode may suffice (Clang only supports this for ELF at the moment), and the tool could output a message such as “rebuild with -ffat-lto-objects” if given input that lacks IR. For fully-linked objects, there is no equivalent yet. Also, when using the more accurate -O0 baseline derived from the unoptimised line table, we require not only the optimised IR but also an unoptimised build and the ability to correlate the two, e.g. by parsing the linker map. So we envisage a tool advising users something like “rebuild with -O0 -Wl,-Map=somefile.map and pass the map file to --coverage-baseline” but additionally generating combined IR using the IR linker. For these reasons, initial versions will make use of build wrappers, most likely using wllvm. We’ll include examples of the necessary build wrangling in both tool documentation and as part of testing the tool itself.

Future work could add variable liveness information to DWARF. This would be useful at debug time (more detail below) and would remove the need for the tool to take LLVM IR as input.

5 Future improvements

We can imagine lots of ways to improve this for the future, even though they are not part of our initial plan.

5.1 Add variable liveness to DWARF

It would be ideal for this coverage tool as well as other analysis tools if DWARF described the first defined and last used points for source variables. Future work could explore a DWARF extension to capture this during compilation, and then adjust the coverage tool to make use of it. This would simplify the coverage tool both internally and at time of use, as we’d no longer need to examine LLVM IR. Beyond the coverage use case here, debuggers could warn about use of uninitialised values, tracers could about printing bogus data, etc. This would obviously require its own RFC and communication with DWARF committee if it were pursued.

5.2 Investigate finer-grained coverage

It would be nice to increase coverage precision by going beyond line granularity in some way. This would be particularly helpful for language features like loop headers, which are made up of several expressions that might all occupy a single source line but which execute at different times in the running program. It may also be helpful for other constructs like function calls with computations in their arguments and similar expressions which do not have a source line all to themselves.

It’s not immediately obvious how best to go beyond lines when using the DWARF line table as-is, since an instruction is mapped to a single source position, not a source region (with start and end) as you’d have in a source language AST. While you could perhaps extrapolate a region by joining adjacent line table rows (and stopping when you see the end_sequence flag), it is not clear if such data would be reliable, as line table gaps would imply unintentionally inflated regions. Future work could explore ways of improving precision here.

There’s also a separate dimension to consider: whether the whole variable is covered vs. only some fraction of the bits it contains. Our initial version assumes any coverage of a variable covers the whole variable, but future work may wish to be more precise here.

5.3 Support debug info formats beyond DWARF

As already mentioned, we plan to support only DWARF debug info for our initial work, but there’s nothing about the approach that is specific to DWARF. It would be great to see this approach applied to other debug info formats in a single tool.

Acknowledgements

Thanks to everyone who has provided feedback on this along the way. Adrian Prantl gave quite helpful advice in discussions at EuroLLVM. The Sony / SN Systems team is assisting us with this effort and reviewed an earlier draft of this RFC. We received valuable community feedback on earlier versions of this RFC.

This material is based upon work supported by the Engineering & Physical Sciences Research Council (EPSRC) under grant EP/W012308/1, an Impact Acceleration Award from King’s College London, and the Defense Advanced Research Projects Agency (DARPA) under contract HR001124C0488. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of any sponsor or supporter.