[RFC] Debug info coverage tool (original) (raw)

September 9, 2024, 1:15pm 1

Summary

We (J. Ryan Stinnett and Stephen Kell at King’s College London with support from the Sony / SN Systems team) propose to contribute a new tool for measuring how well local variables are covered by debug info (e.g. DWARF) which improves on previous coverage approaches. The initial version proposed here would focus on DWARF, but support for other debug info formats can be added by future work.

Coverage approach

Existing tools

Existing tools that compute some form of debug info coverage include llvm-dwarfdump and debuginfo-quality. The approach used in these tools has a few problems.

Coverage is measured in terms of instruction bytes. Instruction bytes are problematic for debug info coverage in several ways. The emitted instructions will vary across compilers and compiler options, meaning coverage values are not easily comparable. Optimisations that significantly change the number of bytes (adding by e.g. unrolling loops and also removing as well) can end up distorting coverage results. Additionally, debugging users are most often stepping by source lines, so a bytes-based coverage metric is not a good match for spotting issues that affect users.

Full coverage is defined as the entire parent scope (block / function) for all variables. To understand the issue of using the entire parent scope as the coverage target, imagine a variable which is first defined (not declared, but first assigned / written to) half way down a function. Optimising compilers won’t emit debug coverage for that variable until after it is first defined. Such a variable would never be covered for the entire parent scope even by an “ideal” optimising compiler, and thus 100% coverage under such a metric is unattainable for these variables. This makes it hard to discern whether less-than-perfect coverage can be improved. It also accidentally biases towards unoptimised compilations (where variables are placed on the stack for their whole lifetime).

Our approach

To remedy these issues, our approach makes several adjustments:

Measure coverage in terms of source lines
For each variable, calculate its defined regions and only expect coverage in the those lines

By measuring coverage in source lines instead of bytes, the measurement is comparable across compilations and better aligned with the typical debugging user experience. By including in the baseline only those lines where the variable being examined is defined, 100% coverage becomes attainable for all variables.

By combining these adjustments, our approach offers an accurate and achievable coverage metric. Variable storage (stack vs. register) also does not affect coverage attainability, whereas previous metrics accidentally favoured on-stack locals, because these tend to have ranges covering the whole scope, unlike registers that have usually just the defined ranges.

Further detail available

We have previously shared our debug info coverage approach via a EuroLLVM 2024 talk and CC 2024 paper. The talk and paper contains a more detailed story along with experimental evaluations from our research prototype, which used a static analysis approach specific to C language programs. This proposal takes a different approach by using language-agnostic data sources (as one might expect for LLVM tools).

Use cases

There are quite a few potential use cases for this debug info coverage data, including:

Tracking over time (as in LLVM nightly tester)
Pre-merge comparison (similar to LLVM compile time tracker)
Some kind of coverage view in Compiler Explorer
Integration tests

We suspect there are other potential applications as well. (Let us know if you think of any!)

Data sources

To compute our metric, there are 3 major sources of data needed:

Debug info (e.g. DWARF) to be analysed
Source lines that should be covered
First definition point(s) for each source variable

As a data source for the source lines to be covered, we intend to use the DWARF line table from an unoptimised compilation. This also ensures our baseline only counts lines with meaningful computation (e.g. it skips blank lines, comments, etc.), as we assume the unoptimised line table only includes the lines we actually care about. For our initial version, we will get first definition data from a liveness analysis of source variables in unoptimised LLVM IR. It would be ideal if DWARF also contained variable first definition point(s) (or liveness generally), but that is not the case today.

Alternative data sources

We also considered a source-language static analysis pass to find the baseline source lines and first definition points, and in fact our research prototype used this approach for C language programs. However, this would mean writing a static analysis for every potential source language, which makes it much harder to access this tool for each new source language. We believe our source-language-agnostic design above is a better fit as an LLVM tool.

Tool home

We propose to add a new llvm-debuginfo-coverage tool to compute this. We know we’ll need to take in multiple build outputs (not just debug info from a single compilation), which makes our tool a bit different from existing LLVM tools like llvm-dwarfdump. It will also give us a bit more freedom to experiment with coverage output formats for people and tools without worrying about expectations users may have.

Alternative tool homes

Of existing LLVM tools, we also considered llvm-dwarfdump, as that’s the closest existing tool, especially with its --statistics mode. llvm-dwarfdump is mainly thought of as a DWARF pretty printer, taking only the file(s) to be analysed. We would need to take in a few additional inputs, which may be awkward to add to the llvm-dwarfdump CLI. Additionally, our coverage approach is not DWARF-specific. Although we only plan to support DWARF initially, the coverage tool could be expanded to support other formats in the future. For these reasons, we believe a new tool is a better fit.

Workflow

The initial version of the tool will consume DWARF and LLVM IR from an unoptimised build as the baseline, along with the DWARF from the optimised build being analysed. We acknowledge it’s a bit awkward to wrangle build systems to produce all of these (particularly the LLVM IR, which may necessitate build wrappers like wllvm), we believe this is acceptable for an initial version of the tool. Future work (more detail below) could add variable liveness to DWARF, which would remove the need for the LLVM IR input, which simplifies usage of the tool. The primary use case of this coverage tool is imagined to be in occasionally-run automated jobs, so hopefully scripting together those inputs is not too onerous. We’ll include examples of this build wrangling in both tool documentation and as part of testing the tool itself.

Future improvements

We can imagine lots of ways to improve this for the future, even though they are not part of our initial plan.

Add variable liveness to DWARF

It would be ideal for this coverage tool as well as other analysis tools if DWARF described the first defined and last used points for source variables. Future work could explore a DWARF extension to capture this during compilation, and then adjust the coverage tool to make use of it. This would simplify the coverage tool both internally and at time of use, as we’d no longer need to examine LLVM IR. Beyond the coverage use case here, debuggers could warn about use of uninitialised values, tracers could about printing bogus data, etc. This would obviously require its own RFC and communication with DWARF committee if it were pursued.

Investigate finer-grained coverage

It would be nice to increase coverage precision by going beyond line granularity in some way. This would be particularly helpful for language features like loop headers, which are made up of several expressions that might all occupy a single source line but which execute at different times in the running program. It may also be helpful for other constructs like function calls with computations in their arguments and similar expressions which do not have a source line all to themselves.

It’s not immediately obvious how best to go beyond lines when using the DWARF line table as-is, since an instruction is mapped to a single source position, not a source region (with start and end) as you’d have in a source language AST. While you could perhaps extrapolate a region by joining adjacent line table rows (and stopping when you see the end_sequence flag), it is not clear if such data would be reliable, as line table gaps would imply unintentionally inflated regions. Future work could explore ways of improving precision here.

There’s also a separate dimension to consider: whether the whole variable is covered vs. only some fraction of the bits it contains. Our initial version assumes any coverage of a variable covers the whole variable, but future work may wish to be more precise here.

Support debug info formats beyond DWARF

As already mentioned, we plan to support only DWARF debug info for our initial work, but there’s nothing about the approach that is specific to DWARF. It would be great to see this approach applied to other debug info formats in a single tool.

Acknowledgements

Thanks to everyone who has provided feedback on this along the way. Adrian Prantl gave quite helpful advice in discussions at EuroLLVM. The Sony / SN Systems team is assisting us with this effort and reviewed an earlier draft of this RFC.

pogo59 September 9, 2024, 5:25pm 2

Have you looked at all at llvm-debuginfo-analyzer? It is already set up to take multiple inputs and compare information from them, so perhaps a new “mode” of the analyzer would simplify your work. I believe it also already understands CodeView.

It sounds like your tool would need to take 3 inputs (-O0 IR, -O0 object file, and optimized object file) which would be a new trick, but as the -O0 IR is specifically needed to identify live ranges, it seems like that would be kind of its own thing that would modify the understanding of the -O0 object file. Then comparing the two object files would be very natural in the analyzer.

jryans September 10, 2024, 12:30pm 3

Ah hmm, I am aware of llvm-debuginfo-analyzer (and even did some code reviews as it was landing), but we hadn’t considered it for this project until you mentioned it just now. I think of llvm-debuginfo-analyzer as focused on lower-level debug info analysis tasks for questions like “how exactly is variable foo represented in these debug info files?”. Our coverage task feels more like a high-level summary view. I’m also concerned that if coverage were an extra mode of llvm-debuginfo-analyzer, it might get a bit lost, as llvm-debuginfo-analyzer is somewhat of a complex tool with many options. For our coverage work, a more focused tool seems like a better fit.

I think at the moment, a separate tool is a better fit, but as we’re building it, we’ll keep llvm-debuginfo-analyzer in mind, and potentially rethink the plan if it appears we’re duplicating many things it already offers.

Yes, llvm-debuginfo-analyzer already does. In terms of expanding coverage tooling to debug info formats beyond DWARF, I would expect most of the format-specific code we might need already exists in terms of LLVM’s readers for each format, so there’s not necessarily lots of extra code needed for other formats. We’re intending to focus on DWARF for an initial version mainly because we have the most experience with this format, and we’d like to keep the scope of initial version contained where we can.

Yes, that’s right, the O0 IR is only there to “fill in” information that debug info currently lacks, so it would just “extend” our baseline data (until some future date where debug info might carry this extra info).

pogo59 September 10, 2024, 2:27pm 4

Okay, for exploratory/prototype work you could well be better off starting with a separate tool. The debuginfo-analyzer libraries might still be useful, just sayin’.

The high-level goal sounds great, and solving the can’t-ever-get-to-100%-coverage problem is especially awesome.

Thanks for working on this! I’m very optimistic about the potential for using this in a CI checking context, since we don’t currently have any way to measure variable coverage that gives a fair comparison across different optimization pipelines or compiler versions. The ability to clearly measure the effect of specific commits on variable coverage is a big missing metric for LLVM, and should make it much easier to identify where regressions (or improvements) are coming from!

jryans September 13, 2024, 11:17am 6

Thanks to everyone who’s left feedback so far. At the moment, it sounds like everyone is positive on the idea overall and okay with the design we’ve purposed.

I’d like to get reactions from a few more people in the debug info area if possible… Perhaps @adrian.prantl or @dblaikie may have thoughts?

If anyone does have concerns, it would be great to hear those now so we can adjust the design upfront if needed.

IIRC, we talked about this at EuroLLVM and I suggested just implementing a better metric in llvm-dwarfdump. I came up with the of the current metrics a week before a dev meeting to prove a point (https://llvm.org/devmtg/2018-10/slides/Prantl-Kumar-debug-info-bof-2018.pdf) and it was never meant to be and end point in the evolution of dwarfdump --statistics. Maybe I’m misremembering but I think I suggested using counting the unique linetable entries within a variable’s lexical scope at EuroLLVM, because that’s something that’s available in dwarfdump.

Full coverage is defined as the entire parent scope (block / function) for all variables. To understand the issue of using the entire parent scope as the coverage target, imagine a variable which is first defined (not declared, but first assigned / written to) half way down a function. Optimising compilers won’t emit debug coverage for that variable until after it is first defined. Such a variable would never be covered for the entire parent scope even by an “ideal” optimising compiler, and thus 100% coverage under such a metric is unattainable for these variables. This makes it hard to discern whether less-than-perfect coverage can be improved. It also accidentally biases towards unoptimised compilations (where variables are placed on the stack for their whole lifetime).

I think you’re looking at this from a very Clang-specific perspective (other frontends start a new lexical scope for each variable declaration), but I agree that it makes it harder to judge how good coverage is, because you don’t know what the number for the best case would look like.

In my opinion, these two arguments make perfect sense, but they don’t justify creating yet another tool.

Use cases
There are quite a few potential use cases for this debug info coverage data, including:

Tracking over time (as in LLVM nightly tester)

Just want to point out to everyone in this thread that we have historical data going almost a decade back:

https://green.lab.llvm.org/job/llvm.org/view/LLDB/job/clang-3.4-debuginfo-statistics/

Unfortunately the link doesn’t work right now because the LLVM LNT instance has issues (LNT Server Status)

Pre-merge comparison (similar to LLVM compile time tracker)

Some kind of coverage view in Compiler Explorer

Integration tests

I think that would be great!

Data sources
To compute our metric, there are 3 major sources of data needed:

Debug info (e.g. DWARF) to be analysed

Source lines that should be covered

First definition point(s) for each source variable

As a data source for the source lines to be covered, we intend to use the DWARF line table from an unoptimised compilation. This also ensures our baseline only counts lines with meaningful computation (e.g. it skips blank lines, comments, etc.), as we assume the unoptimised line table only includes the lines we actually care about.

Can you explain what this adds over just enumerating the distinct source locations in the line table inside a variable’s lexical scope? In the end a a variable can only be inspected at a break point, so to me it sounds like the only interesting information is: “In how many possible break point locations inside this variable’s scope can I see the variable?”

Alternative data sources
We also considered a source-language static analysis pass to find the baseline source lines and first definition points, and in fact our research prototype used this approach for C language programs. However, this would mean writing a static analysis for every potential source language, which makes it much harder to access this tool for each new source language. We believe our source-language-agnostic design above is a better fit as an LLVM tool.

Agreed, I think this would make it a lot less useful.

Tool home
We propose to add a new llvm-debuginfo-coverage tool to compute this. We know we’ll need to take in multiple build outputs (not just debug info from a single compilation), which makes our tool a bit different from existing LLVM tools like llvm-dwarfdump. It will also give us a bit more freedom to experiment with coverage output formats for people and tools without worrying about expectations users may have.

I really disagree with this. We made dwarfdump --statistics output versioned and extensible, precisely so we could add any metric people find useful in the future. Can you give a concrete example of what you would like to output that would not fit into the existing dwarfdump harness?

Alternative tool homes
Of existing LLVM tools, we also considered llvm-dwarfdump, as that’s the closest existing tool, especially with its --statistics mode. llvm-dwarfdump is mainly thought of as a DWARF pretty printer, taking only the file(s) to be analysed. We would need to take in a few additional inputs, which may be awkward to add to the llvm-dwarfdump CLI. Additionally, our coverage approach is not DWARF-specific. Although we only plan to support DWARF initially, the coverage tool could be expanded to support other formats in the future. For these reasons, we believe a new tool is a better fit.

I agree that it would be strange for dwarfdump to take a reference file as input, but more importantly I’m not yet convinced that this is actually needed in order to derive useful metrics. So far the only example use of the reference input you gave was to understand what lines are active, and I think we can side-step that problem by focussing on surviving break point locations in the binary instead. Then we can express coverage as ratio of covered break points inside the variable’s scope, which should be very close at modeling what the actual debugging experience will be like. It sounds like you want to add the defining source locations as a DWARF extension for Clang’s benefit at a later point, which you could then use to filter out break points that shouldn’t be counted.

In summary, thank you for writing this up, I’m really excited that all this work is happening and am rooting for it! But I want us to take a step back and look at what problem we want to solve before designing a solution that adds a ton of complexity for very few additional benefits.

Hi Adrian… thanks very much for the supportive comments and attention to detail.

For packaging within dwarfdump --statistics, yes, I totally see the case for improving an existing tool rather than adding a new one. If that’s the consensus then that’s what we’ll do.

The bigger question is how we are to come up with the baseline, i.e. the number of coverable lines for a given variable. The specific issue is how much we trust the optimised line table. It would be nice if that could provide the baseline, since it would avoid the inconvenience of supplying an external reference to the tool.

In our paper (i.e. before the current proposal) we took a very untrusting approach: we went right back to the source code to enumerate a set of coverable lines. The rationale is that we want to minimise how much of the debug info we trust the compiler to get correct. However, it does leave open the question of how to filter out unreachable lines; ideally we’d like a way to match exactly the compiler’s approximation of unreachability. In our experiments we demo’d a way around this using data from profiling runs, but essentially we largely punted on this question. So now is a good time to think about the right solution in the context of LLVM tooling. One solution is simply to look at only the optimised line table.

The hazard I see is as follows. If a drop in variable coverage coincides with a drop in line table coverage, for some reason internal to the compiler, e.g. indiscriminate elimination of a bunch of metadata nodes but not their attached code, then by using the optimised line table as a reference, we would miss this.

On the other hand, if we’re confident in the line table and want the tool to focus only on the variable info, then using the optimised line table makes sense.

It’s really an LLVM call about to what extent the “indiscriminate elimination” thing is a real hazard. I know the !dbg metadata nodes for the line table are distinct from the dbg.value and dbg.declare. If the treatment is sufficiently orthogonal in practice, then that would be an argument that it’s OK to trust the optimised line table.

I wonder if there’s a quick way to rustle up an experiment that sheds light on this. E.g. if across profiling runs of an O0 binary we rarely see a bigger set of reached lines than running the same test/input at O2, it would establish trust in the optimised line table.

I take the point that a non-Clang frontend may generate lexical scopes differently, but unless I’m missing something, I don’t think that changes the fundamentals? (It could maybe eliminate the liveness analysis if we assume a variable’s scope is precisely matched to its lifetime, but that’s not safe to assume in general.)

I’m not sure I understand what you said about “a variable can only be inspected at a break point”… surely that’s not true in general? E.g. we could always have is_stmt = 0, but still somehow step through our code inspecting variables as we go. I’m pretty sure my debugging workflows often involve inspecting variables at non-breakpoint instructions, e.g. maybe directly on return from a call.

So perhaps it’s worth thinking separately about two aspects of the line table (and the trustworthiness of each): the full collection of embodied source lines, and the subset that have is_stmt set at some PC. Both of these could be used to provide our baseline, of course, and the trustworthiness of each may be different.

jryans September 16, 2024, 2:35pm 9

Thanks @adrian.prantl for taking a look.

Stephen has just now replied to you on the primary topic of the metric itself, so please take a look at that. I also wanted to clarify a few additional bits.

If there’s consensus that it should go in dwarfdump --statistics, we could do that. My current opinion is that we’ll need additional reference inputs (at least for the metric as purposed at the top of the thread), and also our metric is not specific to DWARF, so to me it feels strange to force it into dwarfdump --statistics (but in terms of tool output, what we’re proposing should certainly have a machine-readable summary just like dwarfdump --statistics does, more on this below).

Hmm, it’s possible you did… I don’t have it in my notes from the conference, so I’m unsure on my side. See Stephen’s reply for further discussion on the metric itself.

While it’s true that at least Swift (are there other frontends doing the same…?) emits lexical blocks for each variable, from testing this a while back, that does not convey variable liveness info. In particular, a variable may be declared at the top of a block, but only first defined later on. The per-variable lexical block approach gives precise declaration bounds, but not definition or liveness bounds. When optimisation is enabled, the variable is likely to only be covered in debug info where it is live. The metric we’re purposing uses this first definition / liveness info to avoid counting these “declared but not defined” regions.

Ah sorry, I think the RFC text is causing confusion here. We did not mean to imply there’s something wrong with the output format of dwarfdump --statistics. I agree it’s extensible, and we could add additional data points there (assuming that’s the correct tool to use based on other factors e.g. potential extra input files, future analysis of non-DWARF debug info).

When the RFC mentioned “output formats”, that was intending to mean additional formats beyond a machine friendly summary like dwarfdump --statistics. For example, we could show a per-variable table of coverage in various formats (CLI-formatted table, TSV, JSON).

I agree that it would be strange for dwarfdump to take a reference file as input, but more importantly I’m not yet convinced that this is actually needed in order to derive useful metrics.
…
It sounds like you want to add the defining source locations as a DWARF extension for Clang’s benefit at a later point, which you could then use to filter out break points that shouldn’t be counted.

As mentioned above the defined range for variables is a necessary input for getting useful variable coverage information for (at least) C/C++ programs, which represent a fairly substantial use case for this feature. At the moment there’s no DWARF representation for this information, and while the RFC notes that such an extension could be added to DWARF in future, my suspicion is that that will carry a much higher review burden - even if it is just an LLVM vendor extension to start with - than writing output to a separate file to use as input to the coverage tool. In prior discussions, I’ve suggested that the best choice was to not use llvm-dwarfdump as long as the tool needs input outside of DWARF. I wasn’t putting much weight on the cost of adding yet-another-tool, but if that’s a major concern then personally I would be fine with this being an extra option for llvm-dwarfdump --statistics, especially if there’s a way to easily include the feature in existing infrastructure.

pogo59 September 16, 2024, 5:19pm 11

So, one goal is to normalize the variable-coverage report across architectures, by converting PC-range data into source-line-range data. Another is to make sure it is theoretically possible to reach 100% coverage by constraining the denominator range to start at the first assignment rather than the lexical scope start.

The normalizing-to-source-lines part actually seems pretty straightforward; the PC range of a variable’s scope and the PC range of its DW_AT_location can both be mapped to source-line ranges by looking at the line table. This seems like a reasonable and valuable step to take for whatever statistics we might be reporting today, even if the other part of the proposal isn’t accepted.

I will say that this mapping obviously depends on the correctness and completeness of the line table, for optimized code, and I have my doubts as to that completeness and correctness. However, both the numerator and denominator will be using the same (potentially unreliable) mapping, so perhaps any problems with that mapping will cancel out? I haven’t thought about that in detail.

Finding the first assignment really does require looking at the pre-optimization IR, AFAICT. Thinking about the mechanics of this, I believe the cheapest/simplest way for it to work would be an Analysis pass that runs on the as-produced-by-Clang IR, and records the source location of each variable’s “first” definition in its debug info. (For some definition of “first.”) Then we emit that to the object file as a vendor-defined attribute on the variable, to be consumed by the statistics tool. This Analysis pass would ideally be cheap enough that we can just have it on by default (when emitting more than line-tables-only).

Yes, but it would also be impossible to observe the difference, since you cannot stop at the source location that the compiler dropped. Now it’s possible the dropped debug value could turn into a correctness problem a later source location, but that would be a separate issue.

I.e., I wonder if we could separate out the concerns of determining metrics for source location (coverage, correctness) and variable location (coverage, correctness) and treat them as 4 different problems?

On the other hand, if we’re confident in the line table and want the tool to focus only on the variable info, then using the optimised line table makes sense.

I’m not confident in it at all, but I think

it’s a different problem that we shouldn’t try to address indirectly through variables location metrics.
Practically, only the variables values at distinct source locations (I’ve been calling them break points) matter, because that’s where tools will display them to users.

It’s really an LLVM call about to what extent the “indiscriminate elimination” thing is a real hazard. I know the !dbg metadata nodes for the line table are distinct from the dbg.value and dbg.declare. If the treatment is sufficiently orthogonal in practice, then that would be an argument that it’s OK to trust the optimised line table.

I wonder if there’s a quick way to rustle up an experiment that sheds light on this. E.g. if across profiling runs of an O0 binary we rarely see a bigger set of reached lines than running the same test/input at O2, it would establish trust in the optimised line table.

I guess that would be interesting to know, but IMO it’s an orthogonal problem that needs to be addressed separately. In the end we want both high-quality source locations and high-quality variables, because without source locations you can’t meaningfully inspect the variables. Without a source location you cannot set a breakpoint, so you also can’t inspect the variables at that missing location. And if you, e.g., instruction-step through a entire function with missing source locations, then you don’t where you are, which renders the variable values meaningless. (At least in the general case, constants might still be interesting).

I take the point that a non-Clang frontend may generate lexical scopes differently, but unless I’m missing something, I don’t think that changes the fundamentals? (It could maybe eliminate the liveness analysis if we assume a variable’s scope is precisely matched to its lifetime, but that’s not safe to assume in general.)

It doesn’t change the fundamentals, this was more a comment about the relative importance of this subproblem; it only affects some (admittedly important) languages.

I’m not sure I understand what you said about “a variable can only be inspected at a break point”… surely that’s not true in general? E.g. we could always have is_stmt = 0, but still somehow step through our code inspecting variables as we go. I’m pretty sure my debugging workflows often involve inspecting variables at non-breakpoint instructions, e.g. maybe directly on return from a call.

I used the word break point without defining it first. What I meant is a distinct source location (one you could set a breakpoint on or step to). Unless they are constants, variable values are only meaningful together with a source location. A call site typically has a source location.

So perhaps it’s worth thinking separately about two aspects of the line table (and the trustworthiness of each): the full collection of embodied source lines, and the subset that have is_stmt set at some PC. Both of these could be used to provide our baseline, of course, and the trustworthiness of each may be different.

As I said above I was not thinking necessarily about is_stmt (but I understand why you arrived at that conclusion). I’ll try to use the word “source locations” going forward, because I also find the word “line” problematic for modern programming languages that often have very complex control flow and even multiple closures/lambdas on a single line.

jryans September 17, 2024, 10:30am 13

(I think you already know this part, but just to emphasise for everyone…)

With the metric we’re purposing (which uses the unoptimised line table as the baseline), any coverable lines that are missing from the optimised line table will appear as coverage losses, so we’re not trusting the optimised line table. It appears everyone in thread so far is generally agreeing the optimised line table is known to have losses (from either compiler bugs or intentional line drops), so it would seem to be important to have a separate reference source for the expected line table. We believe the unoptimised line table works well as that reference.

Yes, I agree some analysis of pre-optimisation IR is needed to properly find first definition point(s) (there could be multiple depending on the CFG).

I would be hesitant to do that analysis in the compiler itself (at least initially), as I wouldn’t want to affect compile time with this work. That’s why I’m imagining the coverage tool would do the analysis on unoptimised IR. If we later realise such info is of use to other tools, the analysis could then be moved into a regular compilation step that emits metadata as you say.

With the metric we’re purposing (which uses the unoptimised line table as the baseline), any coverable lines that are missing from the optimised line table will appear as coverage losses, so we’re not trusting the optimised line table.

I’ll add that from a nightly testing or pre-commit perspective, a good metric to put side-by-side with variable coverage relative to the unoptimized line table would be variable scope line coverage relative to the unoptimized line table; calculated exactly the same way as the coverage metric but treating every variable like it has a valid value at all lines present in the optimized line table. Being able to view both of these side-by-side would give a much clearer view, when comparing revisions, of the difference between cases where we have either purely gained or lost either variable or line coverage.

pogo59 September 17, 2024, 1:25pm 15

Maybe I didn’t understand. Looking first at the unoptimized object, we use that line table as the oracle of what source locations produced instructions. In the unoptimized object, we can map the variable’s scope’s PC range to a set of source locations, then reduce that set by removing the locations that precede the first definition of the variable. Variable coverage in the unoptimized object is necessarily 100% because we don’t emit PC ranges on variables at O0.

Looking at the optimized object, we could in principle have a lossless optimized line table (all lines from O0 are still represented). We could in principle have lossless variable coverage if the PC range on the variable spans the first-def to end-of-scope. If we are in that situation, coverage is still 100%.

However, even with a lossless optimized line table, we can discover variable-coverage losses simply because the variable’s liveness does not in fact span its entire scope, which is an artifact of optimization that we don’t really want to “fix.” The only way to achieve 100% coverage in this case is to artificially extend the variable’s lifetime.

And, even with a variable whose liveness does span its entire scope, we can also discover across-the-board losses due to source lines not represented in the optimized line table at all, because optimizations have correctly eliminated redundancies (for example). Again we do not want to “fix” this because the optimizations are doing exactly what we want.

So, there are two independent sources of lost coverage, that are not bugs, meaning that we are unlikely to achieve 100% coverage in the optimized build. And then there are losses in the line table due to bugs, and losses in variable coverage due to bugs. It appears that the proposed metric folds all these together?

That might be the best we can do, but I think it would be good to tease apart line-table losses from variable-coverage losses, if we can. (Maybe we can’t.)

jryans September 17, 2024, 2:12pm 16

Yes, this all sounds correct to me.

Yes, there may be cases where the variable’s last use happens before the end of scope, and optimisation make take advantage of this, and it’s not really seen as an optimisation bug. As a debugging user, you would likely still expect to see those variables until the end of their scope, so that’s why our coverage metric considers these “after last use to end of scope” gaps as a coverage loss. Artificial extension features can help recover some of this coverage when desired.

This bucket is bit more complex I’d say, as it’s likely to be a mix of optimisation bugs accidentally dropping lines, optimisations performing transformations we can’t express in current debug info, etc. In any case, yes, our metric treats these as coverage losses, since the debugger user would still want to inspect them, even if we’re unsure / don’t know how to cover them today.

Yes, the current metric does “entangle” line table losses and variable coverage losses. To achieve coverage under our metric, the variable’s coverable region must have its PC to source line mapping in the optimised line table (for the same lines as the unoptimised version) and the variable must have a location expression that covers the PCs that map to those lines.

One way to separate out line table losses would be to first look at just the line table comparison. If there are source lines in the unoptimised line table that are not in optimised line table, we know it’s a loss at the line table level. That’s something our coverage tool could report in addition to the original variable-focused coverage which builds on top of the line table info.

Thanks @pogo59 and @adrian.prantl … this is really useful.

Adrian, I understand your points a lot better now, so particular thanks. You’re right that although loss of line info and loss of variable coverage are separate phenomena, either one of them is sufficient to cause the same loss in debug experience: a user can’t see the variable’s value at the affected line.

Our metric can still show a difference between them, however: if the O2-lost lines are in the O0 table and known to be reachable (however we determine that; see below) then they would be in the denominator for our coverage calculation, so the coverage score of the O2 build would be reduced – and accurately so. Conversely, if the lines are not in whatever reference line table we use, but are still in fact reachable, we’d see no reduction of measured coverage – this is less accurate, because those variable-lifetime-segments really have gone away from the debug info, despite being live during reachable code.

Given these two distinct phenomena, I think Paul’s framing of “teasing apart” is a good one. However, I’m going to be bold and say that that’s already what we’re proposing!

Rather than “entangling” the two concerns, a better way to see our proposal is that the coverage calculation assumes we can get an accurate line table from somewhere. (“Accurate” here means it includes all the source lines – or line/column ranges or whatever – that are actually embodied in the optimised code, and not others.)

For me that’s the opposite of entangling: although it is punting on the question of how to get that line table, the line-table concern is factored out cleanly.

That is why our proposed packaging of the tool is as one that takes a reference input. We might get that reference in many ways, from many places, and the core of the metric does not care.

Of course we could default that input to using the optimised code’s line table, so that in the common case there is no inconvenience of supplying an external reference. This comes at the potential cost of accuracy – if we drop line info but not the code those lines embody, coverage for variables live over those lines is over-reported. That is really the step that does the entangling.

And as Adrian has noted, it’s an entangling that also occurs during a real debugging scenario: the two losses are indistinguishable at debug time. If, however, the debugger also had access to an accurate reference for reachable source lines, that would no longer be the case! The debugger would know that some lines “existed” but for some unknown reason could not be mapped to PCs. This extra information would not enable many useful debug-time features that I can see, except maybe better warning messages e.g. when setting a breakpoint on such a line. But it does enable more accurate variable coverage measurement, because it does not mask the unavailability of those variables over the affected lines.

We could then think separately about ways to obtain a better reference line table. The O0 line table is one possible reference, and one we have proposed, although it has the converse problem of the O2 table’s problem: it may include unreachable lines, so report lower coverage than is reasonable.

So maybe the “teasing apart” thing we need to do (thanks Paul for this framing) is

(as we’ve proposed) a variable coverage tool that can work with multiple references for the line table, perhaps defaulting to the optimised one
(separately) an accurate line table coverage tool that can account for reachability

Number 2 is hard because the the “best” reachability calculation is not a matter of absolute truth: to be really accurate, we should match whatever approximation of reachability that the compiler happened to use. This depends on analyses/transformations done by potentially many passes. We’re not in a position to contribute a solution to number 2 just now, but I think there’s lots of things one could try: an interprocedural reachability analysis on the O0 code might still be a good start, and one could also look at combining under- and over-approximate variations of this, albeit yielding an “interval” not a single coverage figure. A looser initial stab at such an interval could of course just use the O0 and O2 tables, which we’re already enabling.

pogo59 September 18, 2024, 2:37pm 18

Right, and if we emit a fully correct line table with optimization, we’re good here. Adrian and I both believe we don’t have that today, however, and I’m sure we’re not the only ones. The open question is, can we actually find that data elsewhere.

The metric as an abstraction does not care; the metric as an implementation detail very much cares. This is the part that sounds fishy to me. I don’t see how to get that reference from anywhere else.

I can see being able to get the set of source locations spanning a variable’s first-def to its end-of-scope from an unoptimized object (i.e., its debug info and line table). This isn’t the same as its liveness, but it’s the best we can do with what Clang emits at O0.

But given an optimized object file, there’s no way to get the corresponding set of source locations for a PC range (of any kind) without referring to that same object’s line table. You can’t get it from the unoptimized line table, because the addresses (and even relative offsets from the start of the function) are likely to be completely different; looking up optimized-object addresses/offsets in the unoptimized line table is simply invalid.

Thanks Paul. I think we are at cross purposes slightly. Nobody is talking about looking up optimised-object addresses in another table! That would clearly be silly.

I’m not sure, but think what you might be missing is that for our baseline we don’t need to get the set of source locations for a PC range. We just need to get them for a variable. In our paper we did this entirely at source level. Although that has problems matching the compiler w.r.t. reachability, it’s still a usable approximation, although overapproximate. Another usable approximation is simply using the optimised IR’s line table; that has other problems as I outlined above, and is likely to be underapproximate. We could even combine the two (with the “interval” thing I mooted). Either way, there’s no need to cross-correlate the PC ranges. And to reiterate, we do already have decent (albeit approximate) ways to get the reference we need, Of course that doesn’t stop us thinking of better ones… maybe I am going too far that way already.

Does that help?

jryans

September 19, 2024, 4:46pm 20

Perhaps @adrian.prantl and any others on the fence here may find it persuasive to think about coverage metrics from the perspective of someone investigating coverage losses. I’ll compare existing tools and the proposed approach in that light.

Existing tools

A curious engineer runs llvm-dwarfdump --statistics on their optimised build, and they see that quite a few variables are reported as having <100% of their parent scope covered, so they’d like to investigate further. One way might be to take a closer look at a few individual variables (as that might reveal the root cause of problems that affect others as well). llvm-dwarfdump --statistics does not currently offer a per-variable coverage view, so they switch to debuginfo-quality --variables (which uses the same coverage approach as llvm-dwarfdump --statistics) for this. They notice many variables with reduced coverage, including variable count from function example, which has 60% of its parent scope PC bytes covered.

Existing tools have led our investigator to think there may be a problem with coverage for variable count, but it’s hard to know what to make of “60% of parent scope PC bytes covered”. As a debugging user, you typically think in terms of the source language. “60% of parent scope PC bytes” could mean essentially anything in terms of experience, since the debugging experience will also depend on line table quality for this region of PCs. Our investigator checks the line table, variable location expressions, and stares at the function source for a while. They eventually notice count is only defined part-way down the function, so 100% coverage (of parent scope) is not even expected for this variable!

Aside: Coverage achievability

This issue of coverage achievability in existing metrics that our investigator encountered is not artificial at all: it affects many variables. In our paper, we looked at coverage achievability, using the Git codebase as an example.

As the figure above shows, more than half of the source variables have some portion of unachievable coverage when measured by existing metrics (because these variables are first defined somewhere after the start of their parent scope).

Proposed approach

Let’s see what our investigator could learn via the tools proposed here. They would like to understand the coverage of count, ideally in more relatable source language terms and using coverage metrics where 100% is achievable for all variables so they know they aren’t wasting their time.

Discussion with @pogo59 and others here highlighted the need for additional tooling to analyse the line table in isolation, so let’s imagine using that first. It’s hard to analyse the optimised line table on its own, as there’s no clear baseline. Our investigator assembles both unoptimised and optimised builds and runs the newly-described line table coverage analysis (which checks an optimised line table against an unoptimised line table). This tooling reports that 30% of the source locations in function example are missing in the optimised build. Great, this is something our investigator can understand and act on. There might be compiler bugs that have dropped some locations here. Or, there might be transformations that can’t currently be expressed in debug info. Either way, these are both good to investigate further for potential bugs or debug info design issues.

Let’s imagine the line table issues are fixed, and our investigator looks again at fresh builds. This function now has 100% line table coverage, hooray!

Now our investigator runs the per-variable coverage analysis, which layers variables location expressions on top of the line table and also uses knowledge of variable first definition points. This shows that 20% of variable count’s defined region is not covered. Since the line table for this function is now fully covered, our investigator knows that’s not to blame. The most likely suspect is an optimisation pass dropping some variable location coverage. This is once again an actionable report. With a bit more investigation, the responsible pass can be identified and hopefully fixed if possible.

I hope this case study shows why we believe our proposed approach has a lot of value to offer debug info analysis. The story here is quite similar to the real steps I’ve encountered investigating debug info issues personally. I find that our proposed metrics allow me to work out the root cause far more quickly, and I can be confident that false alarms (like unachievable coverage) have already been filtered away.