LLVM Code Coverage Mapping Format — LLVM 21.0.0git documentation (original) (raw)

Introduction

LLVM’s code coverage mapping format is used to provide code coverage analysis using LLVM’s and Clang’s instrumentation based profiling (Clang’s -fprofile-instr-generate option).

This document is aimed at those who would like to know how LLVM’s code coverage mapping works under the hood. A prior knowledge of how Clang’s profile guided optimization works is useful, but not required. For those interested in using LLVM to provide code coverage analysis for their own programs, see the Clang documentation https://clang.llvm.org/docs/SourceBasedCodeCoverage.html.

We start by briefly describing LLVM’s code coverage mapping format and the way that Clang and LLVM’s code coverage tool work with this format. After the basics are down, more advanced features of the coverage mapping format are discussed - such as the data structures, LLVM IR representation and the binary encoding.

High Level Overview

LLVM’s code coverage mapping format is designed to be a self contained data format that can be embedded into the LLVM IR and into object files. It’s described in this document as a mapping format because its goal is to store the data that is required for a code coverage tool to map between the specific source ranges in a file and the execution counts obtained after running the instrumented version of the program.

The mapping data is used in two places in the code coverage process:

  1. When clang compiles a source file with -fcoverage-mapping, it generates the mapping information that describes the mapping between the source ranges and the profiling instrumentation counters. This information gets embedded into the LLVM IR and conveniently ends up in the final executable file when the program is linked.
  2. It is also used by llvm-cov - the mapping information is extracted from an object file and is used to associate the execution counts (the values of the profile instrumentation counters), and the source ranges in a file. After that, the tool is able to generate various code coverage reports for the program.

The coverage mapping format aims to be a “universal format” that would be suitable for usage by any frontend, and not just by Clang. It also aims to provide the frontend the possibility of generating the minimal coverage mapping data in order to reduce the size of the IR and object files - for example, instead of emitting mapping information for each statement in a function, the frontend is allowed to group the statements with the same execution count into regions of code, and emit the mapping information only for those regions.

Advanced Concepts

The remainder of this guide is meant to give you insight into the way the coverage mapping format works.

The coverage mapping format operates on a per-function level as the profile instrumentation counters are associated with a specific function. For each function that requires code coverage, the frontend has to create coverage mapping data that can map between the source code ranges and the profile instrumentation counters for that function.

Mapping Region

The function’s coverage mapping data contains an array of mapping regions. A mapping region stores the source code range that is covered by this region, the file id, the coverage mapping counter and the region’s kind. There are several kinds of mapping regions:

}

#ifdef DEBUG // Skipped Region from 2:1 to 4:2
printf("Hello world");
#endif
return 0;
}

Source Range:

The source range record contains the starting and ending location of a certain mapping region. Both locations include the line and the column numbers.

File ID:

The file id an integer value that tells us in which source file or macro expansion is this region located. It enables Clang to produce mapping information for the code defined inside macros, like this example demonstrates:

void func(const char *str) { // Code Region from 1:28 to 6:2 with file id 0 #define PUT printf("%s\n", str) // 2 Code Regions from 2:15 to 2:34 with file ids 1 and 2 if(*str)
PUT; // Expansion Region from 4:5 to 4:8 with file id 0 that expands a macro with file id 1 PUT; // Expansion Region from 5:3 to 5:6 with file id 0 that expands a macro with file id 2 }

Counter:

A coverage mapping counter can represent a reference to the profile instrumentation counter. The execution count for a region with such counter is determined by looking up the value of the corresponding profile instrumentation counter.

It can also represent a binary arithmetical expression that operates on coverage mapping counters or other expressions. The execution count for a region with an expression counter is determined by evaluating the expression’s arguments and then adding them together or subtracting them from one another. In the example below, a subtraction expression is used to compute the execution count for the compound statement that follows the else keyword:

int main(int argc, const char *argv[]) { // Region's counter is a reference to the profile counter #0

if (argc > 1) { // Region's counter is a reference to the profile counter #1 printf("%s\n", argv[1]);
} else { // Region's counter is an expression (reference to the profile counter #0 - reference to the profile counter #1) printf("\n");
}
return 0;
}

Finally, a coverage mapping counter can also represent an execution count of of zero. The zero counter is used to provide coverage mapping for unreachable statements and expressions, like in the example below:

int main() {
return 0;
printf("Hello world!\n"); // Unreachable region's counter is zero }

The zero counters allow the code coverage tool to display proper line execution counts for the unreachable lines and highlight the unreachable code. Without them, the tool would think that those lines and regions were still executed, as it doesn’t possess the frontend’s knowledge.

Note that branch regions are created to track branch conditions in the source code and refer to two coverage mapping counters, one to track the number of times the branch condition evaluated to “true”, and one to track the number of times the branch condition evaluated to “false”.

LLVM IR Representation

The coverage mapping data is stored in the LLVM IR using a global constant structure variable called __llvm_coverage_mapping with the _IPSK_covmap_section specifier (i.e. “.lcovmap$M” on Windows and “__llvm_covmap” elsewhere).

For example, let’s consider a C file and how it gets compiled to LLVM:

int foo() { return 42; } int bar() { return 13; }

The coverage mapping variable generated by Clang has 2 fields:

The variable has 8-byte alignment because ld64 cannot always pack symbols from different object files tightly (the word-level alignment assumption is baked in too deeply).

@__llvm_coverage_mapping = internal constant { { i32, i32, i32, i32 }, [32 x i8] } { { i32, i32, i32, i32 } ; Coverage map header { i32 0, ; Always 0. In prior versions, the number of affixed function records i32 32, ; The length of the string that contains the encoded translation unit filenames i32 0, ; Always 0. In prior versions, the length of the affixed string that contains the encoded coverage mapping data i32 3, ; Coverage mapping format version }, [32 x i8] c"..." ; Encoded data (dissected later) }, section "__llvm_covmap", align 8

The current version of the format is version 6.

There is one difference between versions 6 and 5:

There is one difference between versions 5 and 4:

There are two differences between versions 4 and 3:

The only difference between versions 3 and 2 is that a special encoding for column end locations was introduced to indicate gap regions.

In version 1, the function record for foo was defined as follows:

{ i8*, i32, i32, i64 } { i8* getelementptr inbounds ([3 x i8]* @__profn_foo, i32 0, i32 0), ; Function's name i32 3, ; Function's name length i32 9, ; Function's encoded coverage mapping data string length i64 0 ; Function's structural hash }

In version 2, the function record for foo was defined as follows:

{ i64, i32, i64 } { i64 0x5cf8c24cdb18bdac, ; Function's name MD5 i32 9, ; Function's encoded coverage mapping data string length i64 0 ; Function's structural hash

Function record:

A function record is a structure of the following type:

{ i64, i32, i64, i64, [? x i8] }

It contains the function name’s MD5, the length of the encoded mapping data for that function, the function’s structural hash value, the hash of the filenames in the function’s translation unit, and the encoded mapping data.

Dissecting the sample:

Here’s an overview of the encoded data that was stored in the IR for the coverage mapping sample that was shown earlier:

Encoding

The per-function coverage mapping data is encoded as a stream of bytes, with a simple structure. The structure consists of the encodingtypes like variable-length unsigned integers, that are used to encode File ID Mapping, Counter Expressions and the Mapping Regions.

The format of the structure follows:

[file id mapping, counter expressions, mapping regions]

The translation unit filenames are encoded using the same encodingtypes as the per-function coverage mapping data, with the following structure:

[numFilenames : LEB128, filename0 : string, filename1 : string, ...]

Types

This section describes the basic types that are used by the encoding format and can appear after : in the [foo : type] description.

LEB128

LEB128 is an unsigned integer value that is encoded using DWARF’s LEB128 encoding, optimizing for the case where values are small (1 byte for values less than 128).

Strings

[length : LEB128, characters...]

String values are encoded with a LEB value for the length of the string and a sequence of bytes for its characters.

File ID Mapping

[numIndices : LEB128, filenameIndex0 : LEB128, filenameIndex1 : LEB128, ...]

File id mapping in a function’s coverage mapping stream contains the indices into the translation unit’s filenames array.

Counter

[value : LEB128]

A coverage mapping counter is stored in a single LEB value. It is composed of two things — the tagwhich is stored in the lowest 2 bits, and the counter data which is stored in the remaining bits.

Tag:

The counter’s tag encodes the counter’s kind and, if the counter is an expression, the expression’s kind. The possible tag values are:

Data:

The counter’s data is interpreted in the following manner:

Counter Expressions

[numExpressions : LEB128, expr0LHS : LEB128, expr0RHS : LEB128, expr1LHS : LEB128, expr1RHS : LEB128, ...]

Counter expressions consist of two counters as they represent binary arithmetic operations. The expression’s kind is determined from the tag of the counter that references this expression.

Mapping Regions

[numRegionArrays : LEB128, regionsForFile0, regionsForFile1, ...]

The mapping regions are stored in an array of sub-arrays where every region in a particular sub-array has the same file id.

The file id for a sub-array of regions is the index of that sub-array in the main array e.g. The first sub-array will have the file id of 0.

Sub-Array of Regions

[numRegions : LEB128, region0, region1, ...]

The mapping regions for a specific file id are stored in an array that is sorted in an ascending order by the region’s starting location.

Mapping Region

[header, source range]

The mapping region record contains two sub-records — the header, which stores the counter and/or the region’s kind, and the source range that contains the starting and ending location of this region.

Source Range

[deltaLineStart : LEB128, columnStart : LEB128, numLines : LEB128, columnEnd : LEB128]

The source range record contains the following fields:

Testing Format

Warning

This section is for the LLVM developers who are working on llvm-cov only.

llvm-cov uses a special file format (called .covmapping below) for testing purposes. This format is private and should have no use for general users. As a developer, you can get such files by the convert-for-testingsubcommand of llvm-cov.

The structure of the .covmapping files follows:

[magicNumber : u64, version : u64, profileNames, coverageMapping, coverageRecords]

Magic Number and Version

The magic is 0x6d766f636d766c6c, which is the ASCII stringllvmcovm in little-endian.

There are two versions for now:

The only difference between Version1 and Version2 is in the encoding of thecoverageMapping fields, which is explained later.

Profile Names

profileNames, coverageMapping and coverageRecords are 3 sections extracted from the original binary file.

profileNames encodes the size, address and the raw data of the section:

[profileNamesSize : LEB128, profileNamesAddr : LEB128, profileNamesData : bytes]

Coverage Mapping

This field is padded with zero bytes to make it 8-byte aligned.

coverageMapping contains the records of the source files. In version 1, only one record is stored:

[padding : bytes, coverageMappingData : bytes]

Version 2 relaxes this restriction by encoding the size ofcoverageMappingData as a LEB128 number before the data:

[coverageMappingSize : LEB128, padding : bytes, coverageMappingData : bytes]

The current version is 2.

Coverage Records

This field is padded with zero bytes to make it 8-byte aligned.

coverageRecords is encoded as:

[padding : bytes, coverageRecordsData : bytes]

The rest data in the file is considered as the coverageRecordsData.