Faster C++ program startups
by improving runtime linking efficiency. (original) (raw)
Léon Bottou
AT&T Labs - Research, Middletown, NJ, USA
Current address: NEC Research Institute, Princeton, NJ, USA
http://leon.bottou.com
John Ryland
Trolltech AS, Oslo, Norway
mailto:jryland@trolltech.com
1- Introduction
This paper describes runtime linking speed issues arising with shared libraries written in C++. Two methods for reducing the runtime linking complexity, namely objprelink1 and combreloc, are compared and shown to be equivalent. A third method, namely objprelink2, yields additional speedups by resolving virtual table entries on demand. These methods are then compared with the full pre-linking approach. A promising hybrid is suggested.
2- Shared libraries
Shared libraries are a common feature of all modern operating systems. Executable program no longer incorporate the code of common library functions such as printf. Instead the executable program contains unresolved references to the library functions and an additional piece of information describing a shared library file(s) containing the missing functions. Shared libraries themselves can depend on additional shared libraries.
When the program starts up, the runtime linker loads the necessary shared libraries, resolves the undefined symbols, and eventually transfers control to the program itself.
The runtime linker performs the following operations:
- A suitable space in memory must be allocated for each necessary shared library.
- The shared library code and data must appear in this space. This is usually achieved by mapping the shared library file using virtual memory. Multiple processes then can use the same read-only memory image of the shared library. This approach yields considerable memory savings
- The shared library might contain absolute addresses that must be patched in order to reflect the actual loading address of the library.
- Undefined symbols in the program and all shared libraries must be resolved using symbols defined in other shared libraries. Code and data referring to each undefined symbol must be patched in order to reflect the address of the corresponding definitions.
The required patches are entirely described by _relocation records_located in the shared library file itself. Patches must be applied independently in each process using a given shared library. Virtual memory page affected by the patches no longer can be shared between processes using the same library. Resolving symbols also requires a potentially expensive search in the list of all symbols defined by all shared libraries.
The Linux system, and more generally all operating systems using the ELF object file format, use several strategies to reduce the cost of these operations [ELF-1995]:
- Shared libraries are usually compiled as position independent code (PIC). For instance, the compiler will generate branch instructions that specify the branch target relative to the branch instruction. Position independent code reduces the number of necessary patches.
- References to undefined symbols are grouped together in a data area known as the global offset table (GOT). Regrouping undefined references ensures that a minimal number of virtual memory pages are affected by the patches. A piece of code that would otherwise refer to an undefined symbol is changed into an instruction sequence that retrieves the desired address from the GOT. To make this easier, compilers compute the location of the GOT at the beginning of each function and keep it in a specific processor register named the got pointer register.
- References to undefined functions can be resolved on demand. Jumps to undefined symbols are replaced by jumps to stubs placed in the program location table (PLT). Each stub jumps to the address specified by a corresponding entry in the GOT. This address initially refers to a piece of code that locates the symbol definition, patches the GOT entry, and finally executes the function itself. The first call to the target function performs the link edition. Subsequent calls directly call the target function. This lazy symbol resolution ensures that no time is wasted resolving function symbols that are not useful for a particular program. This is very significant because programs are usually linked with fairly large libraries (such as the C library) but only use a small fraction of the facilities offered by the library. Lazy symbol resolution also reduces the apparent startup time of interactive program because it avoids performing a complete link edition upfront.
Although these strategies are fairly successful on usual C programs, they do not address significant cases. A C shared library, for instance, may define the following global variables:
char *a = "abcdef";
double (*b)(double) = &cos;
The semantics of the C language implies that both variables a andb must contain absolute pointers to either the string"abcdef" or the library function cos(). Both variables will require a patch when the shared library will be used.
**Remark:**It is interesting to notice that the example above would almost work if variable b contained the address of a PLT entry for functioncos() instead of the absolute address of function cos(). The shared library that initializes variable b can indeed call the cosine function via the function pointer. Calling it from another shared library does not work: the GOT register contains a pointer to the other shared library, while the PLT stub expects that it contains a pointer to the shared library that initializes b.
3- Runtime linking and C++.
It is unfortunate that the C++ language relies a lot more on constructs that are poorly handled by the runtime linking speedup strategies discussed above.
Waldo Bastian [Bastian-2001] has investigated the run-time linking performance of various components of the KDE desktop environment (a large collection of C++ shared libraries and executable). He pinpoints the performance issue to the resolution of the virtual table entries:
In his own words: " Is this a KDE specific problem? That depends on how you look at it. The problem of long startup times is caused by the linker needing to do many relocations. In the case of KDE, these relocations are for a great deal caused by vtable entries. KDE and Qt make liberal use of virtual functions and inheritance. In general every class that inherits from a base class with virtual functions will need to provide a vtable that covers at least the virtual functions of this base class. It seems that each vtable entry causes a relocation, and worse, a relocation that can not be done lazy. "
Waldo Bastian suggests two strategies to improve the runtime linking performance: (a) improving the efficiency of C++ virtual table linking, and (b) pre-relocation and pre-linking of shared libraries. This paper explores the first strategy in section 4 and 5. Jelinek's ELF pre-linker is briefly reviewed in section 6 in order to discuss the pros and cons of both approaches.
4- Reducing the complexity.
Each virtual table entry requires a separate relocation record. The runtime linker processes each relocation record by performing one symbol resolution and applying the corresponding patch.
In a large class hierarchy, the number of distinct symbols referenced by the virtual table entries is usually much lower than the number of virtual table entries themselves. All the derived class virtual tables to a large extent replicate the virtual tables of their base classes. For instance, almost every class in the KDE C++ code inherits about one hundred virtual functions from class QWidget. Every corresponding virtual table contains about one hundred identical entries. Each of these virtual table entries is separately resolved by the runtime linker.
4.1- objprelink-1
The objprelink-1 program [Bottou-2001] provides a first method for avoiding this waste of CPU cycles. Virtual table entries now point to a relay stub jumping to the target function. All virtual table entries addressing the same function use the same relay stub.
There is still a relocation record per virtual table entry, because each virtual table entry contains the absolute address of a relay stub. The runtime linker computes this address without performing a symbol resolution, merely by adding a displacement to the base address of the shared library. The relay stub, on the other hand, requires a relocation record involving a symbol resolution. But the number of relay stubs is much lower than the number of virtual table entries.
The program objprelink-1 (object file pre-linker) achieves this result by post-processing the object files produced by the compiler in a way that coerces the GNU link editor ld into producing the modified shared libraries. Measurements on various KDE2 applications [Bottou-2001] have shown a significant reduction of the runtime linking time. The perceptual responsiveness of the KDE2 desktop environment was significantly improved by using objprelink1.
4.2- combreloc
Recent versions of the Linux operating system provide a similar speedup using a scheme initially implemented in the Solaris 7 operating system. The method relies on modifications in both the link editor and the runtime linker [Jelinek-2001b].
- The GNU link editor ld option -z combreloc, enabled by default, reorders the shared library relocation records according to the name of the symbol to resolve. All relocation records referencing the same symbol are placed consecutively in the relocation record table.
- The runtime linker processes the relocation records in sequence and always caches the last symbol resolution. The presence of both modifications ensures that relocation records involving the same symbol resolution require only one execution of the actual symbol resolution code.
4.3- Performance
The performance of both methods has been evaluated using the small benchmark program proposed in [Bastian-2001]. A collection of shared libraries of increasing complexity are compiled from the following source code:
Source code of test library testclassN.so
#include <qwidget.h> template struct testclass : public QWidget { virtual void setSizeIncrement(int w, int h) { QWidget::setSizeIncrement(w+T, h+T); } }; template class testclass<1>; ... template class testclass;
Each shared library testclassN.so contains N class definition. Each class inherits about one hundred virtual member functions from the QWidget class defined by Trolltech Qt-3.0.3 library.
The following table compares the execution times of C++ programs composed of an empty main function linked with the test shared libraries. Execution times are averaged on 100 runs on a 600 MHz Pentium III laptop computer running Red Hat Linux 7.2.
Four cases are considered according to the presence of option-z combreloc and to the use of the object pre-linkerobjprelink1. All tests use the same source code, the same compiler, and the same version of the Qt library. Execution time differences are therefore caused by differences in the runtime linking time of the test shared libraries.
Compared execution times (milliseconds).
testclass*.so | nocombreloc | combreloc | objprelink1+nocombreloc | objprelink1+combreloc |
---|---|---|---|---|
testclass1.so | 46 | 47 | 45 | 45 |
testclass2.so | 47 | 47 | 46 | 47 |
testclass5.so | 48 | 47 | 47 | 48 |
testclass10.so | 50 | 48 | 48 | 48 |
testclass20.so | 54 | 49 | 49 | 48 |
testclass50.so | 66 | 52 | 52 | 52 |
testclass500.so | 275 | 101 | 100 | 91 |
These results show that both methods provide very significant speedups Employing one method only yields speedups of similar magnitude. Simultaneously using both methods provides only a marginal improvement over each method employed alone.
4.4- Summary
- Both the objprelink1 and combreloc methods yield the same speedups because they reduce the complexity of the runtime linking process in similar ways.
- There is no point using objprelink1 on a system implementing the combreloc method. Follow this link to check whether your Linux system implements thecombreloc method.
5- On-demand resolution of virtual table entries.
Both the objprelink1 and combreloc methods improve the runtime linking time by reducing the number of actual symbol resolutions. Further improvements can be achieved by providing lazy symbol resolution for virtual table entries.
Lazy symbol resolution improves the responsiveness of interactive programs because the cost of symbol resolution is distributed over the duration of the program execution. It also improves the overall runtime linking time because few programs actually use all the functions provided by the needed shared libraries. On our test system (Red Hat Linux 7.2 with KDE3), thekonqueror web browser and its needed shared libraries reference 40021 distinct symbols in their relocation records. A typical session however only uses 17919 distinct symbols. More interestingly, the konquerorwindow appears after using only 13214 distinct symbols.
5.1- objprelink2
It seems very easy to obtain lazy symbol resolution for virtual table entries. We must first create a PLT entry for each distinct symbol in the virtual tables, and have each virtual table entry point to the corresponding PLT entry stub. The first call to a virtual function will invoke the runtime linker. Subsequent calls will directly jump to the target function.
Yet a difficult problem remains. As discussed by a remark in section 2, the PLT stub code expects that the got pointer register contains a pointer to the GOT of the shared library containing the PLT entry. This is true when the virtual function is called from code contained in this same shared library. This is no longer true when the virtual function is called from the executable program or from another shared library. Such a call usually causes a program crash.
Program objprelink2 solves the problem by inserting fake relocation records that cause the runtime linker to patch the relevant PLT entries and make them work regardless of the contents of the got pointer register.
The standard i386 stub code for a shared library PLT is laid out according to the following template. The got pointer register is named %ebx.
PLT0: pushl 4(%ebx) // Code for entering runtime linker
jmp *8(%ebx) // Code for entering runtime linker
....
// Beginning of PLT stub i.
PLTi: jmp *offsetofgotentry(%ebx)
// This GOT entry initially points
// to the following instruction.
pushl $relocationrecordforgotentry
jmp PLT0
// This jump enters the runtime linker code that
// updates the GOT entry and calls the target.
// Subsequent calls to this PLT entry will
// then jump directly to the target function.
Program objprelink2 redefines the relevant PLT entries according to the following model.
PLT0: pushl GOT+4 jmp *GOT+8 .... PLTi: jmp *GOT+offsetofgotentry pushl $relocationrecordforgotentry jmp PLT0
References to the got pointer register have been replaced by the absolute address of the relevant GOT entries. Such a construct is normally used in executable programs. Since shared libraries can be loaded at variable addresses, objprelink2 creates additional relocation records that will adjust these absolute addresses according to the actual loading address of the shared library.
The program objprelink2 performs these modifications by preprocessing the object files and post processing the shared library. Additional details can be found here. It would be considerably more elegant and more efficient to perform these operations inside the GNU link editor ld. Yet objprelink2 is sufficient for verifying the concept and measuring the resulting performance.
5.2- Relocation counts by type
The performance of the objprelink2 method can be measured using the test shared libraries testclassN.so presented in section 4.3. A first measurement counts the number of relocation records of each type in the test shared librariestestclassN.so. Four record types are relevant for Intel processors:
R_386_RELATIVE | These records describe patches than only depend on the loading address of the shared library and therefore require no symbol resolution. |
---|---|
R_386_32 | These records describe patches that require a symbol resolution when the program starts up. Virtual table entries are usually constructed using these relocation records. |
R_386_JUMP_SLOT | These records describe GOT patches that involve a lazy symbol resolution. These relocation records are not processed when the program starts up, but when the corresponding function is first invoked. |
R_386_GLOB_DATA | These records describe GOT patches that require a symbol resolution when the program starts up. These records are listed for reference and remained unchanged. |
The following tables present the relocation record counts for the test shared libraries compiled as usual, using objprelink1, or using objprelink2.
Relocation record counts.
Normal compilation.
| | R_386_32 | R_386_GLOB_DAT | R_386_JUMP_SLOT | R_386_RELATIVE | | | --------------- | ----------------- | ------------------ | ---------------- | - | | testclass1.so | 121 | 9 | 8 | 3 | | testclass2.so | 242 | 13 | 8 | 3 | | testclass5.so | 605 | 25 | 8 | 3 | | testclass10.so | 1210 | 45 | 8 | 3 | | testclass20.so | 2420 | 85 | 8 | 3 | | testclass50.so | 6050 | 205 | 8 | 3 | | testclass500.so | 60500 | 2005 | 8 | 3 |
Relocation record counts.
Compiled with objprelink1.
| | R_386_32 | R_386_GLOB_DAT | R_386_JUMP_SLOT | R_386_RELATIVE | | | --------------- | ----------------- | ------------------ | ---------------- | ----- | | testclass1.so | 120 | 9 | 8 | 123 | | testclass2.so | 125 | 13 | 8 | 243 | | testclass5.so | 140 | 25 | 8 | 603 | | testclass10.so | 165 | 45 | 8 | 1203 | | testclass20.so | 215 | 85 | 8 | 2403 | | testclass50.so | 365 | 205 | 8 | 6003 | | testclass500.so | 2615 | 2005 | 8 | 60003 |
Relocation record counts.
Compiled with objprelink2.
| | R_386_32 | R_386_GLOB_DAT | R_386_JUMP_SLOT | R_386_RELATIVE | | | --------------- | ----------------- | ------------------ | ---------------- | ----- | | testclass1.so | 1 | 9 | 127 | 244 | | testclass2.so | 2 | 13 | 131 | 368 | | testclass5.so | 5 | 25 | 143 | 740 | | testclass10.so | 10 | 45 | 163 | 1360 | | testclass20.so | 20 | 85 | 203 | 2600 | | testclass50.so | 50 | 205 | 323 | 6320 | | testclass500.so | 500 | 2005 | 2123 | 62120 |
These counts illustrate how both methods work:
- Program objprelink1 changes the virtual table relocation records from type R_386_32 to type R_386_RELATIVE. It also introduces one new R_386_32 relocation record for each distinct symbol referenced by the virtual table entries.
- Program objprelink2 also changes the virtual table relocation records from type R_386_32 to type R_386_RELATIVE. It introduces a new R_386_JUMP_SLOT lazy relocation record for each distinct symbol referenced by the virtual table entries. It also introduces additionalR_386_RELATIVE relocation record for fixing the PLT stubs.
5.3 - Execution times
The following table compares the execution times of C++ programs composed of an empty main function linked with the test shared librariestestclassN.so. Execution times are averaged on 100 runs on a 600 MHz Pentium III laptop computer running Red Hat Linux 7.2.
Four cases are considered, depending on whether the Qt library and/or the test shared library were compiled using the default settings or usingobjprelink2. The combreloc option and the symbol resolution cache (see section 4.2) were enabled in all experiments.
Compared execution times (milliseconds).
libqt-mt.sotestclass*.so | defaultdefault | defaultobjprelink2 | objprelink2default | objprelink2objprelink2 |
---|---|---|---|---|
testclass1.so | 93 | 92 | 47 | 46 |
testclass2.so | 93 | 94 | 47 | 46 |
testclass5.so | 94 | 92 | 47 | 46 |
testclass10.so | 93 | 92 | 48 | 46 |
testclass20.so | 95 | 93 | 49 | 47 |
testclass50.so | 100 | 97 | 52 | 49 |
testclass500.so | 144 | 125 | 101 | 75 |
Lazy linking virtual table entries brings another 50% speedup over previous improvements. Compiling the test shared libraries brings a speedup that is roughly proportional to the number of classes in the shared library. Switching from a normal version to a version of Qt compiled withobjprelink2 saves about 40 milliseconds.
5.4- Qt performance
The following table counts the number of relocation records of each type in the Qt-3.0.3 library. The number of upfront symbol resolutions has been reduced from 26823 to 3640.
Relocation record counts for libqt-mt.so.3.0.3.
| | R_386_32 | R_386_GLOB_DAT | R_386_JUMP_SLOT | R_386_RELATIVE | | | --------------- | ----------------- | ------------------ | ---------------- | ----- | | default | 24571 | 2252 | 6916 | 11356 | | objprelink2 | 1388 | 2252 | 12339 | 40750 |
The following table indicates the sizes of various sections of the Qt-3.0.3 shared library file.
Section sizes for libqt-mt.so.3.0.3 (stripped).
| | Total | .plt | .text | .{*}data | .got | .rel.{*} | | | --------------- | ----- | ----- | --------- | ----- | --------- | ---- | | default | 6660K | 110K | 4147K | 1010K | 37K | 361K | | objprelink2 | 6939K | 198K | 4209K | 1055K | 58K | 454K |
Several size increases can be observed:
- The size of both the PLT and the GOT increases in order to accommodate one additional PLT stubs for each distinct symbol referenced by virtual table entries.
- The size of the REL sections increases in order to accommodate the additional relocation records required to fix the PLT stub code.
- The size of the TEXT section increases because of technical reasons related to the objprelink2 implementation.
- The size of the RODATA section increases significantly because our version of the GNU link editor stubbornly refuses to merge identical read-only constants when performing the incremental link edition needed by objprelink2. An optimal implementation of the objprelink2 ideas would not change the size of either the TEXT or RODATA section. Details can be found here.
5.5- KDE performance
Our final test consists in compiling all the KDE3 desktop environment usingobjprelink2. Libraries compiled with objprelink2 seem to perform as flawlessly as the originals on our test machine, a 600 MHz Pentium III laptop running Red Hat Linux 7.2.
The following printout shows typical statistics returned by the runtime linker when running the web browser konqueror, before and after recompiling Qt and KDE3 with objprelink2.
Using combreloc only:
[leonb@fuji src]$ LD_DEBUG=statistics konqueror 32013: runtime linker statistics: 32013: total startup time in dynamic loader: 159462206 clock cycles 32013: time needed for relocation: 156734640 clock cycles (98.2%) 32013: number of relocations: 22312 32013: number of relocations from cache: 46825 32013: time needed to load objects: 2506857 clock cycles (1.5%)
After recompiling Qt and KDE3 with objprelink2:
[leonb@fuji leonb]$ LD_DEBUG=statistics konqueror 28630: runtime linker statistics: 28630: total startup time in dynamic loader: 73010700 clock cycles 28630: time needed for relocation: 70757732 clock cycles (96.9%) 28630: number of relocations: 8201 28630: number of relocations from cache: 4414 28630: time needed to load objects: 2014055 clock cycles (2.7%)
Multiple runs can return very different results for the time needed to load objects. The results for the time needed for relocation are very consistent. The numbers of relocations, of course, never change across multiple runs.
Recompiling both Qt and KDE3 with objprelink2 reduces the time needed for relocation by one half compared to combreloc alone. This represents about 150 milliseconds on our test machine. This result however does not account for the time needed to perform the lazy symbol resolutions.
A more interesting measurement evaluates the delay between the momentkonqueror is launched and the moment konqueror starts waiting for user input. This startup time can be evaluated by tracing well chosen system calls using the program strace. The following table reports the konqueror startup times with different setups for lazy symbol resolution:
Konqueror startup times (seconds)
No lazy resolution, (using LD_BIND_NOW=1) | 3.07 |
---|---|
Normal lazy resolution, (using combreloc). | 2.80 |
Normal lazy resolution, Qt compiled with objprelink2. | 2.78 |
Normal lazy resolution, Qt and KDE3 compiled with objprelink2. | 2.66 |
The konqueror startup time decreases by about 140 milliseconds when both Qt and KDE3 are recompiled using objprelink2. This is close to the results obtained from the runtime linker statistics. It also demonstrates that a large improvement in runtime linking speed no longer translates into obvious improvements in application startup times.
Only recompiling Qt does not bring significant speedups. This result probably means that a large proportion of Qt virtual table symbols need to be resolved during the application startup.
5.6- Summary
- Program objprelink2 still reduces the runtime linking time by about 50%. This reduction comes on top of the more significant speedups achieved by either combreloc or objprelink1.
- The runtime linking time now represents only a small part of the KDE3 application startup time. Large speedups are to be found elsewhere.
Follow this link to replicate.
6- Comparison with the ELF pre-linker.
6.1- Pre-linker overview
Another way to improve the runtime linking consists in _pre-relocating_and pre-linking shared libraries. This approach is very nicely illustrated by Jelinek's ELF pre-linker [Jelinek-2001].
- Pre-relocating a shared library means selecting a preferred memory address for loading the shared library, and pre-patching the shared library file for execution at this selected address.
- Pre-linking a shared library means identifying which other shared libraries are needed by this library, and pre-patching the shared library file by assuming that the needed shared libraries will be loaded at their selected pre-relocation addresses. Pre-linking involves an additional subtlety. The ELF document specifies that undefined symbols in shared libraries must be first searched in the main executable program, then searched in the needed shared libraries. It is therefore necessary to record potential conflicts in the executable file, that is to say, a list of symbols defined by the main executable that override symbols defined by shared libraries.
When a program starts, the runtime linker checks (a) that the executable program file contains an up-to-date conflict list, (b) that all the needed shared libraries have been pre-relocated and pre-linked, (c) that all the shared libraries can be loaded at their selected pre-relocation address, and (d) that no shared library has been modified since pre-linking object files that depend on them. When all conditions are verified, the runtime linker needs only apply the few patches described by the conflict list. This is obviously a very fast operation since it mostly consists in doing nothing. It also maximizes the number of virtual memory pages that can be shared between several instances of the same shared library running in different processes.
The price of this performance is a reduced flexibility. The runtime linker must revert to the usual processing as soon as one of the above condition is not met. Upgrading a minor shared library in the system is very likely to compromise the runtime linking speed of a large number of programs. The only way to restore the optimal speed then consists of running the pre-linker on all executables and all shared libraries in the system. This is a potentially expensive operation.
6.2- Suggestion
The results reported in this paper suggest a way to obtain nearly optimal runtime linking speed with a much smaller cost in flexibility.
- Shared libraries could be pre-relocated but not pre-linked. This would eliminate the pre-linking dependencies between required shared libraries and eliminate the need for maintaining up-to-date conflict lists. Yet it would deliver optimal virtual memory page sharing for all shared libraries than can be loaded at their selected pre-relocation address.
- Lazy symbol resolution can be arranged as suggested by theobjprelink2 program. Extensive lazy symbol resolution would considerably reduce the runtime linking cost of not performing the pre-linking step.
7- Conclusion.
The runtime linking properties of C++ programs has been described. Complexity reduction methods such as objprelink1 and combreloc have been compared and shown to be equivalent. Further speedups using lazy resolution of virtual table symbols have been demonstrated usingobjprelink2. Comparing these methods with the full ELF pre-linker has lead to suggesting a promising hybrid approach.
References
[ELF-1995]
TIS Committee: Tool Interface Standard (TIS) - Executable and Linking Format (ELF) Specification - version 1.2, May 1995, http://www.linuxbase.org/spec/refspecs/elf.pdf.
[Bastian-2001]
Waldo Bastian: Making C++ ready for the desktop, May 2001, http://www.suse.de/~bastian/Export/linking.txt.
[Jelinek-2001]
Jakub Jelinek: ELF prelinking, June 2001, http://sources.redhat.com/ml/libc-alpha/2001-06/msg00113.html. Additional info here.
[Bottou-2001]
Leon Bottou: Faster startups by fixing C++ object files, July 2001, http://objprelink.sourceforge.net/index-1.html. Additional info here.
[Jelinek-2001-b]
Jakub Jelinek: Support -z combreloc in binutils, August-2001, http://sources.redhat.com/ml/binutils/2001-08/msg00311.html.