[RFC] [C++] [Modules] Stop using abbrev and drop the maintainance (original) (raw)
Audience
Module users and vendors (including PCH, clang modules, objc modules and standard c++ modules (header units) and named modules) and clang developers who needs to create new or change existing AST types (roughly the guys who work with C++ language committee)
Problem
If you had ever the experience that you add/change some attributes in AST and you find you forget updating the serializer part, then you simply add the corresponding bits in ASTReaderDecl and ASTWriterDecl (and also ASTReaderStmt, ASTWriterStmt), then boom! Countless crashes and you don’t know what happens. After fighting a long time, you finally find you need to change some weird hardcoded information, majorly:
the experience is terrible
And what’s worse is, the devs who develop new language feature didn’t find it. Until end users try the new language features with modules and face crashes. It is bad user experience. And also, as module maintainers, it also makes me suffering to debug such problems. So I’d like to propose to remove such hardcoding from serializations. It makes maintaining easier and makes clang developers easier to write new language features.
Background: what is abbrev?
Abbrev is a feature in LLVM’s serializer. Abstractly, LLVM’s serializer can use the hard coded information to compress the serialized things. For the above example,
Abv->Add(BitCodeAbbrevOp(BitCodeAbbrevOp::Fixed,
7)); // Packed DeclBits: ModuleOwnershipKind,
// isUsed, isReferenced, AccessSpecifier,
//
// The following bits should be 0:
// isImplicit, HasStandaloneLexicalDC, HasAttrs,
// TopLevelDeclInObjCContainer,
// isInvalidDecl
This tells the serializer it is enough to use 7 bits to store it, otherwise the serializer may use the full 64 bits. And similarly:
Abv->Add(BitCodeAbbrevOp(BitCodeAbbrevOp::VBR, 6)); // DeclContext
This tells the serializer to use VBR6 format to store the information. See https://llvm.org/docs/BitCodeFormat.html :
Variable-width integer (VBR) values encode values of arbitrary size, optimizing
for the case where the values are small. Given a 4-bit VBR field, any 3-bit
value (0 through 7) is encoded directly, with the high bit set to zero. Values
larger than N-1 bits emit their bits in a series of N-1 bit chunks, where all
but the last set the high bit.
For example, the value 30 (0x1E) is encoded as 62 (0b0011'1110) when emitted as
a vbr4 value. The first set of four bits starting from the least significant
indicates the value 6 (110) with a continuation piece (indicated by a high bit
of 1). The next set of four bits indicates a value of 24 (011 << 3) with no
continuation. The sum (6+24) yields the value 30.
The VBR format use less bits when the number is small but may be pessimize if the number is large.
After all, the feature is about to use hard coded information to do optimizations to produce smaller things.
Impact
So removing abbrev will increase the size of BMIs. But how much? In my local test, disabling abbrev will increase 3% size of BMIs. I do think it doesn’t worth the efforts. I feel the codes can be much more simpler and maintainable.
It will be good if other module vendors can test this on your code bases, I have a simple patch at GitHub - ChuanqiXu9/llvm-project at DisableAbbrev then we can discuss how to do with real numbers.
Some people may say think is a de-optimization and it is not good. But I think this is a trade off between maintainability and performance, especially the performance is not significant. Given the limited maintainability we have right now, I feel it is worse for the community if the maintainers and developers lost in such not-so-helpful features instead of focus on other aspects.
And I want to disable this for C++20 modules if other vendors want to remain the benefits.
Other related topics
Reader may notice an important problem we didn’t explore is, testing new language features with modules. I wanted to send a RFC to ask other new language feature developers to test the new modules with modules to make it more stable. I didn’t write it since I don’t know how to write a detailed guide lines to tell people how to test with modules precisely. But given we’re going to have reflections, and possibly we’re going to have a lot of new AST types, maybe it is better to discuss it too.