[LLVMdev] [RFC] Less memory and greater maintainability for debug info IR (original) (raw)
Sean Silva chisophugis at gmail.com
Thu Oct 16 22:09:29 PDT 2014
- Previous message: [LLVMdev] [RFC] Less memory and greater maintainability for debug info IR
- Next message: [LLVMdev] [RFC] Less memory and greater maintainability for debug info IR
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Tue, Oct 14, 2014 at 11:40 AM, Duncan P. N. Exon Smith < dexonsmith at apple.com> wrote:
> On Oct 13, 2014, at 6:59 PM, Sean Silva <chisophugis at gmail.com> wrote: > > Stupid question, but when I was working on LTO last Summer the primary culprit for excessive memory use was due to us not being smart when linking the IR together (Espindola would know more details). Do we still have that problem? For starters, how does the memory usage of just llvm-link compare to the memory usage of the actual LTO run? If the issue I was seeing last Summer is still there, you should see that the invocation of llvm-link is actually the most memory-intensive part of the LTO step, by far. To be clear, I'm running the command-line: $ llvm-lto -exported-symbol main llvm-lto.lto.bc Since this is a pre-linked bitcode file, we shouldn't be wasting much memory from the linking stage. Running ld64 directly gives a peak memory footprint of ~30GB for the full link, so there's something else going on there that I'll be digging into later.
Dig into this first!
In the OP you are talking about essentially a pure "optimization" (in the programmer-wisdom "beware of it" sense), to "save" 2GB of peak memory. But from your analysis it's not clear that this 2GB savings actually is reflected as peak memory usage saving (since the ~30GB peak might be happening elsewhere in the LTO process). It is this ~30GB peak, and not the one you originally analyzed, which your customers presumably care about.
Be realistic: ~15GB (equivalent to all the memory you tracked combined) is flying under the radar of your analysis. It would be very surprising if this has not led to multiple erroneous conclusions about what direction to take things in your plan.
To recap, the analysis you performed seems to support neither of the following conclusions:
- Peak memory usage during LTO would be improved by this plan
- Build time for LTO would be improved by this plan (from what you have posted, you didn't measure time at all)
I hesitate to consider such an "aggressive plan" (as you describe it) with neither of the above two legs to stand on.
Of course, this is all tangential to the discussion of e.g. a more readable/writable .ll form for debug info, or debug info compatibility. However, it seems like you jumped into this from the point of view of it being an optimization, rather than a maintainability/compatibility thing.
-- Sean Silva
> 2GB (out of 15.3GB i.e. ~13%) seems pretty pathetic savings when we have a single pie slice near 40% of the # of Value's allocated and another at 21%. Especially this being "step 4". 15.3GB is the peak memory of
llvm-lto
. This comes late in the process, after DIEs have been created. I haven't looked in detail past debug info metadata, but here's a sketch of what I imagine is in memory at this point. - The IR, including uniquing side-tables. - Optimization and backend passes. - Parts of SelectionDAG that haven't been freed. -MachineFunction
s and everything inside them. - Whatever state theAsmPrinter
, etc., need. I expect to look at a couple of other debug-info-related memory usage areas once I've shrunk the metadata: - What's the total footprint of DIEs? This run has 4M of them, whose allocated footprint is ~1GB. I'm hoping that a deeper look will reveal an even larger attack surface. - How much do debug info intrinsics cost? They show up in at least three forms -- IR-level, SDNodes, and MachineInstrs -- and there can be a lot of them. How many? What's their footprint? For now, I'm focusing on the problem I've already identified. > You need more data. Right now you have essentially one data point, I looked at a number of internal C and C++ programs with -flto -g, and dug deeply into llvm-lto.lto.bc because it's small enough that it's easy to analyze (and its runtime profile was representative of the other C++ programs I was looking at). I didn't look deeply at a broad spectrum, but memory usage and runtime for building clang with -flto -g is something we care a fair bit about. > and it's not even clear what you measured really. If your goal is saving memory, I would expect at least a pie chart that breaks down LLVM's memory usage (not just # of allocations of different sorts; an approximation is fine, as long as you explain how you arrived at it and in what sense it approximates the true number). I'm not sure there's value in diving deeply into everything at once. I've identified one of the bottlenecks, so I'd like to improve it before digging into the others. Here's some visibility into where my numbers come from. I got the 15.3GB from a profile of memory usage vs. time. Peak usage comes late in the process, around when DIEs are being dealt with. Metadata node counts stabilize much earlier in the process. The rest of the numbers are based on countingMDNodes
and their respectiveMDNodeOperands
, and multiplying by the cost of their operands. Here's a dump from around the peak metadata node count: LineTables = 7500000[30000000], InlinedLineTables = 6756182, Directives = 7611669[42389128], Arrays = 570609[577447], Others = 1176556[5133065] Tag = 256, Count = 554992, Ops = 2531428, Name = DWTAGautovariable Tag = 16647, Count = 988, Ops = 4940, Name = DWTAGGNUtemplateparameterpack Tag = 52, Count = 9933, Ops = 59598, Name = DWTAGvariable Tag = 33, Count = 190, Ops = 190, Name = DWTAGsubrangetype Tag = 59, Count = 1, Ops = 3, Name = DWTAGunspecifiedtype Tag = 40, Count = 24731, Ops = 24731, Name = DWTAGenumerator Tag = 21, Count = 354166, Ops = 2833328, Name = DWTAGsubroutinetype Tag = 2, Count = 77999, Ops = 623992, Name = DWTAGclasstype Tag = 47, Count = 27122, Ops = 108488, Name = DWTAGtemplatetypeparameter Tag = 28, Count = 8491, Ops = 33964, Name = DWTAGinheritance Tag = 66, Count = 10930, Ops = 43720, Name = DWTAGrvaluereferencetype Tag = 16, Count = 54680, Ops = 218720, Name = DWTAGreferencetype Tag = 23, Count = 624, Ops = 4992, Name = DWTAGuniontype Tag = 4, Count = 5344, Ops = 42752, Name = DWTAGenumerationtype Tag = 11, Count = 360390, Ops = 1081170, Name = DWTAGlexicalblock Tag = 258, Count = 1, Ops = 1, Name = DWTAGexpression Tag = 13, Count = 73880, Ops = 299110, Name = DWTAGmember Tag = 58, Count = 1387, Ops = 4161, Name = DWTAGimportedmodule Tag = 1, Count = 2747, Ops = 21976, Name = DWTAGarraytype Tag = 46, Count = 1341021, Ops = 12069189, Name = DWTAGsubprogram Tag = 257, Count = 4373879, Ops = 20785065, Name = DWTAGargvariable Tag = 8, Count = 2246, Ops = 6738, Name = DWTAGimporteddeclaration Tag = 53, Count = 57, Ops = 228, Name = DWTAGvolatiletype Tag = 15, Count = 55163, Ops = 220652, Name = DWTAGpointertype Tag = 41, Count = 3382, Ops = 6764, Name = DWTAGfiletype Tag = 22, Count = 158479, Ops = 633916, Name = DWTAGtypedef Tag = 48, Count = 486, Ops = 2430, Name = DWTAGtemplatevalueparameter Tag = 36, Count = 15, Ops = 45, Name = DWTAGbasetype Tag = 17, Count = 1164, Ops = 8148, Name = DWTAGcompileunit Tag = 31, Count = 19, Ops = 95, Name = DWTAGptrtomembertype Tag = 57, Count = 2034, Ops = 6102, Name = DWTAGnamespace Tag = 38, Count = 32133, Ops = 128532, Name = DWTAGconsttype Tag = 19, Count = 72995, Ops = 583960, Name = DWTAGstructuretype (Note: the InlinedLineTables stat is included in LineTables stat.) You can determine the rough memory footprint of each type of node by multiplying the "Count" bysizeof(MDNode)
(x86-64: 56B) and the "Ops" bysizeof(MDNodeOperand)
(x86-64: 32B). Overall, there are 7.5M linetables with 30M operands, so by this method their footprint is ~1.3GB. There are 7.6M descriptors with 42.4M operands, so their footprint is ~1.7GB. I dumped another stat periodically to tell me the peak size of the side-tables for line table entries, which are split into "Scopes" (for non-inlined) and "Inlined" (these counts are disjoint, unlike the previous stats): Scopes = 203166 [203166], Inlined = 3500000 [3500000] I assumed that bothDenseMap
andstd::vector
over-allocate by 50% to estimate the current (and planned) costs for the side-tables. Another stat I dumped periodically was the breakdown between V(alues), U(sers), C(onstants), M(etadata nodes), and (metadata) S(trings). Here's a sample from nearby: V = 23967800 (40200000 - 16232200) U = 5850877 ( 7365503 - 1514626) C = 205491 ( 279134 - 73643) M = 16837368 (31009291 - 14171923) S = 693869 ( 693869 - 0) Lastly, I dumped a breakdown of the types of MDNodeOperands. This is also a sample from nearby: MDOps = 77644750 (100%) Const = 14947077 ( 19%) Node = 41749475 ( 53%) Str = 9553581 ( 12%) Null = 10976693 ( 14%) Other = 417924 ( 0%) While I didn't use this breakdown for my memory estimates, it was interesting nevertheless. Note the following: - The number of constants is just under 15M. This dump came less than a second before the dump above, where we have 7.5M line table entries. Line table entries have 2 operands ofConstantInt
. This lines up nicely. Note: this checkedisa<Constant>(Op) && !isa<GlobalValue>(Op)
. - There are a lot of null operands. By making subclasses for the various types of debug info IR, we can probably shed some of these altogether. - There are few "Other" operands. These are likely allGlobalValue
references, and are the only operands that need to be referenced using value handles. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20141016/d2b8face/attachment.html>
- Previous message: [LLVMdev] [RFC] Less memory and greater maintainability for debug info IR
- Next message: [LLVMdev] [RFC] Less memory and greater maintainability for debug info IR
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]