Optimize MTrie checkpoint: 47x speedup (11.7 hours -> 15 mins), -431 GB alloc/op, -7.6 billion allocs/op, -6.9 GB file size by fxamacker · Pull Request #1944 · onflow/flow-go (original) (raw)

Description

Optimize checkpoint creating (includes loading):

Most of the optimizations were proposed in comments to issue #1750. I'm moving all remaining optimizations like concurrency and/or compression, etc. to separate PRs so this is ready for review as-is.

Increased interim + leaf node counts are causing checkpoint creation to take hours. This PR sacrifices some readability and simplicity as tradeoffs and gains speed, memory efficiency, and storage efficiency.

I limited scope of PR to optimizations that don't require performance tradeoffs or overhead (like adding processes, IPC).

Big thanks to @ramtinms for opening #1750 to point out that trie flattening can have big optimizations. 👍

Closes #1750
Closes #1884
Updates #1744, #1746, https://github.com/dapperlabs/flow-go/issues/6114

Impact on Execution Nodes

⚠️ Unoptimized checkpoint creation reaches 248+GB RAM within the first 30 minutes and can run for about 15-17+ hours on EN3. This duration during heavy load was long enough on EN3 to accumulate enough WAL files to trigger another checkpoint immediately after the current one finishes. So the 248 GB RAM is held again with 590 GB alloc/op and 9.8 billion allocs/op.

EN Startup Time

This PR speeds up EN startup time in several ways:

See issue #1884 for more info about extra WAL segments causing EN startup delays.

EN Memory Use and System Requirements

This PR can reduce long-held data in RAM (15+ hours on EN3) by up to 116 GB. Additionally, eliminating 431 GB alloc/op and 7.6 billion allocs/op will reduce load on the Go garbage collector.

Benchmark Comparisons

Unoptimized Checkpoint Creation

After the first 30 minutes and for next 11+ hours (15-17+ hours on EN3):
image

Optimized Checkpoint Creation

Finishes in 15+ minutes and peaks in the last minute at:
image

Preliminary Results (WIP) Without Adding Concurrency Yet

MTrie Checkpoint Load+Create v3 (old) vs v4 (WIP)

Input: checkpoint.00003443 + 41 WAL files
Platform: Go 1.16, benchnet (the big one)
name                        old time/op       new time/op       delta
NewCheckpoint-48            42052s ± 0%        886s ± 0%       -97.89%

name                        old alloc/op      new alloc/op      delta
NewCheckpoint-48             590GB ± 0%       159GB ± 0%       -73.04%

name                        old allocs/op     new allocs/op     delta
NewCheckpoint-48             9.80G ± 0%       2.19G ± 0%       -77.67%

DISCLAIMERS: not done yet, didn't add concurrency yet, n=1 due to duration, 
file system cache can affect results.

UPDATE: on March 1, optimized checkpoint creation speed (v4 -> v4 with 41 WALs) varied by 63 seconds between the first 2 runs (all 3 used same input files to create same output):

Load Checkpoint File + replay 41 WALs v3 (old) vs v4 (WIP)

Input: checkpoint.00003443 + 41 WAL files
Platform: Go 1.16, benchnet (the big one)
name                        old time/op       new time/op       delta
LoadCheckpointAndWALs-48     989s ± 0%         676s ± 0%       -31.64%

name                        old alloc/op      new alloc/op      delta
LoadCheckpointAndWALs-48    297GB ± 0%        136GB ± 0%       -54.35%

name                        old allocs/op     new allocs/op     delta
LoadCheckpointAndWALs-48    5.98G ± 0%        2.17G ± 0%       -63.67%

DISCLAIMERS: not done yet, didn't add concurrency yet, n=1, 
file system cache affects speed so delta can be -28% to -32%.

Changes include:

TODO

Additional TODOs that will probably be wrapped up in a separate PR