Optimize MTrie checkpoint: 47x speedup (11.7 hours -> 15 mins), -431 GB alloc/op, -7.6 billion allocs/op, -6.9 GB file size by fxamacker · Pull Request #1944 · onflow/flow-go (original) (raw)
Description
Optimize checkpoint creating (includes loading):
- 47x speedup (11.7 hours to 15 mins), avoid 431 GB alloc/op, avoid 7.6 billion allocs/op
- 171x speedup (11.4 hours to 4 mins) in MTrie traversal+flattening+writing phase
- Reduce long-held data in RAM by 116+ GB (lowers hardware requirements)
- Reduce checkpoint file size by 6.9+ GB (another 4.4+ GB reduction planned in separate PR) without using compression
Most of the optimizations were proposed in comments to issue #1750. I'm moving all remaining optimizations like concurrency and/or compression, etc. to separate PRs so this is ready for review as-is.
Increased interim + leaf node counts are causing checkpoint creation to take hours. This PR sacrifices some readability and simplicity as tradeoffs and gains speed, memory efficiency, and storage efficiency.
I limited scope of PR to optimizations that don't require performance tradeoffs or overhead (like adding processes, IPC).
Big thanks to @ramtinms for opening #1750 to point out that trie flattening can have big optimizations. 👍
Closes #1750
Closes #1884
Updates #1744, #1746, https://github.com/dapperlabs/flow-go/issues/6114
Impact on Execution Nodes
⚠️ Unoptimized checkpoint creation reaches 248+GB RAM within the first 30 minutes and can run for about 15-17+ hours on EN3. This duration during heavy load was long enough on EN3 to accumulate enough WAL files to trigger another checkpoint immediately after the current one finishes. So the 248 GB RAM is held again with 590 GB alloc/op and 9.8 billion allocs/op.
EN Startup Time
This PR speeds up EN startup time in several ways:
- Checkpoint loading and WAL replaying will be optimized for speed (see benchmarks).
- Checkpoint creation will be fast enough to run multiple times per day, which will reduce WAL segments that need to be replayed during startup.
- Checkpoint creation finishing in minutes rather than 15+ hours reduces risk of being interrupted by shutdown, etc. which can cause extra WAL segments to be replayed on next EN startup.
See issue #1884 for more info about extra WAL segments causing EN startup delays.
EN Memory Use and System Requirements
This PR can reduce long-held data in RAM (15+ hours on EN3) by up to 116 GB. Additionally, eliminating 431 GB alloc/op and 7.6 billion allocs/op will reduce load on the Go garbage collector.
Benchmark Comparisons
Unoptimized Checkpoint Creation
After the first 30 minutes and for next 11+ hours (15-17+ hours on EN3):
Optimized Checkpoint Creation
Finishes in 15+ minutes and peaks in the last minute at:
Preliminary Results (WIP) Without Adding Concurrency Yet
MTrie Checkpoint Load+Create v3 (old) vs v4 (WIP)
Input: checkpoint.00003443 + 41 WAL files
Platform: Go 1.16, benchnet (the big one)
name old time/op new time/op delta
NewCheckpoint-48 42052s ± 0% 886s ± 0% -97.89%
name old alloc/op new alloc/op delta
NewCheckpoint-48 590GB ± 0% 159GB ± 0% -73.04%
name old allocs/op new allocs/op delta
NewCheckpoint-48 9.80G ± 0% 2.19G ± 0% -77.67%
DISCLAIMERS: not done yet, didn't add concurrency yet, n=1 due to duration,
file system cache can affect results.
UPDATE: on March 1, optimized checkpoint creation speed (v4 -> v4 with 41 WALs) varied by 63 seconds between the first 2 runs (all 3 used same input files to create same output):
- 926 secs (first run right after OS booted, maybe didn't wait long enough)
- 863 secs (second run without rebooting OS, maybe file system cache helped)
- 879 secs (third run after other activities without rebooting OS)
Load Checkpoint File + replay 41 WALs v3 (old) vs v4 (WIP)
Input: checkpoint.00003443 + 41 WAL files
Platform: Go 1.16, benchnet (the big one)
name old time/op new time/op delta
LoadCheckpointAndWALs-48 989s ± 0% 676s ± 0% -31.64%
name old alloc/op new alloc/op delta
LoadCheckpointAndWALs-48 297GB ± 0% 136GB ± 0% -54.35%
name old allocs/op new allocs/op delta
LoadCheckpointAndWALs-48 5.98G ± 0% 2.17G ± 0% -63.67%
DISCLAIMERS: not done yet, didn't add concurrency yet, n=1,
file system cache affects speed so delta can be -28% to -32%.
Changes include:
- Create checkpoint file v4 and replace v3, while retaining ability to load older versions. (v4 is not yet finalized). First, Reduce checkpoint file size by 5.8+GB. Next, reduce checkpoint file size by 1.1+GB by removing encoded hash size and path size. Further reduction of 4.4+GB is planned for 10.2 GB combined reduction compared to v3. These file size reductions don't use compression.
- Use stream encoding and writing for checkpoint file creation. This reduces RAM use by avoiding the creation of a ~400 million element slice containing all nodes and creation of 400 million objects. Savings will be about 43.2+ GB plus more from other changes in this PR.
- Add NewUniqueNodeIterator() to skip shared nodes. NewUniqueNodeIterator() can be used to optimize node iteration for forest. It skips shared sub-tries that were visited and only iterates unique nodes.
- Optimize reading checkpoint file by reusing buffer. Reduce allocs by using a 4096 byte scratch buffer to reduce another 400+ million allocs during checkpoint reading. Since checkpoint creation requires reading checkpoint, this optimization benefits both.
- Optimize creating checkpoint by reusing buffer. Reduce allocs by using a 4096 byte scratch buffer to reduce another 400+ million allocs during checkpoint writing.
- Skip StorableNode/StorableTrie when creating checkpoint
- Merge FlattenForest() with StoreCheckpoint() to iterate and serialize nodes without creating intermediate StorableNode/StorableTrie objects.
- Stream encode nodes to avoid creating 400+ million element slice holding 400 million StorableNode objects.
- Change checkpoint file format (v4) to store node count and trie count at the footer (instead of header) required for stream encoding.
- Support previous checkpoint formats (v1, v3).
- Skip StorableNode/Trie when reading checkpoint
- Merge RebuildTries() with LoadCheckpoint() to deserialize data to nodes without creating intermediate StorableNode/StorableTrie objects.
- Avoid creating 400+ million element slice holding all StorableNodes read from checkpoint file.
- DiskWal.Replay*() APIs are changed. checkpointFn receives []*trie.MTrie instead of FlattenedForest.
- Add flattening encoding tests, add checkpoint v3 decoding tests, add more validation, add comments, refactor code for readability, and etc.
TODO
- update benchmark comparisons using latest results
Additional TODOs that will probably be wrapped up in a separate PR
- maybe add zeroCopy flag and more tests for these functions:
DecodeKey(),DecodeKeyPart(), andDecodePayload(). Not high priority because these functions appear to be unused. - further reduce data written to checkpoint part 2 (e.g. encoded payload value size can be uint32 instead of uint64 but changing this affects code outside checkpoint creation)
- micro optimizations (depends on speedup vs readability tradeoff)
- add concurrency
- maybe add file compression or payload compression (only if concurrency is added)
- maybe replace CRC32 with BLAKE3 or BLAKE2 since checkpoint file is >60GB
- maybe encode integers using variable length to reduce space (possibly not needed if/when we use file compression)
maybe split checkpoint file into 3 files (metadata, nodes, and payload file).I synced with Ramtin and his preference is to keep the checkpoint as one file for this PR.