feat(HA): offline cluster bootstrap from a pre-seeded database (snapshot-and-restore) (original) (raw)
Problem
Bringing up a multi-node Raft HA cluster on top of a previously-imported 1+ GB database currently forces every new replica to receive the entire database over HTTP from the leader, in real time, while the leader is also serving writes. The operator-visible symptoms:
- Long catch-up windows (minutes to hours, per replica) during which the new follower's
STATUSisSTALLEDand writes on the leader are competing with snapshot HTTP traffic. - Failures we have seen in practice when
arcadedb.ha.snapshotDownloadTimeoutis hit on slow disks or constrained networks (SnapshotInstaller: Snapshot download attempt N/3 failed). - No way to take advantage of out-of-band copy mechanisms the operator already has (S3, NFS, image bake-in, init containers,
kubectl cp). - No way to ship only the delta when a follower is just slightly behind the leader (e.g. brief disconnect, or pre-staged with a slightly older backup).
This is a real pain point reported by users running ArcadeDB on Kubernetes who want to scale a single-pod deployment up to 3 pods AFTER importing data.
What exists today
- Leader-served full snapshot install:
SnapshotHttpHandlerat/api/v1/ha/snapshot/{database}, triggered automatically when a follower lags pastarcadedb.ha.snapshotThreshold. Always full database, regardless of how small the gap is. - Replicated
restore databaseon the leader:PostServerCommandHandler.replicateRestoredDatabasecallsRaftReplicatedDatabase.createInReplicas(true), submits an install-database Raft entry withforceSnapshot=true. Each replica then pulls the full snapshot from the leader.
Both paths funnel the full database through the leader at runtime.
Proposed feature
A two-part HA improvement, designed and shipped together because they share wire format and code paths.
Part A - offline cluster bootstrap
Allow the operator to pre-seed each pod's filesystem with identical database files BEFORE the cluster forms, and have the Raft cluster recognise this as a valid baseline. After bootstrap, normal Raft replication continues on top of the pre-seeded state.
Part B - delta resync
When a peer is close to the leader's state (lastTxId gap below a configurable threshold and the leader has retained the relevant WAL), ship only the WAL delta instead of the full database. Falls back to the existing full-snapshot path otherwise. Used in two places:
- The catch-up path for the bootstrap mismatch case (Part A): a follower whose fingerprint doesn't match the chosen source attempts a delta first.
- The existing runtime catch-up path (
SnapshotInstaller): a follower that briefly disconnected and reconnected withlastTxIdonly a few thousand entries behind no longer needs a full snapshot.
Operator workflow (unchanged from earlier draft)
- Import the dataset into a single ArcadeDB instance (no HA).
- Take a
tarof the database directory (or use a regular full backup file). - Distribute it out-of-band to every pod's filesystem (init container from S3, NFS read-only mount, baked image layer,
kubectl cp...). - Start every pod with HA enabled.
- Pods form the Raft group with everyone already at the same state. No bytes flow through the leader at startup.
Time-to-cluster goes from "minutes-to-hours of HTTP transfer × N replicas" to "seconds, because everyone already has the bytes".
Design
1. Config flags
arcadedb.ha.bootstrapFromLocalDatabase(Boolean, defaulttrue). The bootstrap path engages only when every peer's Raft log is empty (first formation) and is gated by a fingerprint + recency check.arcadedb.ha.bootstrapTimeoutMs(Long, default120_000). How long the bootstrap leader waits for every peer to report its(fingerprint, lastTxId, oldestRetainedTxId).arcadedb.ha.bootstrapDeltaThreshold(Long, default100_000). MaximumlastTxIdgap for which a peer can be brought up to date by replaying a WAL delta. Above this, the leader-shipped full-snapshot path is used. Set to 0 to disable delta resync.arcadedb.ha.walArchiveSizeMax(String, default"1GB"). Maximum total bytes of retained WAL history that the source peer keeps available for delta resync. Older files are dropped FIFO past this cap.
2. Database state attestation: (fingerprint, lastTxId, oldestRetainedTxId)
Each peer reports a per-database tuple at bootstrap time:
fingerprint: deterministic SHA-256 over(file name, size, content checksum)for every file in the database directory, excluding WAL/lock files. Computed byBootstrapFingerprint(already shipped in Phase 1, commitbbc64d813).lastTxId: the monotonic transaction LSN persisted in the database (TransactionManager.getLastTransactionId(), persisted vialast-tx-id.binso it survives clean shutdown — also shipped in Phase 1).oldestRetainedTxId: the lowesttxIdfor which this peer can serve a delta via the new delta endpoint. Equal tolastTxId + 1means no WAL is retained → only full-snapshot can satisfy this peer.
3. Bootstrap protocol (unchanged from prior version)
a. Wait for every peer in HA_SERVER_LIST to report (or until bootstrapTimeoutMs). RPC: POST /api/v1/cluster/bootstrap-state.
b. Pick the peer with the highest lastTxId as the source.
c. If the source isn't the current Raft leader, transfer leadership to it (RaftHAServer.transferLeadership).
d. The source commits BOOTSTRAP_FINGERPRINT_ENTRY=(dbName, fingerprint, lastTxId). The committed entry deliberately does NOT carry oldestRetainedTxId; each follower already learned the source's value from the pre-bootstrap RPC and decides its own catch-up path locally.
e. Late joiners with strictly-newer lastTxId than the committed entry refuse to start with an actionable SEVERE log (deployment error: cluster bootstrapped from older state, this peer's data would be lost).
4. Per-follower catch-up decision (the delta path)
After the bootstrap entry commits, each follower decides locally:
if myFingerprint == sourceFingerprint:
→ bootstrap locally (zero bytes)
elif myLastTxId == sourceLastTxId:
→ full snapshot (same LSN, different content = divergent histories, must replace)
elif myLastTxId >= sourceOldestRetainedTxId - 1
and sourceLastTxId - myLastTxId <= bootstrapDeltaThreshold:
→ delta resync via GET /api/v1/ha/delta/{db}?fromTxId=myLastTxId
elif full-snapshot path:
→ existing leader-shipped snapshot
5. WAL retention for delta serving
TransactionManager gains a "retained" mode (gated on arcadedb.ha.bootstrapFromLocalDatabase=true) where WAL files are not eagerly purged after their pages flush. Retention is bounded by:
- Number of transactions retained:
>= bootstrapDeltaThreshold. - Total bytes retained:
<= walArchiveSizeMax. - Older files are dropped FIFO past either cap.
This is the only ArcadeDB-internal behaviour change; the rest of the WAL machinery is unchanged.
6. Delta endpoint: GET /api/v1/ha/delta/{database}?fromTxId=N
Streams the WAL transactions for (N+1 .. currentLastTxId), framed and compressed. The follower's installer applies them via the existing TransactionManager.applyChanges path; no new application logic. Error semantics:
412 Precondition FailedifN < oldestRetainedTxId(gap too big — follower must use full snapshot).204 No ContentifN >= currentLastTxId(nothing to ship).404if HA / target database is not present on this peer.
7. SnapshotInstaller becomes "try delta, fall back to full"
The existing runtime catch-up path is enhanced: if the gap is below bootstrapDeltaThreshold and the leader's oldestRetainedTxId <= followerLastTxId + 1, attempt the delta endpoint first. On 412 or any other failure, fall back to the existing full-snapshot install. Behaviour for catch-up beyond the threshold is unchanged.
8. Gating: bootstrap path only at first cluster formation
The bootstrap path engages only when every peer's Raft log is empty (first formation). After even one entry is committed via Raft, default=true is a no-op for subsequent restarts; late joiners go through the existing RECOVER / leader-shipped logic. This is what makes default=true safe.
9. Why we do NOT re-check the fingerprint after bootstrap
After bootstrap, Raft owns logical consistency: every transaction goes through the log, only the leader proposes, every follower applies the same entries deterministically. Byte-level file content has legitimate non-determinism (page allocation, compaction timing, dictionary key ordering) so two replicas with identical logical state can have byte-different files. A fingerprint computed over raw file bytes would falsely report mismatches at runtime. The bootstrap path's gating on "empty Raft log" sidesteps this — we never re-check post-bootstrap.
10. Studio + status surfaces
Per-database fields in the cluster status JSON: bootstrapMode: "local" | "delta" | "leader_snapshot" | null, bootstrapLastTxId, oldestRetainedTxId. Status table column: BOOTSTRAP=local(X+1) / delta(X+1<-X) / leader-snapshot(X+1). Studio shows the same as a colored badge.
Worked example: A=X, B=X-1, C=X+1, threshold=100k
| Step | What happens |
|---|---|
| All three pods start, every Raft log empty. | |
| First election picks A as Raft leader. | |
| A sends bootstrap-state RPC. Replies: A=(fpA, X, X-50k), B=(fpB, X-1, X-50001), C=(fpC, X+1, X-49999). | |
| A picks highest lastTxId = X+1 → source is C. | |
| A calls transferLeadership(C). C becomes Raft leader. | |
| C commits BOOTSTRAP_FINGERPRINT_ENTRY=(fpC, X+1). | |
| C: own fingerprint matches → bootstrap locally. | |
| A: gap = 1, C's oldestRetainedTxId = X-49999 ≤ X+1 → delta resync. Hits GET /api/v1/ha/delta/heimdall?fromTxId=X on C, applies one transaction, done. | |
| B: gap = 2, same path → delta resync, two transactions. |
In the customer's actual case (1+GB pre-staged, all backups identical) every peer would hit the "fingerprint matches → bootstrap locally" branch and the delta path is never exercised at startup. Delta still helps later: a pod restarted into a running cluster with lastTxId 5k entries behind will pull only those 5k entries instead of a full snapshot.
Behavior matrix at first formation
With default=true:
| Local DBs across pods | Result |
|---|---|
| All pods empty | Empty cluster, normal bootstrap. Identical to today. |
| All pods identical | Fast offline bootstrap; the win. |
| Different ages, gaps within threshold, source has WAL retained | Highest lastTxId peer wins, leadership transfers, others delta-replay. Minutes-not-hours. |
| Different ages, gaps over threshold | Highest lastTxId peer wins, others fall back to full snapshot (same as today). |
| Same lastTxId, divergent fingerprints | One peer wins, others fall back to full snapshot. WARNING about divergent staging. |
| Some pods unreachable at bootstrap | Wait up to bootstrapTimeoutMs, then proceed with majority + SEVERE log. |
| Cluster restart with stale local data on a node | Bootstrap path NOT engaged (Raft log non-empty). Late-joiner path = delta if gap small, full snapshot otherwise. |
| Cluster restart of healthy nodes | Bootstrap path NOT engaged. No fingerprint re-check. |
Implementation phases (status)
Phase 1 — foundations (config + BootstrapFingerprint + TransactionManager.getLastTransactionId() with disk persistence in last-tx-id.bin). Shipped in commit bbc64d813.
Phase 2 — BOOTSTRAP_FINGERPRINT_ENTRY Raft log entry type + codec. Shipped in commit bbc64d813.
Phase 3 — pre-bootstrap RPC POST /api/v1/cluster/bootstrap-state. Returns per-database (fingerprint, lastTxId, oldestRetainedTxId). (Next.)
Phase 4 — bootstrap election protocol in RaftHAServer: empty-log gating, peer-state collection, source pick, leadership transfer, commit.
Phase 5 — ArcadeStateMachine.applyTransaction handling for BOOTSTRAP_FINGERPRINT_ENTRY; SnapshotInstaller decides per-follower path; late-newer-joiner refusal.
Phase 6 — delta resync end-to-end:
- 6a: WAL retention with bounded archive (
TransactionManager). - 6b:
GET /api/v1/ha/delta/{database}endpoint. - 6c:
SnapshotInstaller"try delta, fall back to full". - 6d: Tests (delta-applied state matches source state; gap over threshold falls back; gap under threshold but no WAL falls back; runtime catch-up uses delta).
Phase 7 — Studio + cluster status surfaces (bootstrapMode, oldestRetainedTxId per database in JSON; ASCII table column; Studio badge).
Phase 8 — Integration tests:
RaftBootstrapFromLocalDatabaseIT: 3 nodes, identical pre-seed, no HTTP transfer.RaftBootstrapPicksHighestLastTxIdIT: A=X-1, B=X, C=X+1; verify C becomes source.RaftBootstrapLeadershipTransferIT: leader transfer happens before commit.RaftBootstrapFingerprintMismatchSameLsnIT: samelastTxId, divergent fingerprints; full snapshot path with WARNING.RaftBootstrapDeltaResyncIT: gap = 5k transactions, threshold 100k; verify delta endpoint is hit, full snapshot is not.RaftBootstrapDeltaFallbackToFullIT: gap = 200k > threshold; verify fall back to full snapshot.RaftBootstrapTimeoutFallbackIT: one peer unreachable; verify proceed after timeout with SEVERE log.RaftBootstrapDoesNotEngageOnRestartIT: existing cluster restarted; bootstrap path skipped.RaftBootstrapLateNewerJoinerIT: peer with strictly-newer state joins after commit; refuses to start.RuntimeDeltaCatchupIT: a running follower disconnected briefly catches up via delta, not full.
Out of scope
- Auto-detecting that the operator has pre-seeded data without setting the flag — we default it on and rely on gating + fingerprint + recency.
- Pre-seeding the Raft log itself (only the database state).
- Cross-version bootstraps (file format must match).
- Reconciling divergent transactions across pods (split-brain recovery is a separate concern).
- Continuous post-bootstrap consistency checks (would need a logical fingerprint, separate feature).
Acceptance criteria
- 3-node cluster with 1 GB pre-seeded database reaches
STATUS=HEALTHYon every follower within< 10 s(no HTTP snapshot transfer). - Different-age peers within
bootstrapDeltaThresholdconverge via delta resync, not full snapshot. - Different-age peers beyond the threshold fall back to the existing full-snapshot path.
- Late joiner with strictly-newer state refuses to start with an actionable error.
- Operator-visible error when a peer is permanently unreachable at bootstrap.
- No behavior change on cluster restart with stale data.
- No behavior change on cluster restart of healthy nodes.
- Runtime catch-up: a follower with a small gap pulls only the delta from the leader, not a full snapshot.
- Documented in
docs/with a Kubernetes recipe. - All existing HA tests pass; new ITs above pass.
Why now
Reported by a customer scaling a single-pod ArcadeDB deployment to 3 pods after a 1+ GB import. Their snapshot transfer competes with ongoing writes and triggers leader churn (#4083). Offline bootstrap removes the runtime burden on the leader entirely; delta resync also benefits the existing runtime catch-up path.
Related
- Multiple warning over lost indexes on massive insert with gRPC #4083 - HA leader-churn resilience and replica health visibility (just shipped).
- HA: ClassCastException 'RaftReplicatedDatabase cannot be cast to LocalDatabase' on leader during import / read-only property write #4144 - HA: ClassCastException on leader during readonly-property write (just shipped).
- HA: stale follower recovery when snapshot download fails on a quiet cluster #3893 - HA: stale follower recovery when snapshot download fails.
- Support for HTTP restore database command #3764 - Support for HTTP restore database command.
- feat: Raft-based High Availability using Apache Ratis #3730 - feat: Raft-based High Availability using Apache Ratis.