feat(HA): offline cluster bootstrap from a pre-seeded database (snapshot-and-restore) (original) (raw)

Problem

Bringing up a multi-node Raft HA cluster on top of a previously-imported 1+ GB database currently forces every new replica to receive the entire database over HTTP from the leader, in real time, while the leader is also serving writes. The operator-visible symptoms:

This is a real pain point reported by users running ArcadeDB on Kubernetes who want to scale a single-pod deployment up to 3 pods AFTER importing data.

What exists today

  1. Leader-served full snapshot install: SnapshotHttpHandler at /api/v1/ha/snapshot/{database}, triggered automatically when a follower lags past arcadedb.ha.snapshotThreshold. Always full database, regardless of how small the gap is.
  2. Replicated restore database on the leader: PostServerCommandHandler.replicateRestoredDatabase calls RaftReplicatedDatabase.createInReplicas(true), submits an install-database Raft entry with forceSnapshot=true. Each replica then pulls the full snapshot from the leader.

Both paths funnel the full database through the leader at runtime.

Proposed feature

A two-part HA improvement, designed and shipped together because they share wire format and code paths.

Part A - offline cluster bootstrap

Allow the operator to pre-seed each pod's filesystem with identical database files BEFORE the cluster forms, and have the Raft cluster recognise this as a valid baseline. After bootstrap, normal Raft replication continues on top of the pre-seeded state.

Part B - delta resync

When a peer is close to the leader's state (lastTxId gap below a configurable threshold and the leader has retained the relevant WAL), ship only the WAL delta instead of the full database. Falls back to the existing full-snapshot path otherwise. Used in two places:

Operator workflow (unchanged from earlier draft)

  1. Import the dataset into a single ArcadeDB instance (no HA).
  2. Take a tar of the database directory (or use a regular full backup file).
  3. Distribute it out-of-band to every pod's filesystem (init container from S3, NFS read-only mount, baked image layer, kubectl cp...).
  4. Start every pod with HA enabled.
  5. Pods form the Raft group with everyone already at the same state. No bytes flow through the leader at startup.

Time-to-cluster goes from "minutes-to-hours of HTTP transfer × N replicas" to "seconds, because everyone already has the bytes".

Design

1. Config flags

2. Database state attestation: (fingerprint, lastTxId, oldestRetainedTxId)

Each peer reports a per-database tuple at bootstrap time:

3. Bootstrap protocol (unchanged from prior version)

a. Wait for every peer in HA_SERVER_LIST to report (or until bootstrapTimeoutMs). RPC: POST /api/v1/cluster/bootstrap-state.
b. Pick the peer with the highest lastTxId as the source.
c. If the source isn't the current Raft leader, transfer leadership to it (RaftHAServer.transferLeadership).
d. The source commits BOOTSTRAP_FINGERPRINT_ENTRY=(dbName, fingerprint, lastTxId). The committed entry deliberately does NOT carry oldestRetainedTxId; each follower already learned the source's value from the pre-bootstrap RPC and decides its own catch-up path locally.
e. Late joiners with strictly-newer lastTxId than the committed entry refuse to start with an actionable SEVERE log (deployment error: cluster bootstrapped from older state, this peer's data would be lost).

4. Per-follower catch-up decision (the delta path)

After the bootstrap entry commits, each follower decides locally:

if myFingerprint == sourceFingerprint:
    → bootstrap locally (zero bytes)
elif myLastTxId == sourceLastTxId:
    → full snapshot (same LSN, different content = divergent histories, must replace)
elif myLastTxId >= sourceOldestRetainedTxId - 1
     and sourceLastTxId - myLastTxId <= bootstrapDeltaThreshold:
    → delta resync via GET /api/v1/ha/delta/{db}?fromTxId=myLastTxId
elif full-snapshot path:
    → existing leader-shipped snapshot

5. WAL retention for delta serving

TransactionManager gains a "retained" mode (gated on arcadedb.ha.bootstrapFromLocalDatabase=true) where WAL files are not eagerly purged after their pages flush. Retention is bounded by:

This is the only ArcadeDB-internal behaviour change; the rest of the WAL machinery is unchanged.

6. Delta endpoint: GET /api/v1/ha/delta/{database}?fromTxId=N

Streams the WAL transactions for (N+1 .. currentLastTxId), framed and compressed. The follower's installer applies them via the existing TransactionManager.applyChanges path; no new application logic. Error semantics:

7. SnapshotInstaller becomes "try delta, fall back to full"

The existing runtime catch-up path is enhanced: if the gap is below bootstrapDeltaThreshold and the leader's oldestRetainedTxId <= followerLastTxId + 1, attempt the delta endpoint first. On 412 or any other failure, fall back to the existing full-snapshot install. Behaviour for catch-up beyond the threshold is unchanged.

8. Gating: bootstrap path only at first cluster formation

The bootstrap path engages only when every peer's Raft log is empty (first formation). After even one entry is committed via Raft, default=true is a no-op for subsequent restarts; late joiners go through the existing RECOVER / leader-shipped logic. This is what makes default=true safe.

9. Why we do NOT re-check the fingerprint after bootstrap

After bootstrap, Raft owns logical consistency: every transaction goes through the log, only the leader proposes, every follower applies the same entries deterministically. Byte-level file content has legitimate non-determinism (page allocation, compaction timing, dictionary key ordering) so two replicas with identical logical state can have byte-different files. A fingerprint computed over raw file bytes would falsely report mismatches at runtime. The bootstrap path's gating on "empty Raft log" sidesteps this — we never re-check post-bootstrap.

10. Studio + status surfaces

Per-database fields in the cluster status JSON: bootstrapMode: "local" | "delta" | "leader_snapshot" | null, bootstrapLastTxId, oldestRetainedTxId. Status table column: BOOTSTRAP=local(X+1) / delta(X+1<-X) / leader-snapshot(X+1). Studio shows the same as a colored badge.

Worked example: A=X, B=X-1, C=X+1, threshold=100k

Step What happens
All three pods start, every Raft log empty.
First election picks A as Raft leader.
A sends bootstrap-state RPC. Replies: A=(fpA, X, X-50k), B=(fpB, X-1, X-50001), C=(fpC, X+1, X-49999).
A picks highest lastTxId = X+1 → source is C.
A calls transferLeadership(C). C becomes Raft leader.
C commits BOOTSTRAP_FINGERPRINT_ENTRY=(fpC, X+1).
C: own fingerprint matches → bootstrap locally.
A: gap = 1, C's oldestRetainedTxId = X-49999 ≤ X+1 → delta resync. Hits GET /api/v1/ha/delta/heimdall?fromTxId=X on C, applies one transaction, done.
B: gap = 2, same path → delta resync, two transactions.

In the customer's actual case (1+GB pre-staged, all backups identical) every peer would hit the "fingerprint matches → bootstrap locally" branch and the delta path is never exercised at startup. Delta still helps later: a pod restarted into a running cluster with lastTxId 5k entries behind will pull only those 5k entries instead of a full snapshot.

Behavior matrix at first formation

With default=true:

Local DBs across pods Result
All pods empty Empty cluster, normal bootstrap. Identical to today.
All pods identical Fast offline bootstrap; the win.
Different ages, gaps within threshold, source has WAL retained Highest lastTxId peer wins, leadership transfers, others delta-replay. Minutes-not-hours.
Different ages, gaps over threshold Highest lastTxId peer wins, others fall back to full snapshot (same as today).
Same lastTxId, divergent fingerprints One peer wins, others fall back to full snapshot. WARNING about divergent staging.
Some pods unreachable at bootstrap Wait up to bootstrapTimeoutMs, then proceed with majority + SEVERE log.
Cluster restart with stale local data on a node Bootstrap path NOT engaged (Raft log non-empty). Late-joiner path = delta if gap small, full snapshot otherwise.
Cluster restart of healthy nodes Bootstrap path NOT engaged. No fingerprint re-check.

Implementation phases (status)

Phase 1 — foundations (config + BootstrapFingerprint + TransactionManager.getLastTransactionId() with disk persistence in last-tx-id.bin). Shipped in commit bbc64d813.

Phase 2 — BOOTSTRAP_FINGERPRINT_ENTRY Raft log entry type + codec. Shipped in commit bbc64d813.

Phase 3 — pre-bootstrap RPC POST /api/v1/cluster/bootstrap-state. Returns per-database (fingerprint, lastTxId, oldestRetainedTxId). (Next.)

Phase 4 — bootstrap election protocol in RaftHAServer: empty-log gating, peer-state collection, source pick, leadership transfer, commit.

Phase 5 — ArcadeStateMachine.applyTransaction handling for BOOTSTRAP_FINGERPRINT_ENTRY; SnapshotInstaller decides per-follower path; late-newer-joiner refusal.

Phase 6 — delta resync end-to-end:

Phase 7 — Studio + cluster status surfaces (bootstrapMode, oldestRetainedTxId per database in JSON; ASCII table column; Studio badge).

Phase 8 — Integration tests:

Out of scope

Acceptance criteria

Why now

Reported by a customer scaling a single-pod ArcadeDB deployment to 3 pods after a 1+ GB import. Their snapshot transfer competes with ongoing writes and triggers leader churn (#4083). Offline bootstrap removes the runtime burden on the leader entirely; delta resync also benefits the existing runtime catch-up path.