8x DGX Spark Cluster Build Report: CRS812 + 400DD→4x100G Breakouts, Nemotron 3 Ultra at TP=8 (original) (raw)

8x DGX Spark Cluster Build Report: CRS812 + 400DD→4x100G Breakouts, Nemotron 3 Ultra at TP=8

Sharing real-world results from running 8 GB10 nodes (4x ASUS Ascent GX10 + 4x Lenovo ThinkStation PGX) on a single MikroTik CRS812.

Hardware

The CRS812 only has two 400G QSFP-DD ports, but the breakout approach lets one switch absorb the entire 8-node cluster at 100G per node.

Network: 100G vs 200G (Measured)

Same TP=4 inference workload (Qwen3.5 397B-A17B int4-AutoRound, vLLM 0.22 + no-ray + 30 GiB KV) measured under both link configurations.

Benchmark: 5 iterations × 4 prompt sizes {8K, 16K, 64K, 128K} × n=4 concurrency, max_tokens=500, thinking off, direct endpoint (no proxy).

Per-stream decode tps (single stream)

Size All 200G All 100G Δ
8K 25.21 24.78 -1.7%
16K 25.78 25.48 -1.2%
64K 25.08 24.64 -1.8%
128K 23.48 24.20 +3.1% (noise)

Per-stream decode is essentially independent of link bandwidth (±3%). Qwen3.5-397B INT4 TP=4 decode is LPDDR5X UMA memory-bandwidth bound; NCCL all-reduce link bandwidth is not the bottleneck.

Aggregate throughput (n=4 concurrent, warm)

Size All 200G All 100G Δ
8K 68.7 tps 53.6 tps -22.0%
16K 78.6 tps 64.1 tps -18.5%
64K 77.0 tps 73.1 tps -5.0%
128K 80.8 tps 80.6 tps -0.2%

Aggregate throughput drops ~20% at short contexts (8K–16K), but the gap nearly disappears at 64K–128K.

TTFT (warm, prefix-cache hit)

Size All 200G All 100G Δ
8K 2.02s 4.15s +106%
16K 1.69s 3.12s +85%
64K 1.03s 1.60s +56%
128K 0.86s 1.11s +29%

The TTFT multiplier is larger at short contexts, but warm TTFT stays in the seconds-to-seconds range either way — barely perceptible.

Conclusion

From a production inference standpoint: decode tps is link-independent (memory bound), warm TTFT differences are small in absolute terms, and only cold prefill on large contexts is significantly affected. Collapsing all 8 nodes onto one switch at 100G is an acceptable trade-off for production.

⚠ Caveat: early in the project I hit a ConnectX-7 PCIe Power Throttle stuck state — after a cable hot-swap, inter-node bandwidth got stuck at ~13 Gbit/s until a host reboot. Worth checking if your inter-node bandwidth looks wrong.

Inference: Nemotron 3 Ultra 550B-A55B NVFP4 at TP=8

Engine: scitrera/dgx-spark-sglang:0.5.12, key settings:

--tp-size 8 --pp-size 1
--quantization modelopt_fp4
--kv-cache-dtype fp8_e4m3
--mem-fraction-static 0.85
--attention-backend flashinfer
--moe-runner-backend flashinfer_cutlass
--max-mamba-cache-size 96
--cuda-graph-max-bs 8
--disable-piecewise-cuda-graph

NCCL_IB_HCA=rocep1s0f1   # specifying both RoCE lanes fails at startup (ibv_modify_qp error)
SGLANG_ENABLE_DEEP_GEMM=0   # --disable-deep-gemm flag not implemented in 0.5.12

Startup

n=1 baseline (no MTP, prefix cache enabled)

Size TTFT p50 TPOT p50 Decode p50
8K 5.5 s 73.9 ms 13.5 tps
16K 6.1 s 74.4 ms 13.5 tps
32K 6.5 s 74.3 ms 13.5 tps
64K 6.2 s 75.0 ms 13.3 tps
128K 4.8 s 76.0 ms 13.2 tps

⚠ Note on the flat TTFT: this is not true cold-prefill performance. The benchmark slices prompts of different sizes from the same source corpus, so larger prompts contain smaller ones as prefixes — SGLang’s radix cache hit range grows with prompt size, flattening TTFT (sometimes even making larger sizes faster).

True cold prefill (measured with radix cache disabled, see MTP section below) scales near-linearly: TTFT goes 5.7s → 46.7s from 8K → 64K at a stable ~1,380 tok/s prefill throughput. NemotronH’s hybrid architecture (Mamba-dominant, sparse attention) keeps this O(n)-ish rather than a Transformer’s O(n²), but prefill cost still grows with size.

Production takeaway: workloads that reuse a common system prompt (chatbots, agent loops) get seconds-range TTFT from the prefix cache; workloads with unique long prompts (RAG, batch) should budget against the 1,380 tok/s cold-prefill figure.

MTP NEXTN (speculative decoding)

--speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-num-draft-tokens 4:

Size TTFT (true cold, radix off) Decode p50 vs baseline
8K 5.8 s 29.6 tps 2.19×
16K 11.7 s 29.4 tps 2.19×
32K 23.1 s 29.0 tps 2.16×
64K 46.7 s 29.0 tps 2.17×
128K n/a — a GX10 node crashed mid-run ⚠

Consistent 2.16–2.19× decode speedup with TPOT cut in half (74 ms → 34 ms) across 8K–64K.

Note: in SGLang 0.5.12, NemotronH + MTP cannot coexist with the radix cache (--disable-radix-cache required), so enabling MTP means losing prefix caching. Trade-off: long-form output / RAG / agentic → MTP on; short-response multi-turn chat with a shared system prompt → MTP off + prefix cache.

Getting a 550B model to ~30 tps per-stream with a 256K context window on an 8-node, individual-budget setup exceeded my expectations.

Bonus: GX10 stability tips wanted

Quick ask: my 4 ASUS Ascent GX10 units frequently go down under sustained inference load (e.g., the 128K cold-prefill runs above). The 4 Lenovo ThinkStation PGX units — same GB10 SoC — run the identical workload without issues, so I suspect something GX10-specific.

Symptoms:

If anyone has tips for keeping the GX10 stable under this kind of workload — BIOS settings, power management, thermal control, driver options, anything — I’d really appreciate it.

Happy to share full recipes, NCCL flags, SGLang launch scripts, and breakout wiring diagrams if anyone wants to replicate this setup.