8x DGX Spark Cluster Build Report: CRS812 + 400DD→4x100G Breakouts, Nemotron 3 Ultra at TP=8 (original) (raw)

8x DGX Spark Cluster Build Report: CRS812 + 400DD→4x100G Breakouts, Nemotron 3 Ultra at TP=8

Sharing real-world results from running 8 GB10 nodes (4x ASUS Ascent GX10 + 4x Lenovo ThinkStation PGX) on a single MikroTik CRS812.

✅ A single CRS812-8DS-2DQ-2DDQ with two 400DD→4x100G breakout cables can host an 8-node DGX Spark cluster
✅ 200G vs 100G: per-stream decode speed is essentially unchanged (TTFT increases slightly when warm, noticeably on cold prefill)
🚀 Nemotron 3 Ultra 550B-A55B NVFP4 at TP=8 was faster than expected
Bonus question at the end: looking for GX10 stability tips

Hardware

Compute: 8x GB10 (SoC sm_121a, 128GB UMA each, ~1TB total)
- ASUS Ascent GX10 × 4
- Lenovo ThinkStation PGX × 4
Switch: MikroTik CRS812-8DS-2DQ-2DDQ (RouterOS 7.23)
Cabling: 2x 400DD→4x100G breakout cables (switch side: 1x 400G QSFP-DD → node side: 4x 100G QSFP28). Two cables × 4 nodes = all 8 nodes at 100G from just the two QSFP-DD ports
Driver: NVIDIA 580.159 on all 8 nodes (apt-mark hold)
OS: Ubuntu 24.04 LTS / DGX OS, kernel 6.17.0-1021-nvidia

The CRS812 only has two 400G QSFP-DD ports, but the breakout approach lets one switch absorb the entire 8-node cluster at 100G per node.

Network: 100G vs 200G (Measured)

Same TP=4 inference workload (Qwen3.5 397B-A17B int4-AutoRound, vLLM 0.22 + no-ray + 30 GiB KV) measured under both link configurations.

Benchmark: 5 iterations × 4 prompt sizes {8K, 16K, 64K, 128K} × n=4 concurrency, max_tokens=500, thinking off, direct endpoint (no proxy).

Per-stream decode tps (single stream)

Size	All 200G	All 100G	Δ
8K	25.21	24.78	-1.7%
16K	25.78	25.48	-1.2%
64K	25.08	24.64	-1.8%
128K	23.48	24.20	+3.1% (noise)

Per-stream decode is essentially independent of link bandwidth (±3%). Qwen3.5-397B INT4 TP=4 decode is LPDDR5X UMA memory-bandwidth bound; NCCL all-reduce link bandwidth is not the bottleneck.

Aggregate throughput (n=4 concurrent, warm)

Size	All 200G	All 100G	Δ
8K	68.7 tps	53.6 tps	-22.0%
16K	78.6 tps	64.1 tps	-18.5%
64K	77.0 tps	73.1 tps	-5.0%
128K	80.8 tps	80.6 tps	-0.2%

Aggregate throughput drops ~20% at short contexts (8K–16K), but the gap nearly disappears at 64K–128K.

TTFT (warm, prefix-cache hit)

Size	All 200G	All 100G	Δ
8K	2.02s	4.15s	+106%
16K	1.69s	3.12s	+85%
64K	1.03s	1.60s	+56%
128K	0.86s	1.11s	+29%

The TTFT multiplier is larger at short contexts, but warm TTFT stays in the seconds-to-seconds range either way — barely perceptible.

Conclusion

From a production inference standpoint: decode tps is link-independent (memory bound), warm TTFT differences are small in absolute terms, and only cold prefill on large contexts is significantly affected. Collapsing all 8 nodes onto one switch at 100G is an acceptable trade-off for production.

⚠ Caveat: early in the project I hit a ConnectX-7 PCIe Power Throttle stuck state — after a cable hot-swap, inter-node bandwidth got stuck at ~13 Gbit/s until a host reboot. Worth checking if your inter-node bandwidth looks wrong.

Inference: Nemotron 3 Ultra 550B-A55B NVFP4 at TP=8

Engine: scitrera/dgx-spark-sglang:0.5.12, key settings:

--tp-size 8 --pp-size 1
--quantization modelopt_fp4
--kv-cache-dtype fp8_e4m3
--mem-fraction-static 0.85
--attention-backend flashinfer
--moe-runner-backend flashinfer_cutlass
--max-mamba-cache-size 96
--cuda-graph-max-bs 8
--disable-piecewise-cuda-graph

NCCL_IB_HCA=rocep1s0f1   # specifying both RoCE lanes fails at startup (ibv_modify_qp error)
SGLANG_ENABLE_DEEP_GEMM=0   # --disable-deep-gemm flag not implemented in 0.5.12

Startup

NCCL init (8-node ring+tree): ~5 s
Weight load (113 safetensors shards): ~9 min
KV cache: 17.6M tokens (50.4 GB cluster-wide), Mamba cache: 96 slots
Total time to READY: ~10 min

n=1 baseline (no MTP, prefix cache enabled)

Size	TTFT p50	TPOT p50	Decode p50
8K	5.5 s	73.9 ms	13.5 tps
16K	6.1 s	74.4 ms	13.5 tps
32K	6.5 s	74.3 ms	13.5 tps
64K	6.2 s	75.0 ms	13.3 tps
128K	4.8 s	76.0 ms	13.2 tps

⚠ Note on the flat TTFT: this is not true cold-prefill performance. The benchmark slices prompts of different sizes from the same source corpus, so larger prompts contain smaller ones as prefixes — SGLang’s radix cache hit range grows with prompt size, flattening TTFT (sometimes even making larger sizes faster).

True cold prefill (measured with radix cache disabled, see MTP section below) scales near-linearly: TTFT goes 5.7s → 46.7s from 8K → 64K at a stable ~1,380 tok/s prefill throughput. NemotronH’s hybrid architecture (Mamba-dominant, sparse attention) keeps this O(n)-ish rather than a Transformer’s O(n²), but prefill cost still grows with size.

Production takeaway: workloads that reuse a common system prompt (chatbots, agent loops) get seconds-range TTFT from the prefix cache; workloads with unique long prompts (RAG, batch) should budget against the 1,380 tok/s cold-prefill figure.

MTP NEXTN (speculative decoding)

--speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-num-draft-tokens 4:

Size	TTFT (true cold, radix off)	Decode p50	vs baseline
8K	5.8 s	29.6 tps	2.19×
16K	11.7 s	29.4 tps	2.19×
32K	23.1 s	29.0 tps	2.16×
64K	46.7 s	29.0 tps	2.17×
128K	n/a — a GX10 node crashed mid-run ⚠

Consistent 2.16–2.19× decode speedup with TPOT cut in half (74 ms → 34 ms) across 8K–64K.

Note: in SGLang 0.5.12, NemotronH + MTP cannot coexist with the radix cache (--disable-radix-cache required), so enabling MTP means losing prefix caching. Trade-off: long-form output / RAG / agentic → MTP on; short-response multi-turn chat with a shared system prompt → MTP off + prefix cache.

Getting a 550B model to ~30 tps per-stream with a 256K context window on an 8-node, individual-budget setup exceeded my expectations.

Bonus: GX10 stability tips wanted

Quick ask: my 4 ASUS Ascent GX10 units frequently go down under sustained inference load (e.g., the 128K cold-prefill runs above). The 4 Lenovo ThinkStation PGX units — same GB10 SoC — run the identical workload without issues, so I suspect something GX10-specific.

Symptoms:

Silent failure during inference: kernel ring buffer freezes → user space dies ~13 minutes later
Eventually progresses to a full power-off state (no ICMP/ARP; physical power-on required)
Neighboring nodes’ journalctl/dmesg show zero events related to the failed node

If anyone has tips for keeping the GX10 stable under this kind of workload — BIOS settings, power management, thermal control, driver options, anything — I’d really appreciate it.

Happy to share full recipes, NCCL flags, SGLang launch scripts, and breakout wiring diagrams if anyone wants to replicate this setup.