Nemotron-3-Ultra-550B-A55B (2-bit GGUF) across 2× DGX Spark via llama.cpp RPC — it works (~5 tok/s) (original) (raw)

Got NVIDIA’s brand-new Nemotron-3-Ultra-550B-A55B running across two DGX Sparks (GB10) and wanted to share the working recipe, and I didn’t find anyone who’d done it on Spark hardware yet.

**Setup**

• Model: unsloth/NVIDIA-Nemotron-3-Ultra-550B-A55B-GGUF, UD-Q2_K_XL (~188 GiB, 6 shards)

• 2× DGX Spark (GB10, 128GB unified ea), linked over the 200GbE ConnectX-7 RoCE port

• llama.cpp upstream master, built `-DGGML_CUDA=ON -DGGML_RPC=ON -DCMAKE_CUDA_ARCHITECTURES=121` (auto → sm_121a)

• Layer-split via RPC: rpc-server on node 2, llama-server on node 1 with `–rpc :50052 --device CUDA0,RPC0 -sm layer --tensor-split 1,1 --no-mmap -fit off`

**Results**

• Decode ~5.2 tok/s · Prefill ~120 tok/s · ~95 GiB/node (balanced) · RoCEv2 RDMA auto-activated

• Hybrid Mamba-2 + MoE (`nemotron_h_moe`) loads + runs on current master (dedicated nemotron-h-moe.cpp)

**Sample — actual reasoning task (not just specs)**

Prompt: *“8 identical balls, one slightly heavier — find it with a balance scale used only twice.”* It’s a thinking model (visible reasoning), and it nailed it:

> Weigh 3 vs 3 → if balanced, heavy ball’s in the leftover 2 (weigh them). If unbalanced, it’s in the heavier 3 → weigh 1 vs 1 (if equal, the 3rd). Two weighings, guaranteed.

600 tokens in 117s — a half-trillion-param model thinking out loud, entirely on two Sparks.

**Gotchas (hope this saves you time)**

1. `–no-mmap` + `-fit off` mandatory — the 188 GiB model exceeds a single node’s RAM, so mmap thrashes and the load never finishes (sits at “fitting params to device memory”). --no-mmap reads linearly (~960 MB/s), loads in ~6 min.

2. Don’t run a redundant local rpc-server on the head — use the head GPU directly via `–device CUDA0,RPC0`; a local rpc-server double-counts the GPU and stalls.

3. Both nodes must run the same llama.cpp build (RPC protocol version must match).

Honest take: decode is RPC-round-trip-bound, so dual-node is slower per-token than single — you do it purely because it won’t fit on one box. ~5 tok/s isn’t fast, but a 550B running fully local on two Sparks is wild.