nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16 (original) (raw)

The number seem to be in same league as other top models.

jeremyk June 4, 2026, 3:30pm 2

Not trying to sound stupid, but what are the odds of running this (NVFP4) on 3 node Spark Cluster in pipeline parallel?

Most likely tight on a 3x node cluster. Qwen3.5-397B-A17B-int4-AutoRound runs on 2x w/226GB of source+weights and ultra nvfp4 clocks in at 352GB

giles8 June 4, 2026, 4:56pm 5

That is in line with my original guesstimation, I actually think 5 nodes or more to be comfortable.

Dear Boss,
I need 2 more Sparks and a very expensive switch. Should probably make it 3 Sparks.

Love,
Me

mashie June 4, 2026, 5:03pm 7

Actually the MikroTik switch most people are using for 4/8 node clusters is a quarter of the price of a DGX Spark.

giles8 June 4, 2026, 5:06pm 8

Is it loud thing? Looks like half rack width, with DC type fans and psus. Those things can get very noisy!

mashie June 4, 2026, 5:17pm 10

According to some owners not very noisy. Personally I’m working on a DAC solution to not require a switch in the first place. Hopefully it works out as I would prefer to not add a switch myself.

s0ne June 4, 2026, 5:36pm 11

I am using a Mikrotik CRS804 switch, and it is not as noisy as I expected. I am currently downloading an NVFP4 model. I expect the model to load successfully across the four DGX nodes, but the max-model-len will likely be the key factor. It seems that 1M context might not be possible.

With 55b active servers, it will be slow unless the new architecture makes a difference.

Balaxxe June 4, 2026, 7:20pm 13

Ah, the DGX station model has released lol.

Well, when the NVFP4 variant comes around at least.

wonder if this could be run on 4x GB10 with reasonable speed.

the num_speculative_tokens from suggested inference config is 5 instead of 3 of 120B A12B, it might be better than 12/55=22% speed of the super model. 15 tok/s wouldn’t be ruled out yet!

truxnor June 4, 2026, 9:30pm 15

I’m trying to get the NVFP4 up and running on 4 sparks currently, with no joy so far, its not going oom, so there is some hope I can figure a way to do this eventually

I hope this model can eventually be quantized to INT2 using Intel AutoRound and deployed on a 2-node DGX Spark cluster with vLLM.

adrenfu June 5, 2026, 2:44am 18

any recommended recipe for the nvfp4 variant? i have a cluster of 4 via crs804 probably will try to run this but at 55b active params i expect this to be slower than qwen 3.6 27b dense.

0rand June 5, 2026, 9:37am 19

In case someone want to try it (I am about to start benching it lol) - Nvidia offers it for free on OpenRouter: Nemotron 3 Ultra (free) - API Pricing & Benchmarks | OpenRouter

giles8 June 5, 2026, 9:54am 20

Does INT2 even work without hallucinating like mad? I’m just trying to imagine how a bit 2-bit activation function could not cause a massive loss of quality.

s0ne June 5, 2026, 2:23pm 21

I have tried various methods to load the model but have not yet succeeded. Has anyone with devices of 4 clusters or more successfully loaded the NVFP4 model?

0rand June 5, 2026, 2:32pm 22

Before I had a cluster, was waiting for a cable, I tested Deepseek V4 flash with darkstar4 custom inference engine for deepseek and with q2 gguf on one node. It was insanely slow, 6-7 t/s but surprisingly coherent. Bigger the model, better it quantizes. I truly belive most big firm inference for chat apps run below q4, maybe at q3 or q2.