Deliberations on 4-sparks cluster advantages (original) (raw)

June 16, 2026, 8:56am 1

I just can’t get this thought out of my mind.
I am a happy camper with my 2x Gigabyte Ai Top Atoms running Deepseek V4 Flash, especially with yesterday improvement. However just looking at prices on ram jumping almost daily, general panic here in EU about AI availability, painful divorcing from US thank to administration’s persistent efforts, security and economical implications you can’t just stop thinking that, if you don’t do it now (although investing extra 10k on top of already invested is no joke at all for me) - you will be out priced, outinflated, restricted, pushed into underclass.

The question though is on a practical side: what extra capabilities we can gain from models running on 4x cluster we can’t get on 2x? Near-frontier models are still out of question (~1T+). Models in 500B-750B might work in theory (very few models) but likely to be painfully slow on even 4x cluster as performance gains from 2->4 are diminishing from what I have read. Minimax M3 seems to me only model that would be enabled by this jump. Nemotron ultra can kiss my bottom.
am I missing something? Let’s chat people!

Make sure it is not fear of missing out. Most of local usecases fine with just 1 spark running qwen3.6-27b or gemma4-31b.

Pushing to 2/4 sparks only needed if you ready to compromise stability and invest lot more maintenance in it.

There are really good models for 2 and 4 and 8 sparks, but between them quality walls like 5-10% in particular benchmark and you may get better results by combining Cloud Inference for bigger models with local models for executing heavy on tokens/context/cache tasks. You may even combine workloads for different providers, Gemini, Claude, DeepSeek, etc.

I run on 2 nodes Minimax m2.7, on 4 GLM-4.7 or large FP8 models to confirm quantization quality. All newer large models requires sparse attention, whatever, which is not supported on gb10/sm12x.

Ria33 June 16, 2026, 9:36am 3

Hi, I am newbie here and have been using strix halo + dual 3090 egpu on it (good for 27b, 31b on 48GB egpu, will try on spark as well), and recently ordered 4 x dgx spark(arrived 2 units yet). I have had same question as you posted.
I just justified my purchase on top of extra switch (1140 euros) for future proof. Probably 2 x sparks might be enough in most cases, but I have seen one of member here posted about GLM 5.1 nvfp4 which seems possibly fit into 4 units. (bare with me who does not have clustering on sparks, just size wise gussing).
GLM 5.2 is coming, and minimax m3 seems needing more than 2 copies.
I also see that in worst case, 2 units for LLM A family, another 2 units for LLM B family to cross review each other for better result.
My expectaion for next spark version might later than 2027 due to DDR6 deley, so its life span might be fine, and probably even after spark 2, 1’s value might not drop severely as we still see 3090’s value held.
So for me, why not 4? as long as if there is sqeezable budget.

0rand June 16, 2026, 9:52am 4

Thanks, that exactly the point of this discussion. Models like Qwen 3.6 35b or 27b just don’t work for me - my goal is to execute tasks (especially coding locally), my work is 100% backend, distributed systems and data science, TUI/console is my front-end. I am okay to use rented compute from Lambda or RunPod for one-off tasks like SFT/LoRA, but I want everyday work to be done locally. My codebase is sensitive, I still can tolerate occational spillage to cloud, but soon as product becomes investable it won’t be possible.
DeepSeek v4 Flash is pretty much first model that covers my day to day use, due to its wide knowledge and 1M token window. Qwen 122b is next but 256k windows is very limiting (large codebase of interleaved classes and vast document corpus even as it is in Wiki) and YaRN slows down past 256k by 75% both prefill and decode. I already have enough cache on 2 sparks, I run hermes, forgjo, even a small vision model in llamacpp qwen 3.5 9b q4 for occatiotional screeshot look, plus I run gemma4 12b for another task on my 5070ti gpu on workstation.

So yes, it is a good question, what 4 sparks can offer in my case. M3 perhaps with 1M window, Qwen 3,5 397b as a backup with healthy windows.

GLM models are low-context from what I can see, not of interest for me, unless I am mistaken. FP8 - my own tests on smaller models did not show noticeable difference to well-quantized FP4/nvfp4 or even int4 models, if occational tool call get borked, agent can fix it. Overall robustness, world knowledge and reasoning quality is more important.

Let’s consider options - Lambda.ai

NVIDIA B200 SXM6	180 GB	52	720 GiB	5.5 TiB SSD	$6.89

To run lets say M3 I will need 2 units clustered.
14 + 20% = 17 usd per hour.

Lets say I want to use is to review days work, accumulate tasks to escalate and solve - 1 hour per day, the rest is done by 2x sparks and ds4f.

20 x 17 = 340 / month

lets say usable life of spark is 3 years

upgrading to 4x (2 units + router + cables) is 10k

so depriciation wise alone this use case will cross upgrade cost in 29 months, and at only 1h per day, not 24 hours per day that sparks upgrade will give.

that’s the math equation.

on token cost - I just did the calculation,
if we compare ds4f to gpt5.4 mini (its a stretch but actual results are surprisingly comparable)
running own sparks cuts costs by 94%, including electricity and depreciation. but assuming max power use 24x7 of course, which is not the case.
However, even if we assume that we run at best at 10% utilization across 24x7, it’s still 50%+ cheaper as electricity is major cost not depriciation, for 24x7 electricity is 90% cost.

0rand June 16, 2026, 10:01am 5

If a new spark will land the cost of RAM will make it unattainable for me, likely will creep to DGX Station cost. Sparks aren’t cheap and aren’t fast but returns/ram/cost are imo unbeatable right now. But it fast changes as cost of units truly goes up. :(

truxnor June 16, 2026, 10:18am 6

The costs of all hardware related to ram is kinda going crazy right now.

My journey was one spark, this was before I knew much about running LLM’s locally. At first I was happy with one, then realised that you have that expensive network card that is not being used, so I bought another. I was happy with 2 sparks, but always wanted to run larger local models, for somewhat similar reasons to yourself. The smaller qwen models are good at certain things, but they do lack the larger knowledge base for my needs.

My original goal was to run glm 5.1, so I had plans to goto 4 and maybe 8 sparks, but the diminishing returns in speed as the LLM’s get larger is certainly a trade off.

even though I had qwen 397b running on 2 sparks, there was not much head room for context, so I went for 4 sparks running the same model and it gave me the headroom I needed.

I’m now running minimax 3 on 4 sparks, and trying it for the last several days against my normal work, and I’m indifferent if it is better than qwen, its not worse, but its honestly hard to say if it is truly better. I don’t really use my sparks for coding, they might make some simple scripts, but this is for just parsing and looking at DFIR logs.

For me I am likely going to buy a couple more sparks, just so I can run a few different LLM’s.
I usually have several different investigations going when I have a platform that is looking at and parsing DFIR logs, so I need the bigger context and something that is able to a bit more than tool calling, so trying to find the right LLM for this is not easy!

If you can afford it and you have a genuine “need” for them, go for it. I do not have buyers remorse for my current investment. I have also learnt alot and this will certainly be of a benefit going forward.

0rand June 16, 2026, 10:21am 7

thank you so much for your inputs. what is your experience with M3 on 4 sparks prefill/decode, pushing to 1m limit? maybe you can run tool eval tests if you have a spare time?

truxnor June 16, 2026, 10:30am 8

I currently using the sglang recipe from spark arena and set it to 500k context, but I do intend this weekend to try vllm to see if I can get more tokens a sec, currently about 19/20 which is usable, but of course I would like it to be faster. There is enough vram where I could push it to 1M context, but I have not actually tried…yet.

I did do one run of tool-eval benchmark, and it scored mid 80’s for a full run, but for me the most important parts are the malicious prompt injection / detection, and it at least passed TC-60 which most llm fail, the only other one I have seen so far is glm 5.1, that also passed that - I just cant run glm 5.1 with enough context to be useful.

minimax is for me just about acceptable at the current speed, if it was a few tokens slower then I would likely go back to qwen 397b, but I am sure I can get more speed out of it when I have a play around this weekend. I will do some benchmarks once I have a working recipe

0rand

June 16, 2026, 10:59am 9

People are running new nvfp4 recipe at 27t/s on single and 40 on 2, up to 5 seqs where it seems to start chocking. this is definitely a usable speed
Check out how sparkarena/Minimax-M3-v0-NVFP4 achieved 24.34 tokens/sec on text generation on NVIDIA DGX Spark with SGLang!

View full benchmark at sparkarena/Minimax-M3-v0-NVFP4 - Spark Arena Benchmark

I tested cloud M3 on tool eval hardmode - it was also 85. But I found that cloud versions score lower than local ones, so I assume they use a very aggressive quantization, prompt injection etc that ruins tests and usability. Local/Colocated is the way, not API. For me certainly.

Ria33 June 16, 2026, 11:24am 10

Looks already nice in this early optimization. well 4 copies might be good enough reason for someone who likes run minimax m3 locally. (me)

Ive order 4 units before fable restriction, but that was adding more justifying my decision

GLM-5.2 supports 1m tokens, yet, won’t expect easy run on gb10 platform.

Overall deepseek v4 is unicorn, this is only model which needs 10gb VRAM for 1m tokens context, yet recall still fails on it after 256k tokens. Other mainstream models requires 100-is gb of VRAM for 1m context, which is exactly the reason of current mess with RAM prices, as cloud providers pushes prefix cache from VRAM to RAM to free up GPU resources.

0rand June 16, 2026, 11:42am 12

I am not sure what are you talking about - I constantly run over 500k and prefix cache hit is 97% and it rocks at 30-35 t/s. I add maybe 100-500 new tokens per iteration on average after initial ram up and ttft is instant. prefix cache totally works fine. I am not knowledgeable about interworkings of “mainstream” models, but the models I did test are not requering 100gb per 1m. worst offender is qwen 3.6 27 - about 36gb per 1m tokens in q8. Nemotron super and cascade - very small, 10gm per 1m. qwen 3.5 122b - similar about 14gb per 1m. and so on

 rate: 97.5%
(APIServer pid=1) INFO 06-16 14:11:31 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.64, Accepted throughput: 25.00 tokens/s, Drafted throughput: 30.40 tokens/s, Accepted: 250 tokens, Drafted: 304 tokens, Per-position acceptance rate: 0.974, 0.671, Avg Draft acceptance rate: 82.2%
(APIServer pid=1) INFO 06-16 14:11:41 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 39.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 97.3%, Prefix cache hit rate: 97.5%
(APIServer pid=1) INFO 06-16 14:11:41 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.49, Accepted throughput: 23.60 tokens/s, Drafted throughput: 31.60 tokens/s, Accepted: 236 tokens, Drafted: 316 tokens, Per-position acceptance rate: 0.886, 0.608, Avg Draft acceptance rate: 74.7%
(APIServer pid=1) INFO 06-16 14:11:51 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 38.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 97.3%, Prefix cache hit rate: 97.5%
(APIServer pid=1) INFO 06-16 14:11:51 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.46, Accepted throughput: 23.00 tokens/s, Drafted throughput: 31.60 tokens/s, Accepted: 230 tokens, Drafted: 316 tokens, Per-position acceptance rate: 0.892, 0.563, Avg Draft acceptance rate: 72.8%
(APIServer pid=1) INFO:     192.168.1.2:64972 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=1) INFO 06-16 14:12:01 [loggers.py:271] Engine 000: Avg prompt throughput: 249.1 tokens/s, Avg generation throughput: 23.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 95.3%, Prefix cache hit rate: 97.5%
(APIServer pid=1) INFO 06-16 14:12:01 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.41, Accepted throughput: 13.50 tokens/s, Drafted throughput: 19.20 tokens/s, Accepted: 135 tokens, Drafted: 192 tokens, Per-position acceptance rate: 0.854, 0.552, Avg Draft acceptance rate: 70.3%
(APIServer pid=1) INFO 06-16 14:12:11 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 40.7 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 95.1%, Prefix cache hit rate: 97.5%
(APIServer pid=1) INFO 06-16 14:12:11 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.65, Accepted throughput: 25.40 tokens/s, Drafted throughput: 30.80 tokens/s, Accepted: 254 tokens, Drafted: 308 tokens, Per-position acceptance rate: 0.974, 0.675, Avg Draft acceptance rate: 82.5%

After two days of heavy use and agent constantly chumming through the data - got to 97%, no slowdown, reducing and keeping around 95%

harix June 16, 2026, 12:54pm 14

I was in the same boat. I already had two DGX Sparks, and I recently pulled the trigger on two Gigabyte Atoms. I bought them without VAT (EU), so a month ago, the 4TB version was still around 4,300USD.Intotal,thefourmachines,switch,andcablescostmeabitover4,300 USD. In total, the four machines, switch, and cables cost me a bit over 4,300USD.Intotal,thefourmachines,switch,andcablescostmeabitover19k USD.

My logic was this: they come with a two-year warranty. Even in the absolute worst-case scenario where all four die the day after the warranty expires, it breaks down to about $800 a month. Compared to what I’d be paying for API tokens, that’s still a bargain.

I usually work on two or three codebases simultaneously, and for my use cases, I don’t see a huge difference between SOTA cloud models and local ones. Sure, on a good day, Opus beats local models hands down. But I’ve also seen Opus generate plenty of stupid bugs when it’s being throttled. I simply have more confidence in a local model that runs exactly the same way every time I set it up.

Having the second pair of machines also gives me a lot more flexibility. I can dedicate one pair to coding, while using the second pair for ComfyUI workflows or, eventually, LoRA training.

I’m hoping that in a year, we’ll see a local model under 512GB with the capabilities of today’s Opus. If that happens, I think I’m set until true AGI arrives.

What is your performance on GLM 5.1?

It feels like if you have the budget for a third or fourth Spark, it’s probably time to look into faster solutions.

A Spark gives you tons of memory, but it’s always slow. Whether you run 4, 8, or 16 Sparks, you’re still going to be bottlenecked at around 20–30 t/s on large models at best.

Take Qwen-397B as an example. On a dual-Spark setup, we get 25 t/s. Someone on the forum shared a hyper-optimized run on 4 Sparks with b12x, and that pushed it to… a whopping 35 t/s! 😅

It seems like at some point, it’s way better to switch to RTX PRO 6000, where people are getting up to 180 t/s with the exact same model:

Yeah, it’s more expensive. If your budget only covers 1 or 2 Sparks, then it’s a great bang-for-your-buck solution. But honestly, if you have the budget for 4, 8, or 16 of them, it makes way more sense to look into RTX instead.

If we are talking specifically about models, things can change completely over time.

As of today, DeepSeek-V4-Flash justifies the purchase of a second Spark. It is an intelligent and, by Spark standards, fast model. An excellent solution.
Before this model came out, there was only Qwen-3.5-397B-INT4. With a speed of around 25 t/s, it was quite a smart model, but did it justify the cost of a second Spark? It depends on the user.

As of right now, there isn’t a truly good (both fast and smart) model optimized for a four-Spark setup. But that doesn’t mean we won’t see one in a month or six months.

But with RTX setups, it’s much easier for us to get models running at acceptable speeds—like GLM-5 or MiniMax M3—instead of completely discarding them because they’re ‘smart but unacceptably slow’.

0rand June 17, 2026, 4:03pm 18

Are replying to me? I have no desire to host a server at home. It’s my personal gear, must be portable. For business once we go live I will just rent cloud metal and never think twice. We don’t need a massive model for production for what we do, even gemma 12b does very well, but smallest ttft is important. But for development and data analysis I need my own gear and portable. There is no question sparks or a massive heater box. It’s either rented instance at lambda or sparks. Before I tested waters with 1 spark my plan was to get gh200 for few hours a days and be done with it for most tasks and dual b200 for few most complex or compute intensive.

truxnor June 17, 2026, 11:21pm 19

that is 4 RTX Pro 6000, which are way more expensive than 4 sparks, in the UK these cards are now about 11k, sure its faster than a spark, but you need 4 of them and an appropriate PC, which will set you back about 50k, 4 sparks and a switch is way cheaper than that, not even close.

Plus the heat and noise from 4 cards will be alot more than the 4 sparks, so its not really comparing the same thing

truxnor June 17, 2026, 11:27pm 20

its been a while but if I recall I was getting about 10 or so tokens a sec, and after a short period of time it would go oom, I could possibly run a bechmark or two, but not use it for anything else, just too tight on the vram

I assume that was with MTP?