Pushing GB10 to the Limit: Qwen3 235B MoE + Concurrent Best-of-4 + Persistent Agent Layer. Architecture check & Optimization tips? (original) (raw)
Hey everyone,
I’m architecting a high-performance local AI node around a single ASUS Ascent GX10 (NVIDIA GB10 Grace Blackwell, SM121, 128GB unified LPDDR5x). The goal is to maximize reasoning depth and throughput on a single-node setup without compromising quantization quality or moving to a multi-node cluster.
Hardware & Stack
Layer 0 — Persistent Infrastructure (always running)
• Obsidian — knowledge graph, automated nightly restructuring via AI agent (MOC updates, link detection, tag normalization)
• n8n — workflow orchestration, webhooks, fan-out routing, scheduled agents
• Qdrant — vector DB for RAG pipelines
• Whisper — local STT (medical dictation, voice input)
• Tailscale — secure remote access
• Open WebUI — front-end interface
Layer 1 — Primary Inference
• Model: Qwen3 235B MoE (NVFP4 target, fallback Q4_K_M GGUF)
• Engine: vLLM (avarok kernel build for SM121) → TensorRT-LLM path planned
• Inference strategy: Concurrent Best-of-4 (batch size = 4) with lightweight local Judge for trajectory scoring
Layer 2 — Nightly Autonomous Loop
• Log review agent → generates corrected system prompt for next day
• RAG reindexing agent → Qdrant cleanup + fresh embeddings
• Micro LoRA fine-tune → BF16, pairs from daily error detection
• Obsidian restructuring agent → knowledge graph maintenance
Why this specific setup?
MoE memory arbitrage: A dense 235B in BF16 (~470GB) is impossible on 128GB. Qwen3 235B MoE fits the unified memory envelope in NVFP4 (~58-65GB weights), while activating only ~22B parameters per token. This keeps per-token latency low while preserving reasoning depth.
Unified memory advantage: Unlike discrete GPU setups, the GB10’s unified LPDDR5x means experts stay floating in RAM with no PCIe transfer overhead. Best-of-N benefits directly from this — each of the 4 parallel runs hits already-resident experts.
Best-of-4 rationale: MoE routing creates genuine divergence across parallel runs — different expert combinations activate per token, producing structurally different reasoning paths, not just temperature-bruited variants of the same path. This makes Best-of-N significantly more effective on MoE than on dense models.
Nightly self-improvement loop: The system reviews its own daily logs, generates a corrected behavioral system prompt, reindexes RAG, and runs micro LoRA fine-tuning overnight. GB10 runs at full load 24/7 — no idle cycles.
Questions for the Community & NVIDIA Engineers
- NVFP4 maturity on SM121 for MoE architectures
How stable is vLLM (avarok kernel) + NVFP4 on SM121 specifically for MoE routing? The Mistral Small 4 thread showed SM121 vs SM100 kernel compatibility issues. Are there known perplexity degradation risks specific to MoE expert routing when dropping to FP4, or is the microscaling (16-value blocks, FP8 E4M3 scale) sufficient to protect routing quality? - KV cache budget at Batch Size = 4
With 235B total / ~22B active params in NVFP4 (~60GB weights), I’m estimating ~60-65GB remaining for KV cache + runtime. With PagedAttention fully optimized, what’s the realistic context window per stream at BS=4 before OOM? Has anyone benchmarked this specific config? TurboQuant (K=FP4, V=FP3) an option here to reclaim headroom? - Reward Model placement for async Best-of-4 scoring
To score the 4 parallel completions asynchronously, I need a Judge/Reward model. Three options I’m considering:
• Same GB10, pipelined (risks memory pressure)
• CPU/system RAM offload (latency hit?)
• Qwen3 235B self-judging via separate system prompt (no extra model, uses MoE
Speculative draft scoring: using a tiny co-resident model
(Qwen3 1.7B or 4B) not for full speculative decoding
(known to underperform on MoE due to expert thrash + SSM
sequential constraints) but specifically for early
trajectory pruning — scoring partial completions at
token 50-100 to kill low-quality paths before full
generation. Is this pattern documented anywhere?
Any latency measurements on partial-completion scoring?
What’s the community’s experience with Reward Model overhead in single-node Best-of-N setups? Is self-judging via the same model a known pattern with acceptable quality?
- Nightly LoRA on GB10 — BF16 vs NVFP4
For micro fine-tuning runs (small curated pairs, ~1-2h nightly), is BF16 still the safe default on SM121 or are there tested NVFP4 training recipes for single Spark nodes? The Day 4 Kubesimplify benchmarks showed BF16 outperforming NVFP4 for training throughput on single Spark — still the case? - Layer 0 persistent stack overhead
Running Obsidian sync agent, n8n, Qdrant, and Whisper as persistent background services alongside vLLM. Anyone running a similar always-on stack? What’s the realistic RAM and CPU overhead to reserve for Layer 0 so vLLM memory allocation doesn’t collide?
Current numbers for reference
• Mistral Small 4 119B NVFP4 on GB10: ~27 tok/s confirmed (this forum)
• Nemotron 3 Super NVFP4 via TensorRT-LLM/NIM: ~38 tok/s (NVIDIA published)
• Target for Qwen3 235B MoE NVFP4 BS=4: unknown — looking for community data
Looking forward to hardware tweaks, scheduling strategies, and any real numbers on Qwen3 235B MoE on GB10. As my use cases are French/EU-centric (energy markets,
European regulatory context, French medical terminology),
I’m also interested in running Mistral Small 4 119B MoE
NVFP4 alongside Qwen3 235B — not simultaneously (OOM risk
with both loaded: ~66GB + ~62GB ≈ 128GB with zero headroom)
but as a hot-swappable profile for latency-sensitive tasks.
Has anyone profiled the swap time between two NVFP4 models
on GB10? Is 7-8 min weight loading still the bottleneck or
are there faster checkpoint strategies?
Running DGX OS (Ubuntu 24.04 ARM64) on ASUS Ascent GX10.
#NvidiaBlackwell #GB10 #LocalLLM #MixtureOfExperts #vLLM #TensorRTLLM #InferenceEngineering #Qwen3 #BestOfN