Pushing GB10 to the Limit: Qwen3 235B MoE + Concurrent Best-of-4 + Persistent Agent Layer. Architecture check & Optimization tips? (original) (raw)

Hey everyone,

I’m architecting a high-performance local AI node around a single ASUS Ascent GX10 (NVIDIA GB10 Grace Blackwell, SM121, 128GB unified LPDDR5x). The goal is to maximize reasoning depth and throughput on a single-node setup without compromising quantization quality or moving to a multi-node cluster.

Hardware & Stack

Layer 0 — Persistent Infrastructure (always running)

•	Obsidian — knowledge graph, automated nightly restructuring via AI agent (MOC updates, link detection, tag normalization)
•	n8n — workflow orchestration, webhooks, fan-out routing, scheduled agents
•	Qdrant — vector DB for RAG pipelines
•	Whisper — local STT (medical dictation, voice input)
•	Tailscale — secure remote access
•	Open WebUI — front-end interface

Layer 1 — Primary Inference

•	Model: Qwen3 235B MoE (NVFP4 target, fallback Q4_K_M GGUF)
•	Engine: vLLM (avarok kernel build for SM121) → TensorRT-LLM path planned
•	Inference strategy: Concurrent Best-of-4 (batch size = 4) with lightweight local Judge for trajectory scoring

Layer 2 — Nightly Autonomous Loop

•	Log review agent → generates corrected system prompt for next day
•	RAG reindexing agent → Qdrant cleanup + fresh embeddings
•	Micro LoRA fine-tune → BF16, pairs from daily error detection
•	Obsidian restructuring agent → knowledge graph maintenance

Why this specific setup?

MoE memory arbitrage: A dense 235B in BF16 (~470GB) is impossible on 128GB. Qwen3 235B MoE fits the unified memory envelope in NVFP4 (~58-65GB weights), while activating only ~22B parameters per token. This keeps per-token latency low while preserving reasoning depth.

Unified memory advantage: Unlike discrete GPU setups, the GB10’s unified LPDDR5x means experts stay floating in RAM with no PCIe transfer overhead. Best-of-N benefits directly from this — each of the 4 parallel runs hits already-resident experts.

Best-of-4 rationale: MoE routing creates genuine divergence across parallel runs — different expert combinations activate per token, producing structurally different reasoning paths, not just temperature-bruited variants of the same path. This makes Best-of-N significantly more effective on MoE than on dense models.

Nightly self-improvement loop: The system reviews its own daily logs, generates a corrected behavioral system prompt, reindexes RAG, and runs micro LoRA fine-tuning overnight. GB10 runs at full load 24/7 — no idle cycles.

Questions for the Community & NVIDIA Engineers

  1. NVFP4 maturity on SM121 for MoE architectures
    How stable is vLLM (avarok kernel) + NVFP4 on SM121 specifically for MoE routing? The Mistral Small 4 thread showed SM121 vs SM100 kernel compatibility issues. Are there known perplexity degradation risks specific to MoE expert routing when dropping to FP4, or is the microscaling (16-value blocks, FP8 E4M3 scale) sufficient to protect routing quality?
  2. KV cache budget at Batch Size = 4
    With 235B total / ~22B active params in NVFP4 (~60GB weights), I’m estimating ~60-65GB remaining for KV cache + runtime. With PagedAttention fully optimized, what’s the realistic context window per stream at BS=4 before OOM? Has anyone benchmarked this specific config? TurboQuant (K=FP4, V=FP3) an option here to reclaim headroom?
  3. Reward Model placement for async Best-of-4 scoring
    To score the 4 parallel completions asynchronously, I need a Judge/Reward model. Three options I’m considering:
    • Same GB10, pipelined (risks memory pressure)
    • CPU/system RAM offload (latency hit?)
    • Qwen3 235B self-judging via separate system prompt (no extra model, uses MoE
    Speculative draft scoring: using a tiny co-resident model
    (Qwen3 1.7B or 4B) not for full speculative decoding
    (known to underperform on MoE due to expert thrash + SSM
    sequential constraints) but specifically for early
    trajectory pruning — scoring partial completions at
    token 50-100 to kill low-quality paths before full
    generation. Is this pattern documented anywhere?
    Any latency measurements on partial-completion scoring?

What’s the community’s experience with Reward Model overhead in single-node Best-of-N setups? Is self-judging via the same model a known pattern with acceptable quality?

  1. Nightly LoRA on GB10 — BF16 vs NVFP4
    For micro fine-tuning runs (small curated pairs, ~1-2h nightly), is BF16 still the safe default on SM121 or are there tested NVFP4 training recipes for single Spark nodes? The Day 4 Kubesimplify benchmarks showed BF16 outperforming NVFP4 for training throughput on single Spark — still the case?
  2. Layer 0 persistent stack overhead
    Running Obsidian sync agent, n8n, Qdrant, and Whisper as persistent background services alongside vLLM. Anyone running a similar always-on stack? What’s the realistic RAM and CPU overhead to reserve for Layer 0 so vLLM memory allocation doesn’t collide?

Current numbers for reference

•	Mistral Small 4 119B NVFP4 on GB10: ~27 tok/s confirmed (this forum)
•	Nemotron 3 Super NVFP4 via TensorRT-LLM/NIM: ~38 tok/s (NVIDIA published)
•	Target for Qwen3 235B MoE NVFP4 BS=4: unknown — looking for community data

Looking forward to hardware tweaks, scheduling strategies, and any real numbers on Qwen3 235B MoE on GB10. As my use cases are French/EU-centric (energy markets,
European regulatory context, French medical terminology),
I’m also interested in running Mistral Small 4 119B MoE
NVFP4 alongside Qwen3 235B — not simultaneously (OOM risk
with both loaded: ~66GB + ~62GB ≈ 128GB with zero headroom)
but as a hot-swappable profile for latency-sensitive tasks.

Has anyone profiled the swap time between two NVFP4 models
on GB10? Is 7-8 min weight loading still the bottleneck or
are there faster checkpoint strategies?

Running DGX OS (Ubuntu 24.04 ARM64) on ASUS Ascent GX10.

#NvidiaBlackwell #GB10 #LocalLLM #MixtureOfExperts #vLLM #TensorRTLLM #InferenceEngineering #Qwen3 #BestOfN