Asus GX10 Stable: Hermes Twin Qwen3.6-35A-A3B + Qwen3.6-27B + ComfyUI (original) (raw)

llama.cpp vs. vLLM

I haven’t tried EUGR, but I used AEON-Ultimate on vLLM and kept notes. Below are the best settings I was able to get with it.

I eventually returned to llama.cpp due to its improving support for CUDA, MTP, NVFP4, and Qwen models, while still having better launch speed and memory efficiency.

sudo docker run -d \
  --name aeon-ultimate \
  --gpus all \
  --ipc=host \
  --network=host \
  --ulimit memlock=-1:-1 \
  -e TORCH_CUDA_ARCH_LIST="12.0+PTX" \
  -e PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
  -e VLLM_USE_FLASHINFER_MOE_FP4=0 \
  -e VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
  -e TORCH_MATMUL_PRECISION=high \
  -e NVIDIA_FORWARD_COMPAT=1 \
  -e NVIDIA_DISABLE_REQUIRE=1 \
  -e ENABLE_NVFP4_SM100=0 \
  -e VLLM_TEST_FORCE_FP8_MARLIN=0 \
  -e VLLM_USE_FLASHINFER_SAMPLER=1 \
  -v /opt/models/aeon-ultimate-nvfp4:/models/aeon-ultimate:ro \
  -v /opt/models/qwen36-dflash:/models/dflash:ro \
  -v /opt/cache/vllm:/root/.cache/vllm \
  ghcr.io/aeon-7/vllm-aeon-ultimate-dflash:qwen36-v3 \
    vllm serve /models/aeon-ultimate \
    --host 0.0.0.0 \
    --port 10996 \
    --tensor-parallel-size 1 \
    --dtype auto \
    --quantization compressed-tensors \
    --kv-cache-dtype auto \
    --max-model-len 131072 \
    --max-num-seqs 8 \
    --max-num-batched-tokens 8192 \
    --gpu-memory-utilization 0.55 \
    --enable-chunked-prefill \
    --enable-prefix-caching \
    --load-format safetensors \
    --trust-remote-code \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --reasoning-parser qwen3 \
    --attention-backend flash_attn \
    --limit-mm-per-prompt '{"image": 4, "video": 2}' \
    --mm-encoder-tp-mode data \
    --mm-processor-cache-type shm \
    --speculative-config '{"method":"dflash","model":"/models/dflash","num_speculative_tokens":15}'

MTP Acceptance Rate

I did not measure the exact acceptance rate. Instead, I measured tokens per second inside the Hermes agent.

A draft value of 2–3 was the most optimal based on speed tests I ran, while also considering the other llama.cpp instances. The 35B model could draft up to 6, but that did not have much practical benefit. It was already fast enough, so drafting more was wasteful at that point.

MTP was used on the 35B model. On the 27B model, MTP was combined with ngram-mod for sequential speculation: MTP first, then n-gram.

This was mainly because the 27B model was bandwidth-intensive. The GB10’s bandwidth limitations meant I needed to work around the issue by limiting drafting while still trying to speed up token generation.

Thermal Management

Thermal management is required, not optional. I am convinced this device was made for a server room with controlled temperature.

Originally, I was inspired by others on this forum to 3D-print a custom rack for the GX10 with external fans. After having both Claude and OpenAI come up with independent designs, I concluded that the issue was related to how air flows through the device: primarily in through the bottom, then out through the back.

The issue is that, in the best case, airflow would have to be directed at a 45-degree angle toward the bottom of the device if I kept it horizontal. Ideally, the design would also need another fan moving air away from the back.

I remembered someone else on this forum mentioning that they set their device vertically. I did not believe it at first, but after seeing the airflow designs, I understood why. The device needs air to flow in through the bottom, which is always obstructed by whatever is underneath it. Even raising the device, which I tried, is not as good as placing it vertically because the air still cannot get direct flow into the device without an angle.

I used a laptop cooler, an old Thermaltake board with four fans, via USB. It was originally placed under the GX10, but I then moved it vertically against the wall. I adjusted the fans so they feed air into the bottom of the GX10, which is now on the side, while the other fans blow air away from the back. Both are parallel to the wall. It is not a strong cooler, but it is enough for this case.

KV Cache

Each llama.cpp instance runs with Parallel 1, to maximize single use speed, and so there is no division of the KV cache.

I tried different settings. Ultimately, setting something like Parallel 2 with a 256K context, which gets divided in two, was not as good as using Parallel 1 with a 128K KV cache.

Each instance is set to 128K, which allows for greater parallelism because each instance manages its own memory pool and CUDA usage.

ComfyUI Concurrency

Yes, even without ComfyUI, the three llama.cpp instances contend for GPU availability. However, this is still better than a single FIFO setup where everything has to wait for the GPU to become available.

Image creation may be a bit slower if other inference is running. I optimized the pipeline for the highest quality at low memory bandwidth.

One change I made was to unload two of the three llama.cpp instances on the weekends, specifically the two 35B instances. I run only the 27B model to create ComfyUI workflows and perform setup operations. This frees up 81 GB, allowing for video FLUX 2 + LTX 2.3 + Upscaler, while still keeping an orchestrator available in Hermes.

235B MoE vs. Dual 35B-A3B

I have not tried running Qwen3 235B MoE NVFP4 on a single GX10. Memory usage might be a challenge.

However, I do not recommend the 35B-A3B model for deep thinking. It is good for light investigations or for following a workflow created by a smarter model, such as the 27B. The 35B-A3B runs all my Hermes cron jobs, most have over 2000 lines of instructions, and it rarely misses.