DGX Spark — FieldDiag PowerStress FAIL (MODS-020000600139), thermal sensor — requesting RMA (original) (raw)
Hi — my NVIDIA-branded DGX Spark (purchased ~Jan
2026, ~6 months old) has a confirmed hardware fault
and I’d like to start an RMA.
Symptom: under sustained GPU/LLM inference load the
entire host silently hard-freezes — no SSH, no
console, requires a physical power-cycle. Zero
kernel trace (journalctl ends mid-line, no Xid, no
NVRM error, no OOM, no panic). Reported GPU temps
are benign (80–83 °C) at low power (40–50 W) right
up to the freeze, so it doesn’t look thermal from
telemetry. Time-to-freeze is highly variable (13
min to ~3 h).
FieldDiag confirms a hardware fault. I ran
partnerdiag --field (FieldDiag r9.257.3) —
GpuStress, C2CStress, and both CpuStress tests
PASS, but:
MODS-020000600139 | PowerStress | Power |
“Acceptable temperature limits exceeded
or the
thermal sensor is broken or miscalibrated”
Final Result: FAIL
Software/config was exhaustively ruled out first
(this isn’t a vLLM/driver config issue): the freeze
reproduced across two different models, two
different containers, with FP8 GEMM kernels
swapped, CUDA graphs disabled (enforce_eager),
KV-cache dtype changed, gpu_memory_utilization
lowered, context reduced, and host swap fully
disabled — and it persisted after updating to
driver 580.159.03 plus the latest EC (0x03000302)
and SoC firmware. The only stable configuration is
one that stays at low power draw, consistent with
the PowerStress failure. (One additional
power-subsystem signal: the USB-C PD-controller
firmware update repeatedly fails to apply via
fwupd, on the official adapter direct to wall.)
Given the FieldDiag PowerStress FAIL
(MODS-020000600139), I believe this qualifies for
RMA. I can provide the serial number, full
FieldDiag logs + summary.json, and proof of
purchase by DM.
System: NVIDIA DGX Spark (Product NVIDIA_DGX_Spark,
A.7), DGX OS Ubuntu 24.04.4, driver 580.159.03, EC
0x03000302.
@aniculescu @NVES — could you help me get the RMA
case opened? Thank you.