-
-
Notifications
You must be signed in to change notification settings - Fork 11.8k
Open
Labels
usageHow to use vllmHow to use vllm
Description
Your current environment
Help: Running NVFP4 model on 2x DGX Spark with vLLM + Ray (multi-node)
Hardware
- 2x DGX Spark (GB10 GPU each, sm_121a / compute capability 12.1)
- Connected via 200GbE ConnectX-7/Ethernet
- Driver: 580.95.05, Host CUDA: 13.0
Goal
Run lukealonso/GLM-4.6-NVFP4 (357B MoE model, NVFP4 quantization) across both nodes using vLLM with Ray distributed backend.
What I've Tried
1. nvcr.io/nvidia/vllm:25.11-py3 (NGC)
- vLLM 0.11.0
- Error:
FlashInfer kernels unavailable for ModelOptNvFp4FusedMoE on current platform - NVFP4 requires vLLM 0.12.0+
2. vllm/vllm-openai:nightly-aarch64 (vLLM 0.11.2.dev575)
- With
VLLM_USE_FLASHINFER_MOE_FP4=1 - Error:
ptxas fatal: Value 'sm_121a' is not defined for option 'gpu-name' - Triton's bundled ptxas 12.8 doesn't support GB10
3. vllm/vllm-openai:v0.12.0-aarch64 (vLLM 0.12.0)
- Fixed ptxas with symlink:
ln -sf /usr/local/cuda/bin/ptxas /usr/local/lib/python3.12/dist-packages/triton/backends/nvidia/bin/ptxas - Triton compilation passes ✅
- Error:
RuntimeError: [FP4 gemm Runner] Failed to run cutlass FP4 gemm on sm120. Error: Error Internal
4. Tried both parallelism modes:
--tensor-parallel-size 2→ same CUTLASS error--pipeline-parallel-size 2→ same CUTLASS error
5. --enforce-eager flag
- Not fully tested yet
Environment Details
| Component | Version |
|---|---|
| Host Driver | 580.95.05 |
| Host CUDA | 13.0 |
| Container CUDA | 12.9 |
| Container ptxas | 12.9.86 (supports sm_121a ✅) |
| Triton bundled ptxas | 12.8 (NO sm_121a ❌) |
| PyTorch | 2.9.0+cu129 |
The Blocking Error
vLLM correctly loads weights (41/41 shards), then during profile_run:
INFO [flashinfer_utils.py:289] Flashinfer TRTLLM MOE backend is only supported on SM100 and later, using CUTLASS backend instead
INFO [modelopt.py:1142] Using FlashInfer CUTLASS kernels for ModelOptNvFp4FusedMoE.
...
RuntimeError: [FP4 gemm Runner] Failed to run cutlass FP4 gemm on sm120. Error: Error Internal
FlashInfer detects GB10 is not SM100 (B200), falls back to CUTLASS - but CUTLASS FP4 also fails.
Key Question
Are CUTLASS FP4 GEMM kernels compiled for GB10 (sm_121a)?
Is there:
- A vLLM build with CUTLASS kernels for sm_121?
- A way to force Marlin FP4 fallback on GB10?
- Recommended Docker image for DGX Spark + NVFP4?
I see NVFP4 models tested on:
- B200 (sm_100) ✅
- H100/A100 with Marlin FP4 fallback ✅
But GB10 is sm_121 (Blackwell desktop/workstation variant). The error says sm120 which seems wrong - GB10 should be sm_121a.
References
Thanks!
Metadata
Metadata
Assignees
Labels
usageHow to use vllmHow to use vllm