Skip to content

[Usage]: Help Running NVFP4 model on 2x DGX Spark with vLLM + Ray (multi-node) #30163

@letsrock85

Description

@letsrock85

Your current environment

Help: Running NVFP4 model on 2x DGX Spark with vLLM + Ray (multi-node)

Hardware

  • 2x DGX Spark (GB10 GPU each, sm_121a / compute capability 12.1)
  • Connected via 200GbE ConnectX-7/Ethernet
  • Driver: 580.95.05, Host CUDA: 13.0

Goal

Run lukealonso/GLM-4.6-NVFP4 (357B MoE model, NVFP4 quantization) across both nodes using vLLM with Ray distributed backend.

What I've Tried

1. nvcr.io/nvidia/vllm:25.11-py3 (NGC)

  • vLLM 0.11.0
  • Error: FlashInfer kernels unavailable for ModelOptNvFp4FusedMoE on current platform
  • NVFP4 requires vLLM 0.12.0+

2. vllm/vllm-openai:nightly-aarch64 (vLLM 0.11.2.dev575)

  • With VLLM_USE_FLASHINFER_MOE_FP4=1
  • Error: ptxas fatal: Value 'sm_121a' is not defined for option 'gpu-name'
  • Triton's bundled ptxas 12.8 doesn't support GB10

3. vllm/vllm-openai:v0.12.0-aarch64 (vLLM 0.12.0)

  • Fixed ptxas with symlink: ln -sf /usr/local/cuda/bin/ptxas /usr/local/lib/python3.12/dist-packages/triton/backends/nvidia/bin/ptxas
  • Triton compilation passes ✅
  • Error: RuntimeError: [FP4 gemm Runner] Failed to run cutlass FP4 gemm on sm120. Error: Error Internal

4. Tried both parallelism modes:

  • --tensor-parallel-size 2 → same CUTLASS error
  • --pipeline-parallel-size 2 → same CUTLASS error

5. --enforce-eager flag

  • Not fully tested yet

Environment Details

Component Version
Host Driver 580.95.05
Host CUDA 13.0
Container CUDA 12.9
Container ptxas 12.9.86 (supports sm_121a ✅)
Triton bundled ptxas 12.8 (NO sm_121a ❌)
PyTorch 2.9.0+cu129

The Blocking Error

vLLM correctly loads weights (41/41 shards), then during profile_run:

INFO [flashinfer_utils.py:289] Flashinfer TRTLLM MOE backend is only supported on SM100 and later, using CUTLASS backend instead
INFO [modelopt.py:1142] Using FlashInfer CUTLASS kernels for ModelOptNvFp4FusedMoE.
...
RuntimeError: [FP4 gemm Runner] Failed to run cutlass FP4 gemm on sm120. Error: Error Internal

FlashInfer detects GB10 is not SM100 (B200), falls back to CUTLASS - but CUTLASS FP4 also fails.

Key Question

Are CUTLASS FP4 GEMM kernels compiled for GB10 (sm_121a)?

Is there:

  1. A vLLM build with CUTLASS kernels for sm_121?
  2. A way to force Marlin FP4 fallback on GB10?
  3. Recommended Docker image for DGX Spark + NVFP4?

I see NVFP4 models tested on:

  • B200 (sm_100) ✅
  • H100/A100 with Marlin FP4 fallback ✅

But GB10 is sm_121 (Blackwell desktop/workstation variant). The error says sm120 which seems wrong - GB10 should be sm_121a.

References

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    usageHow to use vllm

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions