[Usage]: Help Running NVFP4 model on 2x DGX Spark with vLLM + Ray (multi-node)

### Your current environment

# Help: Running NVFP4 model on 2x DGX Spark with vLLM + Ray (multi-node)

## Hardware
- **2x DGX Spark** (GB10 GPU each, sm_121a / compute capability 12.1)
- Connected via 200GbE ConnectX-7/Ethernet
- Driver: 580.95.05, Host CUDA: 13.0

## Goal
Run `lukealonso/GLM-4.6-NVFP4` (357B MoE model, NVFP4 quantization) across both nodes using vLLM with Ray distributed backend.

## What I've Tried

### 1. `nvcr.io/nvidia/vllm:25.11-py3` (NGC)
- vLLM 0.11.0
- **Error:** `FlashInfer kernels unavailable for ModelOptNvFp4FusedMoE on current platform`
- NVFP4 requires vLLM 0.12.0+

### 2. `vllm/vllm-openai:nightly-aarch64` (vLLM 0.11.2.dev575)
- With `VLLM_USE_FLASHINFER_MOE_FP4=1`
- **Error:** `ptxas fatal: Value 'sm_121a' is not defined for option 'gpu-name'`
- Triton's bundled ptxas 12.8 doesn't support GB10

### 3. `vllm/vllm-openai:v0.12.0-aarch64` (vLLM 0.12.0)
- Fixed ptxas with symlink: `ln -sf /usr/local/cuda/bin/ptxas /usr/local/lib/python3.12/dist-packages/triton/backends/nvidia/bin/ptxas`
- Triton compilation passes ✅
- **Error:** `RuntimeError: [FP4 gemm Runner] Failed to run cutlass FP4 gemm on sm120. Error: Error Internal`

### 4. Tried both parallelism modes:
- `--tensor-parallel-size 2` → same CUTLASS error
- `--pipeline-parallel-size 2` → same CUTLASS error

### 5. `--enforce-eager` flag
- Not fully tested yet

## Environment Details
| Component | Version |
|-----------|---------|
| Host Driver | 580.95.05 |
| Host CUDA | 13.0 |
| Container CUDA | 12.9 |
| Container ptxas | 12.9.86 (supports sm_121a ✅) |
| Triton bundled ptxas | 12.8 (NO sm_121a ❌) |
| PyTorch | 2.9.0+cu129 |

## The Blocking Error

vLLM correctly loads weights (41/41 shards), then during profile_run:

```
INFO [flashinfer_utils.py:289] Flashinfer TRTLLM MOE backend is only supported on SM100 and later, using CUTLASS backend instead
INFO [modelopt.py:1142] Using FlashInfer CUTLASS kernels for ModelOptNvFp4FusedMoE.
...
RuntimeError: [FP4 gemm Runner] Failed to run cutlass FP4 gemm on sm120. Error: Error Internal
```

FlashInfer detects GB10 is not SM100 (B200), falls back to CUTLASS - but CUTLASS FP4 also fails.

## Key Question

**Are CUTLASS FP4 GEMM kernels compiled for GB10 (sm_121a)?**

Is there:
1. A vLLM build with CUTLASS kernels for sm_121?
2. A way to force Marlin FP4 fallback on GB10?
3. Recommended Docker image for DGX Spark + NVFP4?

I see NVFP4 models tested on:
- B200 (sm_100) ✅
- H100/A100 with Marlin FP4 fallback ✅

But GB10 is **sm_121** (Blackwell desktop/workstation variant). The error says `sm120` which seems wrong - GB10 should be sm_121a.



## References
- [ GLM-4.6-NVFP4](https://huggingface.co/lukealonso/GLM-4.6-NVFP4)(https://huggingface.co/lukealonso/GLM-4.6-NVFP4)

- [Firworks/GLM-4.5-Air-nvfp4](https://huggingface.co/Firworks/GLM-4.5-Air-nvfp4)

Thanks!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Usage]: Help Running NVFP4 model on 2x DGX Spark with vLLM + Ray (multi-node) #30163

Your current environment

Help: Running NVFP4 model on 2x DGX Spark with vLLM + Ray (multi-node)

Hardware

Goal

What I've Tried

1. `nvcr.io/nvidia/vllm:25.11-py3` (NGC)

2. `vllm/vllm-openai:nightly-aarch64` (vLLM 0.11.2.dev575)

3. `vllm/vllm-openai:v0.12.0-aarch64` (vLLM 0.12.0)

4. Tried both parallelism modes:

5. `--enforce-eager` flag

Environment Details

The Blocking Error

Key Question

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Component	Version
Host Driver	580.95.05
Host CUDA	13.0
Container CUDA	12.9
Container ptxas	12.9.86 (supports sm_121a ✅)
Triton bundled ptxas	12.8 (NO sm_121a ❌)
PyTorch	2.9.0+cu129

Uh oh!

[Usage]: Help Running NVFP4 model on 2x DGX Spark with vLLM + Ray (multi-node) #30163

Description

Your current environment

Help: Running NVFP4 model on 2x DGX Spark with vLLM + Ray (multi-node)

Hardware

Goal

What I've Tried

1. nvcr.io/nvidia/vllm:25.11-py3 (NGC)

2. vllm/vllm-openai:nightly-aarch64 (vLLM 0.11.2.dev575)

3. vllm/vllm-openai:v0.12.0-aarch64 (vLLM 0.12.0)

4. Tried both parallelism modes:

5. --enforce-eager flag

Environment Details

The Blocking Error

Key Question

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. `nvcr.io/nvidia/vllm:25.11-py3` (NGC)

2. `vllm/vllm-openai:nightly-aarch64` (vLLM 0.11.2.dev575)

3. `vllm/vllm-openai:v0.12.0-aarch64` (vLLM 0.12.0)

5. `--enforce-eager` flag