[MoE] Alpha MoE integration #30078

vyom-hai · 2025-12-04T18:49:46Z

Purpose

Add support for Alpha-MoE, a high-performance fused Mixture-of-Experts CUDA megakernel optimized for tensor-parallel servings of FP8 quantized MoE models.

This integration provides an alternative MoE kernel backend that fuses the up projection, SiLU activation, and down projection into a single CUDA kernel, reducing memory bandwidth overhead and improving inference latency for models like DeepSeek-V3.

Changes:

New file: vllm/model_executor/layers/fused_moe/alpha_moe.py - Alpha-MoE kernel wrapper with availability check, weight interleaving, and kernel dispatch
Modified: vllm/model_executor/layers/fused_moe/fused_moe.py - Added Alpha-MoE to the kernel dispatch chain (before DeepGemm fallback)
Modified: vllm/model_executor/layers/fused_moe/layer.py - Added _maybe_interleave_for_alpha_moe() for automatic weight interleaving during model loading
Modified: vllm/envs.py - Added VLLM_USE_ALPHA_MOE and VLLM_ALPHA_MOE_CONFIG environment variables

Test Plan

Clone Alpha-MoE repository

git clone https://github.com/Aleph-Alpha/Alpha-MoE.git
cd Alpha-MoE

Install with CUDA extension compilation (requires CUDA toolkit)

pip install -e . --no-build-isolation

Set library path for PyTorch dependencies

export LD_LIBRARY_PATH="$(python -c 'import torch; print(torch.path[0])')/lib:$LD_LIBRARY_PATH"

Generate JIT kernel configuration (optional but recommended)

cd Alpha-MoE
python jit_moe.py --E 256 --N 2048 --K 7168 --top-k 8 --out-file configs/moe_jit_config.json

Run latency benchmark with Alpha-MoE

VLLM_USE_ALPHA_MOE=1
VLLM_ALPHA_MOE_CONFIG=/configs/moe_jit_config.json
vllm bench latency
--model deepseek-ai/DeepSeek-V3
--tensor-parallel-size 8
--quantization fp8
--enforce-eager
--num-iters 8
--batch-size 16
--max-model-len 16384

Test Result

Model: DeepSeek-V3 | Hardware: 8x NVIDIA B200 GPUs | Quantization: FP8 | Batch Size: 16

Metric	Alpha-MoE	DeepGemm (Baseline)	Improvement
Avg Latency	10.916s	12.999s	16.0% faster
P10 Latency	10.839s	12.846s	15.6% faster
P25 Latency	10.852s	12.863s	15.6% faster
P50 Latency	10.875s	12.901s	15.7% faster
P75 Latency	10.885s	12.922s	15.8% faster
P90 Latency	11.006s	13.189s	16.5% faster
P99 Latency	11.251s	13.715s	18.0% faster

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This pull request introduces support for Alpha-MoE, a high-performance fused MoE kernel, which is a valuable addition for improving inference latency on FP8 quantized models. The implementation is well-structured, with clear separation of concerns in the new alpha_moe.py module and clean integration into the existing MoE layers. The use of environment variables for configuration is consistent with the project's patterns. My main concern, detailed in the review comment, is related to error handling for the Alpha-MoE configuration file, which could lead to a server crash if not handled gracefully.

vllm/model_executor/layers/fused_moe/alpha_moe.py

github-actions · 2025-12-04T19:21:00Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

mergify · 2025-12-05T13:06:57Z

Hi @vyom-hai, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

LucasWilkinson · 2025-12-05T20:14:14Z

cc @mgoin

mgoin · 2025-12-05T20:38:54Z

Thanks for working on this integration! I was planning to look into it myself, so appreciate it. Thanks for the benchmarks too, I was wondering if you could compare against the triton moe as well for e2e results. You should be able to do VLLM_MOE_USE_DEEP_GEMM=0
CC @SzymonOzog btw

SzymonOzog · 2025-12-06T00:30:42Z

Amazing work! I'll look at the code later when I have a bit more time. Curious how are you running it on B200? The code uses WGMMA and AFAIK it should not compile on Blackwell but I never tried it

…

On Fri, 5 Dec 2025, 12:39 Michael Goin, ***@***.***> wrote: *mgoin* left a comment (vllm-project/vllm#30078) <#30078 (comment)> Thanks for working on this integration! I was planning to look into it myself, so appreciate it. Thanks for the benchmarks too, I was wondering if you could compare against the triton moe as well for e2e results. You should be able to do VLLM_MOE_USE_DEEP_GEMM=0 CC @SzymonOzog <https://github.com/SzymonOzog> btw — Reply to this email directly, view it on GitHub <#30078 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AN5O4IPQJ3U5KI5555LX5C34AHUPLAVCNFSM6AAAAACOCIC76OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTMMJYGUZTCNJYGA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

vllm/model_executor/layers/fused_moe/alpha_moe.py

Signed-off-by: vyom-hai <vyom@hippocraticai.com>

vyom1611 and others added 2 commits December 4, 2025 18:24

feat(moe): Add Alpha MoE fused FP8 megakernel support

a08c614

Merge branch 'vllm-project:main' into feature/alpha-moe-integration

ee74699

vyom-hai requested review from mgoin and pavanimajety as code owners December 4, 2025 18:49

gemini-code-assist bot reviewed Dec 4, 2025

View reviewed changes

vllm/model_executor/layers/fused_moe/alpha_moe.py Outdated Show resolved Hide resolved

LucasWilkinson assigned mgoin Dec 5, 2025

LucasWilkinson changed the title ~~Feature/alpha moe integration~~ [MoE] Alpha MoE integration Dec 5, 2025

SzymonOzog suggested changes Dec 6, 2025

View reviewed changes

vllm/model_executor/layers/fused_moe/alpha_moe.py Outdated Show resolved Hide resolved

vllm/model_executor/layers/fused_moe/alpha_moe.py Outdated Show resolved Hide resolved

vyom-hai added 2 commits December 6, 2025 22:14

fix(alpha_moe): Add error handling and fix invalid default config

19f4215

Signed-off-by: vyom-hai <vyom@hippocraticai.com>

Merge branch 'main' into feature/alpha-moe-integration

55c1623

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[MoE] Alpha MoE integration #30078

[MoE] Alpha MoE integration #30078

vyom-hai commented Dec 4, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

github-actions bot commented Dec 4, 2025

Uh oh!

mergify bot commented Dec 5, 2025

Uh oh!

LucasWilkinson commented Dec 5, 2025

Uh oh!

mgoin commented Dec 5, 2025

Uh oh!

SzymonOzog commented Dec 6, 2025 via email

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

[MoE] Alpha MoE integration #30078

Are you sure you want to change the base?

[MoE] Alpha MoE integration #30078

Conversation

vyom-hai commented Dec 4, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Clone Alpha-MoE repository

Install with CUDA extension compilation (requires CUDA toolkit)

Set library path for PyTorch dependencies

Generate JIT kernel configuration (optional but recommended)

Run latency benchmark with Alpha-MoE

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

github-actions bot commented Dec 4, 2025

Uh oh!

mergify bot commented Dec 5, 2025

Uh oh!

LucasWilkinson commented Dec 5, 2025

Uh oh!

mgoin commented Dec 5, 2025

Uh oh!

SzymonOzog commented Dec 6, 2025 via email

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

vyom-hai commented Dec 4, 2025 •

edited by github-actions bot

Loading