-
-
Notifications
You must be signed in to change notification settings - Fork 11.8k
[MoE] Alpha MoE integration #30078
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[MoE] Alpha MoE integration #30078
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for Alpha-MoE, a high-performance fused MoE kernel, which is a valuable addition for improving inference latency on FP8 quantized models. The implementation is well-structured, with clear separation of concerns in the new alpha_moe.py module and clean integration into the existing MoE layers. The use of environment variables for configuration is consistent with the project's patterns. My main concern, detailed in the review comment, is related to error handling for the Alpha-MoE configuration file, which could lead to a server crash if not handled gracefully.
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
|
Hi @vyom-hai, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, |
|
cc @mgoin |
|
Thanks for working on this integration! I was planning to look into it myself, so appreciate it. Thanks for the benchmarks too, I was wondering if you could compare against the triton moe as well for e2e results. You should be able to do VLLM_MOE_USE_DEEP_GEMM=0 |
|
Amazing work! I'll look at the code later when I have a bit more time.
Curious how are you running it on B200? The code uses WGMMA and AFAIK it
should not compile on Blackwell but I never tried it
…On Fri, 5 Dec 2025, 12:39 Michael Goin, ***@***.***> wrote:
*mgoin* left a comment (vllm-project/vllm#30078)
<#30078 (comment)>
Thanks for working on this integration! I was planning to look into it
myself, so appreciate it. Thanks for the benchmarks too, I was wondering if
you could compare against the triton moe as well for e2e results. You
should be able to do VLLM_MOE_USE_DEEP_GEMM=0
CC @SzymonOzog <https://github.com/SzymonOzog> btw
—
Reply to this email directly, view it on GitHub
<#30078 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AN5O4IPQJ3U5KI5555LX5C34AHUPLAVCNFSM6AAAAACOCIC76OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTMMJYGUZTCNJYGA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Signed-off-by: vyom-hai <vyom@hippocraticai.com>
Purpose
Add support for Alpha-MoE, a high-performance fused Mixture-of-Experts CUDA megakernel optimized for tensor-parallel servings of FP8 quantized MoE models.
This integration provides an alternative MoE kernel backend that fuses the up projection, SiLU activation, and down projection into a single CUDA kernel, reducing memory bandwidth overhead and improving inference latency for models like DeepSeek-V3.
Changes:
vllm/model_executor/layers/fused_moe/alpha_moe.py- Alpha-MoE kernel wrapper with availability check, weight interleaving, and kernel dispatchvllm/model_executor/layers/fused_moe/fused_moe.py- Added Alpha-MoE to the kernel dispatch chain (before DeepGemm fallback)vllm/model_executor/layers/fused_moe/layer.py- Added_maybe_interleave_for_alpha_moe()for automatic weight interleaving during model loadingvllm/envs.py- AddedVLLM_USE_ALPHA_MOEandVLLM_ALPHA_MOE_CONFIGenvironment variablesTest Plan
Clone Alpha-MoE repository
git clone https://github.com/Aleph-Alpha/Alpha-MoE.git
cd Alpha-MoE
Install with CUDA extension compilation (requires CUDA toolkit)
pip install -e . --no-build-isolation
Set library path for PyTorch dependencies
export LD_LIBRARY_PATH="$(python -c 'import torch; print(torch.path[0])')/lib:$LD_LIBRARY_PATH"
Generate JIT kernel configuration (optional but recommended)
cd Alpha-MoE
python jit_moe.py --E 256 --N 2048 --K 7168 --top-k 8 --out-file configs/moe_jit_config.json
Run latency benchmark with Alpha-MoE
VLLM_USE_ALPHA_MOE=1
VLLM_ALPHA_MOE_CONFIG=/configs/moe_jit_config.json
vllm bench latency
--model deepseek-ai/DeepSeek-V3
--tensor-parallel-size 8
--quantization fp8
--enforce-eager
--num-iters 8
--batch-size 16
--max-model-len 16384
Test Result
Model: DeepSeek-V3 | Hardware: 8x NVIDIA B200 GPUs | Quantization: FP8 | Batch Size: 16
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.