-
Notifications
You must be signed in to change notification settings - Fork 629
Adapt dispatch_ffn_combine for decoding #4762
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors the fused MoE implementation, primarily to decouple operations and dynamically calculate maxOutputSize. While this is a good direction, the changes introduce several critical issues. There's a C++ syntax error that will prevent compilation, a mismatch between a Python operator call and its C++ definition, and the removal of safety checks that could lead to correctness issues. I've detailed these in the review comments.
Signed-off-by: mojave2 <chenchen145@huawei.com>
Signed-off-by: mojave2 <chenchen145@huawei.com>
What this PR does / why we need it?
dispatch_ffn_combinehost and kernel implementations to support the decoding phase, including updated tiling, epilogue, and quantized MoE routing handling.HcclShmem) for cross‑rank synchronization and safe shared‑memory access in multi‑rank decode.dispatch_ffn_combinedecode path into the Ascend runtime (forward context, fused MoE communication, W8A8 quantization, MTP proposer, and model_runner_v1).Does this PR introduce any user-facing change?
How was this patch tested?