[WIP][perf] replace all_reduce for kv_consumer and support different num_tokens among all ranks #4736

linfeng-yuan · 2025-12-05T03:07:57Z

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

vLLM version: v0.12.0
vLLM main: vllm-project/vllm@ad32e3e

gemini-code-assist

Code Review

This pull request introduces performance optimizations for MoE models in a distributed setting, primarily for kv_consumer nodes. It achieves this by replacing a costly all_reduce operation, allowing different ranks to handle varying numbers of tokens, which should improve throughput. This is supported by pre-calculating a global batch size for MoE operators. While the changes are promising for performance, I've identified a couple of areas that need attention to ensure correctness and robustness. Specifically, there's a behavioral change in how random experts are selected for load balancing that might be a bug, and a critical assumption about the MoE communication method that could lead to runtime failures if not enforced.

gemini-code-assist · 2025-12-05T03:09:55Z

vllm_ascend/worker/model_runner_v1.py

+        if self.is_kv_consumer and not self.in_profile_run:
+            num_tokens_after_padding = torch.tensor([num_tokens] *
+                                        self.dp_size,
+                                        device="cpu",
+                                        dtype=torch.int32)
+            return num_tokens, num_tokens_after_padding, with_prefill


This optimization to skip all_reduce for num_tokens is a great performance improvement. However, it relies on the assumption that the MoE communication method will be MC2, as noted in the comment. If a different communication method like AllGather is used (e.g., if num_tokens exceeds mc2_tokens_capacity), it will lead to a runtime failure because AllGather requires tensors of the same size across ranks.

To make this more robust, I suggest adding an assertion to ensure that MC2 is indeed the selected communication method when this optimization is active. This will prevent silent failures in unexpected scenarios.

Suggested change

if self.is_kv_consumer and not self.in_profile_run:

num_tokens_after_padding = torch.tensor([num_tokens] *

self.dp_size,

device="cpu",

dtype=torch.int32)

return num_tokens, num_tokens_after_padding, with_prefill

if self.is_kv_consumer and not self.in_profile_run:

assert self._select_moe_comm_method(num_tokens) == MoECommType.MC2, \

"Skipping all_reduce for num_tokens is only supported with MC2 MoE communication."

num_tokens_after_padding = torch.tensor([num_tokens] *

self.dp_size,

device="cpu",

dtype=torch.int32)

return num_tokens, num_tokens_after_padding, with_prefill

gemini-code-assist · 2025-12-05T03:09:55Z

vllm_ascend/quantization/w4a8_dynamic.py

+            random_matrix = torch.rand(topk_ids.size(0), global_num_experts, device=topk_ids.device)
+            topk_ids = torch.argsort(random_matrix, dim=1)[:, :topk_ids.size(1)].to(topk_ids.dtype)


The logic for enable_force_load_balance has been changed to potentially include redundant experts. The previous implementation explicitly excluded them by using global_num_experts - global_redundant_expert_num as the upper bound for random integers. The new implementation uses global_num_experts for the torch.rand call, which means redundant experts can be selected.

If redundant experts should not receive tokens, this is a bug. If this is the case, please consider the following suggestion to correct the range of experts.

random_matrix = torch.rand(topk_ids.size(0), global_num_experts - global_redundant_expert_num, device=topk_ids.device) topk_ids = torch.argsort(random_matrix, dim=1)[:, :topk_ids.size(1)].to(topk_ids.dtype)

gemini-code-assist · 2025-12-05T03:09:55Z

vllm_ascend/quantization/w8a8_dynamic.py

+            random_matrix = torch.rand(topk_ids.size(0), global_num_experts, device=topk_ids.device)
+            topk_ids = torch.argsort(random_matrix, dim=1)[:, :topk_ids.size(1)].to(topk_ids.dtype)


The logic for enable_force_load_balance has been changed to potentially include redundant experts. The previous implementation explicitly excluded them by using global_num_experts - global_redundant_expert_num as the upper bound for random integers. The new implementation uses global_num_experts for the torch.rand call, which means redundant experts can be selected.

If redundant experts should not receive tokens, this is a bug. If this is the case, please consider the following suggestion to correct the range of experts.

random_matrix = torch.rand(topk_ids.size(0), global_num_experts - global_redundant_expert_num, device=topk_ids.device) topk_ids = torch.argsort(random_matrix, dim=1)[:, :topk_ids.size(1)].to(topk_ids.dtype)

github-actions · 2025-12-05T04:00:02Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

github-actions · 2025-12-06T09:20:02Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

…mong all ranks Signed-off-by: linfeng-yuan <1102311262@qq.com>

gemini-code-assist bot reviewed Dec 5, 2025

View reviewed changes

github-actions bot added module:ops module:quantization labels Dec 5, 2025

linfeng-yuan force-pushed the replace_all_reduce branch 2 times, most recently from f45b53e to 87129a9 Compare December 5, 2025 07:21

github-actions bot added the merge-conflicts label Dec 6, 2025

replace all_reduce for kv_consumer and support different num_tokens a…

0381e07

…mong all ranks Signed-off-by: linfeng-yuan <1102311262@qq.com>

linfeng-yuan force-pushed the replace_all_reduce branch from 87129a9 to 0381e07 Compare December 8, 2025 02:20

github-actions bot removed the merge-conflicts label Dec 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP][perf] replace all_reduce for kv_consumer and support different num_tokens among all ranks #4736

[WIP][perf] replace all_reduce for kv_consumer and support different num_tokens among all ranks #4736

linfeng-yuan commented Dec 5, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 5, 2025

Uh oh!

gemini-code-assist bot Dec 5, 2025

Uh oh!

gemini-code-assist bot Dec 5, 2025

Uh oh!

github-actions bot commented Dec 5, 2025

Uh oh!

github-actions bot commented Dec 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		random_matrix = torch.rand(topk_ids.size(0), global_num_experts, device=topk_ids.device)
		topk_ids = torch.argsort(random_matrix, dim=1)[:, :topk_ids.size(1)].to(topk_ids.dtype)

[WIP][perf] replace all_reduce for kv_consumer and support different num_tokens among all ranks #4736

Are you sure you want to change the base?

[WIP][perf] replace all_reduce for kv_consumer and support different num_tokens among all ranks #4736

Conversation

linfeng-yuan commented Dec 5, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Dec 5, 2025

Uh oh!

github-actions bot commented Dec 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

linfeng-yuan commented Dec 5, 2025 •

edited by github-actions bot

Loading