Skip to content

Conversation

@ziliangpeng
Copy link
Contributor

Summary

This PR adds a new metric vllm:request_prefill_kv_computed_tokens that tracks the number of KV tokens computed during prefill phase, excluding cached tokens.

Motivation

Currently, vLLM tracks total prompt tokens (vllm:request_prompt_tokens) but doesn't have per-request visibility into how many KV tokens were actually computed vs served from cache (local prefix cache or remote KV cache like LMCache). This metric helps:

  • Understand cache effectiveness on a per-request basis
  • Better estimate actual compute costs vs total prompt size
  • Debug and optimize caching strategies
  • Monitor workload characteristics more accurately

Changes

  • Added num_cached_tokens field to FinishedRequestStats dataclass
  • Updated update_from_finished_request() to accept num_cached_tokens parameter
  • Added new histogram metric vllm:request_prefill_kv_computed_tokens in metrics loggers
  • Metric calculation: num_prompt_tokens - max(num_cached_tokens, 0)
  • Added comprehensive unit tests

Testing

  • Added unit tests in tests/v1/metrics/test_stats.py:
    • Test with prefix cache hits
    • Test without cache
    • Test edge cases (negative values, all tokens cached)
  • Verified in production workloads showing expected cache effectiveness

The metric correctly includes cache hits from both local prefix cache and remote KV stores (KV connector, LMCache).

🤖 Generated with Claude Code

Co-Authored-By: Claude Sonnet 4.5 noreply@anthropic.com

@ziliangpeng ziliangpeng requested a review from markmc as a code owner December 6, 2025 19:36
@mergify mergify bot added the v1 label Dec 6, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new metric, vllm:request_prefill_kv_computed_tokens, to track the number of KV tokens computed during the prefill phase, excluding any tokens served from the cache. The changes are well-implemented, adding the num_cached_tokens field to FinishedRequestStats and plumbing it through from the output processor. A new histogram is added to the Prometheus logger to record this metric, correctly calculating it as the difference between prompt tokens and cached tokens. The inclusion of comprehensive unit tests covering various scenarios, including edge cases, ensures the reliability of this new feature. The code is clear, follows existing patterns, and improves the observability of cache effectiveness. Overall, this is a solid contribution.

@ziliangpeng ziliangpeng force-pushed the feat-prefill-kv-metric branch from 17b00c9 to 9a5fc4d Compare December 6, 2025 19:40
@mergify
Copy link

mergify bot commented Dec 6, 2025

Hi @ziliangpeng, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Add new Prometheus metric `vllm:request_prefill_kv_computed_tokens` to
track the number of new KV cache tokens computed during the prefill
phase, excluding tokens served from prefix cache.

This metric helps measure actual compute workload during prefill,
accounting for prefix cache hits. It correctly handles:
- Prefix caching (excludes cached tokens)
- Chunked prefill (counts total prompt tokens, not per-chunk)
- Edge cases (negative values, no cache)

Changes:
- Add `num_cached_tokens` field to `FinishedRequestStats`
- Pass `num_cached_tokens` from `RequestState` through stats pipeline
- Calculate prefill KV compute as `num_prompt_tokens - num_cached_tokens`
- Add Prometheus histogram metric with standard buckets
- Add comprehensive unit tests covering cache hits, no cache, and edge cases

Example:
  Request with 10,000 token prompt
  Prefix cache hit: 1,200 tokens
  Metric reports: 8,800 tokens (10,000 - 1,200)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Ziliang Peng <ziliang@character.ai>
@ziliangpeng ziliangpeng force-pushed the feat-prefill-kv-metric branch from 9a5fc4d to 34d07c5 Compare December 6, 2025 19:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant