Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Usage]: Profiling Prefill and Decode Phases Separately #4900

Open
Msiavashi opened this issue May 18, 2024 · 5 comments
Open

[Usage]: Profiling Prefill and Decode Phases Separately #4900

Msiavashi opened this issue May 18, 2024 · 5 comments
Labels
usage How to use vllm

Comments

@Msiavashi
Copy link

Your current environment

I'm attempting to independently measure the performance (e.g., latency, throughput, etc.) of the prefill and decode phases. Is there a way to achieve this? I have noticed a few benchmarks that measure end-to-end throughput and latency but do not provide separate metrics for each phase.

I would greatly appreciate any guidance on profiling these two phases separately.

How would you like to use vllm

No response

@Msiavashi Msiavashi added the usage How to use vllm label May 18, 2024
@leiwen83
Copy link
Contributor

stream mode shall get each token's latency, and thus prefill and decode phase could be measured.
While current benchmark using sync mode, another workaround may be considered is:

  1. measure latency for input_len=1000, output_len=1, thus get prefill latency for input_len=1000
  2. measure latency for input_len=1, output_len=1, get average latency A, and then input_len=1, output_len=1000, get average latency B. (B-A)/999 to get the decode latency...

@Msiavashi
Copy link
Author

So, there is still no embedded mechanism for these measurements/profiling, right?

@KevinZeng08
Copy link

stream mode shall get each token's latency, and thus prefill and decode phase could be measured. While current benchmark using sync mode, another workaround may be considered is:

  1. measure latency for input_len=1000, output_len=1, thus get prefill latency for input_len=1000
  2. measure latency for input_len=1, output_len=1, get average latency A, and then input_len=1, output_len=1000, get average latency B. (B-A)/999 to get the decode latency...

Hi, now do you have some other ways for profiling prefill and decode phase separately?

@kerthcet
Copy link
Contributor

There's a ongoing PR related. #2809

@dshm
Copy link

dshm commented Sep 19, 2024

same question. now do we have some other ways for profiling prefill and decode phase separately?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
usage How to use vllm
Projects
None yet
Development

No branches or pull requests

5 participants