You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When doing text generation with Mistral 7b with Hugginface transformers on a MI100 GPU, I can see in the collected torch trace that a lot of time is wasted due a hipMemcpyWithStream triggered by torch.multinomial. The hipMemcpyWithStream operation seems to return much later after the previously queued GPU kernels have finished executing.
For information, it is responsible for a ~6ms bubble out of ~40ms for the generation of 1 token.
Looks like optimizing it would have quite an impact.
Minimal example to collect the trace that can be visualized for example with https://ui.perfetto.dev:
🐛 Describe the bug
Hi,
When doing text generation with Mistral 7b with Hugginface transformers on a MI100 GPU, I can see in the collected torch trace that a lot of time is wasted due a hipMemcpyWithStream triggered by torch.multinomial. The hipMemcpyWithStream operation seems to return much later after the previously queued GPU kernels have finished executing.
For information, it is responsible for a ~6ms bubble out of ~40ms for the generation of 1 token.
Looks like optimizing it would have quite an impact.
Minimal example to collect the trace that can be visualized for example with https://ui.perfetto.dev:
Versions
Environment:
Python packages: (only transformers is relevant besides the torch packages)
Best,
Epliz
The text was updated successfully, but these errors were encountered: