-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
POC: Enable upstream pytorch profiler for microbenchmarks #2343
Conversation
Signed-off-by: Anatoly Myachev <[email protected]>
Signed-off-by: Anatoly Myachev <[email protected]>
Signed-off-by: Anatoly Myachev <[email protected]>
Signed-off-by: Anatoly Myachev <[email protected]>
benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchmark.py
Outdated
Show resolved
Hide resolved
benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchmark.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Anatoly Myachev <[email protected]>
Signed-off-by: Anatoly Myachev <[email protected]>
# record time of `fn` | ||
fn() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of this function is a copy of the do_bench_ipex
function.
However, we can't use the following code because the kernel we need isn't among the subevents of this event (because of a bug I guess):
with record_function("__profile_kernel_of_func"):
fn()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please ensure a ticket is created for this if it is not already.
@@ -33,7 +33,7 @@ def _summarize_statistics(times, quantiles, return_mode): | |||
|
|||
|
|||
def do_bench_ipex(fn, warmup=25, rep=100, grad_to_none=None, quantiles=None, fast_flush=True, return_mode="mean", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't changed the default implementation yet so I can switch and see how the new implementation behaves relative to the old one.
In general, I think it is great to know that we can already use PTI with workaround. However, knowing that the PTI problem suppose to be fixed in a month, I would not do this short term solution, as it adds maintenance, and continue to use IPEX in the meantime. What's your motivation to switch to PTI right away? |
@whitneywhtsang first of all, thank you for looking at this. The most important thing for me is that we can use a more accurate benchmarking method even now on new platforms, because without this change we are looking at the performance calculated through The second thought is that the sooner we start adapting the new method, the more time we have to find potential problems or get used to the new baseline. Considering the lag in finding the problem and fixing it, this may be far from a month (in the worst case scenario, of course). The third thought is that if the results are confirmed, then we can get rid of the IPEX within a week (much faster than a month in the best case scenario). I could (and would like to) turn on this method by default for the coming weekend and get several measurements that could show an approximate trend. |
Realistically speaking the ETA for PTI fix been delivered is about 6-8 weeks from now. |
The implementation proposed yields timing that is not consistent with the timing given by using IPEX. The PR description states:
Let's take Softmax, the difference between the timing IPEX gives vs the proposed implementation in this PR is too large (11% geomen diff for xeTLA, 6% for Triton). Given that the main objective of this PR is to provide a way to get precise timing information for kernels (like IPEX does), I do not think the PR solves the problem (diff is too large). |
Thanks for your explanations and clarification @anmyachev and @vlad-penkin! Now I have a better understanding that this change is aim to improve benchmarking method on client GPUs and no intended differences on PVC. As mentioned by @etiotto , do we know why the timings from PTI and IPEX are not the same? |
In this approach I implicitly mean teraflops (although it should have been written explicitly) and the geometric mean of teraflops is higher, but on the other hand the measured average time is less, which may indicate that the accuracy of measurements has become greater. It is quite difficult to say why the performance has increased, since some internal optimizations may have influenced this. |
Signed-off-by: Anatoly Myachev <[email protected]>
Signed-off-by: Anatoly Myachev <[email protected]>
Signed-off-by: Anatoly Myachev <[email protected]>
|
Required for #2343 because it will add a new benchmarking method: `UPSTREAM_PYTORCH_PROFILER` The option is added more with an eye on CI, so for now I left `USE_IPEX` option so as not to change the interaction interface (default behavior is also not changed). --------- Signed-off-by: Anatoly Myachev <[email protected]> Co-authored-by: Pavel Chekin <[email protected]>
CI status (all passed):
Note for reviewers: it is quite safe to merge as it does not affect the default measurement method. Adding the kernel name when needed is very easy. |
Signed-off-by: Anatoly Myachev <[email protected]>
I will re-measure the perf before merge one more time: https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/11112254377 (upstream) vs https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/11113062942 (legacy):
A few thoughts on the results:
I suggest we merge this as early as possible and get more data on the basis of which we can judge the remaining issues. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that this PR doesn't change the default behavior on PVC, I am fine with merging it.
It is important to find out the reason why XeTLA is improved more than Triton, which is blocked by the with record_function("__profile_kernel_of_func"):
bug. Let's ensure the bug get fixed asap, and check if it is still the case that XeTLA is improved more than Triton.
# record time of `fn` | ||
fn() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please ensure a ticket is created for this if it is not already.
Closes #2344
It looks like we can still use the upstream Pytorch profiler, with the limitation of explicitly specifying the kernel name. Using
print(prof.key_averages(group_by_stack_n=5).table(sort_by="xpu_time"))
I found events that matched the name and their number, which matched the number of times the function was launched.It also looks strange because
DeviceType.CUDA
type is written where, but the time is calledxpu_time
. The numbers I get are similar to those we get with IPEX (so far I've only looked at softmax).Let's compare: https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/11038468895 (7ade81f from this PR) vs https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/11035535245 (ipex):