POC: Enable upstream pytorch profiler for microbenchmarks #2343

anmyachev · 2024-09-25T16:18:37Z

It looks like we can still use the upstream Pytorch profiler, with the limitation of explicitly specifying the kernel name. Using print(prof.key_averages(group_by_stack_n=5).table(sort_by="xpu_time")) I found events that matched the name and their number, which matched the number of times the function was launched.

It also looks strange because DeviceType.CUDA type is written where, but the time is called xpu_time. The numbers I get are similar to those we get with IPEX (so far I've only looked at softmax).

(Pdb) functions[-1]
<FunctionEvent id=4562 name=scan_kernel device_type=DeviceType.CUDA node_id=-1 cpu_time=0.000us start_us=35873.791 end_us=35876.031 cpu_children=[] xpu_time=2.240us name=scan_kernel thread=1 input_shapes=[] cpu_memory_usage=0 xpu_memory_usage=0 is_async=False is_remote=False seq_nr=-1 is_legacy=False>

Let's compare: https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/11038468895 (7ade81f from this PR) vs https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/11035535245 (ipex):

Softmax: triton geomean diff: +6%, xetla geomean diff: +11%, ratio (triton/xetla) geomean diff: -5%
FA advanced: triton geomean diff: ~0%, xetla geomean diff: ~0%, ratio (triton/xetla) geomean diff: ~0%
GEMM advanced: triton geomean diff: +2%, xetla geomean diff: +4%, ratio (triton/xetla) geomean diff: -2%
Prefix sum: triton geomean diff: +1%

Signed-off-by: Anatoly Myachev <[email protected]>

benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchmark.py

Signed-off-by: Anatoly Myachev <[email protected]>

anmyachev · 2024-09-25T19:12:25Z

benchmarks/triton_kernels_benchmark/benchmark_testing.py

+            # record time of `fn`
+            fn()


Most of this function is a copy of the do_bench_ipex function.

However, we can't use the following code because the kernel we need isn't among the subevents of this event (because of a bug I guess):

with record_function("__profile_kernel_of_func"): fn()

Please ensure a ticket is created for this if it is not already.

anmyachev · 2024-09-25T21:56:01Z

benchmarks/triton_kernels_benchmark/benchmark_testing.py

@@ -33,7 +33,7 @@ def _summarize_statistics(times, quantiles, return_mode):


 def do_bench_ipex(fn, warmup=25, rep=100, grad_to_none=None, quantiles=None, fast_flush=True, return_mode="mean",


I haven't changed the default implementation yet so I can switch and see how the new implementation behaves relative to the old one.

whitneywhtsang · 2024-09-26T01:13:15Z

In general, I think it is great to know that we can already use PTI with workaround. However, knowing that the PTI problem suppose to be fixed in a month, I would not do this short term solution, as it adds maintenance, and continue to use IPEX in the meantime. What's your motivation to switch to PTI right away?

anmyachev · 2024-09-26T10:45:03Z

In general, I think it is great to know that we can already use PTI with workaround. However, knowing that the PTI problem suppose to be fixed in a month, I would not do this short term solution, as it adds maintenance, and continue to use IPEX in the meantime. What's your motivation to switch to PTI right away?

@whitneywhtsang first of all, thank you for looking at this.

The most important thing for me is that we can use a more accurate benchmarking method even now on new platforms, because without this change we are looking at the performance calculated through elapsed_time function (which does not work very accurately on small kernels).

The second thought is that the sooner we start adapting the new method, the more time we have to find potential problems or get used to the new baseline. Considering the lag in finding the problem and fixing it, this may be far from a month (in the worst case scenario, of course).

The third thought is that if the results are confirmed, then we can get rid of the IPEX within a week (much faster than a month in the best case scenario). I could (and would like to) turn on this method by default for the coming weekend and get several measurements that could show an approximate trend.

vlad-penkin · 2024-09-26T11:53:07Z

In general, I think it is great to know that we can already use PTI with workaround. However, knowing that the PTI problem suppose to be fixed in a month, I would not do this short term solution, as it adds maintenance, and continue to use IPEX in the meantime. What's your motivation to switch to PTI right away?

Realistically speaking the ETA for PTI fix been delivered is about 6-8 weeks from now.

etiotto · 2024-09-26T16:29:55Z

The implementation proposed yields timing that is not consistent with the timing given by using IPEX. The PR description states:

Let's compare: https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/11038468895 (https://github.com/intel/intel-xpu-backend-for-triton/commit/7ade81f9fa6a59526e8ae911904f8e5042224715 from this PR) vs https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/11035535245 (ipex):

Softmax: triton geomean diff: +6%, xetla geomean diff: +11%, ratio (triton/xetla) geomean diff: -5%
FA advanced: triton geomean diff: ~0%, xetla geomean diff: ~0%, ratio (triton/xetla) geomean diff: ~0%
GEMM advanced: triton geomean diff: +2%, xetla geomean diff: +4%, ratio (triton/xetla) geomean diff: -2%
Prefix sum: triton geomean diff: +1%

Let's take Softmax, the difference between the timing IPEX gives vs the proposed implementation in this PR is too large (11% geomen diff for xeTLA, 6% for Triton).

Given that the main objective of this PR is to provide a way to get precise timing information for kernels (like IPEX does), I do not think the PR solves the problem (diff is too large).

whitneywhtsang · 2024-09-26T16:36:59Z

Thanks for your explanations and clarification @anmyachev and @vlad-penkin! Now I have a better understanding that this change is aim to improve benchmarking method on client GPUs and no intended differences on PVC. As mentioned by @etiotto , do we know why the timings from PTI and IPEX are not the same?

anmyachev · 2024-09-26T17:57:50Z

The implementation proposed yields timing that is not consistent with the timing given by using IPEX. The PR description states:
Let's compare: https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/11038468895 (https://github.com/intel/intel-xpu-backend-for-triton/commit/7ade81f9fa6a59526e8ae911904f8e5042224715 from this PR) vs https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/11035535245 (ipex):

Softmax: triton geomean diff: +6%, xetla geomean diff: +11%, ratio (triton/xetla) geomean diff: -5%
FA advanced: triton geomean diff: ~0%, xetla geomean diff: ~0%, ratio (triton/xetla) geomean diff: ~0%
GEMM advanced: triton geomean diff: +2%, xetla geomean diff: +4%, ratio (triton/xetla) geomean diff: -2%
Prefix sum: triton geomean diff: +1%
Let's take Softmax, the difference between the timing IPEX gives vs the proposed implementation in this PR is too large (11% geomen diff for xeTLA, 6% for Triton).

Given that the main objective of this PR is to provide a way to get precise timing information for kernels (like IPEX does), I do not think the PR solves the problem (diff is too large).

In this approach I implicitly mean teraflops (although it should have been written explicitly) and the geometric mean of teraflops is higher, but on the other hand the measured average time is less, which may indicate that the accuracy of measurements has become greater. It is quite difficult to say why the performance has increased, since some internal optimizations may have influenced this.

Signed-off-by: Anatoly Myachev <[email protected]>

anmyachev · 2024-09-27T18:54:38Z

~~Blocked on #2376~~

Required for #2343 because it will add a new benchmarking method: `UPSTREAM_PYTORCH_PROFILER` The option is added more with an eye on CI, so for now I left `USE_IPEX` option so as not to change the interaction interface (default behavior is also not changed). --------- Signed-off-by: Anatoly Myachev <[email protected]> Co-authored-by: Pavel Chekin <[email protected]>

anmyachev · 2024-09-27T20:12:58Z

CI status (all passed):

https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/11077080971 (PYTORCH_LEGACY_PROFILER_USING_IPEX)
https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/11077083022 (ELAPSED_TIME)
https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/11077084310 (UPSTREAM_PYTORCH_PROFILER)

Note for reviewers: it is quite safe to merge as it does not affect the default measurement method. Adding the kernel name when needed is very easy.

Signed-off-by: Anatoly Myachev <[email protected]>

anmyachev · 2024-09-30T19:04:54Z

I will re-measure the perf before merge one more time: https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/11112254377 (upstream) vs https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/11113062942 (legacy):

Softmax: triton geomean diff: +10%, xetla geomean diff: +11%, ratio (triton/xetla) geomean diff: -1%
FA advanced: triton geomean diff: +1%, xetla geomean diff: +2%, ratio (triton/xetla) geomean diff: -1%
GEMM advanced: triton geomean diff: +2%, xetla geomean diff: +4%, ratio (triton/xetla) geomean diff: -2%
Prefix sum: triton geomean diff: -8%

A few thoughts on the results:

The results are likely to be heavily influenced by fluctuations. Once we have a few measurements on the dashboard, it will be easier to judge. As an example, the ratio of the last three results (for GEMM: triton adv/xetla) fluctuates within one percent.
For "Prefix sum" the result looks like an anomaly, I can't explain it by the profiling method. We need to look at the trend.
As a workaround I don't use with record_function("__profile_kernel_of_func"):. The influence of this function on the performance is unknown. I can check only after the bug is fixed. It is also necessary to make sure that an issue has been opened for this particular problem (or that the issues already open relate to this problem).

I suggest we merge this as early as possible and get more data on the basis of which we can judge the remaining issues.

cc @whitneywhtsang

whitneywhtsang

Given that this PR doesn't change the default behavior on PVC, I am fine with merging it.
It is important to find out the reason why XeTLA is improved more than Triton, which is blocked by the with record_function("__profile_kernel_of_func"): bug. Let's ensure the bug get fixed asap, and check if it is still the case that XeTLA is improved more than Triton.

whitneywhtsang · 2024-09-30T23:05:22Z

benchmarks/triton_kernels_benchmark/benchmark_testing.py

+            # record time of `fn`
+            fn()


Please ensure a ticket is created for this if it is not already.

anmyachev added 4 commits September 25, 2024 16:05

POC: use pytorch profiler

73ecb27

Signed-off-by: Anatoly Myachev <[email protected]>

fixes

343de50

Signed-off-by: Anatoly Myachev <[email protected]>

fix for gemm

830f28b

Signed-off-by: Anatoly Myachev <[email protected]>

enable more benchmarks

0474593

Signed-off-by: Anatoly Myachev <[email protected]>

anmyachev commented Sep 25, 2024

View reviewed changes

benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchmark.py Outdated Show resolved Hide resolved

anmyachev commented Sep 25, 2024

View reviewed changes

benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchmark.py Outdated Show resolved Hide resolved

Apply suggestions from code review

7ade81f

anmyachev changed the title ~~POC: use pytorch profiler~~ POC: Enable upstream pytorch profiler for microbenchmarks Sep 25, 2024

anmyachev added 2 commits September 25, 2024 18:49

cleanup

33624fc

Signed-off-by: Anatoly Myachev <[email protected]>

cleanup2

0382cec

Signed-off-by: Anatoly Myachev <[email protected]>

anmyachev marked this pull request as ready for review September 25, 2024 18:54

anmyachev requested review from ESI-SYD, whitneywhtsang, chengjunlu, etiotto and vlad-penkin and removed request for ESI-SYD September 25, 2024 18:54

anmyachev commented Sep 25, 2024

View reviewed changes

anmyachev added 2 commits September 27, 2024 13:44

Merge remote-tracking branch 'origin' into amyachev/pytorch-prof

ba8997e

Introduce 'BENCHMARKING_METHOD' for 'Triton benchmarks'

fb0ae5e

Signed-off-by: Anatoly Myachev <[email protected]>

anmyachev mentioned this pull request Sep 27, 2024

Introduce BENCHMARKING_METHOD for Triton benchmarks #2376

Merged

fixes

343eff5

Signed-off-by: Anatoly Myachev <[email protected]>

anmyachev and others added 5 commits September 27, 2024 17:20

fix

eb07d59

Signed-off-by: Anatoly Myachev <[email protected]>

Apply suggestions from code review

92b2d4c

Update benchmarks/triton_kernels_benchmark/benchmark_testing.py

2dd73ae

Merge branch 'amyachev/bench-methods' into amyachev/pytorch-prof

f23183f

adapt to 'BENCHMARKING_METHOD'

58e8811

Signed-off-by: Anatoly Myachev <[email protected]>

Merge branch 'main' into amyachev/pytorch-prof

a36bf8e

anmyachev requested a review from pbchekin September 27, 2024 20:13

anmyachev and others added 3 commits September 30, 2024 10:21

Merge remote-tracking branch 'origin' into amyachev/pytorch-prof

9a12300

fix

7c8d387

Signed-off-by: Anatoly Myachev <[email protected]>

Merge branch 'main' into amyachev/pytorch-prof

41ff471

pbchekin approved these changes Sep 30, 2024

View reviewed changes

whitneywhtsang approved these changes Sep 30, 2024

View reviewed changes

anmyachev merged commit 2f332f5 into main Oct 1, 2024
7 checks passed

anmyachev deleted the amyachev/pytorch-prof branch October 1, 2024 11:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

POC: Enable upstream pytorch profiler for microbenchmarks #2343

POC: Enable upstream pytorch profiler for microbenchmarks #2343

anmyachev commented Sep 25, 2024 •

edited

Loading

anmyachev Sep 25, 2024

whitneywhtsang Sep 30, 2024

anmyachev Sep 25, 2024

whitneywhtsang commented Sep 26, 2024

anmyachev commented Sep 26, 2024

vlad-penkin commented Sep 26, 2024

etiotto commented Sep 26, 2024

whitneywhtsang commented Sep 26, 2024

anmyachev commented Sep 26, 2024

anmyachev commented Sep 27, 2024 •

edited

Loading

anmyachev commented Sep 27, 2024 •

edited

Loading

anmyachev commented Sep 30, 2024 •

edited

Loading

whitneywhtsang left a comment

whitneywhtsang Sep 30, 2024

		@@ -33,7 +33,7 @@ def _summarize_statistics(times, quantiles, return_mode):


		def do_bench_ipex(fn, warmup=25, rep=100, grad_to_none=None, quantiles=None, fast_flush=True, return_mode="mean",

POC: Enable upstream pytorch profiler for microbenchmarks #2343

POC: Enable upstream pytorch profiler for microbenchmarks #2343

Conversation

anmyachev commented Sep 25, 2024 • edited Loading

anmyachev Sep 25, 2024

Choose a reason for hiding this comment

whitneywhtsang Sep 30, 2024

Choose a reason for hiding this comment

anmyachev Sep 25, 2024

Choose a reason for hiding this comment

whitneywhtsang commented Sep 26, 2024

anmyachev commented Sep 26, 2024

vlad-penkin commented Sep 26, 2024

etiotto commented Sep 26, 2024

whitneywhtsang commented Sep 26, 2024

anmyachev commented Sep 26, 2024

anmyachev commented Sep 27, 2024 • edited Loading

anmyachev commented Sep 27, 2024 • edited Loading

anmyachev commented Sep 30, 2024 • edited Loading

whitneywhtsang left a comment

Choose a reason for hiding this comment

whitneywhtsang Sep 30, 2024

Choose a reason for hiding this comment

anmyachev commented Sep 25, 2024 •

edited

Loading

anmyachev commented Sep 27, 2024 •

edited

Loading

anmyachev commented Sep 27, 2024 •

edited

Loading

anmyachev commented Sep 30, 2024 •

edited

Loading