Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

POC: Enable upstream pytorch profiler for microbenchmarks #2343

Merged
merged 19 commits into from
Oct 1, 2024

Conversation

anmyachev
Copy link
Contributor

@anmyachev anmyachev commented Sep 25, 2024

Closes #2344

It looks like we can still use the upstream Pytorch profiler, with the limitation of explicitly specifying the kernel name. Using print(prof.key_averages(group_by_stack_n=5).table(sort_by="xpu_time")) I found events that matched the name and their number, which matched the number of times the function was launched.

It also looks strange because DeviceType.CUDA type is written where, but the time is called xpu_time. The numbers I get are similar to those we get with IPEX (so far I've only looked at softmax).

(Pdb) functions[-1]
<FunctionEvent id=4562 name=scan_kernel device_type=DeviceType.CUDA node_id=-1 cpu_time=0.000us start_us=35873.791 end_us=35876.031 cpu_children=[] xpu_time=2.240us name=scan_kernel thread=1 input_shapes=[] cpu_memory_usage=0 xpu_memory_usage=0 is_async=False is_remote=False seq_nr=-1 is_legacy=False>

Let's compare: https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/11038468895 (7ade81f from this PR) vs https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/11035535245 (ipex):

  • Softmax: triton geomean diff: +6%, xetla geomean diff: +11%, ratio (triton/xetla) geomean diff: -5%
  • FA advanced: triton geomean diff: ~0%, xetla geomean diff: ~0%, ratio (triton/xetla) geomean diff: ~0%
  • GEMM advanced: triton geomean diff: +2%, xetla geomean diff: +4%, ratio (triton/xetla) geomean diff: -2%
  • Prefix sum: triton geomean diff: +1%

Signed-off-by: Anatoly Myachev <[email protected]>
Signed-off-by: Anatoly Myachev <[email protected]>
Signed-off-by: Anatoly Myachev <[email protected]>
Signed-off-by: Anatoly Myachev <[email protected]>
@anmyachev anmyachev changed the title POC: use pytorch profiler POC: Enable upstream pytorch profiler for microbenchmarks Sep 25, 2024
Signed-off-by: Anatoly Myachev <[email protected]>
Signed-off-by: Anatoly Myachev <[email protected]>
@anmyachev anmyachev marked this pull request as ready for review September 25, 2024 18:54
Comment on lines +191 to +192
# record time of `fn`
fn()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of this function is a copy of the do_bench_ipex function.

However, we can't use the following code because the kernel we need isn't among the subevents of this event (because of a bug I guess):

with record_function("__profile_kernel_of_func"):
    fn()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please ensure a ticket is created for this if it is not already.

@@ -33,7 +33,7 @@ def _summarize_statistics(times, quantiles, return_mode):


def do_bench_ipex(fn, warmup=25, rep=100, grad_to_none=None, quantiles=None, fast_flush=True, return_mode="mean",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't changed the default implementation yet so I can switch and see how the new implementation behaves relative to the old one.

@whitneywhtsang
Copy link
Contributor

In general, I think it is great to know that we can already use PTI with workaround. However, knowing that the PTI problem suppose to be fixed in a month, I would not do this short term solution, as it adds maintenance, and continue to use IPEX in the meantime. What's your motivation to switch to PTI right away?

@anmyachev
Copy link
Contributor Author

In general, I think it is great to know that we can already use PTI with workaround. However, knowing that the PTI problem suppose to be fixed in a month, I would not do this short term solution, as it adds maintenance, and continue to use IPEX in the meantime. What's your motivation to switch to PTI right away?

@whitneywhtsang first of all, thank you for looking at this.

The most important thing for me is that we can use a more accurate benchmarking method even now on new platforms, because without this change we are looking at the performance calculated through elapsed_time function (which does not work very accurately on small kernels).

The second thought is that the sooner we start adapting the new method, the more time we have to find potential problems or get used to the new baseline. Considering the lag in finding the problem and fixing it, this may be far from a month (in the worst case scenario, of course).

The third thought is that if the results are confirmed, then we can get rid of the IPEX within a week (much faster than a month in the best case scenario). I could (and would like to) turn on this method by default for the coming weekend and get several measurements that could show an approximate trend.

@vlad-penkin
Copy link
Contributor

In general, I think it is great to know that we can already use PTI with workaround. However, knowing that the PTI problem suppose to be fixed in a month, I would not do this short term solution, as it adds maintenance, and continue to use IPEX in the meantime. What's your motivation to switch to PTI right away?

Realistically speaking the ETA for PTI fix been delivered is about 6-8 weeks from now.

@etiotto
Copy link
Contributor

etiotto commented Sep 26, 2024

The implementation proposed yields timing that is not consistent with the timing given by using IPEX. The PR description states:

Let's compare: https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/11038468895 (https://github.com/intel/intel-xpu-backend-for-triton/commit/7ade81f9fa6a59526e8ae911904f8e5042224715 from this PR) vs https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/11035535245 (ipex):

Softmax: triton geomean diff: +6%, xetla geomean diff: +11%, ratio (triton/xetla) geomean diff: -5%
FA advanced: triton geomean diff: ~0%, xetla geomean diff: ~0%, ratio (triton/xetla) geomean diff: ~0%
GEMM advanced: triton geomean diff: +2%, xetla geomean diff: +4%, ratio (triton/xetla) geomean diff: -2%
Prefix sum: triton geomean diff: +1%

Let's take Softmax, the difference between the timing IPEX gives vs the proposed implementation in this PR is too large (11% geomen diff for xeTLA, 6% for Triton).

Given that the main objective of this PR is to provide a way to get precise timing information for kernels (like IPEX does), I do not think the PR solves the problem (diff is too large).

@whitneywhtsang
Copy link
Contributor

Thanks for your explanations and clarification @anmyachev and @vlad-penkin! Now I have a better understanding that this change is aim to improve benchmarking method on client GPUs and no intended differences on PVC. As mentioned by @etiotto , do we know why the timings from PTI and IPEX are not the same?

@anmyachev
Copy link
Contributor Author

The implementation proposed yields timing that is not consistent with the timing given by using IPEX. The PR description states:

Let's compare: https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/11038468895 (https://github.com/intel/intel-xpu-backend-for-triton/commit/7ade81f9fa6a59526e8ae911904f8e5042224715 from this PR) vs https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/11035535245 (ipex):

Softmax: triton geomean diff: +6%, xetla geomean diff: +11%, ratio (triton/xetla) geomean diff: -5%
FA advanced: triton geomean diff: ~0%, xetla geomean diff: ~0%, ratio (triton/xetla) geomean diff: ~0%
GEMM advanced: triton geomean diff: +2%, xetla geomean diff: +4%, ratio (triton/xetla) geomean diff: -2%
Prefix sum: triton geomean diff: +1%

Let's take Softmax, the difference between the timing IPEX gives vs the proposed implementation in this PR is too large (11% geomen diff for xeTLA, 6% for Triton).

Given that the main objective of this PR is to provide a way to get precise timing information for kernels (like IPEX does), I do not think the PR solves the problem (diff is too large).

In this approach I implicitly mean teraflops (although it should have been written explicitly) and the geometric mean of teraflops is higher, but on the other hand the measured average time is less, which may indicate that the accuracy of measurements has become greater. It is quite difficult to say why the performance has increased, since some internal optimizations may have influenced this.

Signed-off-by: Anatoly Myachev <[email protected]>
@anmyachev
Copy link
Contributor Author

anmyachev commented Sep 27, 2024

Blocked on #2376

anmyachev added a commit that referenced this pull request Sep 27, 2024
Required for
#2343 because
it will add a new benchmarking method: `UPSTREAM_PYTORCH_PROFILER`

The option is added more with an eye on CI, so for now I left `USE_IPEX`
option so as not to change the interaction interface (default behavior
is also not changed).

---------

Signed-off-by: Anatoly Myachev <[email protected]>
Co-authored-by: Pavel Chekin <[email protected]>
@anmyachev
Copy link
Contributor Author

anmyachev commented Sep 27, 2024

CI status (all passed):

Note for reviewers: it is quite safe to merge as it does not affect the default measurement method. Adding the kernel name when needed is very easy.

@anmyachev
Copy link
Contributor Author

anmyachev commented Sep 30, 2024

I will re-measure the perf before merge one more time: https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/11112254377 (upstream) vs https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/11113062942 (legacy):

  • Softmax: triton geomean diff: +10%, xetla geomean diff: +11%, ratio (triton/xetla) geomean diff: -1%
  • FA advanced: triton geomean diff: +1%, xetla geomean diff: +2%, ratio (triton/xetla) geomean diff: -1%
  • GEMM advanced: triton geomean diff: +2%, xetla geomean diff: +4%, ratio (triton/xetla) geomean diff: -2%
  • Prefix sum: triton geomean diff: -8%

A few thoughts on the results:

  • The results are likely to be heavily influenced by fluctuations. Once we have a few measurements on the dashboard, it will be easier to judge. As an example, the ratio of the last three results (for GEMM: triton adv/xetla) fluctuates within one percent.
  • For "Prefix sum" the result looks like an anomaly, I can't explain it by the profiling method. We need to look at the trend.
  • As a workaround I don't use with record_function("__profile_kernel_of_func"):. The influence of this function on the performance is unknown. I can check only after the bug is fixed. It is also necessary to make sure that an issue has been opened for this particular problem (or that the issues already open relate to this problem).

I suggest we merge this as early as possible and get more data on the basis of which we can judge the remaining issues.

cc @whitneywhtsang

Copy link
Contributor

@whitneywhtsang whitneywhtsang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that this PR doesn't change the default behavior on PVC, I am fine with merging it.
It is important to find out the reason why XeTLA is improved more than Triton, which is blocked by the with record_function("__profile_kernel_of_func"): bug. Let's ensure the bug get fixed asap, and check if it is still the case that XeTLA is improved more than Triton.

Comment on lines +191 to +192
# record time of `fn`
fn()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please ensure a ticket is created for this if it is not already.

@anmyachev anmyachev merged commit 2f332f5 into main Oct 1, 2024
7 checks passed
@anmyachev anmyachev deleted the amyachev/pytorch-prof branch October 1, 2024 11:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

POC: Enable upstream pytorch profiler for microbenchmarks
5 participants