Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Profile-Guided Optimization (PGO) and LLVM BOLT results #827

Open
zamazan4ik opened this issue Sep 8, 2023 · 12 comments
Open

Profile-Guided Optimization (PGO) and LLVM BOLT results #827

zamazan4ik opened this issue Sep 8, 2023 · 12 comments
Labels
enhancement Improve the expected

Comments

@zamazan4ik
Copy link

Hi!

I did a lot of Profile-Guided Optimization (PGO) benchmarks recently on different kinds of software - all currently available results are located at https://github.com/zamazan4ik/awesome-pgo . According to the tests, PGO usually helps with achieving better performance. That's why testing PGO would be a good idea for Typos. I did some benchmarks on my local machine and want to share my results.

Test environment

  • Fedora 38
  • Linux kernel 6.4.13
  • AMD Ryzen 9 5800x
  • 48 Gib RAM
  • SSD Samsung 980 Pro 2 Tib
  • Rust: 1.72
  • Latest Typos from the master branch (commit da2759161fbf9ac2840d6955f120bc3c6f24405f )

Test workload

As a test scenario, I used LLVM sources from https://github.com/llvm/llvm-project on commit 11db162db07d6083b79f4724e649a8c2c69913e1. All runs are performed on the same hardware, operating system, and the same background workload. The command to run typos is taskset -c 0 ./typos -q --threads 1 llvm_project. One thread was used for the purpose of reducing multi-threading scheduler influence on the results. All PGO optimizations are done with cargo-pgo.

Results

Here are the results. Also, I posted Instrumentation results so you can estimate how typos slow in the Instrumentation mode. The results are in time utility format.

  • Release: 48,86s user 3,44s system 99% cpu 52,628 total
  • PGO optimized: 30,09s user 3,23s system 99% cpu 33,616 total
  • PGO instrumented: 128,16s user 3,55s system 99% cpu 2:12,23 total
  • PGO optimized + BOLT instrumented: 92,05s user 3,60s system 99% cpu 1:36,08 total
  • PGO optimized + BOLT optimized: 29,09s user 3,16s system 98% cpu 32,585 total

Some conclusions

  • PGO shows great improvements in typos performance
  • BOLT (at least in the Lite mode) optimization mode does not show great improvement here

Further steps

I can suggest to do the following things:

  • Add a note to the Typos documentation about building with PGO. In this case, users and maintainers who build their own Typos binaries will be aware of PGO as an additional way to optimize the project
  • Optimize provided by Typos project binaries on the CI (like it's already done for other projects like Rustc), if any
@epage
Copy link
Collaborator

epage commented Sep 8, 2023

Thanks for running these numbers!

iirc BOLT doesn't need a representative run to guide its optimizations. I wonder what a BOLT-only run looks like.

Optimize provided by Typos project binaries on the CI (like it's already done for other projects like Rustc), if any

If you had an example to point to that isn't as large as rustc, I'd appreciate it. I'd be curious to see what maintenance burden and CI pipeline load this introduces.

@epage epage added the enhancement Improve the expected label Sep 8, 2023
@zamazan4ik
Copy link
Author

iirc BOLT doesn't need a representative run to guide its optimizations. I wonder what a BOLT-only run looks like.

Well, it's partially true. Yes, BOLT can perform some optimizations even without a runtime profile. However most of the optimizations are done by BOLT only with runtime profiles. The runtime profile could be collected with Linux's perf (sampling mode) or via BOLT's Instrumentation (as I did for typos).

If you had an example to point to that isn't as large as rustc, I'd appreciate it. I'd be curious to see what maintenance burden and CI pipeline load this introduces.

Sure! I have multiple examples of different PGO and/or BOLT integration into different projects:

Here are the examples not only for Rust-based projects - I hope it could help somehow.

@epage
Copy link
Collaborator

epage commented Sep 11, 2023

Thanks for pydantic-core, that is exactly what I was looking for!

The next question is what is a minimal reasonable use case to profile. We're already going to be blowing up our build times with this and I'd like to not make it worse, particularly because our github action has a race condition where if you specify master it will start using the new version even if the binary isn't built yet which will fail.

@zamazan4ik
Copy link
Author

The next question is what is a minimal reasonable use case to profile.

I have some (hopefully helpful) thoughts on that:

  • There is almost no need to run a PGO-optimized build for each commit/PR. I suggest using PGO only for Release builds. Under "Release" I mean delivered to the users' binaries (or something similar). So building PGO-optimized builds per Release or once should be ok.
  • You can generate a profile once and use it continuously in the CI, so there will be no need to perform 2-stage builds (that's the most time-consuming thing in the PGO builds). There is a possible issue here - profile skewing. If the profile is collected for the old enough typos version, the profile will be probably less efficient (code reformattings, missing profiles for the new code, changed a little hot/cold splitting in the workloads). So it would be good enough to just recollect binaries with some frequency, not per build.
  • How to generate a profile? I think it's okay to start with simply running Typos on some representative workloads (like checking Linux/LLVM sources in multiple modes), then collect the profiles and commit them into the repo + develop a script for recollecting the profiles (so in the future will be easier to regenerate profiles + for the users will be easier to generate their own profiles for Typos).

@epage
Copy link
Collaborator

epage commented Sep 18, 2023

There is almost no need to run a PGO-optimized build for each commit/PR. I suggest using PGO only for Release builds. Under "Release" I mean delivered to the users' binaries (or something similar). So building PGO-optimized builds per Release or once should be ok.

That was my expectation. Even still, build times are an impact because we have a gap between master being updated and the binary being available that for any actions living at HEAD will be broken

You can generate a profile once and use it continuously in the CI, so there will be no need to perform 2-stage builds (that's the most time-consuming thing in the PGO builds). There is a possible issue here - profile skewing. If the profile is collected for the old enough typos version, the profile will be probably less efficient (code reformattings, missing profiles for the new code, changed a little hot/cold splitting in the workloads). So it would be good enough to just recollect binaries with some frequency, not per build.

So to verify, the code doesn't need to be 1:1 but it handles skew between the profile and PGO? Where can I read more about this so I understand the technical limitations?

@zamazan4ik
Copy link
Author

Even still, build times are an impact because we have a gap between master being updated and the binary being available that for any actions living at HEAD will be broken

Yeah. I think for the dependent on the HEAD actions you can use a Release build without PGO and just do not make PGO builds for the HEAD version.

So to verify, the code doesn't need to be 1:1 but it handles skew between the profile and PGO? Where can I read more about this so I understand the technical limitations?

That's an excellent question! Unfortunately, I have no related resources regarding PGO profile skew handling in rustc compiler. Maybe @Kobzol has something. You can read something about this question in the PGO documentation for the Go compiler - https://go.dev/doc/pgo . I hope it could help somehow (but there is no guarantee that this information is true for rustc PGO implementation).

@Kobzol
Copy link

Kobzol commented Sep 18, 2023

I don't think that rustc currently promises anything regarding skew, for ideal results, the code should be the same both for instrumentation and for optimization. That being said, I think that as long as most functions will still have the same symbol name (this is the important thing for PGO), it should be mostly fine, and probably will be better than no PGO at all. So only reprofiling e.g. every 100 commits or every week or so should be OK. Of course if any build flags or the compiler changes, then new profiles have to be gathered.

However, even if you reprofile the binary on every release workflow, I don't think that the CI cost would have to be so large. I think that running on some input that takes ~30s in CI should be enough for this project. So you'd have to pay for one additional (re)build of the crate + 30s-1m of profile gathering. You could try to use cargo-pgo to make the PGO workflow simpler.

By the way, I if you want to make the released binaries faster, I think that using ThinLTO and/or CGU=1 could also have a large effect, without the complication of profile gathering (it will somewhat increase build times of course).

@epage
Copy link
Collaborator

epage commented Sep 18, 2023

By the way, I if you want to make the released binaries faster, I think that using ThinLTO and/or CGU=1 could also have a large effect, without the complication of profile gathering (it will somewhat increase build times of course).

Huh, I had thought those were on. I enabled CGU=1 because it offered a big gain but didn't enable ThinLTO because it slowed down compile times (iirc) for little gain. See 1250609

@epage
Copy link
Collaborator

epage commented Sep 25, 2023

With the costs and trade offs of PGO, is it still worth it with the CGU=1 change?

@zamazan4ik
Copy link
Author

With the costs and trade offs of PGO, is it still worth it with the CGU=1 change?

I think so since CGU=1 and PGO implement different optimization sets. And enabling CGU=1 with LTO is a good thing to do before enabling PGO.

@epage
Copy link
Collaborator

epage commented Sep 25, 2023

There are trade offs with this. What I'm trying to weigh is how much of a gain there is going from CGU=1 to CGU=1 + PGO compared to any analysis time we have to do as part of our release pipeline.

@zamazan4ik
Copy link
Author

The only way to estimate the benefits is testing CGU=1 vs CGU=1 + PGO in the benchmarks :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improve the expected
Projects
None yet
Development

No branches or pull requests

3 participants