Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build released compiler artifacts as optimized as possible #49180

Closed
michaelwoerister opened this issue Mar 19, 2018 · 29 comments
Closed

Build released compiler artifacts as optimized as possible #49180

michaelwoerister opened this issue Mar 19, 2018 · 29 comments
Labels
C-tracking-issue Category: A tracking issue for an RFC or an unstable feature. I-compiletime Issue: Problems and improvements with respect to compile times. T-bootstrap Relevant to the bootstrap subteam: Rust's build system (x.py and src/bootstrap) T-infra Relevant to the infrastructure team, which will review and decide on the PR/issue. WG-compiler-performance Working group: Compiler Performance

Comments

@michaelwoerister
Copy link
Member

michaelwoerister commented Mar 19, 2018

At the moment the compiler binaries that we release are not as fast and optimized as they could be. As of ff227c4, they are built with multiple codegen units and ThinLTO again, which makes the compiler around 10% slower than when built with a single CGU per crate. We really should be able to do better here, especially for stable releases:

  • At least, the compiler should be built with -Ccodegen-units=1 for stable releases.
  • In the medium term, the compiler might gain support for profile-guided optimization (see Add basic PGO support. #48346). Once it is available, we should use it for making the compiler itself faster. (see also symbol ordering: Use section/symbol ordering files for compiling rustc (e.g. BOLT) #50655)
  • We don't use full LTO for compiling the compiler, mainly because we don't support it for Rust dylibs. We should review if this restriction is still current, and, if we can lift it, enable full LTO.

@rust-lang/release @rust-lang/infra, how can we decouple builds of stable releases from the regular CI builds that are timing out so much lately. There should be a way of doing these builds without the severe time limits that we have in regular CI.

@michaelwoerister michaelwoerister added A-build I-compiletime Issue: Problems and improvements with respect to compile times. T-infra Relevant to the infrastructure team, which will review and decide on the PR/issue. C-tracking-issue Category: A tracking issue for an RFC or an unstable feature. WG-compiler-performance Working group: Compiler Performance labels Mar 19, 2018
@alexcrichton
Copy link
Member

alexcrichton commented Mar 19, 2018

To make sure we set expectations, the 10% number of perf is not 10% slower, it's "executes 10% more instructions". The change in the number of instructions is often an indicator that there could be a regression but it does not translate to a 10% slowdown in literal wall time. For example the wall-time measurements for that commit shows the worst regression, percentage-wise, as 0.49s to 0.56s. Large benchmarks like servo-style-opt got at worse 3.8% slower in a clean build from scratch, going from 75 to 78 seconds.

I mean to point this out in terms of reducing the number of codegen units or PGO or those sorts of optimizations aren't really silver bullets. They're incredibly expensive optimizations for a few seconds here and there as opposed to major optimizations across the board.

@SimonSapin
Copy link
Contributor

@alexcrichton thanks for clarifying.

@michaelwoerister
Copy link
Member Author

@alexcrichton Yes, I know that this won't make the compiler massively faster. On the other hand, it's not uncommon that we spend weeks of developer time on getting a 5% compile time improvement. If there's the opportunity of making the compiler 10% faster by letting a build machine chew on it for a few hours every six weeks, I think we should take it.

That being said, I don't underestimate the complexity of our CI. I just don't want us to disregard the opportunity from the beginning. Maybe there is a simpler solution that would get us 90% of the way.

@ishitatsuyuki
Copy link
Contributor

ishitatsuyuki commented Mar 20, 2018

Moving to opt-level=3 can speed up up to 2%, but it's blocked on a Windows codegen bug. See also: #48204.

@michaelwoerister
Copy link
Member Author

@andjo403's comments on gitter have given me the idea that we could also try to build LLVM with PGO. I realize of course that this would require lots of new infrastructure support and isn't something that can be implemented quickly.

@michaelwoerister
Copy link
Member Author

Some updates here:

  • In an heroic effort, @alexcrichton and @kennytm are working on switching the compiler's C++ code to be built with Clang 6.0 (Compile LLVM with Clang on release builders #50200) which promises to speed up the compiler by a few percent.
  • Using Clang will open up the possibility to use linker-based ThinLTO, which does not seem to have problems with Rust dylibs. This should give another few percent in compiler performance.
  • Another option for making the compiler faster is optimizing the order in which sections/symbols are emitted into object files (Chrome does this and Firefox might soon do it too).

@michaelwoerister
Copy link
Member Author

I opened a separate issue for symbol ordering: #50655

@mati865
Copy link
Contributor

mati865 commented Dec 2, 2018

windows-gnu remains the only Tier 1 platform still using GCC instead of Clang to build LLVM.
I decided to take a look at it and the results are:

  • Clang 7.0.0 with ld:
    Because of alignment bug in ld (fixed in Binutils trunk recently), dbg! macros and few other things cause runtime failure.
  • Clang 7.0.0 with lld (downloaded from https://llvm.org):
    lld 7.x isn't fully compatible with libraries built by GNU toolchain and requires rebuilding sysroot with LLVM toolchain.
  • Clang trunk with lld:
    lld trunk is said to be compatible with GNU based sysroots. I haven't tested it but it won't be problem for me to test if there is interest.

@jonas-schievink jonas-schievink added T-bootstrap Relevant to the bootstrap subteam: Rust's build system (x.py and src/bootstrap) and removed T-bootstrap Relevant to the bootstrap subteam: Rust's build system (x.py and src/bootstrap) A-build labels Apr 21, 2019
@michaelwoerister
Copy link
Member Author

@alexcrichton & @nnethercote: Thanks to you we have pipelining now and our bootstrap time should be quite a bit shorter, right? (according to this: https://gistpreview.github.io/?74d799739504232991c49607d5ce748a)

Can we switch compiler back to -Ccodegen-units=1? That might be a 10% performance win right there!

@Mark-Simulacrum
Copy link
Member

We're unfortunately way too close to 4 hours and frequently going over I think today to be able to afford going back to codegen-units=1. Pipelining I think doesn't help us too much on CI since we only have 2 cores currently so we're not getting the advantage of -j28 like that graph shows :)

@bjorn3
Copy link
Member

bjorn3 commented Oct 2, 2019

I am surprised that the simple rustc_codegen_utils takes 18s, while the way more complex rustc_codegen_ssa takes 24s in the timings of @michaelwoerister.

@michaelwoerister
Copy link
Member Author

since we only have 2 cores

😱

@andjo403
Copy link
Contributor

andjo403 commented Oct 2, 2019

But as there only is 2 cores are we sure that codegen-units=1 is not faster?

@Mark-Simulacrum
Copy link
Member

My understanding is that LLVM is faster at optimizing smaller modules (not altogether non-obvious, I think, though certainly interesting). That means that splitting the same IR into more modules can produce faster builds, even with just one core.

@michaelwoerister
Copy link
Member Author

That means that splitting the same IR into more modules can produce faster builds, even with just one core.

On the other hand we'd skip the entire ThinLTO step... let me give it a try locally.

@alexcrichton
Copy link
Member

I would personally agree with @Mark-Simulacrum that we're extremely strapped for time budget on CI right now, and the longest builders are the Windows release builders. We should be extremely careful about making them slow (aka losing parallelism) and we're also hoping to get 4-core machines at some point which may change the calculus in terms of whether 2 cores + pipelining gives us sufficient parallelism or not.

@michaelwoerister
Copy link
Member Author

My local test for ./x.py -j2 dist on Linux gave me ~40 minutes for 1 CGU and ~37 minutes for 16 CGUs, so the one CGU case is indeed a bit slower (although it's not as extreme as in the past).

@nnethercote
Copy link
Contributor

@michaelwoerister said this at the start:

how can we decouple builds of stable releases from the regular CI builds that are timing out so much lately. There should be a way of doing these builds without the severe time limits that we have in regular CI.

From subsequent comments it seems like this point might be getting overlooked? We wouldn't do this for all CI builds, just those generating stable releases. How often are stable releases generated?

@Mark-Simulacrum
Copy link
Member

We build stable artifacts approximately once every 6 weeks. While I believe the CI platform we're currently on, Pipelines, does not have strict timeouts, I would rather avoid having to wait for more than the existing 4+ hours for a full stable build. Plus, optimizations in this area are plausibly likely to introduce regressions, right? I guess that might be rare, but I believe it is non-theoretical that changes to codegen units in how we build the compiler have caused bugs in the past; I could be wrong about this claim.

@ishitatsuyuki
Copy link
Contributor

I grepped for past PRs and I have no idea what's the current state of distribution builds: it seems the last documented change was #45444, which means that codegen-units=1 and lto=no? (Of course that seems a bit old, which is weird.)

What is the current state?

@alexcrichton
Copy link
Member

@nnethercote to add to what @Mark-Simulacrum already mentioned I personally think we also derive a lot of value from stable/beta/nightly releases all being produced exactly the same way. That way we can exclude a class of bugs where stable releases are buggy due to how they're built but beta/nightly don't have the same bugs. (for example this would help prevent a showstopper bug on either beta or stable). There's also enough users of non-stable that producing quite-fast compilers on nightly and such is relatively important.

If we try to build a full release every night, however, that's where it gets pretty onerous to make release builds slower. That'd happen at least once a day (multiple times for stable/beta), and that runs the risk of being even slower than we currently are, which is already sort of unbearably slow :(

@ishitatsuyuki I believe the current state is that libstd is built with one CGU and all rustc crates are built with 16 CGUs and have ThinLTO enabled for each crate's set of CGUs.

@nnethercote
Copy link
Contributor

I agree that we should release what we regularly test. Thanks for pointing that out.

@michaelwoerister
Copy link
Member Author

Here's a possibly interesting thought: PGO speeds up Firefox quite a bit (5-10%). Maybe it would be possible to harness PGO for our LLVM builds? We rebuild LLVM only very infrequently and fall back on a cached version for the rest of the time. We just would need a way to fill the cache with a PGO'ed version of LLVM (which is kind of complicated I guess).

Anyway, a starting point would be to do a local test and see if there are actual performance improvements to be had.

@12101111
Copy link
Contributor

-Clinker-plugin-lto -Clinker=clang -Clink-arg=-fuse-ld=lld generate a broken rustc:
rustc[2418] trap invalid opcode ip:7efd8ca7cef8 sp:7efd87acfa40 error:0 in libstd-71e59b47b634435d.so[7efd8ca45000+83000]
It execute to a ud2 instruction

@luser
Copy link
Contributor

luser commented Nov 20, 2020

I don't know if this is the right venue in which to discuss @michaelwoerister 's recent blog post, but I'd love to provide some feedback on my experiences enabling PGO for Firefox CI and the various lessons we learned along the way.

@michaelwoerister
Copy link
Member Author

@luser I'd love to hear about your experiences with PGO for Firefox CI. I think that would be really valuable!

I plan create a tracking issue for using PGO on rustc itself some time this week. If you post your feedback here I can already incorporate it there. Otherwise, I'll just ping you once the tracking issue is online.

@jyn514
Copy link
Member

jyn514 commented Feb 3, 2023

My understanding is that there are two parts to this issue:

  1. Should we have a separate builder for stable/beta releases, which has a higher time limit? It sounds like @Mark-Simulacrum and Alex think that's a bad idea.
  2. Can we enable further optimizations for the compiler? We already enable PGO and BOLT today, and codegen-units-std=1, but I think the compiler itself is still built with multiple-codegen units per crate (although given that we use ThinLTO, maybe that doesn't have much of an impact?).

@michaelwoerister is that an accurate summary? Do you still want to enable codegen-units=1? We have a lot more builder capacity than in the past, I think it would be feasible to turn it on unconditionally for all dist builders, not just stable and beta.

@lqd
Copy link
Member

lqd commented Feb 3, 2023

We also have this newer tracking issue, with more details and all the recent work done for the build config: #103595

@jyn514
Copy link
Member

jyn514 commented Feb 3, 2023

Perfect, thanks! I'm going to close this issue as outdated and use #103595 for tracking these improvements.

@jyn514 jyn514 closed this as not planned Won't fix, can't repro, duplicate, stale Feb 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-tracking-issue Category: A tracking issue for an RFC or an unstable feature. I-compiletime Issue: Problems and improvements with respect to compile times. T-bootstrap Relevant to the bootstrap subteam: Rust's build system (x.py and src/bootstrap) T-infra Relevant to the infrastructure team, which will review and decide on the PR/issue. WG-compiler-performance Working group: Compiler Performance
Projects
None yet
Development

No branches or pull requests