Build released compiler artifacts as optimized as possible #49180

michaelwoerister · 2018-03-19T17:31:51Z

At the moment the compiler binaries that we release are not as fast and optimized as they could be. As of ff227c4, they are built with multiple codegen units and ThinLTO again, which makes the compiler around 10% slower than when built with a single CGU per crate. We really should be able to do better here, especially for stable releases:

At least, the compiler should be built with -Ccodegen-units=1 for stable releases.
In the medium term, the compiler might gain support for profile-guided optimization (see Add basic PGO support. #48346). Once it is available, we should use it for making the compiler itself faster. (see also symbol ordering: Use section/symbol ordering files for compiling rustc (e.g. BOLT) #50655)
We don't use full LTO for compiling the compiler, mainly because we don't support it for Rust dylibs. We should review if this restriction is still current, and, if we can lift it, enable full LTO.

@rust-lang/release @rust-lang/infra, how can we decouple builds of stable releases from the regular CI builds that are timing out so much lately. There should be a way of doing these builds without the severe time limits that we have in regular CI.

The text was updated successfully, but these errors were encountered:

alexcrichton · 2018-03-19T19:34:15Z

To make sure we set expectations, the 10% number of perf is not 10% slower, it's "executes 10% more instructions". The change in the number of instructions is often an indicator that there could be a regression but it does not translate to a 10% slowdown in literal wall time. For example the wall-time measurements for that commit shows the worst regression, percentage-wise, as 0.49s to 0.56s. Large benchmarks like servo-style-opt got at worse 3.8% slower in a clean build from scratch, going from 75 to 78 seconds.

I mean to point this out in terms of reducing the number of codegen units or PGO or those sorts of optimizations aren't really silver bullets. They're incredibly expensive optimizations for a few seconds here and there as opposed to major optimizations across the board.

SimonSapin · 2018-03-19T21:28:43Z

@alexcrichton thanks for clarifying.

michaelwoerister · 2018-03-20T09:14:34Z

@alexcrichton Yes, I know that this won't make the compiler massively faster. On the other hand, it's not uncommon that we spend weeks of developer time on getting a 5% compile time improvement. If there's the opportunity of making the compiler 10% faster by letting a build machine chew on it for a few hours every six weeks, I think we should take it.

That being said, I don't underestimate the complexity of our CI. I just don't want us to disregard the opportunity from the beginning. Maybe there is a simpler solution that would get us 90% of the way.

ishitatsuyuki · 2018-03-20T11:40:54Z

Moving to opt-level=3 can speed up up to 2%, but it's blocked on a Windows codegen bug. See also: #48204.

michaelwoerister · 2018-03-20T18:09:04Z

@andjo403's comments on gitter have given me the idea that we could also try to build LLVM with PGO. I realize of course that this would require lots of new infrastructure support and isn't something that can be implemented quickly.

michaelwoerister · 2018-05-07T10:44:39Z

Some updates here:

In an heroic effort, @alexcrichton and @kennytm are working on switching the compiler's C++ code to be built with Clang 6.0 (Compile LLVM with Clang on release builders #50200) which promises to speed up the compiler by a few percent.
Using Clang will open up the possibility to use linker-based ThinLTO, which does not seem to have problems with Rust dylibs. This should give another few percent in compiler performance.
Another option for making the compiler faster is optimizing the order in which sections/symbols are emitted into object files (Chrome does this and Firefox might soon do it too).

michaelwoerister · 2018-05-11T14:09:02Z

I opened a separate issue for symbol ordering: #50655

mati865 · 2018-12-02T18:49:20Z

windows-gnu remains the only Tier 1 platform still using GCC instead of Clang to build LLVM.
I decided to take a look at it and the results are:

Clang 7.0.0 with ld:
Because of alignment bug in ld (fixed in Binutils trunk recently), dbg! macros and few other things cause runtime failure.
Clang 7.0.0 with lld (downloaded from https://llvm.org):
lld 7.x isn't fully compatible with libraries built by GNU toolchain and requires rebuilding sysroot with LLVM toolchain.
Clang trunk with lld:
lld trunk is said to be compatible with GNU based sysroots. I haven't tested it but it won't be problem for me to test if there is interest.

michaelwoerister · 2019-10-02T11:10:29Z

@alexcrichton & @nnethercote: Thanks to you we have pipelining now and our bootstrap time should be quite a bit shorter, right? (according to this: https://gistpreview.github.io/?74d799739504232991c49607d5ce748a)

Can we switch compiler back to -Ccodegen-units=1? That might be a 10% performance win right there!

Mark-Simulacrum · 2019-10-02T11:55:27Z

We're unfortunately way too close to 4 hours and frequently going over I think today to be able to afford going back to codegen-units=1. Pipelining I think doesn't help us too much on CI since we only have 2 cores currently so we're not getting the advantage of -j28 like that graph shows :)

bjorn3 · 2019-10-02T12:01:36Z

I am surprised that the simple rustc_codegen_utils takes 18s, while the way more complex rustc_codegen_ssa takes 24s in the timings of @michaelwoerister.

michaelwoerister · 2019-10-02T12:15:51Z

since we only have 2 cores

😱

andjo403 · 2019-10-02T12:45:21Z

But as there only is 2 cores are we sure that codegen-units=1 is not faster?

Mark-Simulacrum · 2019-10-02T13:10:48Z

My understanding is that LLVM is faster at optimizing smaller modules (not altogether non-obvious, I think, though certainly interesting). That means that splitting the same IR into more modules can produce faster builds, even with just one core.

michaelwoerister · 2019-10-02T13:37:36Z

That means that splitting the same IR into more modules can produce faster builds, even with just one core.

On the other hand we'd skip the entire ThinLTO step... let me give it a try locally.

alexcrichton · 2019-10-02T14:58:27Z

I would personally agree with @Mark-Simulacrum that we're extremely strapped for time budget on CI right now, and the longest builders are the Windows release builders. We should be extremely careful about making them slow (aka losing parallelism) and we're also hoping to get 4-core machines at some point which may change the calculus in terms of whether 2 cores + pipelining gives us sufficient parallelism or not.

michaelwoerister · 2019-10-02T17:48:29Z

My local test for ./x.py -j2 dist on Linux gave me ~40 minutes for 1 CGU and ~37 minutes for 16 CGUs, so the one CGU case is indeed a bit slower (although it's not as extreme as in the past).

nnethercote · 2019-10-03T00:24:16Z

@michaelwoerister said this at the start:

how can we decouple builds of stable releases from the regular CI builds that are timing out so much lately. There should be a way of doing these builds without the severe time limits that we have in regular CI.

From subsequent comments it seems like this point might be getting overlooked? We wouldn't do this for all CI builds, just those generating stable releases. How often are stable releases generated?

Mark-Simulacrum · 2019-10-03T00:28:41Z

We build stable artifacts approximately once every 6 weeks. While I believe the CI platform we're currently on, Pipelines, does not have strict timeouts, I would rather avoid having to wait for more than the existing 4+ hours for a full stable build. Plus, optimizations in this area are plausibly likely to introduce regressions, right? I guess that might be rare, but I believe it is non-theoretical that changes to codegen units in how we build the compiler have caused bugs in the past; I could be wrong about this claim.

ishitatsuyuki · 2019-10-03T08:46:33Z

I grepped for past PRs and I have no idea what's the current state of distribution builds: it seems the last documented change was #45444, which means that codegen-units=1 and lto=no? (Of course that seems a bit old, which is weird.)

What is the current state?

alexcrichton · 2019-10-03T14:11:28Z

@nnethercote to add to what @Mark-Simulacrum already mentioned I personally think we also derive a lot of value from stable/beta/nightly releases all being produced exactly the same way. That way we can exclude a class of bugs where stable releases are buggy due to how they're built but beta/nightly don't have the same bugs. (for example this would help prevent a showstopper bug on either beta or stable). There's also enough users of non-stable that producing quite-fast compilers on nightly and such is relatively important.

If we try to build a full release every night, however, that's where it gets pretty onerous to make release builds slower. That'd happen at least once a day (multiple times for stable/beta), and that runs the risk of being even slower than we currently are, which is already sort of unbearably slow :(

@ishitatsuyuki I believe the current state is that libstd is built with one CGU and all rustc crates are built with 16 CGUs and have ThinLTO enabled for each crate's set of CGUs.

nnethercote · 2019-10-04T03:20:31Z

I agree that we should release what we regularly test. Thanks for pointing that out.

michaelwoerister · 2019-12-04T09:50:31Z

Here's a possibly interesting thought: PGO speeds up Firefox quite a bit (5-10%). Maybe it would be possible to harness PGO for our LLVM builds? We rebuild LLVM only very infrequently and fall back on a cached version for the rest of the time. We just would need a way to fill the cache with a PGO'ed version of LLVM (which is kind of complicated I guess).

Anyway, a starting point would be to do a local test and see if there are actual performance improvements to be had.

12101111 · 2020-02-21T14:35:00Z

-Clinker-plugin-lto -Clinker=clang -Clink-arg=-fuse-ld=lld generate a broken rustc:
rustc[2418] trap invalid opcode ip:7efd8ca7cef8 sp:7efd87acfa40 error:0 in libstd-71e59b47b634435d.so[7efd8ca45000+83000]
It execute to a ud2 instruction

luser · 2020-11-20T16:35:51Z

I don't know if this is the right venue in which to discuss @michaelwoerister 's recent blog post, but I'd love to provide some feedback on my experiences enabling PGO for Firefox CI and the various lessons we learned along the way.

michaelwoerister · 2020-11-24T10:11:56Z

@luser I'd love to hear about your experiences with PGO for Firefox CI. I think that would be really valuable!

I plan create a tracking issue for using PGO on rustc itself some time this week. If you post your feedback here I can already incorporate it there. Otherwise, I'll just ping you once the tracking issue is online.

jyn514 · 2023-02-03T07:44:19Z

My understanding is that there are two parts to this issue:

Should we have a separate builder for stable/beta releases, which has a higher time limit? It sounds like @Mark-Simulacrum and Alex think that's a bad idea.
Can we enable further optimizations for the compiler? We already enable PGO and BOLT today, and codegen-units-std=1, but I think the compiler itself is still built with multiple-codegen units per crate (although given that we use ThinLTO, maybe that doesn't have much of an impact?).

@michaelwoerister is that an accurate summary? Do you still want to enable codegen-units=1? We have a lot more builder capacity than in the past, I think it would be feasible to turn it on unconditionally for all dist builders, not just stable and beta.

lqd · 2023-02-03T19:23:40Z

We also have this newer tracking issue, with more details and all the recent work done for the build config: #103595

jyn514 · 2023-02-03T19:45:45Z

Perfect, thanks! I'm going to close this issue as outdated and use #103595 for tracking these improvements.

This was referenced Mar 19, 2018

Compiler Performance Tracking Issue #48547

Open

10% performance regression after rollup #49051 #49168

Closed

matthiaskrgr mentioned this issue Mar 19, 2018

bootstrap: support building rustc with monolithic lto #49175

Closed

jonas-schievink added T-bootstrap Relevant to the bootstrap subteam: Rust's build system (x.py and src/bootstrap) and removed T-bootstrap Relevant to the bootstrap subteam: Rust's build system (x.py and src/bootstrap) A-build labels Apr 21, 2019

jyn514 closed this as not planned Won't fix, can't repro, duplicate, stale Feb 3, 2023

jyn514 mentioned this issue Feb 3, 2023

Tracking issue for speeding up rustc via its build configuration #103595

Open

23 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Build released compiler artifacts as optimized as possible #49180

Build released compiler artifacts as optimized as possible #49180

michaelwoerister commented Mar 19, 2018 •

edited

Loading

alexcrichton commented Mar 19, 2018 •

edited

Loading

SimonSapin commented Mar 19, 2018

michaelwoerister commented Mar 20, 2018

ishitatsuyuki commented Mar 20, 2018 •

edited

Loading

michaelwoerister commented Mar 20, 2018

michaelwoerister commented May 7, 2018

michaelwoerister commented May 11, 2018

mati865 commented Dec 2, 2018

michaelwoerister commented Oct 2, 2019

Mark-Simulacrum commented Oct 2, 2019

bjorn3 commented Oct 2, 2019 •

edited

Loading

michaelwoerister commented Oct 2, 2019

andjo403 commented Oct 2, 2019

Mark-Simulacrum commented Oct 2, 2019

michaelwoerister commented Oct 2, 2019

alexcrichton commented Oct 2, 2019

michaelwoerister commented Oct 2, 2019

nnethercote commented Oct 3, 2019

Mark-Simulacrum commented Oct 3, 2019

ishitatsuyuki commented Oct 3, 2019

alexcrichton commented Oct 3, 2019

nnethercote commented Oct 4, 2019

michaelwoerister commented Dec 4, 2019

12101111 commented Feb 21, 2020

luser commented Nov 20, 2020

michaelwoerister commented Nov 24, 2020

jyn514 commented Feb 3, 2023

lqd commented Feb 3, 2023

jyn514 commented Feb 3, 2023

Build released compiler artifacts as optimized as possible #49180

Build released compiler artifacts as optimized as possible #49180

Comments

michaelwoerister commented Mar 19, 2018 • edited Loading

alexcrichton commented Mar 19, 2018 • edited Loading

SimonSapin commented Mar 19, 2018

michaelwoerister commented Mar 20, 2018

ishitatsuyuki commented Mar 20, 2018 • edited Loading

michaelwoerister commented Mar 20, 2018

michaelwoerister commented May 7, 2018

michaelwoerister commented May 11, 2018

mati865 commented Dec 2, 2018

michaelwoerister commented Oct 2, 2019

Mark-Simulacrum commented Oct 2, 2019

bjorn3 commented Oct 2, 2019 • edited Loading

michaelwoerister commented Oct 2, 2019

andjo403 commented Oct 2, 2019

Mark-Simulacrum commented Oct 2, 2019

michaelwoerister commented Oct 2, 2019

alexcrichton commented Oct 2, 2019

michaelwoerister commented Oct 2, 2019

nnethercote commented Oct 3, 2019

Mark-Simulacrum commented Oct 3, 2019

ishitatsuyuki commented Oct 3, 2019

alexcrichton commented Oct 3, 2019

nnethercote commented Oct 4, 2019

michaelwoerister commented Dec 4, 2019

12101111 commented Feb 21, 2020

luser commented Nov 20, 2020

michaelwoerister commented Nov 24, 2020

jyn514 commented Feb 3, 2023

lqd commented Feb 3, 2023

jyn514 commented Feb 3, 2023

michaelwoerister commented Mar 19, 2018 •

edited

Loading

alexcrichton commented Mar 19, 2018 •

edited

Loading

ishitatsuyuki commented Mar 20, 2018 •

edited

Loading

bjorn3 commented Oct 2, 2019 •

edited

Loading