Speed up bors cycle time #8456

alexcrichton · 2013-08-11T19:43:21Z

Right now the cycle time for a passing pull request is 3+ hours, which is up from about 2 hours a week ago, and up from ~1hr if I remember correctly from a few months ago. This isn't really an artifact of rust getting slower, but rather @bors working even harder at testing. It does look like there's some easy low-hanging fruit which could speed up cycle time by at least an hour or two.

Currently the slowest builds are *-nopt (3+ hours), *-all (~2 hours), and *-vg (~1.5 hours). The baseline build time seems to be ~1 hour, and I think that we can get back to that level of speed with these suggestions:

For *-nopt builds, build the compiler with optimizations, but don't run tests with optimizations. This could be enabled by Allow disabling optimizations in tests only #8450, and I believe that with the runtime now unoptimized for *-nopt builds we're seeing a huge slowdown.
Parallelize the *-all builds. From what I understand, these builds essentially target 4 separate builds on linux/mac: H64 -> T32, H64 -> T64, H32 -> T64, H32 -> T32 (where H == host word size, T == target word size). Two of these are already tested by other builders (H64 -> T64, H32 -> T32), and the remaining two don't really need to run on the same builder, they could be built in parallel.

Basically I think we could get the same test coverage by removing the *-all builders and introducing *-cross32 and *-cross64 builders instead (where the host is different from the target architecture)
Either resolve Investigate running tests under Address Sanitizer #749 (should run a lot faster than valgrind), or run only a subset of the tests under valgrind. This is kind of tricky to do either of these, but I think that in the immediate future it's more reasonable to run a subset of the tests in valgrind. I'm not sure if others would agree, but I'd think that we could run only the run-pass, libXtest, and run-fail suites on valgrind bots and get mostly the same signal as running valgrind over the entire build.

Perhaps the snapshot builders could run valgrind for the entire suite, but inevitably we'd get a failure which would then be difficult to debug (but I think would be worth it).

I think that with these three possibilities combined, we could get the cycle time back down to an hour and start processing a lot more builds in one day.

@graydon, @brson, would it be possible to configure the bots this way?

graydon · 2013-08-11T20:24:09Z

Maybe! But I'm also interested in figuring out the sources of the slowdown in the first place.

brson · 2013-08-11T23:37:30Z

The purpose of the -all builds is primarily to ensure that the build system continues to support cross-compiling, both in cross-target and multi-host configurations. In the multi-host configuration where you are building a compiler for both 64 and 32-bit targets there's no easy way to avoid building 4 complete sets of libraries. We could make the multi-host build more efficient in a few ways, primarily by not building all three stages on the secondary hosts - just jumping straight to stage2, and also by bootstrapping the secondary hosts from an earlier stage of the primary. We could also not run the tests or run fewer tests in the cross builds - what's most important is that the build system continues to function.

The valgrind builds could run single-threaded if they are not already.

#7425 is about optimizing the compiler on the nopt bots.

alexcrichton · 2013-08-12T04:26:05Z

Would it be possible to add the *-crossXX builders which actually run tests, but only one host -> target architecture, and then also have the *-all builders but not running any tests. That way we could make sure that the cross-build system works along with running tests against the builds, but in parallel instead of serialized.

Also, #8450 should close #7425 after the configuration is changed on the bots.

graydon · 2013-08-12T06:32:30Z

#4572 / generally reorganizing the metadata-read path for small files is likely the next biggest win.

alexcrichton · 2013-08-14T02:13:42Z

After implementing suggestion 1, and some runtime optimizations, we're down from 3h to 1h40m at the longest time. Currently the longest builders are the *-all-opt builders. The next slowest is *-opt-vg clocking in at 1h20m followed by linux-32-nopt-t clocking in around 1h10m.

For the next speedup win, it looks like the -all-opt builders are running tests in configurations which are already tested by the *-opt builders. Specifically:

check-stage2-T-x86_64-unknown-linux-gnu-H-x86_64-unknown-linux-gnu-cfail
check-stage2-T-x86_64-unknown-linux-gnu-H-x86_64-unknown-linux-gnu-std
check-stage2-T-x86_64-unknown-linux-gnu-H-x86_64-unknown-linux-gnu-extra
check-stage2-T-x86_64-unknown-linux-gnu-H-x86_64-unknown-linux-gnu-rpass
check-stage2-T-x86_64-unknown-linux-gnu-H-x86_64-unknown-linux-gnu-rustpkg
check-stage2-T-x86_64-unknown-linux-gnu-H-x86_64-unknown-linux-gnu-rusti
check-stage2-T-x86_64-unknown-linux-gnu-H-x86_64-unknown-linux-gnu-rfail

Basically these combinations are being run:

T-x86_64-H-x86_64
T-i686-H-x86_64

Oddly enough, these combinations are not being run

T-x86_64-H-i686
T-i686-H-i686

@brson, would it be possible to eliminate the T-x86_64-H-x86_64 tests from being run on the grounds that the functionality is already tested elsewhere? Also, is it an oversight or is it intended that the H-i686 tests aren't running?

graydon · 2013-08-14T16:29:35Z

I'll look into this today. I also think there's reason to believe the tests are not parallelizing due to the new scheduler, and possibly there are negative interactions with valgrind on the -vg bots.

graydon · 2013-08-20T18:28:19Z

There are other strategic moves in terms of the buildbot that we could do here, though I am hesitant to do any ahead of simply fixing the speed bottlenecks in the compiler and testsuite. For reference sake:

Sully's suggestion: run most tests using the check-fast target then run the tests marked xfail-fast manually to complete coverage.
Niko's suggestion: run valgrind tests periodically (say once every 10 landings) and halt bors if they fail. This requires modifying bors somewhat, and leaves us with having to bisect to find culprits, and blocks everyone on breakage rather than just the person responsible for the bad change, but it does mean we'd avoid sitting on valgrind breakage for too long.
My suggestion: split the testsuite into 'mod N' pieces and run N-times as many builders. This way builder K runs a test if the test number mod N is K. This will not help with the straight-line bit but will absorb arbitrarily large testsuites.

brson · 2013-08-21T01:32:07Z

@alexcrichton the reason all four configurations are not tested is essentially an historical accident - I never paid too much attention to what make check does in this configuration.

Under the cross configuration the tests could mostly be eliminated, but I would like to continue to ensure that the makefiles work (tests.mk is crazy complicated) so would like to keep some subset of the cross tests running.

Actually though when thinking of the most important use case here of ensure that the android cross tests work we should actually be doing the cross tests differently, testing the build arch compiling the cross target arch (e.g. x86_64->i686, x86_64->x86_64), not all host archs testing the build arch (e.g. x86_64->x86_64, i686->x86_64).

alexcrichton · 2013-12-02T22:44:13Z

Closing this for now. The cycle time is around 1hr 10min now, which is very tolerable. The queue does not have a huge backup nowadays, so I don't think that this is a pressing matter any more.

More specific issues may be opened in the future, but I don't think that this blanket issue is necessary any more.

l0kod · 2014-08-20T17:51:48Z

Distributed build systems for Rust: http://discuss.rust-lang.org/t/distributed-build-systems-for-rust/400

check `use_self` in `pat` fixes rust-lang#6955 changelog: check `use_self` in `pat`

This was referenced Aug 29, 2013

Enable unit tests in libstd and libextra on Windows #8819

Closed

metabug: problems from check-fast being sole test driver #8844

Closed

alexcrichton closed this as completed Dec 2, 2013

l0kod mentioned this issue Aug 31, 2014

Incremental recompilation #2369

Closed

flip1995 pushed a commit to flip1995/rust that referenced this issue Mar 14, 2022

Auto merge of rust-lang#8456 - ebobrow:use_self_pat, r=llogiq

2e40dc8

check `use_self` in `pat` fixes rust-lang#6955 changelog: check `use_self` in `pat`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up bors cycle time #8456

Speed up bors cycle time #8456

alexcrichton commented Aug 11, 2013

graydon commented Aug 11, 2013

brson commented Aug 11, 2013

alexcrichton commented Aug 12, 2013

graydon commented Aug 12, 2013

alexcrichton commented Aug 14, 2013

graydon commented Aug 14, 2013

graydon commented Aug 20, 2013

brson commented Aug 21, 2013

alexcrichton commented Dec 2, 2013

l0kod commented Aug 20, 2014

Speed up bors cycle time #8456

Speed up bors cycle time #8456

Comments

alexcrichton commented Aug 11, 2013

graydon commented Aug 11, 2013

brson commented Aug 11, 2013

alexcrichton commented Aug 12, 2013

graydon commented Aug 12, 2013

alexcrichton commented Aug 14, 2013

graydon commented Aug 14, 2013

graydon commented Aug 20, 2013

brson commented Aug 21, 2013

alexcrichton commented Dec 2, 2013

l0kod commented Aug 20, 2014