Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cargo build hangs #88862

Closed
Emilgardis opened this issue Sep 11, 2021 · 35 comments · Fixed by #89285
Closed

Cargo build hangs #88862

Emilgardis opened this issue Sep 11, 2021 · 35 comments · Fixed by #89285
Labels
C-bug Category: This is a bug. E-needs-mcve Call for participation: This issue has a repro, but needs a Minimal Complete and Verifiable Example I-compilemem Issue: Problems and improvements with respect to memory usage during compilation. I-compiletime Issue: Problems and improvements with respect to compile times. ICEBreaker-Cleanup-Crew Helping to "clean up" bugs with minimal examples and bisections P-medium Medium priority regression-from-stable-to-beta Performance or correctness regression from stable to beta.

Comments

@Emilgardis
Copy link
Contributor

originally posted in zulip

Hi! I've found a regression in compilation introduced in #85499,

searched nightlies: from nightly-2021-08-17 to nightly-2021-09-10
regressed nightly: nightly-2021-08-26
searched commits: from b03ccac to 0afc208
regressed commit: 0afc208

bisected with cargo-bisect-rustc v0.6.0

Host triple: x86_64-pc-windows-msvc
Reproduce with:

# bisect-script.bat
pwsh.exe -noexit -Command "stop-job -Name cargobuild_bisect; if ((start-job -Name cargobuild_bisect { cargo build -p server 2>&1 } | wait-job -timeout 600).State -ne 'Completed') {receive-job -Name cargobuild_bisect; receive-job -Name cargobuild_bisect | remove-job -Force; exit 10} else {receive-job -Name cargobuild_bisect; receive-job -Name cargobuild_bisect | remove-job -Force; exit 0}"
cargo bisect-rustc --start=2021-08-17 --script .\bisect-script.bat

The issue is that compilation succeeds in a acceptable time before this change, but now it's seemingly taking forever, I have not tried this on linux, only windows so far. I'm not sure how I should report this because I'm not able to reproduce it outside my large workspace where I bisected this. Does anyone have an idea to what this could be or have a good way to figure out what the cause is?

some notable dependencies: actix-web, sqlx, serde, chrono, ring, tokio, tracing, reqwest,anyhow,eyre,async-trait and many more

I've tried

$ export CARGO_LOG="info"
$ export RUSTC_LOG="info" # to stop massive console spam I use this instead = "info,rustc_trait_selection=warn,rustc_codegen_ssa=warn,rustc_metadata=warn,rustc_mir=warn,rustc_query_system=warn,rustc_const_eval=warn"
$ cargo build -p server

which outputs at the end

┐rustc_query_system::dep_graph::serialized::encode_node node=NodeInfo { node: variances_of(7b5e03ed597c3c33-573328205f4d7f81), fingerprint: Fingerprint(12749739837320991735, 10040035723298659737), edges: [2870] }
┘
┐rustc_query_system::dep_graph::serialized::encode_node node=NodeInfo { node: variances_of(7b5e03ed597c3c33-6f459134164ff719), fingerprint: Fingerprint(8361987967435134197, 5478714612052799554), edges: [2870] }
┘
[2021-09-11T16:54:37Z INFO  cargo::core::compiler::job_queue] tokens in use: 0, rustc_tokens: [], waiting_rustcs: [] (events this tick: 0)
┐rustc_query_system::dep_graph::serialized::encode_node node=NodeInfo { node: variances_of(7b5e03ed597c3c33-82cb26faded76d4), fingerprint: Fingerprint(12749739837320991735, 10040035723298659737), edges: [2870] }
[2021-09-11T16:54:40Z INFO  cargo::core::compiler::job_queue] tokens in use: 0, rustc_tokens: [], waiting_rustcs: [] (events this tick: 1)
┘
[2021-09-11T16:54:40Z INFO  cargo::core::compiler::job_queue] tokens in use: 0, rustc_tokens: [], waiting_rustcs: [] (events this tick: 0)
┐rustc_query_system::dep_graph::serialized::encode_node node=NodeInfo { node: variances_of(7b5e03ed597c3c33-7c40a2b5f026df66), fingerprint: Fingerprint(12526569172406770563, 17126563881511896260), edges: [2870] }
[2021-09-11T16:54:40Z INFO  cargo::core::compiler::job_queue] tokens in use: 0, rustc_tokens: [], waiting_rustcs: [] (events this tick: 1)
┘
[2021-09-11T16:54:40Z INFO  cargo::core::compiler::job_queue] tokens in use: 0, rustc_tokens: [], waiting_rustcs: [] (events this tick: 0)
┐rustc_query_system::dep_graph::serialized::encode_node node=NodeInfo { node: variances_of(7b5e03ed597c3c33-c7d9e80976e93c5), fingerprint: Fingerprint(12526569172406770563, 17126563881511896260), edges: [2870] }
[2021-09-11T16:54:40Z INFO  cargo::core::compiler::job_queue] tokens in use: 0, rustc_tokens: [], waiting_rustcs: [] (events this tick: 1)
┘
[2021-09-11T16:54:40Z INFO  cargo::core::compiler::job_queue] tokens in use: 0, rustc_tokens: [], waiting_rustcs: [] (events this tick: 0)
    Building [=======================> ] 445/446: server(bin)

and then just hangs.

How do I proceed?

@Emilgardis Emilgardis added C-bug Category: This is a bug. regression-untriaged Untriaged performance or correctness regression. labels Sep 11, 2021
@rustbot rustbot added the I-prioritize Issue: Indicates that prioritization has been requested for this issue. label Sep 11, 2021
@camelid camelid added I-hang Issue: The compiler never terminates, due to infinite loops, deadlock, livelock, etc. E-needs-mcve Call for participation: This issue has a repro, but needs a Minimal Complete and Verifiable Example regression-from-stable-to-beta Performance or correctness regression from stable to beta. and removed regression-untriaged Untriaged performance or correctness regression. labels Sep 11, 2021
@camelid
Copy link
Member

camelid commented Sep 11, 2021

cc @jackh726

@camelid
Copy link
Member

camelid commented Sep 11, 2021

One way to minimize could be to see which crate in your workspace is triggering the hang. E.g., run cargo build -p my_crate for each of the crates in the workspace and find the one with the fewest dependencies (in-workspace dependencies, or external). Then you can try to minimize that crate further by deleting different parts of the code until the hang disappears.

@Emilgardis
Copy link
Contributor Author

Emilgardis commented Sep 12, 2021

The crate/step which is failing is the final one. I managed to delete parts of the code to make the hang disappear but I'm not certain I've actually pinpointed it exactly yet.

I let it run and seems it failed on STATUS_STACK_BUFFER_OVERRUN
https://gist.github.com/Emilgardis/954c2aeefc0cca270a421a90ecc5067a#file-output-cargo-issue88862-log-L988

@Emilgardis
Copy link
Contributor Author

Emilgardis commented Sep 12, 2021

ran with

[profile.dev]
lto = "off"
panic = "abort"
incremental = false

and the build completed after a while. The memory usage jumps up dramatically, from 138 MB to 7GB to 20GB, and then down again to repeat.

Same behaviour without those flags.

This is probably not a hang, just increased complexity causing longer run-time.

EDIT: On older nightly, the max memory usage is around 1.5GB

@Emilgardis
Copy link
Contributor Author

Managed to minimize it somewhat. Was not able to remove all dependencies but most of them.

❯ cargo -Vv                                           
cargo 1.56.0-nightly (e96bdb0c3 2021-08-17)
release: 1.56.0
commit-hash: e96bdb0c3d0a418e7fcd7fbd69be08abf830b4bc
commit-date: 2021-08-17

takes 40-45 seconds to compile from clean state.


❯ cargo -Vv                                           
cargo 1.56.0-nightly (18751dd3f 2021-09-01)
release: 1.56.0
commit-hash: 18751dd3f238d94d384a7fe967abfac06cbfe0b9
commit-date: 2021-09-01

takes 125-500 seconds to compile from clean state. (and uses 5-7GB of memory sometimes)

I believe the issue is to do with the client traits and impls and tracing instrumentation. Replacing upsert_token with a todo!() makes compilation sane again.

https://gist.github.com/Emilgardis/194812fa2e73ff6839d9942163329887

@camelid
Copy link
Member

camelid commented Sep 12, 2021

@rustbot ping cleanup

I think it would be useful to further minimize the above Gist.

@rustbot
Copy link
Collaborator

rustbot commented Sep 12, 2021

@rustbot rustbot added the ICEBreaker-Cleanup-Crew Helping to "clean up" bugs with minimal examples and bisections label Sep 12, 2021
@smmalis37
Copy link
Contributor

I took a profile of a debug build of rustc building that gist. This definitely doesn't seem to be a hang per se, there's always some work going on, it's just taking a really really long time. There's no one specific function we're spending all our time in either, its more like death by a thousand papercuts. Although we are spending about 25% of the time in hashmap-related code, but I imagine that's normal for rustc?

@camelid
Copy link
Member

camelid commented Sep 13, 2021

Although we are spending about 25% of the time in hashmap-related code, but I imagine that's normal for rustc?

Hmm, I haven't looked at rustc profiles before, but 25% in hashmap-related code (I assume you mean code in the hashmap library—FxHashMap) sounds unusual to me. You could try profiling other trait-using code that doesn't slow rustc down a ton and see how much of the time is in hashmap code.

@Mark-Simulacrum
Copy link
Member

No, 25% may even be low. rustc is very HashMap intensive in most profiles.

@Emilgardis
Copy link
Contributor Author

Emilgardis commented Sep 13, 2021

diff of cargo rustc --profile=dev -Z timings -- -Z self-profile profdata on rustc 1.56.0-nightly (b03ccace5 2021-08-24) as base and rustc 1.56.0-nightly (0afc20860 2021-08-25) as change. converted to csv.

Also some flamegraphs: before, after

can publish the raw profdata if wanted

@Emilgardis
Copy link
Contributor Author

I'm confused as to why there is no analysis in the flamegraph of after.

@jackh726
Copy link
Member

Hi all. Thanks a ton for the help on this. Especially @Emilgardis

It's definitely helpful to see that the majority of the time is spent during the codegen phase. Definitely I think getting a more minimized repro will help me get a fix for this.

@smmalis37
Copy link
Contributor

smmalis37 commented Sep 13, 2021

That's gonna be tricky. Even just removing a few of the calls to tracing, something that I'd expect to have no side effects, would drastically reduce the compilation time. Same for deleting code that gets linted as unreachable.

@camelid
Copy link
Member

camelid commented Sep 13, 2021

That's gonna be tricky. Even just removing a few of the calls to tracing, something that I'd expect to have no side effects, would drastically reduce the compilation time.

Perhaps if you expand the macros (using something like --pretty=expanded), you could remove parts of the calls to tracing and see what the effect is?

@jackh726
Copy link
Member

It's probably not so much total compile time, as much as time spent in normalize_generic_arg_after_erasing_regions

@smmalis37
Copy link
Contributor

One thing I just spotted in my profiling is lots of stacks of collect_items_rec that are over 50 recursive calls deep. Is that expected? Is it worthwhile/possible to change this method to be iterative instead of recursive?

@Emilgardis
Copy link
Contributor Author

Emilgardis commented Sep 14, 2021

Managed to get it quite small, but still directly depends on sqlx and actix-web. https://gist.github.com/Emilgardis/c9b152dcd518f41fbb9e73126b48373e

removing this L31-L33 and here L-48-50 makes the memory footprint smaller.

removing sqlx::Pool::acquire makes the problem dissappear, same if you just call route directly.
I tried to simulate sqlx::Pool::acquire with a few async fns that do things across await boundaries but couldn't replicate the complexity. I'm sure it's possible though.

Thanks @camelid for the hint about using expanded, I did do that before but didn't think to remove parts of tracing.

@Emilgardis
Copy link
Contributor Author

Emilgardis commented Sep 15, 2021

Managed to make it 0-dep 🎉

https://play.rust-lang.org/?version=nightly&mode=debug&edition=2018&gist=a4aa190792b33bbe5e687cdfcb39fc65

Compiles on the playground on stable, timeout/sigkill on current playground nightly (2021-09-14) and beta ofc

Note: Can still be minimized. Just noticed for example that check_conn can just be an Ok(()) body

edit: reduced more, noticeable better compiler performance though https://play.rust-lang.org/?version=nightly&mode=debug&edition=2018&gist=90840273ed63d0a3c2b39596b77d396f

@Emilgardis
Copy link
Contributor Author

For future endeavors like this, I think it'd be very useful to develop a fuzz-like tool that in a working compiler:

  1. inlines all crates
  2. finds and removes unused types/paths
  3. (optionally) Sprinkle/replace todo!() in method bodies (and replacing method calls) in a fashion such that we work upwards from main, branching if needed.

That was basically the workflow I used to get to this.

@steffahn
Copy link
Member

steffahn commented Sep 15, 2021

It can't inline dependencies from other crates, but there's rust-reduce which I've used successfully multiple times. (In combination with creduce, as its docs recommend.)

@Emilgardis
Copy link
Contributor Author

It was not helpful when I tried it in this case, maybe it'll work better now that's it's all inlined.

@Emilgardis
Copy link
Contributor Author

@rustbot label +I-compiletime +I-compilemem -I-hang

@rustbot rustbot added I-compilemem Issue: Problems and improvements with respect to memory usage during compilation. I-compiletime Issue: Problems and improvements with respect to compile times. and removed I-hang Issue: The compiler never terminates, due to infinite loops, deadlock, livelock, etc. labels Sep 16, 2021
@hkratz
Copy link
Contributor

hkratz commented Sep 16, 2021

I only get timeouts with beta/nightly if I use the default ''Run'', but not if I choose ''Build''.

With Build and nightly I get:

warning: `playground` (lib) generated 18 warnings
    Finished dev [unoptimized + debuginfo] target(s) in 2.31s

for the first link and

warning: `playground` (lib) generated 24 warnings
    Finished dev [unoptimized + debuginfo] target(s) in 5.41s

for the second.

With stable the times are 0.82s and and 4.83s but the timings vary widely on playground that this does not help much.

Why does Run time out only for beta/nightly while Build works for all?

@Emilgardis
Copy link
Contributor Author

The regressed commit is currently not on stable, only beta and nightly, it will be promoted to stable soon. Hopefully if a fix can be found it can be beta backported

@trevor-crypto
Copy link

trevor-crypto commented Sep 16, 2021

@hkratz Not sure if this is helpful, but I was able to reproduce the latest minimal playground locally by making a new bin crate and compiling --release with the latest nightly toolchain with debug output. I can only see the hang when logs are output, otherwise it just take a few seconds.

Eventually the logs get to:

│ │ │ │ │ │ │ │ ├─2560ms DEBUG rustc_trait_selection::traits::query::normalize QueryNormalizer: obligations = [
│ │ │ │ │ │ │ │ │     Obligation(predicate=Binder(OutlivesPredicate('_#0r, RePlaceholder(Placeholder { universe: U1, name: BrAnon(0) })),
 []), depth=0),
│ │ │ │ │ │ │ │ │     Obligation(predicate=Binder(OutlivesPredicate('_#1r, RePlaceholder(Placeholder { universe: U1, name: BrAnon(1) })),
 []), depth=0),
│ │ │ │ │ │ │ │ │     Obligation(predicate=Binder(OutlivesPredicate('_#2r, RePlaceholder(Placeholder { universe: U2, name: BrAnon(0) })),
...

Where the last line repeats thousands and thousands of times

@Emilgardis
Copy link
Contributor Author

Last one I think, the more I minimize it, the quicker it compiles (but still slower than 2021-08-25).
https://gist.github.com/Emilgardis/a93b1d38a3b3c8b33fc526b1f1075945

@apiraino
Copy link
Contributor

Assigning priority as discussed in the Zulip thread of the Prioritization Working Group.

@rustbot label -I-prioritize +P-medium

@rustbot rustbot added P-medium Medium priority and removed I-prioritize Issue: Indicates that prioritization has been requested for this issue. labels Sep 22, 2021
@jackh726
Copy link
Member

status update: I've been trying to look into this. Given the complexity of the MCVE, it's very difficult. When running the MCVE with RUSTC_LOG=rustc_trait_selection,rustc_infer and letting it run for minutes, here is just a random snippet of the last few lines of output: https://gist.github.com/jackh726/a6e52b8817c66e1a4503a9ebdbf5c8cc

A couple things to note:

  • We've created 22847 region variables
  • We've created 4359 universes for placeholder regions.
  • There are some extremely gnarly types here.
  • Given that I see a bunch of late-bound regions, it's not surprising that we might be doing a normalization somewhere that we weren't doing before.

jackh726 added a commit to jackh726/rust that referenced this issue Sep 26, 2021
Turns out, this has some really bad perf implications in large types (issue rust-lang#88862). While we technically can handle them fine, it doesn't change test output either way. For now, revert with an added benchmark. Future attempts to change this back will have to consider perf.
@bors bors closed this as completed in 2b6ed3b Sep 27, 2021
@Emilgardis
Copy link
Contributor Author

As a sanity check, I compiled with #89285 and it has fixed the issue in my workspace, thank you @jackh726 for looking into this!

@jackh726
Copy link
Member

It's a bit unfortunate I wasn't able to find a "real" fix, but as Niko said, a decent stopgap measure.

jackh726 added a commit to jackh726/rust that referenced this issue Sep 30, 2021
Turns out, this has some really bad perf implications in large types (issue rust-lang#88862). While we technically can handle them fine, it doesn't change test output either way. For now, revert with an added benchmark. Future attempts to change this back will have to consider perf.
ehuss pushed a commit to ehuss/rust that referenced this issue Oct 4, 2021
Don't normalize opaque types with escaping late-bound regions

Fixes rust-lang#88862

Turns out, this has some really bad perf implications in large types (issue rust-lang#88862). While we technically can handle them fine, it doesn't change test output either way. For now, revert with an added benchmark. Future attempts to change this back will have to consider perf.

Needs a perf run once rust-lang/rustc-perf#1033 is merged

r? `@nikomatsakis`
@RalfJung
Copy link
Member

Looks like the stopgap measure is causing problems... at least if the analysis in rust-lang/miri#2433 is correct, the 'fix' is causing ICEs in Miri when rustc is built with debug assertions.

bors added a commit to rust-lang-ci/rust that referenced this issue Sep 22, 2022
…ound-vars, r=lcnr

Normalize opaques w/ bound vars

First, we reenable normalization of opaque types with escaping late bound regions to fix rust-lang/miri#2433. This essentially reverts rust-lang#89285.

Second, we mitigate the perf regression found in rust-lang#88862 by simplifying the way that we relate (sub and eq) GeneratorWitness types.

This relies on the fact that we construct these GeneratorWitness types somewhat particularly (with all free regions found in the witness types replaced with late bound regions) -- but those bound regions really should be treated as existential regions, not universal ones. Those two facts leads me to believe that we do not need to use the full `higher_ranked_sub` machinery to relate two generator witnesses. I'm pretty confident that this is correct, but I'm glad to discuss this further.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Category: This is a bug. E-needs-mcve Call for participation: This issue has a repro, but needs a Minimal Complete and Verifiable Example I-compilemem Issue: Problems and improvements with respect to memory usage during compilation. I-compiletime Issue: Problems and improvements with respect to compile times. ICEBreaker-Cleanup-Crew Helping to "clean up" bugs with minimal examples and bisections P-medium Medium priority regression-from-stable-to-beta Performance or correctness regression from stable to beta.
Projects
None yet
Development

Successfully merging a pull request may close this issue.