Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compiling hello world on RPi spends 4 seconds in coherence checking #22068

Closed
kmcallister opened this issue Feb 7, 2015 · 8 comments · Fixed by #25323
Closed

Compiling hello world on RPi spends 4 seconds in coherence checking #22068

kmcallister opened this issue Feb 7, 2015 · 8 comments · Fixed by #25323
Labels
I-compiletime Issue: Problems and improvements with respect to compile times.

Comments

@kmcallister
Copy link
Contributor

My rustc is from @japaric's nightlies. It's pretty slow overall, but what stands out is that we spend nearly 4s on coherence checking.

pi@inky /tmp $ rustc --version
rustc 1.0.0-dev (d3732a12e 2015-02-06 23:30:17 +0000)
pi@inky /tmp $ cat hello-world.rs 
fn main() {
    println!("Hello, world!");
}
pi@inky /tmp $ time rustc -Z time-passes hello-world.rs
time: 0.004     parsing
time: 0.000     recursion limit
time: 0.008     gated macro checking
time: 0.000     configuration 1
time: 0.001     crate injection
time: 0.127     plugin loading
time: 0.000     plugin registration
time: 0.069     expansion
time: 0.000     complete gated feature checking
time: 0.000     configuration 2
time: 0.000     maybe building test harness
time: 0.000     prelude injection
time: 0.000     checking that all macro invocations are gone
time: 0.000     assigning node ids and indexing ast
time: 0.000     external crate/lib resolution
time: 0.002     language item collection
time: 0.176     resolution
time: 0.000     lifetime resolution
time: 0.000     looking for entry point
time: 0.000     looking for plugin registrar
time: 0.000     region resolution
time: 0.000     loop checking
time: 0.000     static item recursion checking
time: 0.003     type collecting
time: 0.000     variance inference
time: 3.853     coherence checking
time: 0.025     type checking
time: 0.069     check static items
time: 0.000     const marking
time: 0.000     const checking
time: 0.001     privacy checking
time: 0.000     stability index
time: 0.000     intrinsic checking
time: 0.000     effect checking
time: 0.001     match checking
time: 0.000     liveness checking
time: 0.105     borrow checking
time: 0.008     rvalue checking
time: 0.000     reachability checking
time: 0.000     death checking
time: 0.000     stability checking
time: 0.000     unused lib feature checking
time: 0.002     lint checking
time: 0.000     resolving dependency formats
time: 0.201     translation
  time: 0.003   llvm function passes
  time: 0.003   llvm module passes
  time: 0.122   codegen passes
  time: 0.007   codegen passes
time: 0.243     LLVM passes
  time: 6.600   running linker
time: 6.611     linking

real    0m12.263s
user    0m7.510s
sys     0m0.720s
@kmcallister kmcallister added the I-slow Issue: Problems and improvements with respect to performance of generated code. label Feb 7, 2015
@jdm
Copy link
Contributor

jdm commented Feb 8, 2015

cc me

@kmcallister
Copy link
Contributor Author

Above test is on a fairly heavily loaded machine. With light load, it's 2.6 seconds in coherence, total real time 8.4s, user time 7.4s.

@japaric
Copy link
Member

japaric commented Feb 8, 2015

Yeah, single core ARM devices are pretty slow. On my beaglebone, I get:

> rustc -Z time-passes hello-world.rs | grep -v '0.000'
time: 0.002     parsing
time: 0.042     plugin loading
time: 0.023     expansion
time: 0.001     assigning node ids and indexing ast
time: 0.063     resolution
time: 0.001     type collecting
time: 1.166     coherence checking
time: 0.015     type checking
time: 0.020     check static items
time: 0.032     borrow checking
time: 0.005     rvalue checking
time: 0.001     lint checking
time: 0.083     translation
  time: 0.003   llvm function passes
  time: 0.002   llvm module passes
  time: 0.056   codegen passes
  time: 0.005   codegen passes
time: 0.112     LLVM passes
  time: 3.648   running linker
time: 3.656     linking
rustc -Z time-passes hello-world.rs  4.83s user 0.69s system 98% cpu 5.588 total

The RPi does spend more relative time in coherence checking in comparison to other devices, but all of them spend a significant amount of time there:

Raspberry Pi: 2.6s out of 8.4s (31%) (your numbers)
Beaglebone: 1.166s out of 5.588s (21%)
Odroid XU: 0.264s out of 1.261s (21%)
Core i7 laptop: 0.051s out of 0.349s (14.6%)

No idea why though. Given that the three ARM devices are executing the same binary/instructions, perhaps is just a difference in CPU models (ARMv6 vs ARMv7a)?

@dotdash
Copy link
Contributor

dotdash commented Feb 22, 2015

This might be due to the RBML/EBML decoding code that looks like it would cause lots of unaligned loads. Older ARMs probably handle those worse than newer ones.

@kmcallister
Copy link
Contributor Author

Oh, interesting! Is coherence checking a hot spot for metadata parsing?

@dotdash
Copy link
Contributor

dotdash commented Feb 22, 2015

I guess it's not in general, but the metadata is loaded on demand, and coherence seems to be the first thing that wants some.

@emberian emberian added I-compiletime Issue: Problems and improvements with respect to compile times. and removed I-slow Issue: Problems and improvements with respect to performance of generated code. labels Mar 25, 2015
@eddyb
Copy link
Member

eddyb commented May 9, 2015

I've hit this on my rid3 demo: loading all 120 .rs files (a crate each) in rust-by-example at once used up around 2GB of RAM.

Turns out coherence checking is using 16MB per crate which uses std, even if it has no impl blocks, and that also takes some time.

I've looked at it, and I don't know what could be using all the external impls: coherence::overlap checking already loads all the impls it needs (which would be none, in this case).

Here are "desktop" timings for fn main(){}:

time: 0.000     parsing
time: 0.000     recursion limit
time: 0.000     configuration 1
time: 0.000     gated macro checking
time: 0.000     crate injection
time: 0.019     macro loading
time: 0.000     plugin loading
time: 0.000     plugin registration
time: 0.003     expansion
time: 0.000     complete gated feature checking 1
time: 0.000     configuration 2
time: 0.000     maybe building test harness
time: 0.000     prelude injection
time: 0.000     checking that all macro invocations are gone
time: 0.000     complete gated feature checking 2
time: 0.000     assigning node ids and indexing ast
time: 0.000     external crate/lib resolution
time: 0.000     language item collection
time: 0.002     resolution
time: 0.000     lifetime resolution
time: 0.000     looking for entry point
time: 0.000     looking for plugin registrar
time: 0.000     region resolution
time: 0.000     loop checking
time: 0.000     static item recursion checking
time: 0.000     type collecting
time: 0.000     variance inference
time: 0.368     coherence checking
time: 0.000     type checking
time: 0.000     const checking
time: 0.000     privacy checking
time: 0.000     stability index
time: 0.000     intrinsic checking
time: 0.000     effect checking
time: 0.000     match checking
time: 0.000     liveness checking
time: 0.000     borrow checking
time: 0.000     rvalue checking
time: 0.000     reachability checking
time: 0.000     death checking
time: 0.000     stability checking
time: 0.000     unused lib feature checking
time: 0.000     lint checking
time: 0.000     resolving dependency formats
time: 0.002     translation
  time: 0.000   llvm function passes
  time: 0.000   llvm module passes
  time: 0.004   codegen passes
  time: 0.000   codegen passes
time: 0.014     LLVM passes
  time: 0.721   running linker
time: 0.722     linking

@eddyb
Copy link
Member

eddyb commented May 11, 2015

So it seems the "load all impls ever" logic in coherence is a workaround for the fact that csearch::each_implementation_for_trait, used by populate_implementations_for_trait_if_necessary, doesn't look in every crate, but only the crate where the trait was defined.

However, my initial fix doesn't actually seem to load any impls outside of the trait's definition crate, and I have no idea where else to look.

EDIT: turns out I was looking for a tag which wasn't getting encoded in the list of all the impls in a crate. Should have a patch soon, if nothing else comes up.

bors added a commit that referenced this issue May 12, 2015
The loop to load all the known impls from external crates seems to have been used because `ty::populate_implementations_for_trait_if_necessary` wasn't doing its job, and solely relying on it resulted in loading only impls in the same crate as the trait.

Coherence for `librustc` was reduced from 18.310s to 0.610s, from stage1 to stage2.
Interestingly, type checking also went from 46.232s to 42.003s, though that could be noise or unrelated improvements.

On a smaller scale, `fn main() {}` now spends 0.003s in coherence instead of 0.368s, which fixes #22068.
It also peaks at only 1.2MB, instead of 16MB of heap usage.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I-compiletime Issue: Problems and improvements with respect to compile times.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants