Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Binary size blowup presumably because of allocator integration #42808

Closed
alexbool opened this issue Jun 21, 2017 · 44 comments · Fixed by #44049
Closed

Binary size blowup presumably because of allocator integration #42808

alexbool opened this issue Jun 21, 2017 · 44 comments · Fixed by #44049
Assignees
Labels
C-bug Category: This is a bug. P-high High priority regression-from-stable-to-beta Performance or correctness regression from stable to beta. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Milestone

Comments

@alexbool
Copy link
Contributor

alexbool commented Jun 21, 2017

Between rustc 1.19.0-nightly (10d7cb44c 2017-06-18) and 1.20.0-nightly (445077963 2017-06-20) I experienced one of my binaries to grow up from 3.4M to 3.7M, which is approx 8.8%. At first I was scared that this was a fallout of my own PR (#42716), but that particular binary didn't use C strings almost at all.
I examined a diff in symbols made with nm and saw that the binary compiled with 1.20.0-nightly (445077963 2017-06-20) has a lot more symbols like that:

  • _<alloc::raw_vec::RawVec<T, A>>::dealloc_buffer (14 occurences)
  • _alloc::allocator::Alloc::alloc_array (57 occurences)
  • _alloc::allocator::Alloc::realloc_array (28 occurences)
  • _core::ptr::drop_in_place (mighty 138 new occurences)

This leads us to the possibility that #42313 is the culprit.

Current Situation

There are 2 reproducers, and I need to make some measurements:

@leonardo-m
Copy link

The blowup of binary size is the minor difference. The main difference is in compilation times and run-times of that resulting binary, that are quite worse.

@Mark-Simulacrum Mark-Simulacrum added P-high High priority T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. labels Jun 22, 2017
@Mark-Simulacrum
Copy link
Member

cc @alexcrichton @pnkfelix

@alexcrichton
Copy link
Member

@alexbool can you enable us to reproduce this? I can investigate in parallel but bug reports tend to be much more readily fixable if there's a way to reproduce what caused the bug report!

@alexcrichton
Copy link
Member

Looks like this isn't a regression for "hello world", or likely just "in the noise"

$ rustc +stable -V
rustc 1.18.0 (03fc9d622 2017-06-06)
$ rustc +stable foo.rs && strip foo && ls -alh foo
-rwxrwxr-x 1 alex alex 316K Jun 21 18:47 foo
$ rustc +beta -V
rustc 1.19.0-beta.2 (a175ee509 2017-06-15)
$ rustc +beta foo.rs && strip foo && ls -alh foo
-rwxrwxr-x 1 alex alex 368K Jun 21 18:47 foo
$ rsutc +nightly -V
rustc 1.20.0-nightly (622e7e648 2017-06-21)
$ rustc +nightly foo.rs && strip foo && ls -alh foo
-rwxrwxr-x 1 alex alex 324K Jun 21 18:47 foo

@alexbool
Copy link
Contributor Author

alexbool commented Jun 22, 2017

I'll try to make a minimal reproducible example today or later this week

@brson brson added the regression-from-stable-to-nightly Performance or correctness regression from stable to nightly. label Jun 22, 2017
@alexcrichton
Copy link
Member

@alexbool any updates on this?

@leonardo-m or @hinaria do y'all have examples of this causing a size increase?

@alexbool
Copy link
Contributor Author

Unfortunately didn't have much time to dig into this. Will try as soon as I have possibility.

@pnkfelix
Copy link
Member

Here is a test case that may serve for exhibiting the space increase that @alexbool observed around this time.

(Or at least, on my Mac, building without optimizations, I am observing a size increase of about 1.08x on the stripped binaries. With optimizations, the size increase is about 1.05x)

// vecel.rs

use std::env;

fn main() {
    let mut args = env::args();
    args.next();
    let arg1 = args.next().unwrap();
    let num: usize = arg1.parse().unwrap();
    let mut v = Vec::new();
    for i in 0..num {
        v.push(i);
    }
    assert_eq!(v.len(), num);
}

Build script:

#!/bin/sh # vecel.sh

function f() {
    OUT=vecel.$(rustc --version | sed -e 's/(//' -e 's/)//' | cut -d ' ' -f 4).bin ; rustc --version && rustc ~/Dev/Rust/vecel.rs -o $OUT && strip $OUT
}

for day in 19 20 21 22 23 24 25 26 27; do
    rustup default nightly-2017-06-$day;
    f;
done

CHAN=stable;  rustc +$CHAN ~/Dev/Rust/vecel.rs -o vecel.$CHAN.bin && strip vecel.$CHAN.bin
CHAN=beta;    rustc +$CHAN ~/Dev/Rust/vecel.rs -o vecel.$CHAN.bin && strip vecel.$CHAN.bin
CHAN=nightly; rustc +$CHAN ~/Dev/Rust/vecel.rs -o vecel.$CHAN.bin && strip vecel.$CHAN.bin

ls -alh /tmp/vecel.*.bin

Results:

-rwxr-xr-x  1 fklock  wheel   253K Jun 27 13:35 /tmp/vecel.2017-06-18.bin
-rwxr-xr-x  1 fklock  wheel   253K Jun 27 13:35 /tmp/vecel.2017-06-19.bin
-rwxr-xr-x  1 fklock  wheel   266K Jun 27 13:35 /tmp/vecel.2017-06-20.bin
-rwxr-xr-x  1 fklock  wheel   266K Jun 27 13:35 /tmp/vecel.2017-06-21.bin
-rwxr-xr-x  1 fklock  wheel   274K Jun 27 13:35 /tmp/vecel.2017-06-22.bin
-rwxr-xr-x  1 fklock  wheel   274K Jun 27 13:35 /tmp/vecel.2017-06-23.bin
-rwxr-xr-x  1 fklock  wheel   274K Jun 27 13:35 /tmp/vecel.2017-06-24.bin
-rwxr-xr-x  1 fklock  wheel   274K Jun 27 13:35 /tmp/vecel.2017-06-25.bin
-rwxr-xr-x  1 fklock  wheel   274K Jun 27 13:35 /tmp/vecel.2017-06-26.bin
-rwxr-xr-x  1 fklock  wheel   269K Jun 27 13:35 /tmp/vecel.beta.bin
-rwxr-xr-x  1 fklock  wheel   274K Jun 27 13:35 /tmp/vecel.nightly.bin
-rwxr-xr-x  1 fklock  wheel   253K Jun 27 13:35 /tmp/vecel.stable.bin

@pnkfelix
Copy link
Member

My previous comment notwithstanding, it would be really great if @alexbool could provide us either with their original test case or a somewhat reduced version of it, since it is entirely possible that my (strawman) object size microbenchmark does not actually reflect the real issues that are underpinning the size increase that @alexbool is observing.

@leonardo-m
Copy link

@alexcrichton, a little increase in binary size is not important to me. But I am seeing significant increases in the run-time of the binary. I've filed a different issue but the cause could be the same:

#42935

@pnkfelix pnkfelix self-assigned this Jun 27, 2017
@alexcrichton
Copy link
Member

@pnkfelix oh note that I added a bunch of #[inline] in #42727 which may help with the perf related to that

@alexbool
Copy link
Contributor Author

This pathological case is producing ~8.5% increase in release mode between 2017-06-19 and 2017-06-23:
main.rs:

#[macro_use]
extern crate serde;
extern crate serde_yaml;

use std::fs::OpenOptions;

#[derive(Debug, Deserialize)]
pub struct Something {
    field1: Vec<S1>,
    field2: Vec<S2>,
    field3: Vec<S3>,
    field4: Vec<S4>,
    field5: Vec<S5>,
    field6: Vec<S6>,
    field7: Vec<S7>,
    field8: Vec<S8>,
    field9: Vec<S9>,
}

#[derive(Debug, Deserialize)]
pub struct S1(String);

#[derive(Debug, Deserialize)]
pub struct S2(String);

#[derive(Debug, Deserialize)]
pub struct S3(String);

#[derive(Debug, Deserialize)]
pub struct S4(String);

#[derive(Debug, Deserialize)]
pub struct S5(String);

#[derive(Debug, Deserialize)]
pub struct S6(String);

#[derive(Debug, Deserialize)]
pub struct S7(String);

#[derive(Debug, Deserialize)]
pub struct S8(String);

#[derive(Debug, Deserialize)]
pub struct S9(String);

fn main() {
    println!(
        "{:?}",
        serde_yaml::from_reader::<_, Something>(OpenOptions::new().open("whatever").unwrap())
    );
}

Cargo.toml:

[package]
name = "issue-42808"
version = "0.1.0"
authors = ["Alexander Bulaev <[email protected]>"]

[dependencies]
serde = { version = "=1.0.8", features = ["derive"] }
serde_yaml = "=0.7.1"

@pnkfelix
Copy link
Member

pnkfelix commented Jun 29, 2017

I have a datapoint to provide.

For ease of experimentation, I made a small variation on alexbool's benchmark that avoids pulling in serde. (Basically, my goal was to have a single .rs file that could be independently compiled without any crate dependencies.) I think my benchmark is likely to get at the heart of what is problematic here (namely that the code-duplication from monomorphization of generics can amplify the effect of otherwise minor code size regressions).

Manually inspecting the output assembly and comparing what we generated before and after the Allocator API landed, I saw cases where we could/should be inlining more. So I tried comparing against the branch that is basis for PR #42727, which adds some #[inline] directives we will be definitely needing to address execution time regressions. Unfortunately, that inlining happens to make the overall code size here worse.

Another thing I noticed was that we seem to be spending a number of instructions just shuffling words around. It wasn't immediately obvious what all the causes were, but one clear candidate was the AllocErr struct that now needs to be passed around when we return the Err case. So as an experiment, I tried making the AllocErr struct zero-sized (i.e. removing all of its content, which is probably not what we want to do long term). I also added inlines of Layout::repeat and Layout::array.

rust variant resulting code size (bytes)
just before Allocator API 298,840
just after Allocator API 315,792
post #42727 323,896
post #42727 + making AllocErr zero-sized + more Layout inlines 311,296

So this shows that some of the code size regression can perhaps be blamed on the change to alloc's return type; it used to be *mut u8, now it is Result<*mut u8, AllocErr> (and I'm curious whether it goes down further if you make that Result<NonZero<*mut u8>, AllocErr>, though we have already been through arguments against using NonZero here...)

Update: I attempted to recreate my above results and discovered that I posted my original comment without finishing a crucial sentence: the 311,296 figure was gathered from a build that both made AllocErr zero-sized and marked more operations of Layout as #[inline].

@alexcrichton
Copy link
Member

Thanks for the investigation and minimization @pnkfelix! I'm still pretty surprised about the increase in code there! I think it may be worth digging in a bit more to see where this "code bloat" is coming from. I would suspect that these are legitimately new locations for more Layout validation, but it would be good to validate that claim itself. If no one else gets around to this I'll try to start investigating after #42727 lands

@pnkfelix
Copy link
Member

pnkfelix commented Jul 3, 2017

One thing that I keep seeing in the diffs regardless of how I attempt to "correct" for whatever changes that trait Allocator injected: there continues to be a big increase in the number of core::ptr::drop_in_place definitions.

Building my benchmark just before the Allocator landed has 20 instantiated definitions of core::ptr::drop_in_place in its assembly output (many of them quite trivial and identical bits of code, sigh). Right after Allocator landed, there were 31 instantiated definitions of core::ptr::drop_in_place. But sometime after that, it blew up to 49 instantiated definitions. Not really sure when that happened.

@alexcrichton
Copy link
Member

Hm ok so now I'm getting interesting results again:

Using foo.rs as what @pnkfelix listed above I get:

$ rustc +nightly -V
rustc 1.20.0-nightly (696412de7 2017-07-06)
$ rustc +nightly foo.rs -O && strip -g ./foo && ls -alh foo
-rwxrwxr-x 1 alex alex 483K Jul  7 08:29 foo
$ rustc +nightly-2017-06-01 foo.rs -O && strip -g ./foo && ls -alh foo
-rwxrwxr-x 1 alex alex 476K Jul  7 08:29 foo

However removing derive(Debug) I get:

$ rustc +nightly foo.rs -O && strip -g ./foo && ls -alh foo
-rwxrwxr-x 1 alex alex 459K Jul  7 08:30 foo
$ rustc +nightly-2017-06-01 foo.rs -O && strip -g ./foo && ls -alh foo
-rwxrwxr-x 1 alex alex 456K Jul  7 08:30 foo

Can this still be reproduced on the latest nightly?

@pnkfelix
Copy link
Member

pnkfelix commented Jul 11, 2017

Just a quick note for anyone trying to figure out how the numbers have been jumping up and down (at least, I was initially flummoxed when trying to compare @alexcrichton 's results with the numbers I provided 12 days ago)

The original bug report was comparing rustc 1.19.0-nightly (10d7cb44c 2017-06-18) against 1.20.0-nightly (445077963 2017-06-20). So it was a bit confusing that @alexcrichton used nightly-2017-06-01 as a reference point, because the object size changed quite a bit between May 31st and June 18th, as illustrated here (where foo.rs is the file @alexcrichton came up with that lacks derive(Debug):

% ( OUT=/tmp/foo; rm -f $OUT && RUSTC="rustc +nightly-2017-06-01" ; $RUSTC ~/Dev/Rust/foo.rs -O -o $OUT && strip -g $OUT && $RUSTC --version && ls -balh $OUT )
rustc 1.19.0-nightly (e0cc22b4b 2017-05-31)
-rwxr-xr-x 1 pnkfelix pnkfelix 450K Jul 11 18:01 /tmp/foo
% ( OUT=/tmp/foo; rm -f $OUT && RUSTC="rustc +nightly-2017-06-19" ; $RUSTC ~/Dev/Rust/foo.rs -O -o $OUT && strip -g $OUT && $RUSTC --version && ls -balh $OUT )
rustc 1.19.0-nightly (10d7cb44c 2017-06-18)
-rwxr-xr-x 1 pnkfelix pnkfelix 392K Jul 11 18:01 /tmp/foo
% ( OUT=/tmp/foo; rm -f $OUT && RUSTC="rustc +nightly-2017-06-21" ; $RUSTC ~/Dev/Rust/foo.rs -O -o $OUT && strip -g $OUT && $RUSTC --version && ls -balh $OUT )
rustc 1.20.0-nightly (445077963 2017-06-20)
-rwxr-xr-x 1 pnkfelix pnkfelix 416K Jul 11 18:01 /tmp/foo
% ( OUT=/tmp/foo; rm -f $OUT && RUSTC="rustc +nightly-2017-07-07" ; $RUSTC ~/Dev/Rust/foo.rs -O -o $OUT && strip -g $OUT && $RUSTC --version && ls -balh $OUT )
rustc 1.20.0-nightly (696412de7 2017-07-06)
-rwxr-xr-x 1 pnkfelix pnkfelix 453K Jul 11 18:01 /tmp/foo
% ( OUT=/tmp/foo; rm -f $OUT && RUSTC="rustc +nightly-2017-07-11" ; $RUSTC ~/Dev/Rust/foo.rs -O -o $OUT && strip -g $OUT && $RUSTC --version && ls -balh $OUT )
rustc 1.20.0-nightly (bf0a9e0b4 2017-07-10)
-rwxr-xr-x 1 pnkfelix pnkfelix 453K Jul 11 18:02 /tmp/foo
% 

In case its hard to see, here's the important sequence of numbers:

5/31 6/18 6/20 7/07 7/11
450K 392K 416K 453K 453K

So, there was an observable regression (and it still seems to be present), but it is potentially masked by other improvements to code size that happened between May 31st and June 18th.

@alexcrichton
Copy link
Member

Oh I was actually just picking random dates, looks like I "got lucky"!

@pnkfelix
Copy link
Member

pnkfelix commented Jul 13, 2017

Also, another factor that is easy to overlook is that in the earlier comments from both @alexcrichton and myself: our first invocations of strip had no additional options. The later invocations were strip -g, which only strips debugging symbols. The difference appears to be quite significant, at least on the binaries we are producing today (on a current nightly compiling hello world, the former yields a 348K binary, the latter yields a 432K binary).

@pnkfelix
Copy link
Member

In an effort to try to understand the size trajectories over time, I generalized the benchmark so that one can choose whether one wants to compile it with 1, 2, 3, 11, or 21 distinct variations on the S_ type (and a corresponding number of vectors in the struct Something), and then I compiled each variant with every nightly version of the compiler from May 1st through July 13th.

My intention was to differentiate between a couple different things:

  1. How does the size of a trivial fn main() { } change over time?
  2. How much does that trivial program increase in size when you add the most basic instance of the struct Something that carries a single vector (with that same loop for filling its contents)
  3. How much does that program increase in size when you have two distinctly-typed vectors in struct Something? How about three?
  4. As the number of variations of struct S_ (and Something fields and Vec<S_> instantiations) increases, what is the (amortized) cost per S_?

Here is a gist with the program, a shell script to drive it, and two sets of output transcripts (one that uses strip -g, another using strip)

https://gist.github.com/pnkfelix/fe632a4bad0759c52e0001efde83e038

Here are some observations based on looking both the development of the numbers over time (i.e. scanning down the columns)

  • The numbers sometimes report no binary size increase when one just goes from two S_ variations to three S_ variations. (I assume this can be explained by the presence of padding between function definition(s) that allows code to increase in size "for free" sometimes.)
    • This is probably a sign that I should switch to using objdump rather than ls for the fine-grained analysis.
  • Between (cfb5debbc 2017-06-12) and (03abb1bd7 2017-06-13), the baseline cost of a trivial fn main dropped quite a bit. This potentially masked later regressions, and its been creeping back up since then.
  • Between (10d7cb44c 2017-06-18) and (3bfc18a96 2017-06-29) (i.e., when the trait Alloc API landed, plus some other things I have not yet identified), the amortized cost per S_ type variation jumped from 409 bytes to 1638 bytes,
    • That's when using strip; with strip -g, the jump was from 510 bytes to (about) 2000 bytes.

@alexcrichton
Copy link
Member

Thanks for the scripts and investigation @pnkfelix! Lots of good data.

Between (cfb5deb 2017-06-12) and (03abb1b 2017-06-13), the baseline cost of a trivial fn main dropped quite a bit. This potentially masked later regressions, and its been creeping back up since then.

This range of commits notably includes #42566 which probably just means that jemalloc 4.0.0 is a smaller binary than jemalloc 4.5.0

I tweaked the script/source file to account for this and always link to the system allocator and got these results which notably shows the same point of a regression you saw, and that the addition of #[inline] didn't actually help and may have worsened it!

@alexcrichton alexcrichton added regression-from-stable-to-beta Performance or correctness regression from stable to beta. and removed regression-from-stable-to-nightly Performance or correctness regression from stable to nightly. labels Jul 23, 2017
@aturon
Copy link
Member

aturon commented Jul 26, 2017

cc @rust-lang/compiler, with @pnkfelix away we're going to need some help tackling this one.

@alexcrichton
Copy link
Member

Thanks for the analysis @arielb1! It sounds to me though like there's not necessarily a huge amount to do here. The jemalloc bloat isn't something that should be a problem (small projects turn it off anyway) and the dealloc-related "unwind bloat" is more of a correctness fix right now than anything else. Also if projects care about binary size they're probably compiling with -C panic=abort.

That just leaves 8k of unexplained size along with 3k of OOM-related stuff, both of which seem quite reasonable and ripe for future optimization.

In that sense with a lack of clear actionable bug, is it time to close this?

@arielb1
Copy link
Contributor

arielb1 commented Aug 22, 2017

In which cases can __rust_dealloc unwind? Doesn't it just call the jemalloc function, which can never unwind?

@arielb1
Copy link
Contributor

arielb1 commented Aug 22, 2017

So I think the 8k bloat was caused by other changes from 1143eb2 to 6f4ab94, and I don't think there's anything interesting left for me to investigate.

I'm leaving this open for T-libs to decide what to do about it.

@arielb1 arielb1 removed the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Aug 22, 2017
@alexcrichton
Copy link
Member

Seeing how no global allocators right now can unwind (system/jemalloc) I've proposed #44049 to fix the regression. I'll open a further issue for more discussion here related to unwinding allocators.

@alexcrichton alexcrichton modified the milestone: 1.20 Aug 23, 2017
alexcrichton added a commit to alexcrichton/rust that referenced this issue Aug 23, 2017
This commit flags all allocation-related functions in liballoc as "this can't
unwind" which should largely resolve the size-related issues found on rust-lang#42808.
The documentation on the trait was updated with such a restriction (they can't
panic) as well as some other words about the relative instability about
implementing a bullet-proof allocator.

Closes rust-lang#42808
@alexcrichton
Copy link
Member

This is fixed on beta, removing from milestone.

@alexcrichton alexcrichton removed this from the 1.20 milestone Aug 24, 2017
@alexcrichton alexcrichton added regression-from-stable-to-stable Performance or correctness regression from one stable version to another. regression-from-stable-to-beta Performance or correctness regression from stable to beta. and removed regression-from-stable-to-beta Performance or correctness regression from stable to beta. regression-from-stable-to-stable Performance or correctness regression from one stable version to another. labels Aug 28, 2017
@alexcrichton alexcrichton added this to the 1.21 milestone Aug 28, 2017
alexcrichton added a commit to alexcrichton/rust that referenced this issue Aug 28, 2017
This commit flags all allocation-related functions in liballoc as "this can't
unwind" which should largely resolve the size-related issues found on rust-lang#42808.
The documentation on the trait was updated with such a restriction (they can't
panic) as well as some other words about the relative instability about
implementing a bullet-proof allocator.

Closes rust-lang#42808
bors added a commit that referenced this issue Aug 29, 2017
std: Mark allocation functions as nounwind

This commit flags all allocation-related functions in liballoc as "this can't
unwind" which should largely resolve the size-related issues found on #42808.
The documentation on the trait was updated with such a restriction (they can't
panic) as well as some other words about the relative instability about
implementing a bullet-proof allocator.

Closes #42808
alexcrichton added a commit to alexcrichton/rust that referenced this issue Sep 1, 2017
This commit flags all allocation-related functions in liballoc as "this can't
unwind" which should largely resolve the size-related issues found on rust-lang#42808.
The documentation on the trait was updated with such a restriction (they can't
panic) as well as some other words about the relative instability about
implementing a bullet-proof allocator.

Closes rust-lang#42808
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Category: This is a bug. P-high High priority regression-from-stable-to-beta Performance or correctness regression from stable to beta. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging a pull request may close this issue.