Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Types and specifically "mentioned items" take up a lot of space in Rust metadata files #122936

Open
RalfJung opened this issue Mar 23, 2024 · 6 comments
Labels
A-const-eval Area: Constant evaluation (MIR interpretation) A-metadata Area: Crate metadata I-heavy Issue: Problems and improvements with respect to binary size of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@RalfJung
Copy link
Member

#122568 was a significant size regression for Rust metadata. The extra information that is stored in MIR bodies now is a bunch of Ty<'tcx>, so in memory I think this is actually not that much. But it seems like in our metadata format this takes up quite a bit of space. This not only makes the library files bigger, it also accounts for a large fraction (I think even the majority) of the compile-time regression from that PR.

I don't know if there's something that can be done to improve this -- either by storing different information in mentioned_items that needs less space on disk, or by representing Ty<'tcx> more efficiently on disk. One drastic option would be to "intern" types in the on-disk format, i.e. have one global table of types that everything else just indexes into. That would certainly save space when the same type appears multiple times. I don't know if that is what happens here though.

@rustbot rustbot added the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label Mar 23, 2024
@RalfJung RalfJung changed the title Types and specificaly "mentioned items" take up a lot of space in Rust metadata files Types and specifically "mentioned items" take up a lot of space in Rust metadata files Mar 23, 2024
@jieyouxu jieyouxu added A-metadata Area: Crate metadata T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Mar 23, 2024
@fmease fmease added A-const-eval Area: Constant evaluation (MIR interpretation) I-heavy Issue: Problems and improvements with respect to binary size of generated code. and removed needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. labels Mar 23, 2024
@oli-obk
Copy link
Contributor

oli-obk commented Mar 23, 2024

One drastic option would be to "intern" types in the on-disk format, i.e. have one global table of types that everything else just indexes into. That would certainly save space when the same type appears multiple times. I don't know if that is what happens here though.

I thought that's what we already did. Anything that was already encoded will just get encoded as an offset to where the actual value was encoded before.

@RalfJung
Copy link
Member Author

Hm. Then either these lists are a lot bigger than I expected or it's not working somehow?

@RalfJung
Copy link
Member Author

Yeah it certainly looks like there's a cache that avoids repeatedly encoding the same type:

impl<'tcx, E: TyEncoder<I = TyCtxt<'tcx>>> Encodable<E> for Ty<'tcx> {
fn encode(&self, e: &mut E) {
encode_with_shorthand(e, self, TyEncoder::type_shorthands);
}
}

In that case, no idea what the size regressed here. Is there any way to figure out what is taking up that extra size?

@oli-obk
Copy link
Contributor

oli-obk commented Apr 24, 2024

Lots of bodies with lots of mentioned items just adding up? Even if the items themselves are fairly small

@oli-obk
Copy link
Contributor

oli-obk commented Apr 24, 2024

I remember @saethlin doing some encoder debugging before. Got any ideas?

@saethlin
Copy link
Member

saethlin commented Apr 25, 2024

Well I don't think that my encoder debugging rig is useful here; that's for finding what data is at some offset in the file.

But this doesn't look particularly complicated to understand by generating a file of inferno's folded stacks format: https://crates.io/crates/inferno. I've analyzed memory consumption of programs by turning strace -k -e mmap dumps into the folded stacks with this: https://github.com/saethlin/strace-flamegraph, so for this case I'd use backtrace to collect a backtrace from all the primitive write operations in FileEncoder and print a line of output that's the backtrace, semicolon-delimited, then a space and the number of bytes written. Pipe that into inferno-flamegraph and you should get a flamegraph of what is using up file size in metadata encoding. If you have two files, you can use inferno-diff-folded to get a diff-style flamegraph between the two.

In case it's not obvious, this sort of thing will be incredibly slow, and even just having the code compiled in might have prohibitive runtime overhead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-const-eval Area: Constant evaluation (MIR interpretation) A-metadata Area: Crate metadata I-heavy Issue: Problems and improvements with respect to binary size of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

6 participants