Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serialize incr comp structures to file via fixed-size buffer #80463

Merged
merged 2 commits into from
Jan 12, 2021

Conversation

tgnottingham
Copy link
Contributor

@tgnottingham tgnottingham commented Dec 29, 2020

Reduce a large memory spike that happens during serialization by writing
the incr comp structures to file by way of a fixed-size buffer, rather
than an unbounded vector.

Effort was made to keep the instruction count close to that of the
previous implementation. However, buffered writing to a file inherently
has more overhead than writing to a vector, because each write may
result in a handleable error. To reduce this overhead, arrangements are
made so that each LEB128-encoded integer can be written to the buffer
with only one capacity and error check. Higher-level optimizations in
which entire composite structures can be written with one capacity and
error check are possible, but would require much more work.

The performance is mostly on par with the previous implementation, with
small to moderate instruction count regressions. The memory reduction is
significant, however, so it seems like a worth-while trade-off.

@rust-highfive
Copy link
Collaborator

r? @oli-obk

(rust-highfive has picked a reviewer for you, use r? to override)

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Dec 29, 2020
@tgnottingham
Copy link
Contributor Author

@rustbot label T-compiler A-incr-comp I-compiletime I-compilemem

@rustbot rustbot added A-incr-comp Area: Incremental compilation I-compilemem Issue: Problems and improvements with respect to memory usage during compilation. I-compiletime Issue: Problems and improvements with respect to compile times. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Dec 29, 2020
@tgnottingham
Copy link
Contributor Author

Btw, if #80115 merges and this is rebased on and adapted for it, it should also reduce the impact on instruction count.

@rust-log-analyzer
Copy link
Collaborator

The job mingw-check failed! Check out the build log: (web) (plain)

Click to see the possible cause of the failure (guessed by this bot)
configure: rust.channel         := nightly
configure: rust.debug-assertions := True
configure: llvm.assertions      := True
configure: dist.missing-tools   := True
configure: build.configure-args := ['--enable-sccache', '--disable-manage-submodu ...
configure: writing `config.toml` in current directory
configure: 
configure: run `python /checkout/x.py --help`
configure: 
---
   Compiling chalk-derive v0.36.0
    Checking chalk-ir v0.36.0
    Checking tracing v0.1.19
    Checking rustc_index v0.0.0 (/checkout/compiler/rustc_index)
error[E0425]: cannot find function `write_signed_leb128` in this scope
   --> compiler/rustc_serialize/tests/leb128.rs:36:9
    |
36  |         write_signed_leb128(&mut stream, x);
    |         ^^^^^^^^^^^^^^^^^^^ help: a function with a similar name exists: `read_signed_leb128`
   ::: /checkout/compiler/rustc_serialize/src/leb128.rs:123:1
    |
    |
123 | pub fn read_signed_leb128(data: &[u8], start_position: usize) -> (i128, usize) {
    | ------------------------------------------------------------------------------ similarly named function `read_signed_leb128` defined here
error[E0308]: mismatched types
  --> compiler/rustc_serialize/tests/leb128.rs:10:32
   |
   |
3  | / macro_rules! impl_test_unsigned_leb128 {
4  | |     ($test_name:ident, $write_fn_name:ident, $read_fn_name:ident, $int_ty:ident) => {
5  | |         #[test]
6  | |         fn $test_name() {
...  |
10 | |                 $write_fn_name(&mut stream, (3u64 << x) as $int_ty);
   | |                                ^^^^^^^^^^^ expected array of 3 elements, found struct `Vec`
22 | |     };
23 | | }
23 | | }
   | |_- in this expansion of `impl_test_unsigned_leb128!`
24 | 
25 |   impl_test_unsigned_leb128!(test_u16_leb128, write_u16_leb128, read_u16_leb128, u16);
   |   ------------------------------------------------------------------------------------ in this macro invocation
   |
   = note: expected mutable reference `&mut [MaybeUninit<u8>; 3]`
              found mutable reference `&mut Vec<_>`
error[E0308]: mismatched types
  --> compiler/rustc_serialize/tests/leb128.rs:10:32
   |
   |
3  | / macro_rules! impl_test_unsigned_leb128 {
4  | |     ($test_name:ident, $write_fn_name:ident, $read_fn_name:ident, $int_ty:ident) => {
5  | |         #[test]
6  | |         fn $test_name() {
...  |
10 | |                 $write_fn_name(&mut stream, (3u64 << x) as $int_ty);
   | |                                ^^^^^^^^^^^ expected array of 5 elements, found struct `Vec`
22 | |     };
23 | | }
23 | | }
   | |_- in this expansion of `impl_test_unsigned_leb128!`
...
26 |   impl_test_unsigned_leb128!(test_u32_leb128, write_u32_leb128, read_u32_leb128, u32);
   |   ------------------------------------------------------------------------------------ in this macro invocation
   |
   = note: expected mutable reference `&mut [MaybeUninit<u8>; 5]`
              found mutable reference `&mut Vec<_>`
error[E0308]: mismatched types
  --> compiler/rustc_serialize/tests/leb128.rs:10:32
   |
   |
3  | / macro_rules! impl_test_unsigned_leb128 {
4  | |     ($test_name:ident, $write_fn_name:ident, $read_fn_name:ident, $int_ty:ident) => {
5  | |         #[test]
6  | |         fn $test_name() {
...  |
10 | |                 $write_fn_name(&mut stream, (3u64 << x) as $int_ty);
   | |                                ^^^^^^^^^^^ expected array of 10 elements, found struct `Vec`
22 | |     };
23 | | }
23 | | }
   | |_- in this expansion of `impl_test_unsigned_leb128!`
...
27 |   impl_test_unsigned_leb128!(test_u64_leb128, write_u64_leb128, read_u64_leb128, u64);
   |   ------------------------------------------------------------------------------------ in this macro invocation
   |
   = note: expected mutable reference `&mut [MaybeUninit<u8>; 10]`
              found mutable reference `&mut Vec<_>`
error[E0308]: mismatched types
  --> compiler/rustc_serialize/tests/leb128.rs:10:32
   |
   |
3  | / macro_rules! impl_test_unsigned_leb128 {
4  | |     ($test_name:ident, $write_fn_name:ident, $read_fn_name:ident, $int_ty:ident) => {
5  | |         #[test]
6  | |         fn $test_name() {
...  |
10 | |                 $write_fn_name(&mut stream, (3u64 << x) as $int_ty);
   | |                                ^^^^^^^^^^^ expected array of 19 elements, found struct `Vec`
22 | |     };
23 | | }
23 | | }
   | |_- in this expansion of `impl_test_unsigned_leb128!`
...
28 |   impl_test_unsigned_leb128!(test_u128_leb128, write_u128_leb128, read_u128_leb128, u128);
   |   ---------------------------------------------------------------------------------------- in this macro invocation
   |
   = note: expected mutable reference `&mut [MaybeUninit<u8>; 19]`
              found mutable reference `&mut Vec<_>`
error[E0308]: mismatched types
  --> compiler/rustc_serialize/tests/leb128.rs:10:32
   |
   |
3  | / macro_rules! impl_test_unsigned_leb128 {
4  | |     ($test_name:ident, $write_fn_name:ident, $read_fn_name:ident, $int_ty:ident) => {
5  | |         #[test]
6  | |         fn $test_name() {
...  |
10 | |                 $write_fn_name(&mut stream, (3u64 << x) as $int_ty);
   | |                                ^^^^^^^^^^^ expected array of 5 elements, found struct `Vec`
22 | |     };
23 | | }
23 | | }
   | |_- in this expansion of `impl_test_unsigned_leb128!`
...
29 |   impl_test_unsigned_leb128!(test_usize_leb128, write_usize_leb128, read_usize_leb128, usize);
   |   -------------------------------------------------------------------------------------------- in this macro invocation
   |
   = note: expected mutable reference `&mut [MaybeUninit<u8>; 5]`
              found mutable reference `&mut Vec<_>`
error: aborting due to 6 previous errors

Some errors have detailed explanations: E0308, E0425.
For more information about an error, try `rustc --explain E0308`.
For more information about an error, try `rustc --explain E0308`.
error: could not compile `rustc_serialize`

To learn more, run the command again with --verbose.
warning: build failed, waiting for other jobs to finish...
error: build failed
command did not execute successfully: "/checkout/obj/build/x86_64-unknown-linux-gnu/stage0/bin/cargo" "check" "--target" "i686-pc-windows-gnu" "-Zbinary-dep-depinfo" "-j" "16" "--release" "--locked" "--color" "always" "--features" " llvm" "--manifest-path" "/checkout/compiler/rustc/Cargo.toml" "--all-targets" "-p" "rustc-main" "-p" "rustc_driver" "-p" "rustc_parse" "-p" "rustc_lexer" "-p" "rustc_lint" "-p" "rustc_index" "-p" "rustc_macros" "-p" "rustc_parse_format" "-p" "rustc_attr" "-p" "rustc_trait_selection" "-p" "rustc_infer" "-p" "rustc_graphviz" "-p" "rustc_span" "-p" "rustc_arena" "-p" "rustc_serialize" "-p" "rustc_data_structures" "-p" "rustc_feature" "-p" "rustc_errors" "-p" "rustc_lint_defs" "-p" "rustc_metadata" "-p" "rustc_expand" "-p" "rustc_ast_passes" "-p" "rustc_middle" "-p" "rustc_query_system" "-p" "rustc_apfloat" "-p" "rustc_type_ir" "-p" "rustc_target" "-p" "rustc_session" "-p" "rustc_fs_util" "-p" "rustc_hir" "-p" "rustc_error_codes" "-p" "rustc_save_analysis" "-p" "rustc_interface" "-p" "rustc_ast_lowering" "-p" "rustc_traits" "-p" "rustc_resolve" "-p" "rustc_builtin_macros" "-p" "rustc_symbol_mangling" "-p" "rustc_mir_build" "-p" "rustc_codegen_llvm" "-p" "rustc_llvm" "-p" "rustc_incremental" "-p" "rustc_privacy" "-p" "rustc_typeck" "-p" "rustc_ty_utils" "-p" "rustc_passes" "-p" "rustc_ast" "-p" "rustc_hir_pretty" "-p" "rustc_mir" "-p" "coverage_test_macros" "-p" "rustc_plugin_impl" "-p" "rustc_ast_pretty" "-p" "rustc_codegen_ssa" "--message-format" "json-render-diagnostics"
failed to run: /checkout/obj/build/bootstrap/debug/bootstrap check --target=i686-pc-windows-gnu --host=i686-pc-windows-gnu --all-targets
Build completed unsuccessfully in 0:01:12

@jyn514
Copy link
Member

jyn514 commented Dec 29, 2020

I would be curious to see how much this helps with #79103.

@lqd
Copy link
Member

lqd commented Dec 29, 2020

@bors try @rust-timer queue

@rust-timer
Copy link
Collaborator

Awaiting bors try build completion.

@bors
Copy link
Contributor

bors commented Dec 29, 2020

⌛ Trying commit d2eb5f51983c8d7e73d8a14a8e25bc3e540268b9 with merge 9afe98c1ed87fbe0ee26f93e23a265c09372baf6...

@bors
Copy link
Contributor

bors commented Dec 29, 2020

☀️ Try build successful - checks-actions
Build commit: 9afe98c1ed87fbe0ee26f93e23a265c09372baf6 (9afe98c1ed87fbe0ee26f93e23a265c09372baf6)

@rust-timer
Copy link
Collaborator

Queued 9afe98c1ed87fbe0ee26f93e23a265c09372baf6 with parent d75f48e, future comparison URL.

@rustbot label: +S-waiting-on-perf

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Dec 29, 2020
@rust-timer
Copy link
Collaborator

Finished benchmarking try commit (9afe98c1ed87fbe0ee26f93e23a265c09372baf6): comparison url.

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. Please note that if the perf results are neutral, you should likely undo the rollup=never given below by specifying rollup- to bors.

Importantly, though, if the results of this run are non-neutral do not roll this PR up -- it will mask other regressions or improvements in the roll up.

@bors rollup=never
@rustbot label: +S-waiting-on-review -S-waiting-on-perf

@rustbot rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Dec 29, 2020
@tgnottingham
Copy link
Contributor Author

tgnottingham commented Dec 29, 2020

The results are more extreme than what I see in local benchmarks. E.g. the 20% reduction I mentioned in the summary turned into a 30% reduction, but the instruction counts are worse. I'll investigate.

@tgnottingham
Copy link
Contributor Author

I was hoping to reproduce the above perf results locally, but it's too much of a chore to figure out what's causing the difference. It's not that important anyway, provided it's not actually a correctness issue. I got some assurance that it isn't UB that's somehow causing the perf difference, by running some tests locally under Miri.

I pushed couple of changes: a fix to an existing bug in the signed LEB128 decoding (it had no perf impact, and may not have been encountered in practice), an increase to the FileEncoder buffer size (in case it helps performance), and modifications to address @cjgillot's reviews (thank you!).

Regarding performance, I'd like to see how #80115 changes things before trying anything else. Though if anyone can see something that will make buffered file writing more competitive with the previous implemention, I'd gladly take a hint.

@jyn514
Copy link
Member

jyn514 commented Dec 30, 2020

Though if anyone can see something that will make buffered file writing more competitive with the previous implemention, I'd gladly take a hint.

Have you considered using mmap directly? That would need some unsafe code, but it shouldn't require an intermediate copy at all.

@tgnottingham
Copy link
Contributor Author

Have you considered using mmap directly? That would need some unsafe code, but it shouldn't require an intermediate copy at all.

I hadn't. I have some reservations, but it's worth experimenting with.

@tgnottingham
Copy link
Contributor Author

tgnottingham commented Dec 31, 2020

I implemented an mmap-based version and tested it locally. I didn't optimize, because I just wanted to get a ballpark idea of the kind of improvement we could see. It would need to be significant to justify using mmap, IMO.

The numbers weren't very encouraging -- up to 7% worse in instruction count. The increase is probably due to some complexity in handling data that ends up straddling one mmap'd portion of the file and the next. That likely resulted in less inlining and worse assembly.

I'm sure it could result in an improvement if optimized, but I just don't want to go down that path, personally.

@tgnottingham
Copy link
Contributor Author

@bors try @rust-timer queue

@rust-timer
Copy link
Collaborator

Awaiting bors try build completion.

@bors
Copy link
Contributor

bors commented Jan 10, 2021

⌛ Trying commit cce99b3bdd4c695026fda94fea711e5f0d7c7ee5 with merge 007744ee92eae311ea94065ead37aa138ea30e30...

@bors
Copy link
Contributor

bors commented Jan 10, 2021

☀️ Try build successful - checks-actions
Build commit: 007744ee92eae311ea94065ead37aa138ea30e30 (007744ee92eae311ea94065ead37aa138ea30e30)

@rust-timer
Copy link
Collaborator

Queued 007744ee92eae311ea94065ead37aa138ea30e30 with parent 7cf2056, future comparison URL.

@rustbot label: +S-waiting-on-perf

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Jan 10, 2021
@rust-timer
Copy link
Collaborator

Finished benchmarking try commit (007744ee92eae311ea94065ead37aa138ea30e30): comparison url.

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. Please note that if the perf results are neutral, you should likely undo the rollup=never given below by specifying rollup- to bors.

Importantly, though, if the results of this run are non-neutral do not roll this PR up -- it will mask other regressions or improvements in the roll up.

@bors rollup=never
@rustbot label: +S-waiting-on-review -S-waiting-on-perf

@rustbot rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Jan 10, 2021
@tgnottingham
Copy link
Contributor Author

Perf result confirms that larger buffer doesn't help.

@tgnottingham
Copy link
Contributor Author

I'm looking into the overhead of error handling. From the assembly, it seems like there's unnecessary overhead in Result creation, but I could definitely be missing something. My assembly knowledge isn't great.

On first glance, it looks like creation of the Ok result is polluted with operations only relevant to creation of the Err result from the failed flush() path. But on further inspection, it looks like that work isn't useful for the Err result path, either.

Here is the assembly for emit_u32:

             push   %rbp
             push   %rbx
             push   %rax
             mov    %esi,%ebp
             mov    0x8(%rdi),%rbx
             mov    0x10(%rbx),%rax
             lea    0x5(%rax),%rcx
             cmp    0x8(%rbx),%rcx
   /-------- ja
/--|-------> mov    (%rbx),%rcx
|  |         add    %rax,%rcx
|  |         mov    $0x1,%edx
|  |         cmp    $0x80,%ebp
|  |  /----- jb
|  |  |      mov    %ebp,%esi
|  |  |  /-> mov    %esi,%edi
|  |  |  |   or     $0x80,%bpl
|  |  |  |   mov    %bpl,(%rcx)
|  |  |  |   shr    $0x7,%esi
|  |  |  |   add    $0x1,%rcx
|  |  |  |   add    $0x1,%rdx
|  |  |  |   mov    %esi,%ebp
|  |  |  |   cmp    $0x3fff,%edi
|  |  |  \-- ja
|  |  |      mov    %esi,%ebp
|  |  \----> mov    %bpl,(%rcx)
|  |         add    %rax,%rdx
|  |         mov    %rdx,0x10(%rbx)
|  |         mov    $0x3,%al          // Start of Result creation in Ok path
|  |  /----> shld   $0x8,%rcx,%rdx    // Continuation of Result creation in Ok and Err paths
|  |  |      shl    $0x8,%rcx
|  |  |      movzbl %al,%eax
|  |  |      or     %rcx,%rax
|  |  |      add    $0x8,%rsp
|  |  |      pop    %rbx
|  |  |      pop    %rbp
|  |  |      retq   
|  \--|----> mov    %rbx,%rdi
|     |      callq  *0x168146d(%rip) # <<rustc_serialize::opaque::FileEncoder>::flush@@Base+0xe7b158>
|     |      cmp    $0x3,%al
|     |  /-- jne
|     |  |   xor    %eax,%eax
\-----|--|-- jmp
      |  \-> mov    %rax,%rcx         // Start of Result creation in Err path (when flush() fails)
      |      shrd   $0x8,%rdx,%rcx
      |      shr    $0x8,%rdx
      \----- jmp

The result is a Result<(), std::io::Error>, which is 16 bytes. I'm under the impression the result is returned through rax and rdx.

The formation of the Result in the successful path:

|  |         mov    $0x3,%al        // Set Ok variant, I assume.
|  |  /----> shld   $0x8,%rcx,%rdx  // Seems irrelevant to Ok Result.
|  |  |      shl    $0x8,%rcx       // Seems irrelevant to Ok Result.
|  |  |      movzbl %al,%eax        // eax = Ok variant, zero-extended.
|  |  |      or     %rcx,%rax       // rax = Ok variant plus seemingly irrelevant stuff in rcx.

I think the above could be replaced with something like mov $0x3,%rax, and all the irrelevant junk moved to where it's relevant...except I don't think it's relevant to anything anyway, which leads me to the next part.

The formation of the Result in the unsuccessful flush path:

// ...just called flush, and found that result in rax was Err...

      |  \-> mov    %rax,%rcx
      |      shrd   $0x8,%rdx,%rcx
      |      shr    $0x8,%rdx
      \----- jmp

// ...jumps to the shld from previously quoted block...

|  |  /----> shld   $0x8,%rcx,%rdx
|  |  |      shl    $0x8,%rcx
|  |  |      movzbl %al,%eax
|  |  |      or     %rcx,%rax

IIUC, this sequence doesn't change rax and rdx, and the change to rcx isn't useful. And that would make sense, since this is the case where the Err result gets propagated. It's the same type, and I think the value is already in rax and rdx, so no work is necessary. I think all of this but the jmp could be removed, and its target changed to the instruction following the or.

If all the above is correct, fixing this would remove 4 instructions from the hot path, and 7 from the cold. That alone would help, since the hot code is really hot, but it might also unlock other optimizations. Anyway, I'm hoping my reasoning is right, and I'll continue looking into this.

@cjgillot
Copy link
Contributor

Should we merge this PR for the memory improvement, and let the perf side for another PR?

@oli-obk
Copy link
Contributor

oli-obk commented Jan 11, 2021

Should we merge this PR for the memory improvement, and let the perf side for another PR?

I'm fine with that, but if @tgnottingham still has motivation to munch away at the perf part, I'm not going to stop them 😆

Reduce a large memory spike that happens during serialization by writing
the incr comp structures to file by way of a fixed-size buffer, rather
than an unbounded vector.

Effort was made to keep the instruction count close to that of the
previous implementation. However, buffered writing to a file inherently
has more overhead than writing to a vector, because each write may
result in a handleable error. To reduce this overhead, arrangements are
made so that each LEB128-encoded integer can be written to the buffer
with only one capacity and error check. Higher-level optimizations in
which entire composite structures can be written with one capacity and
error check are possible, but would require much more work.

The performance is mostly on par with the previous implementation, with
small to moderate instruction count regressions. The memory reduction is
significant, however, so it seems like a worth-while trade-off.
The signed LEB128 decoding function used a hardcoded constant of 64
instead of the number of bits in the type of integer being decoded,
which resulted in incorrect results for some inputs. Fix this, make the
decoding more consistent with the unsigned version, and increase the
LEB128 encoding and decoding test coverage.
@tgnottingham
Copy link
Contributor Author

Rebased and returned to using default buffer size.

Should we merge this PR for the memory improvement, and let the perf side for another PR?

I'd love that. :)

I'll create an issue for the Result problem (if I can prove there is one). I suspect it's an LLVM optimization issue, and it'll be a while before I can get up to speed on even the basics of LLVM.

@jyn514
Copy link
Member

jyn514 commented Jan 12, 2021

@bors r=oli-obk rollup=never

@bors
Copy link
Contributor

bors commented Jan 12, 2021

📌 Commit f15fae8 has been approved by oli-obk

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Jan 12, 2021
@bors
Copy link
Contributor

bors commented Jan 12, 2021

⌛ Testing commit f15fae8 with merge 8234db5...

@bors
Copy link
Contributor

bors commented Jan 12, 2021

☀️ Test successful - checks-actions
Approved by: oli-obk
Pushing 8234db5 to master...

@bors bors added the merged-by-bors This PR was explicitly merged by bors. label Jan 12, 2021
@bors bors merged commit 8234db5 into rust-lang:master Jan 12, 2021
@rustbot rustbot added this to the 1.51.0 milestone Jan 12, 2021
@tgnottingham
Copy link
Contributor Author

Created #81146 for the Result propagation performance issue.

@tgnottingham tgnottingham deleted the incr_comp_serial_mem_usage branch January 20, 2021 18:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-incr-comp Area: Incremental compilation I-compilemem Issue: Problems and improvements with respect to memory usage during compilation. I-compiletime Issue: Problems and improvements with respect to compile times. merged-by-bors This PR was explicitly merged by bors. S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.