Don't implement mem::replace with mem::swap. #83022

m-ou-se · 2021-03-11T18:06:47Z

swap is a complicated operation, so this changes the implementation of replace to use read and write instead.

I wrote there:

Implementing the simpler operation (replace) with the much more complicated operation (swap) doesn't make a whole lot of sense. replace is just read+write, and the primitive for moving out of a &mut. swap is for doing that to two &mut at the same time, which is both more niche and more complicated (as shown by swap_nonoverlapping_bytes).

This could be especially interesting for Option<VeryLargeStruct>::take(), since swapping such a large structure with swap_nonoverlapping_bytes is going to be much less efficient than ptr::write()'ing a None.

But also for small values where swap just reads/writes using temporary variable, this makes a replace or take operation simpler:

rust-highfive · 2021-03-11T18:06:50Z

r? @kennytm

(rust-highfive has picked a reviewer for you, use r? to override)

m-ou-se · 2021-03-11T18:07:11Z

@bors try @rust-timer queue

rust-timer · 2021-03-11T18:07:12Z

Awaiting bors try build completion.

@rustbot label: +S-waiting-on-perf

bors · 2021-03-11T18:07:20Z

⌛ Trying commit bf27819 with merge 9acd471bcf643c584381c0ff22e4d38a7617f313...

library/core/src/mem/mod.rs

eddyb · 2021-03-11T18:32:37Z

FWIW I just tested this PR against Rust-GPU and it appears to handle the for i in 0..n motivation, from #83019's description.

bors · 2021-03-11T18:56:27Z

☀️ Try build successful - checks-actions
Build commit: 9acd471bcf643c584381c0ff22e4d38a7617f313 (9acd471bcf643c584381c0ff22e4d38a7617f313)

rust-timer · 2021-03-11T18:56:29Z

Queued 9acd471bcf643c584381c0ff22e4d38a7617f313 with parent 5c6d3bf, future comparison URL.

nagisa · 2021-03-11T19:32:20Z

since swapping such a large structure with swap_nonoverlapping_bytes is going to be much less efficient than ptr::write()'ing a None.

I don't believe that's true – assigning a None is a disciminant write, but ptr::write will need to copy over all the size_of::<Option<VeryLargeStruct>>() bytes of the type regardless of whether the value is a None or not, won't it?

nagisa · 2021-03-11T19:39:57Z

x86_64, aarch64 assembly-level comparison.

m-ou-se · 2021-03-11T19:41:08Z

@nagisa Nope, that just writes the tag of the Option, and nothing else. (Just like write()'ing a MaybeUninit::uninit() writes nothing at all.)

But looks like llvm is also smart enough to do that with the more compicated swap code: https://godbolt.org/z/TM1bcv

khyperia · 2021-03-11T20:14:22Z

I don't believe that's true – assigning a None is a disciminant write, but ptr::write will need to copy over all the size_of::<Option<VeryLargeStruct>>() bytes of the type regardless of whether the value is a None or not, won't it?

x86_64, aarch64 assembly-level comparison.

Looking at the x86_64 assembly, changing it to be this shows off how the ptr::write version optimizes the None case, where the swap version doesn't.

pub fn replace_large(dest: &mut Option<[u8; 127]>) -> Option<[u8; 127]> {
    replace(dest, None)
}

But looks like llvm is also smart enough to do that with the more compicated swap code

Not smart enough in all cases, I guess? No idea how this case is different than yours, though.

edit: discovered this weird thing, I mistakenly thought it was relevant but I missed something ugh. big shrug~

godbolt link

pub fn speedy_fast(x: &mut Option<[u64; 10]>) -> Option<[u64; 10]> {
    std_replace(x, None)
}
// copied from std::mem::replace, just to be clear what's happening (same behavior with std::mem::replace)
fn std_replace(dest: &mut Option<[u64; 10]>, mut src: Option<[u64; 10]>) -> Option<[u64; 10]> {
    std::mem::swap(dest, &mut src);
    src
}

// manually inlined, should be the same? but isn't
pub fn big_slow_write(x: &mut Option<[u64; 10]>) -> Option<[u64; 10]> {
    let mut blah = None;
    std::mem::swap(x, &mut blah);
    blah
}

rust-timer · 2021-03-11T21:45:08Z

Finished benchmarking try commit (9acd471bcf643c584381c0ff22e4d38a7617f313): comparison url.

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. Please note that if the perf results are neutral, you should likely undo the rollup=never given below by specifying rollup- to bors.

Importantly, though, if the results of this run are non-neutral do not roll this PR up -- it will mask other regressions or improvements in the roll up.

@bors rollup=never
@rustbot label: +S-waiting-on-review -S-waiting-on-perf

m-ou-se · 2021-03-11T21:50:11Z

🎉

Quite a significant improvement! Everything is green, except for one +174% that pushed one average into the red. I suppose that's just a random outlier:

m-ou-se · 2021-03-11T21:56:01Z

r? @eddyb

klensy · 2021-03-11T22:26:55Z

Can it rerun perf without try?

@rust-timer queue include=46449

rust-timer · 2021-03-11T22:26:56Z

Insufficient permissions to issue commands to rust-timer.

nagisa · 2021-03-12T12:51:31Z

@rust-timer queue include=46449

I sure have permissions, lets see what happens.

EDIT: Looks like no, you can't rerun perf withouta another try build :(

nagisa · 2021-03-12T13:10:36Z

If nothing else, optimizing a simpler chunk of code is easier on the backend. While I sympathize with @SimonSapin's point about centralization of the unsafe code, I think replace and similar such heavily used functions benefit from such targeted simplifications a… lot.

The instruction count/walltime regression on the incr-patched: empty 3072 test appears to be almost entirely within LLVM's LLVM_lto_optimize and similar areas, so I don't expect much difference at all from a rerun. My guess here is that LLVM simply ends up making different decisions (e.g. giving replace or something similar a higher inline score) when optimizing the newly patched code and thus hitting one of its non-linear areas of code with a significantly larger amount of code.

bors · 2021-03-12T13:47:37Z

☀️ Try build successful - checks-actions
Build commit: 6537b60dde68a3362e54e06e38e45445cd1e920d (6537b60dde68a3362e54e06e38e45445cd1e920d)

rust-timer · 2021-03-12T13:47:39Z

Queued 6537b60dde68a3362e54e06e38e45445cd1e920d with parent 0cc64a3, future comparison URL.

rust-timer · 2021-03-12T13:49:23Z

Finished benchmarking try commit (6537b60dde68a3362e54e06e38e45445cd1e920d): comparison url.

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. Please note that if the perf results are neutral, you should likely undo the rollup=never given below by specifying rollup- to bors.

Importantly, though, if the results of this run are non-neutral do not roll this PR up -- it will mask other regressions or improvements in the roll up.

@bors rollup=never
@rustbot label: +S-waiting-on-review -S-waiting-on-perf

nagisa · 2021-03-12T13:50:31Z

Well, we're good to go, I guess!

@bors r+

bors · 2021-03-12T13:50:32Z

📌 Commit bf27819 has been approved by nagisa

SimonSapin · 2021-03-12T15:01:58Z

For what it’s worth I wasn’t saying in #83019 that we shouldn’t change anything, only that the current situation was not entirely accidental.

These macro-level results look great. Since this effectively undoes #40454 for replace it might be interesting to also look at micro-level effects, especially for large values. Unfortunately I don’t find code to reproduce the benchmarks in #40454.

bors · 2021-03-12T23:27:26Z

⌛ Testing commit bf27819 with merge 46a934a...

bors · 2021-03-13T02:08:27Z

☀️ Test successful - checks-actions
Approved by: nagisa
Pushing 46a934a to master...

jordens · 2021-04-21T15:48:12Z

Thanks for this fix! Example where this (I think) was painful: taking and replacing packet size data in Options:
https://godbolt.org/z/5dPh8eoWa

m-ou-se · 2021-04-21T20:25:04Z

Thanks for this fix! Example where this (I think) was painful: taking and replacing packet size data in Options:
https://godbolt.org/z/5dPh8eoWa

Nice, that's a huge difference! Even with --opt-level set to z or s, it makes a big difference.

…iler-errors Use `load`+`store` instead of `memcpy` for small integer arrays I was inspired by rust-lang#98892 to see whether, rather than making `mem::swap` do something smart in the library, we could update MIR assignments like `*_1 = *_2` to do something smarter than `memcpy` for sufficiently-small types that doing it inline is going to be better than a `memcpy` call in assembly anyway. After all, special code may help `mem::swap`, but if the "obvious" MIR can just result in the correct thing that helps everything -- other code like `mem::replace`, people doing it manually, and just passing around by value in general -- as well as makes MIR inlining happier since it doesn't need to deal with all the complicated library code if it just sees a couple assignments. LLVM will turn the short, known-length `memcpy`s into direct instructions in the backend, but that's too late for it to be able to remove `alloca`s. In general, replacing `memcpy`s with typed instructions is hard in the middle-end -- even for `memcpy.inline` where it knows it won't be a function call -- is hard [due to poison propagation issues](https://rust-lang.zulipchat.com/#narrow/stream/187780-t-compiler.2Fwg-llvm/topic/memcpy.20vs.20load-store.20for.20MIR.20assignments/near/360376712). So because we know more about the type invariants -- these are typed copies -- rustc can emit something more specific, allowing LLVM to `mem2reg` away the `alloca`s in some situations. rust-lang#52051 previously did something like this in the library for `mem::swap`, but it ended up regressing during enabling mir inlining (rust-lang@cbbf06b), so this has been suboptimal on stable for ≈5 releases now. The code in this PR is narrowly targeted at just integer arrays in LLVM, but works via a new method on the [`LayoutTypeMethods`](https://doc.rust-lang.org/nightly/nightly-rustc/rustc_codegen_ssa/traits/trait.LayoutTypeMethods.html) trait, so specific backends based on cg_ssa can enable this for more situations over time, as we find them. I don't want to try to bite off too much in this PR, though. (Transparent newtypes and simple things like the 3×usize `String` would be obvious candidates for a follow-up.) Codegen demonstrations: <https://llvm.godbolt.org/z/fK8hT9aqv> Before: ```llvm define void `@swap_rgb48_old(ptr` noalias nocapture noundef align 2 dereferenceable(6) %x, ptr noalias nocapture noundef align 2 dereferenceable(6) %y) unnamed_addr #1 { %a.i = alloca [3 x i16], align 2 call void `@llvm.lifetime.start.p0(i64` 6, ptr nonnull %a.i) call void `@llvm.memcpy.p0.p0.i64(ptr` noundef nonnull align 2 dereferenceable(6) %a.i, ptr noundef nonnull align 2 dereferenceable(6) %x, i64 6, i1 false) tail call void `@llvm.memcpy.p0.p0.i64(ptr` noundef nonnull align 2 dereferenceable(6) %x, ptr noundef nonnull align 2 dereferenceable(6) %y, i64 6, i1 false) call void `@llvm.memcpy.p0.p0.i64(ptr` noundef nonnull align 2 dereferenceable(6) %y, ptr noundef nonnull align 2 dereferenceable(6) %a.i, i64 6, i1 false) call void `@llvm.lifetime.end.p0(i64` 6, ptr nonnull %a.i) ret void } ``` Note it going to stack: ```nasm swap_rgb48_old: # `@swap_rgb48_old` movzx eax, word ptr [rdi + 4] mov word ptr [rsp - 4], ax mov eax, dword ptr [rdi] mov dword ptr [rsp - 8], eax movzx eax, word ptr [rsi + 4] mov word ptr [rdi + 4], ax mov eax, dword ptr [rsi] mov dword ptr [rdi], eax movzx eax, word ptr [rsp - 4] mov word ptr [rsi + 4], ax mov eax, dword ptr [rsp - 8] mov dword ptr [rsi], eax ret ``` Now: ```llvm define void `@swap_rgb48(ptr` noalias nocapture noundef align 2 dereferenceable(6) %x, ptr noalias nocapture noundef align 2 dereferenceable(6) %y) unnamed_addr #0 { start: %0 = load <3 x i16>, ptr %x, align 2 %1 = load <3 x i16>, ptr %y, align 2 store <3 x i16> %1, ptr %x, align 2 store <3 x i16> %0, ptr %y, align 2 ret void } ``` still lowers to `dword`+`word` operations, but has no stack traffic: ```nasm swap_rgb48: # `@swap_rgb48` mov eax, dword ptr [rdi] movzx ecx, word ptr [rdi + 4] movzx edx, word ptr [rsi + 4] mov r8d, dword ptr [rsi] mov dword ptr [rdi], r8d mov word ptr [rdi + 4], dx mov word ptr [rsi + 4], cx mov dword ptr [rsi], eax ret ``` And as a demonstration that this isn't just `mem::swap`, a `mem::replace` on a small array (since replace doesn't use swap since rust-lang#83022), which used to be `memcpy`s in LLVM changes in IR ```llvm define void `@replace_short_array(ptr` noalias nocapture noundef sret([3 x i32]) dereferenceable(12) %0, ptr noalias noundef align 4 dereferenceable(12) %r, ptr noalias nocapture noundef readonly dereferenceable(12) %v) unnamed_addr #0 { start: %1 = load <3 x i32>, ptr %r, align 4 store <3 x i32> %1, ptr %0, align 4 %2 = load <3 x i32>, ptr %v, align 4 store <3 x i32> %2, ptr %r, align 4 ret void } ``` but that lowers to reasonable `dword`+`qword` instructions still ```nasm replace_short_array: # `@replace_short_array` mov rax, rdi mov rcx, qword ptr [rsi] mov edi, dword ptr [rsi + 8] mov dword ptr [rax + 8], edi mov qword ptr [rax], rcx mov rcx, qword ptr [rdx] mov edx, dword ptr [rdx + 8] mov dword ptr [rsi + 8], edx mov qword ptr [rsi], rcx ret ```

Optimize `ptr::replace` rust-lang#83022 optimized `mem::replace` to reduce the number of `memcpy`s. `ptr::replace`, which is [documented to behave just like `mem::replace`](https://doc.rust-lang.org/nightly/std/ptr/fn.replace.html), was not optimized however, leading to [worse code](https://godbolt.org/z/T3hdEEdfe) and missed optimizations. This PR simply forwards `ptr::replace` to `mem::replace` to take advantage of the better implementation.

Rollup merge of rust-lang#122601 - joboet:ptr_replace, r=workingjubilee Optimize `ptr::replace` rust-lang#83022 optimized `mem::replace` to reduce the number of `memcpy`s. `ptr::replace`, which is [documented to behave just like `mem::replace`](https://doc.rust-lang.org/nightly/std/ptr/fn.replace.html), was not optimized however, leading to [worse code](https://godbolt.org/z/T3hdEEdfe) and missed optimizations. This PR simply forwards `ptr::replace` to `mem::replace` to take advantage of the better implementation.

Don't implement mem::replace with mem::swap.

bf27819

rust-highfive assigned kennytm Mar 11, 2021

rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Mar 11, 2021

This comment has been minimized.

Sign in to view

rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Mar 11, 2021

This comment has been minimized.

Sign in to view

eddyb reviewed Mar 11, 2021

View reviewed changes

library/core/src/mem/mod.rs Show resolved Hide resolved

camelid added I-slow Issue: Problems and improvements with respect to performance of generated code. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Mar 11, 2021

rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Mar 11, 2021

rust-highfive assigned eddyb and unassigned kennytm Mar 11, 2021

rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Mar 12, 2021

bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Mar 12, 2021

nagisa mentioned this pull request Mar 12, 2021

core: disable ptr::swap_nonoverlapping_one's block optimization on SPIR-V. #83019

Merged

bors added the merged-by-bors This PR was explicitly merged by bors. label Mar 13, 2021

bors merged commit 46a934a into rust-lang:master Mar 13, 2021

rustbot added this to the 1.52.0 milestone Mar 13, 2021

eddyb mentioned this pull request Mar 15, 2021

Unlock for i in 0..n loops! (by bumping rust-toolchain to nightly-2021-03-17) EmbarkStudios/rust-gpu#493

Merged

m-ou-se deleted the mem-replace-no-swap branch March 16, 2021 19:02

jordens mentioned this pull request Apr 22, 2021

Inefficient array initialization when moving data into an Option #84426

Open

scottmcm mentioned this pull request Jun 24, 2022

Speed up ptr::replace #98197

Closed

scottmcm mentioned this pull request Mar 4, 2023

mem::replace don't actually mutate the destination on Android #49282

Closed

scottmcm mentioned this pull request May 26, 2023

Use load+store instead of memcpy for small integer arrays #111999

Merged

eddyb mentioned this pull request Oct 1, 2023

What guarantees are provided about whether the reference/stdlib docs are valid on target architectures? rust-lang/unsafe-code-guidelines#461

Open

RalfJung mentioned this pull request Oct 26, 2023

Refactor unsafe implementation of std::mem::replace #117189

Closed

joboet mentioned this pull request Mar 16, 2024

Optimize ptr::replace #122601

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't implement mem::replace with mem::swap. #83022

Don't implement mem::replace with mem::swap. #83022

m-ou-se commented Mar 11, 2021 •

edited

Loading

rust-highfive commented Mar 11, 2021

m-ou-se commented Mar 11, 2021

rust-timer commented Mar 11, 2021

This comment has been minimized.

bors commented Mar 11, 2021

This comment has been minimized.

eddyb commented Mar 11, 2021

bors commented Mar 11, 2021

rust-timer commented Mar 11, 2021

nagisa commented Mar 11, 2021 •

edited

Loading

nagisa commented Mar 11, 2021 •

edited

Loading

m-ou-se commented Mar 11, 2021

khyperia commented Mar 11, 2021 •

edited

Loading

rust-timer commented Mar 11, 2021

m-ou-se commented Mar 11, 2021

m-ou-se commented Mar 11, 2021

klensy commented Mar 11, 2021

rust-timer commented Mar 11, 2021

nagisa commented Mar 12, 2021 •

edited

Loading

nagisa commented Mar 12, 2021 •

edited

Loading

bors commented Mar 12, 2021

rust-timer commented Mar 12, 2021

rust-timer commented Mar 12, 2021

nagisa commented Mar 12, 2021

bors commented Mar 12, 2021

SimonSapin commented Mar 12, 2021

bors commented Mar 12, 2021

bors commented Mar 13, 2021

jordens commented Apr 21, 2021 •

edited

Loading

m-ou-se commented Apr 21, 2021

Don't implement mem::replace with mem::swap. #83022

Don't implement mem::replace with mem::swap. #83022

Conversation

m-ou-se commented Mar 11, 2021 • edited Loading

rust-highfive commented Mar 11, 2021

m-ou-se commented Mar 11, 2021

rust-timer commented Mar 11, 2021

This comment has been minimized.

bors commented Mar 11, 2021

This comment has been minimized.

eddyb commented Mar 11, 2021

bors commented Mar 11, 2021

rust-timer commented Mar 11, 2021

nagisa commented Mar 11, 2021 • edited Loading

nagisa commented Mar 11, 2021 • edited Loading

m-ou-se commented Mar 11, 2021

khyperia commented Mar 11, 2021 • edited Loading

rust-timer commented Mar 11, 2021

m-ou-se commented Mar 11, 2021

🎉

m-ou-se commented Mar 11, 2021

klensy commented Mar 11, 2021

rust-timer commented Mar 11, 2021

nagisa commented Mar 12, 2021 • edited Loading

nagisa commented Mar 12, 2021 • edited Loading

bors commented Mar 12, 2021

rust-timer commented Mar 12, 2021

rust-timer commented Mar 12, 2021

nagisa commented Mar 12, 2021

bors commented Mar 12, 2021

SimonSapin commented Mar 12, 2021

bors commented Mar 12, 2021

bors commented Mar 13, 2021

jordens commented Apr 21, 2021 • edited Loading

m-ou-se commented Apr 21, 2021

m-ou-se commented Mar 11, 2021 •

edited

Loading

nagisa commented Mar 11, 2021 •

edited

Loading

nagisa commented Mar 11, 2021 •

edited

Loading

khyperia commented Mar 11, 2021 •

edited

Loading

nagisa commented Mar 12, 2021 •

edited

Loading

nagisa commented Mar 12, 2021 •

edited

Loading

jordens commented Apr 21, 2021 •

edited

Loading