Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rustc (>= 1.20.0) fails to optimize moves in trivial cases #63631

Open
resilar opened this issue Aug 16, 2019 · 6 comments
Open

rustc (>= 1.20.0) fails to optimize moves in trivial cases #63631

resilar opened this issue Aug 16, 2019 · 6 comments
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-enhancement Category: An issue proposing an enhancement or a PR with one. C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@resilar
Copy link

resilar commented Aug 16, 2019

The example code below generates extra stack copies of String (meta)data in function got() which is expected to produce identical optimized code with expect(). For quickly verifying the issue, compare sub rsp, $FRAME_SIZE instructions which initialize stack frames in the beginning of got() & expect() functions compiled with -C opt-level=3 (or measure & compare the generated code sizes). Rust Playground link.

rustc versions before 1.20.0 produce expected optimized assembly.

pub fn got() -> String {
    let mut res = String::new();
    let s0 = String::from("foobar");
    res.push_str(&s0); let s0 = s0;
    res.push_str(&s0); let s0 = s0;
    res.push_str(&s0); let s0 = s0;
    res.push_str(&s0);
    res
}

pub fn expect() -> String {
    let mut res = String::new();
    let s0 = String::from("foobar");
    res.push_str(&s0);
    res.push_str(&s0);
    res.push_str(&s0);
    res.push_str(&s0);
    res
}

/*
 * s0: String required, s0: &str generates correctly optimized assembly
 * res.push_str(...) line repetitions can be increased for a greater effect
 * let s0 = s0 used for illustration purposes (variable names do not matter)
 */
@resilar
Copy link
Author

resilar commented Aug 16, 2019

Note that corresponding C++ code is NOT expected to optimize due to the braindead aliasing model in C++. Rust aliasing rules are more reasonable allowing more optimization opportunities, including the case presented above. In Rust, .push_str() is assumed to behave sanely, i.e., not to overwrite arbitrary piece of memory despite mutating data through &mut self. Therefore, the moves in the example code are expected to get optimized away, especially since move semantics and zero-cost abstractions are major selling points of Rust.

The ultimate cause of the misoptimization can still be aliasing-related given the long history of poorly optimized (or entirely broken) code for LLVM noalias attribute. Rust is the first LLVM frontend applying noalias extensively as a default attribute for all references, which explains the current awful state of noalias (for example, see #16515 #38941).

Edit: -Zmutable-noalias=yes does not yield better optimization so the issue is most likely unrelated to noalias attribute.

However, relying on LLVM's move optimizations is unacceptable, in my opinion, even if the root cause here was an LLVM bug. As a Rust user, I expect rustc to eliminate unnecessary moves in emitted LLVM-IR (at least in easy cases) given the importance of move semantics as a core feature of the Rust language. Another pragmatic argument is that rustc has extensive context knowledge for optimizing moves more efficiently than LLVM.


TLDR: LLVM may resolve this and similar optimization issues in future, but rustc should still offer basic move optimizations because move semantics are a core feature of Rust.

@jonas-schievink jonas-schievink added A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Aug 16, 2019
@andjo403
Copy link
Contributor

after bisecting this example I have found that it is due to the PR #42313 that the regression starts

@resilar
Copy link
Author

resilar commented Jan 23, 2020

pub fn got() -> Vec<u32> {
    let mut res = Vec::new();
    let s1 = vec![1, 2, 3, 4];
    res.extend_from_slice(&s1); let s1 = s1;
    res.extend_from_slice(&s1); let s1 = s1;
    res.extend_from_slice(&s1); let s1 = s1;
    res.extend_from_slice(&s1);
    res
}

pub fn expect() -> Vec<u32> {
    let mut res = Vec::new();
    let s1 = vec![1, 2, 3, 4];
    res.extend_from_slice(&s1);
    res.extend_from_slice(&s1);
    res.extend_from_slice(&s1);
    res.extend_from_slice(&s1);
    res
}

Vec<u32> is misoptimized similarly as well. However,let s1 = &vec![1, 2, 3, 4] produces correctly optimized code, i.e., identical assembly for got() and expect().

@resilar
Copy link
Author

resilar commented Feb 2, 2020

I managed to create a simplified test case using only primitive types u64 and &u64. In this particular case, however, the problem is poor copy optimization instead of poor move optimization. Possibly these optimization issues are related somehow. See this Rust playground for a bit more verbose test case illustrating the bad move optimizations with struct MovingU64 { value: u64 } type (omitted in this comment for the sake of brevity, in favor of the more compact and straightforward u64/&u64 test case).

Let's start with the simplified test case based on primitive types and borrows:

#[no_mangle]
pub fn got() {
    let x: u64 = 0x0123456789ABCDEF;
    show(&x); let x = x;
    show(&x); let x = x;
    show(&x);
}

#[no_mangle]
pub fn expect() {
    let x: u64 = 0x0123456789ABCDEF;
    show(&x);
    show(&x);
    show(&x);
}

#[no_mangle]
#[inline(never)]
fn show(x: &u64) { println!("(0x{:x})0x{:x}", x as *const _ as usize, x); }

Optimized assembly produced by rustc -C opt-level=3 -Z mir-opt-level=3 (version 1.42.0-nightly 2020-02-01) shown below. Stable 1.38.0 2019-03-23 neither optimizes copies nor moves properly.

$ rustc -C opt-level=3 -Z mir-opt-level=3 --crate-type=dylib poc.rs
$ r2 -qc 's sym.got;af;afv-*;pdf;s sym.expect;af;afv-*;pdf' libpoc.so
69: sym.got ();
0x000476e0  4156                  push r14
0x000476e2  53                    push rbx
0x000476e3  4883ec18              sub rsp, 0x18
0x000476e7  49beefcdab8967452301  movabs r14, 0x123456789abcdef
0x000476f1  4c89742408            mov qword [rsp + 8], r14
0x000476f6  488b1d5b410b00        mov rbx, qword [reloc.show]
0x000476fd  488d7c2408            lea rdi, [rsp + 8]
0x00047702  ffd3                  call rbx
0x00047704  4c893424              mov qword [rsp], r14
0x00047708  4889e7                mov rdi, rsp
0x0004770b  ffd3                  call rbx
0x0004770d  488b0424              mov rax, qword [rsp]
0x00047711  4889442410            mov qword [rsp + 0x10], rax
0x00047716  488d7c2410            lea rdi, [rsp + 0x10]
0x0004771b  ffd3                  call rbx
0x0004771d  4883c418              add rsp, 0x18
0x00047721  5b                    pop rbx
0x00047722  415e                  pop r14
0x00047724  c3                    ret
54: sym.expect ();
0x00047730  4156                  push r14
0x00047732  53                    push rbx
0x00047733  50                    push rax
0x00047734  48b8efcdab8967452301  movabs rax, 0x123456789abcdef
0x0004773e  48890424              mov qword [rsp], rax
0x00047742  4c8b350f410b00        mov r14, qword [reloc.show]
0x00047749  4889e3                mov rbx, rsp
0x0004774c  4889df                mov rdi, rbx
0x0004774f  41ffd6                call r14
0x00047752  4889df                mov rdi, rbx
0x00047755  41ffd6                call r14
0x00047758  4889df                mov rdi, rbx
0x0004775b  41ffd6                call r14
0x0004775e  4883c408              add rsp, 8
0x00047762  5b                    pop rbx
0x00047763  415e                  pop r14
0x00047765  c3                    ret

Rust Playground link. Running main() outputs:

(0x7ffd02fdf7e0)0x123456789abcdef
(0x7ffd02fdf7e8)0x123456789abcdef
(0x7ffd02fdf7f0)0x123456789abcdef
---------------------------------
(0x7ffd02fdf7f0)0x123456789abcdef
(0x7ffd02fdf7f0)0x123456789abcdef
(0x7ffd02fdf7f0)0x123456789abcdef

got() prints the first three lines, and expect() prints the last three lines. Notice that the addresses change between the show() calls of the got() function. This, in addition to the generated assembly, demonstrates that got() creates two unnecessary copies of the variable x in stack.

@klensy
Copy link
Contributor

klensy commented Jan 23, 2022

Still an issue, https://godbolt.org/z/sKWPx7bon 1.60.0-nightly (17d29dc 2022-01-21) example for u64 type

got:
        push    rbx
        sub     rsp, 32
        movabs  rax, 81985529216486895
        mov     qword ptr [rsp + 8], rax
        mov     rbx, qword ptr [rip + show@GOTPCREL]
        lea     rdi, [rsp + 8]
        call    rbx
        mov     rax, qword ptr [rsp + 8]
        mov     qword ptr [rsp + 16], rax
        lea     rdi, [rsp + 16]
        call    rbx
        mov     rax, qword ptr [rsp + 16]
        mov     qword ptr [rsp + 24], rax
        lea     rdi, [rsp + 24]
        call    rbx
        add     rsp, 32
        pop     rbx
        ret
expect:
        push    r14
        push    rbx
        push    rax
        movabs  rax, 81985529216486895
        mov     qword ptr [rsp], rax
        mov     r14, qword ptr [rip + show@GOTPCREL]
        mov     rbx, rsp
        mov     rdi, rbx
        call    r14
        mov     rdi, rbx
        call    r14
        mov     rdi, rbx
        call    r14
        add     rsp, 8
        pop     rbx
        pop     r14
        ret

@nikic
Copy link
Contributor

nikic commented Jan 23, 2022

On the LLVM side, one of the problems here is that this does not optimize: https://llvm.godbolt.org/z/d8KP7rqKe

The pointer is not captured before the call, and the pointer is readonly at the call, so this would be safe. But LLVM currently doesn't distinguish between a capture before and at the call.

@workingjubilee workingjubilee added the C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such label Oct 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-enhancement Category: An issue proposing an enhancement or a PR with one. C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

6 participants