Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Future::and_then and other adapters are not always zero cost #37930

Open
japaric opened this issue Nov 22, 2016 · 12 comments
Open

Future::and_then and other adapters are not always zero cost #37930

japaric opened this issue Nov 22, 2016 · 12 comments
Labels
A-codegen Area: Code generation A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-enhancement Category: An issue proposing an enhancement or a PR with one. C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@japaric
Copy link
Member

japaric commented Nov 22, 2016

I expected these two code snippets to produce the same (or very similar) code
when fully optimized (release+LTO) but they don't:

let tx = busy_wait(tx.write(4));
busy_wait(tx.write(2));
busy_wait(tx.write(4).and_then(|tx| tx.write(2)));

The first is panic!-free but the second is not; it still contains calls to
panic!("cannot poll a chained future twice") even though it never polls either
future after it has yielded its value.

STR

// src/main.rs
#![feature(lang_items)]
#![feature(never_type)]
#![no_main]
#![no_std]

extern crate futures;

use core::{fmt, ptr};

use futures::{Async, Future};

// entry point
#[no_mangle]
pub fn _start() -> ! {
    let tx = Tx { _0: () };

    busy_wait(tx.write(4).and_then(|tx| tx.write(2)));

    loop {}
}

struct Tx {
    _0: (),
}

impl Tx {
    fn write(self, byte: u8) -> Write {
        Write {
            tx: Some(self),
            byte: byte,
        }
    }
}

struct Write {
    byte: u8,
    tx: Option<Tx>,
}

impl Future for Write {
    type Error = !;
    type Item = Tx;

    fn poll(&mut self) -> Result<Async<Tx>, !> {
        unsafe {
            let tx = self.tx.take().expect("cannot poll `write` twice");

            // NOTE `0x0` and `0x4` emulate memory mapped IO registers
            // Can we send data yet?
            if ptr::read_volatile(0x0 as *const bool) {
                // Send one byte of data
                ptr::write_volatile(0x4 as *mut u8, self.byte);
                Ok(Async::Ready(tx))
            } else {
                self.tx = Some(tx);
                Ok(Async::NotReady)
            }
        }
    }
}

fn busy_wait<F>(mut f: F) -> F::Item
    where F: Future<Error = !>
{
    loop {
        if let Ok(Async::Ready(t)) = f.poll() {
            return t;
        }
    }
}

#[lang = "panic_fmt"]
extern "C" fn panic_fmt(_fmt: fmt::Arguments,
                        _file: &'static str,
                        _line: u32)
                        -> ! {
    // HACK to keep the `_file` string in the final binary
    unsafe { ptr::write_volatile(0x8 as *mut _, _file) }
    loop {}
}
$ head -n7 Cargo.toml
[dependencies.futures]
default-features = false
version = "0.1.3"

[profile.release]
lto = true
panic = "abort"

$ cargo rustc --release --verbose -- -C link-arg=-nostartfiles
$ objdump -Cd target/release/foo
00000000000002c0 <_start>:
 2c0:   b0 04                   mov    $0x4,%al
 2c2:   31 c9                   xor    %ecx,%ecx
 2c4:   66 66 66 2e 0f 1f 84    data16 data16 nopw %cs:0x0(%rax,%rax,1)
 2cb:   00 00 00 00 00
 2d0:   80 f9 01                cmp    $0x1,%cl
 2d3:   74 3b                   je     310 <_start+0x50>
 2d5:   66 66 2e 0f 1f 84 00    data16 nopw %cs:0x0(%rax,%rax,1)
 2dc:   00 00 00 00
 2e0:   f6 04 25 00 00 00 00    testb  $0x1,0x0
 2e7:   01
 2e8:   74 f6                   je     2e0 <_start+0x20>
 2ea:   88 04 25 04 00 00 00    mov    %al,0x4
 2f1:   84 c9                   test   %cl,%cl
 2f3:   75 3d                   jne    332 <_start+0x72>
 2f5:   b1 01                   mov    $0x1,%cl
 2f7:   b0 02                   mov    $0x2,%al
 2f9:   f6 04 25 00 00 00 00    testb  $0x1,0x0
 300:   01
 301:   74 cd                   je     2d0 <_start+0x10>
 303:   c6 04 25 04 00 00 00    movb   $0x2,0x4
 30a:   02
 30b:   eb 23                   jmp    330 <_start+0x70>
 30d:   0f 1f 00                nopl   (%rax)
 310:   f6 04 25 00 00 00 00    testb  $0x1,0x0
 317:   01
 318:   74 f6                   je     310 <_start+0x50>
 31a:   88 04 25 04 00 00 00    mov    %al,0x4
 321:   66 66 66 66 66 66 2e    data16 data16 data16 data16 data16 nopw %cs:0x0(%rax,%rax,1)
 328:   0f 1f 84 00 00 00 00
 32f:   00
 330:   eb fe                   jmp    330 <_start+0x70>
 332:   50                      push   %rax
 333:   e8 38 00 00 00          callq  370 <core::panicking::panic::h194ce5d68a8f28a1>
 338:   0f 1f 84 00 00 00 00    nopl   0x0(%rax,%rax,1)
 33f:   00

0000000000000340 <core::panicking::panic_fmt::h561c5ee168a3d2cb>:
(..)
$ objcopy -O binary target/release/foo foo.bin

$ strings foo.bin
(..)
/home/japaric/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-0.1.3/src/chain.rs

If the program is changed to the no-and_then variant, the disassembly looks
like this:

$ objdump -Cd target/release/foo
0000000000000280 <_start>:
 280:   f6 04 25 00 00 00 00    testb  $0x1,0x0
 287:   01
 288:   74 f6                   je     280 <_start>
 28a:   c6 04 25 04 00 00 00    movb   $0x4,0x4
 291:   04
 292:   66 66 66 66 66 2e 0f    data16 data16 data16 data16 nopw %cs:0x0(%rax,%rax,1)
 299:   1f 84 00 00 00 00 00
 2a0:   f6 04 25 00 00 00 00    testb  $0x1,0x0
 2a7:   01
 2a8:   74 f6                   je     2a0 <_start+0x20>
 2aa:   c6 04 25 04 00 00 00    movb   $0x2,0x4
 2b1:   02
 2b2:   66 66 66 66 66 2e 0f    data16 data16 data16 data16 nopw %cs:0x0(%rax,%rax,1)
 2b9:   1f 84 00 00 00 00 00
 2c0:   eb fe                   jmp    2c0 <_start+0x40>

There are no panic_fmt calls in it.

Meta

$ rustc -V
rustc 1.15.0-nightly (43006fcea 2016-11-15)

I don't think this is a problem with the implementation of AndThen/Chain in
the futures crate because I have tried to re-implement Future in a few
different ways e.g. without error-handling, sprinkling #[inline] everywhere but
none of that helps.

Code like this:

busy_wait(futures::done(Ok(42)).and_then(|x| futures::done(Ok(x))));

optimizes the same as the "split" version and in that case LLVM evaluates the
expression at compile time and replaces it with 42. So LLVM can actually lower
and_then to panic! free code; it just doesn't optimize well this particular
case (perhaps, because of the volatile memory operations?)

join is another adapter that has the same issue.

cc @alexcrichton @eddyb

@alexcrichton
Copy link
Member

Hilariously this is not an issue at -Copt-level=2 (how I initially tried to reproduce this). This is an issue at -Copt-level=3

@alexcrichton
Copy link
Member

Opened #37933

@alexcrichton
Copy link
Member

Whoa wow I totally thought this was in the futures-rs repo, not the rust-lang/rust repo! Ok, inlining #37933 into here:


Originally reported at #37930, minimized at https://gist.github.com/alexcrichton/86db63138ef979422d8fa10538dc6dcb, the code optimizes better at opt-level=2 than opt-level=3.

For example:

$ rustc +nightly src/main.rs --emit llvm-ir -C panic=abort -Copt-level=2 && grep 'call .* @_ZN4core9panicking5panic' main.ll
$ rustc +nightly src/main.rs --emit llvm-ir -C panic=abort -Copt-level=3 && grep 'call .* @_ZN4core9panicking5panic' main.ll
  tail call void @_ZN4core9panicking5panic17h194ce5d68a8f28a1E({ %str_slice, %str_slice, i32 }* noalias readonly dereferenceable(40) bitcast ({ %str_slice, %str_slice, i32, [4 x i8] }* @"_ZN47_$LT$main..Chain$LT$A$C$$u20$B$C$$u20$C$GT$$GT$4poll14_MSG_FILE_LINE17hf0dd0d6c8e2c1d26E" to { %str_slice, %str_slice, i32 }*))

That is, with opt-level=2 we can prove the code doesn't panic, but at opt-level=3 something goes awry and we don't prove that it doesn't panic.

Do we need to tweak our opt-level=3 pipeline? Has it diverged from clang? Is it inappropriate to run by default?

@alexcrichton alexcrichton added the A-codegen Area: Code generation label Nov 22, 2016
@japaric
Copy link
Member Author

japaric commented Nov 22, 2016

I should also note that the opt-level=2 (set in Cargo.toml's profile.release) has no effect if you are still using the futures crates. IOW, it's necessary to inline the futures code into a single crate and set opt-level=2.

@arielb1
Copy link
Contributor

arielb1 commented Nov 22, 2016

I am investigating @alexcrichton's issue.

Some code generated ends up looking like

  %f.sroa.0.0.ph.i = phi i8 [ 1, %foo-block ], [ 0, %entry-block ]
  %trunc.i.i.i = trunc i8 %f.sroa.0.0.ph.i to i2
  %switch.i = icmp eq i2 %trunc.i.i.i, 0
...
x-block:
  br i1 %switch.i, label %x2-block, label %x1-block
...
y-block:
  %cond.i.i.i = icmp eq i8 %f.sroa.0.0.ph.i, 0
  br i1 %cond.i.i.i, label %y1-block, label %y2-block

InstCombine manages to combine the 2 compares and transform the phi to a phi on booleans, which leads to SCCP to later optimize out one of the conditionals.

-C opt-level=3, through some convoluted path involving it maintaining a branch table of the form

  switch i2 %trunc.i.i, label %unreachable.i.i [
    i2 0, label %bb2.i.i
    i2 1, label %bb3.i.i
    i2 -2, label %bb1.i.i
  ]

through way too many passes, ends up with a conditional of the form

  %f.sroa.0.0.ph.i = phi i8 [ 1, %foo-block ], [ 0, %entry-block ]
  %switch = icmp eq i8 %f.sroa.0.0.ph.i, 1
...
x-block:
  br i1 %switch, label %x1-block, label %x2-block
...
y-block:
  %cond.i.i.i = icmp eq i8 %f.sroa.0.0.ph.i, 0
  br i1 %cond.i.i.i, label %y1-block, label %y2-block

Which it is then unable to optimize.

@arielb1
Copy link
Contributor

arielb1 commented Nov 22, 2016

Minimal example (if both icmps have the same bitwise value, this optimizes to a switch of true/false. If they don't, it doesn't):

define i8 @_start(i8* %condition) {
entry-block:
  br label %bb1.preheader.i

bb1.preheader.i:
  %f.sroa.0.0.ph.i = phi i8 [ 1, %bb4.i.i.i.i ], [ 0, %entry-block ]
  %switch.i = icmp eq i8 %f.sroa.0.0.ph.i, 1
  br label %loop

loop:
  br i1 %switch.i, label %ret1, label %i101

i101:
  %0 = load volatile i8, i8* %condition
  %1 = icmp eq i8 %0, 0
  br i1 %1, label %bb4.i.i.i.i, label %loop

bb4.i.i.i.i:
  %cond.i.i.i = icmp eq i8 %f.sroa.0.0.ph.i, 0
  br i1 %cond.i.i.i, label %bb1.preheader.i, label %ret0

ret1:
  ret i8 1

ret0:
  ret i8 0
}

The double nested loop seems to be essential to getting the optimization to trigger. I wonder whether this is the case in the more complicated example too.

@arielb1
Copy link
Contributor

arielb1 commented Nov 22, 2016

Sure, looks like this issue is encountered on @japaric's posted no_std case too.

@arielb1
Copy link
Contributor

arielb1 commented Nov 22, 2016

@arielb1 arielb1 added the A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. label Nov 22, 2016
@alexcrichton
Copy link
Member

Awesome, thanks for the minimization @arielb1!

@arielb1 arielb1 added the I-slow Issue: Problems and improvements with respect to performance of generated code. label Nov 22, 2016
@arielb1
Copy link
Contributor

arielb1 commented Dec 9, 2016

I had the minimization correct, but I got confused on finding the pass to blame.

@Mark-Simulacrum Mark-Simulacrum added the C-enhancement Category: An issue proposing an enhancement or a PR with one. label Jul 26, 2017
@nox
Copy link
Contributor

nox commented Apr 2, 2018

I'm confused. I can now grep _ZN4core9panicking5panic with both opt-level=2 and opt-level=3 when compiling the sample in @alexcrichton's comment. Did we regress further?

Cc @rust-lang/wg-codegen

@steveklabnik
Copy link
Member

Triage: future has changed a lot, and so the output is likely to change. I'm not great at future internals yet, but someone wanting to reproduce this bug should port the code in @alexcrichton 's comment and see how it optimizes.

@Noratrieb Noratrieb added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Apr 5, 2023
@workingjubilee workingjubilee added the C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such label Oct 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-codegen Area: Code generation A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-enhancement Category: An issue proposing an enhancement or a PR with one. C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

8 participants