Future::and_then and other adapters are not always zero cost #37930

japaric · 2016-11-22T05:42:16Z

I expected these two code snippets to produce the same (or very similar) code
when fully optimized (release+LTO) but they don't:

let tx = busy_wait(tx.write(4));
busy_wait(tx.write(2));

busy_wait(tx.write(4).and_then(|tx| tx.write(2)));

The first is panic!-free but the second is not; it still contains calls to
panic!("cannot poll a chained future twice") even though it never polls either
future after it has yielded its value.

STR

// src/main.rs
#![feature(lang_items)]
#![feature(never_type)]
#![no_main]
#![no_std]

extern crate futures;

use core::{fmt, ptr};

use futures::{Async, Future};

// entry point
#[no_mangle]
pub fn _start() -> ! {
    let tx = Tx { _0: () };

    busy_wait(tx.write(4).and_then(|tx| tx.write(2)));

    loop {}
}

struct Tx {
    _0: (),
}

impl Tx {
    fn write(self, byte: u8) -> Write {
        Write {
            tx: Some(self),
            byte: byte,
        }
    }
}

struct Write {
    byte: u8,
    tx: Option<Tx>,
}

impl Future for Write {
    type Error = !;
    type Item = Tx;

    fn poll(&mut self) -> Result<Async<Tx>, !> {
        unsafe {
            let tx = self.tx.take().expect("cannot poll `write` twice");

            // NOTE `0x0` and `0x4` emulate memory mapped IO registers
            // Can we send data yet?
            if ptr::read_volatile(0x0 as *const bool) {
                // Send one byte of data
                ptr::write_volatile(0x4 as *mut u8, self.byte);
                Ok(Async::Ready(tx))
            } else {
                self.tx = Some(tx);
                Ok(Async::NotReady)
            }
        }
    }
}

fn busy_wait<F>(mut f: F) -> F::Item
    where F: Future<Error = !>
{
    loop {
        if let Ok(Async::Ready(t)) = f.poll() {
            return t;
        }
    }
}

#[lang = "panic_fmt"]
extern "C" fn panic_fmt(_fmt: fmt::Arguments,
                        _file: &'static str,
                        _line: u32)
                        -> ! {
    // HACK to keep the `_file` string in the final binary
    unsafe { ptr::write_volatile(0x8 as *mut _, _file) }
    loop {}
}

$ head -n7 Cargo.toml
[dependencies.futures]
default-features = false
version = "0.1.3"

[profile.release]
lto = true
panic = "abort"

$ cargo rustc --release --verbose -- -C link-arg=-nostartfiles

$ objdump -Cd target/release/foo
00000000000002c0 <_start>:
 2c0:   b0 04                   mov    $0x4,%al
 2c2:   31 c9                   xor    %ecx,%ecx
 2c4:   66 66 66 2e 0f 1f 84    data16 data16 nopw %cs:0x0(%rax,%rax,1)
 2cb:   00 00 00 00 00
 2d0:   80 f9 01                cmp    $0x1,%cl
 2d3:   74 3b                   je     310 <_start+0x50>
 2d5:   66 66 2e 0f 1f 84 00    data16 nopw %cs:0x0(%rax,%rax,1)
 2dc:   00 00 00 00
 2e0:   f6 04 25 00 00 00 00    testb  $0x1,0x0
 2e7:   01
 2e8:   74 f6                   je     2e0 <_start+0x20>
 2ea:   88 04 25 04 00 00 00    mov    %al,0x4
 2f1:   84 c9                   test   %cl,%cl
 2f3:   75 3d                   jne    332 <_start+0x72>
 2f5:   b1 01                   mov    $0x1,%cl
 2f7:   b0 02                   mov    $0x2,%al
 2f9:   f6 04 25 00 00 00 00    testb  $0x1,0x0
 300:   01
 301:   74 cd                   je     2d0 <_start+0x10>
 303:   c6 04 25 04 00 00 00    movb   $0x2,0x4
 30a:   02
 30b:   eb 23                   jmp    330 <_start+0x70>
 30d:   0f 1f 00                nopl   (%rax)
 310:   f6 04 25 00 00 00 00    testb  $0x1,0x0
 317:   01
 318:   74 f6                   je     310 <_start+0x50>
 31a:   88 04 25 04 00 00 00    mov    %al,0x4
 321:   66 66 66 66 66 66 2e    data16 data16 data16 data16 data16 nopw %cs:0x0(%rax,%rax,1)
 328:   0f 1f 84 00 00 00 00
 32f:   00
 330:   eb fe                   jmp    330 <_start+0x70>
 332:   50                      push   %rax
 333:   e8 38 00 00 00          callq  370 <core::panicking::panic::h194ce5d68a8f28a1>
 338:   0f 1f 84 00 00 00 00    nopl   0x0(%rax,%rax,1)
 33f:   00

0000000000000340 <core::panicking::panic_fmt::h561c5ee168a3d2cb>:
(..)

$ objcopy -O binary target/release/foo foo.bin

$ strings foo.bin
(..)
/home/japaric/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-0.1.3/src/chain.rs

If the program is changed to the no-and_then variant, the disassembly looks
like this:

$ objdump -Cd target/release/foo
0000000000000280 <_start>:
 280:   f6 04 25 00 00 00 00    testb  $0x1,0x0
 287:   01
 288:   74 f6                   je     280 <_start>
 28a:   c6 04 25 04 00 00 00    movb   $0x4,0x4
 291:   04
 292:   66 66 66 66 66 2e 0f    data16 data16 data16 data16 nopw %cs:0x0(%rax,%rax,1)
 299:   1f 84 00 00 00 00 00
 2a0:   f6 04 25 00 00 00 00    testb  $0x1,0x0
 2a7:   01
 2a8:   74 f6                   je     2a0 <_start+0x20>
 2aa:   c6 04 25 04 00 00 00    movb   $0x2,0x4
 2b1:   02
 2b2:   66 66 66 66 66 2e 0f    data16 data16 data16 data16 nopw %cs:0x0(%rax,%rax,1)
 2b9:   1f 84 00 00 00 00 00
 2c0:   eb fe                   jmp    2c0 <_start+0x40>

There are no panic_fmt calls in it.

Meta

$ rustc -V
rustc 1.15.0-nightly (43006fcea 2016-11-15)

I don't think this is a problem with the implementation of AndThen/Chain in
the futures crate because I have tried to re-implement Future in a few
different ways e.g. without error-handling, sprinkling #[inline] everywhere but
none of that helps.

Code like this:

busy_wait(futures::done(Ok(42)).and_then(|x| futures::done(Ok(x))));

optimizes the same as the "split" version and in that case LLVM evaluates the
expression at compile time and replaces it with 42. So LLVM can actually lower
and_then to panic! free code; it just doesn't optimize well this particular
case (perhaps, because of the volatile memory operations?)

join is another adapter that has the same issue.

cc @alexcrichton @eddyb

The text was updated successfully, but these errors were encountered:

alexcrichton · 2016-11-22T16:05:06Z

Hilariously this is not an issue at -Copt-level=2 (how I initially tried to reproduce this). This is an issue at -Copt-level=3

alexcrichton · 2016-11-22T16:53:24Z

Opened #37933

alexcrichton · 2016-11-22T16:54:29Z

Whoa wow I totally thought this was in the futures-rs repo, not the rust-lang/rust repo! Ok, inlining #37933 into here:

Originally reported at #37930, minimized at https://gist.github.com/alexcrichton/86db63138ef979422d8fa10538dc6dcb, the code optimizes better at opt-level=2 than opt-level=3.

For example:

$ rustc +nightly src/main.rs --emit llvm-ir -C panic=abort -Copt-level=2 && grep 'call .* @_ZN4core9panicking5panic' main.ll
$ rustc +nightly src/main.rs --emit llvm-ir -C panic=abort -Copt-level=3 && grep 'call .* @_ZN4core9panicking5panic' main.ll
  tail call void @_ZN4core9panicking5panic17h194ce5d68a8f28a1E({ %str_slice, %str_slice, i32 }* noalias readonly dereferenceable(40) bitcast ({ %str_slice, %str_slice, i32, [4 x i8] }* @"_ZN47_$LT$main..Chain$LT$A$C$$u20$B$C$$u20$C$GT$$GT$4poll14_MSG_FILE_LINE17hf0dd0d6c8e2c1d26E" to { %str_slice, %str_slice, i32 }*))

That is, with opt-level=2 we can prove the code doesn't panic, but at opt-level=3 something goes awry and we don't prove that it doesn't panic.

Do we need to tweak our opt-level=3 pipeline? Has it diverged from clang? Is it inappropriate to run by default?

japaric · 2016-11-22T16:57:42Z

I should also note that the opt-level=2 (set in Cargo.toml's profile.release) has no effect if you are still using the futures crates. IOW, it's necessary to inline the futures code into a single crate and set opt-level=2.

arielb1 · 2016-11-22T21:28:26Z

I am investigating @alexcrichton's issue.

Some code generated ends up looking like

  %f.sroa.0.0.ph.i = phi i8 [ 1, %foo-block ], [ 0, %entry-block ]
  %trunc.i.i.i = trunc i8 %f.sroa.0.0.ph.i to i2
  %switch.i = icmp eq i2 %trunc.i.i.i, 0
...
x-block:
  br i1 %switch.i, label %x2-block, label %x1-block
...
y-block:
  %cond.i.i.i = icmp eq i8 %f.sroa.0.0.ph.i, 0
  br i1 %cond.i.i.i, label %y1-block, label %y2-block

InstCombine manages to combine the 2 compares and transform the phi to a phi on booleans, which leads to SCCP to later optimize out one of the conditionals.

-C opt-level=3, through some convoluted path involving it maintaining a branch table of the form

  switch i2 %trunc.i.i, label %unreachable.i.i [
    i2 0, label %bb2.i.i
    i2 1, label %bb3.i.i
    i2 -2, label %bb1.i.i
  ]

through way too many passes, ends up with a conditional of the form

  %f.sroa.0.0.ph.i = phi i8 [ 1, %foo-block ], [ 0, %entry-block ]
  %switch = icmp eq i8 %f.sroa.0.0.ph.i, 1
...
x-block:
  br i1 %switch, label %x1-block, label %x2-block
...
y-block:
  %cond.i.i.i = icmp eq i8 %f.sroa.0.0.ph.i, 0
  br i1 %cond.i.i.i, label %y1-block, label %y2-block

Which it is then unable to optimize.

arielb1 · 2016-11-22T21:49:28Z

Minimal example (if both icmps have the same bitwise value, this optimizes to a switch of true/false. If they don't, it doesn't):

define i8 @_start(i8* %condition) {
entry-block:
  br label %bb1.preheader.i

bb1.preheader.i:
  %f.sroa.0.0.ph.i = phi i8 [ 1, %bb4.i.i.i.i ], [ 0, %entry-block ]
  %switch.i = icmp eq i8 %f.sroa.0.0.ph.i, 1
  br label %loop

loop:
  br i1 %switch.i, label %ret1, label %i101

i101:
  %0 = load volatile i8, i8* %condition
  %1 = icmp eq i8 %0, 0
  br i1 %1, label %bb4.i.i.i.i, label %loop

bb4.i.i.i.i:
  %cond.i.i.i = icmp eq i8 %f.sroa.0.0.ph.i, 0
  br i1 %cond.i.i.i, label %bb1.preheader.i, label %ret0

ret1:
  ret i8 1

ret0:
  ret i8 0
}

The double nested loop seems to be essential to getting the optimization to trigger. I wonder whether this is the case in the more complicated example too.

arielb1 · 2016-11-22T22:11:31Z

Sure, looks like this issue is encountered on @japaric's posted no_std case too.

arielb1 · 2016-11-22T22:59:32Z

https://llvm.org/bugs/show_bug.cgi?id=31125

alexcrichton · 2016-11-22T23:20:24Z

Awesome, thanks for the minimization @arielb1!

arielb1 · 2016-12-09T13:44:18Z

I had the minimization correct, but I got confused on finding the pass to blame.

nox · 2018-04-02T11:51:55Z

I'm confused. I can now grep _ZN4core9panicking5panic with both opt-level=2 and opt-level=3 when compiling the sample in @alexcrichton's comment. Did we regress further?

Cc @rust-lang/wg-codegen

steveklabnik · 2019-09-22T15:02:59Z

Triage: future has changed a lot, and so the output is likely to change. I'm not great at future internals yet, but someone wanting to reproduce this bug should port the code in @alexcrichton 's comment and see how it optimizes.

alexcrichton mentioned this issue Nov 22, 2016

opt-level=3 winds up with worse code than opt-level=2 #37933

Closed

alexcrichton added the A-codegen Area: Code generation label Nov 22, 2016

arielb1 added the A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. label Nov 22, 2016

arielb1 added the I-slow Issue: Problems and improvements with respect to performance of generated code. label Nov 22, 2016

japaric mentioned this issue Jun 12, 2017

demo futures API rust-embedded/embedded-hal#13

Closed

Mark-Simulacrum added the C-enhancement Category: An issue proposing an enhancement or a PR with one. label Jul 26, 2017

Noratrieb added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Apr 5, 2023

workingjubilee added the C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such label Oct 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Future::and_then and other adapters are not always zero cost #37930

Future::and_then and other adapters are not always zero cost #37930

japaric commented Nov 22, 2016

alexcrichton commented Nov 22, 2016

alexcrichton commented Nov 22, 2016

alexcrichton commented Nov 22, 2016

japaric commented Nov 22, 2016

arielb1 commented Nov 22, 2016

arielb1 commented Nov 22, 2016 •

edited

Loading

arielb1 commented Nov 22, 2016

arielb1 commented Nov 22, 2016

alexcrichton commented Nov 22, 2016

arielb1 commented Dec 9, 2016

nox commented Apr 2, 2018

steveklabnik commented Sep 22, 2019

Future::and_then and other adapters are not always zero cost #37930

Future::and_then and other adapters are not always zero cost #37930

Comments

japaric commented Nov 22, 2016

STR

Meta

alexcrichton commented Nov 22, 2016

alexcrichton commented Nov 22, 2016

alexcrichton commented Nov 22, 2016

japaric commented Nov 22, 2016

arielb1 commented Nov 22, 2016

arielb1 commented Nov 22, 2016 • edited Loading

arielb1 commented Nov 22, 2016

arielb1 commented Nov 22, 2016

alexcrichton commented Nov 22, 2016

arielb1 commented Dec 9, 2016

nox commented Apr 2, 2018

steveklabnik commented Sep 22, 2019

arielb1 commented Nov 22, 2016 •

edited

Loading