Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve LLVM optimization across function calls #21465

Closed
mahkoh opened this issue Jan 21, 2015 · 15 comments
Closed

Improve LLVM optimization across function calls #21465

mahkoh opened this issue Jan 21, 2015 · 15 comments
Labels
A-codegen Area: Code generation A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. I-slow Issue: Problems and improvements with respect to performance of generated code.

Comments

@mahkoh
Copy link
Contributor

mahkoh commented Jan 21, 2015

Vec<T> has an extend function which is equivalent to push_all except that it accepts an iterator. push_all optimizes to memcpy but extend does not. Consider the following unsafe Iterator:

struct MyIntoIter<T> {
    ptr: *const T,
    end: *const T
}

impl<T> MyIntoIter<T> {
    fn from_vec(mut v: Vec<T>) -> MyIntoIter<T> {
        unsafe {
            let ptr = v.as_mut_ptr();
            let begin = ptr as *const T;
            let end = if mem::size_of::<T>() == 0 {
                (ptr as usize + v.len()) as *const T
            } else {
                ptr.offset(v.len() as isize) as *const T
            };
            mem::forget(v);
            MyIntoIter { ptr: begin, end: end }
        }
    }
}

trait RawIterator {
    type Item;

    fn next(&mut self) -> Self::Item;
    fn ptr(&self) -> *const Self::Item;
    fn len(&self) -> usize;
}

impl<T> RawIterator for MyIntoIter<T> {
    type Item = T;

    fn next(&mut self) -> T {
        unsafe {
            let old = self.ptr;
            self.ptr = self.ptr.offset(1);
            ptr::read(old)
        }
    }

    fn ptr(&self) -> *const T {
        self.ptr
    }

    fn len(&self) -> usize {
        (self.end as usize - self.ptr as usize) / mem::size_of::<T>()
    }
}

This has all safety features removed. The user is responsible for not calling next when there are no more elements. One can assume that the normal Iterator will never optimize better than this Iterator.

Consider the following variants of extend:

#![allow(unstable)]

use std::{ptr, mem};

#[inline(never)]
fn extend4<T, I: RawIterator<Item=T>>(vec: &mut Vec<T>, mut iter: I) {
    let len = iter.len();
    vec.reserve(len);
    unsafe {
        let mut ptr = vec.as_mut_ptr().offset(vec.len() as isize);
        let end = ptr.offset(len as isize);
        while ptr != end {
            ptr::write(ptr, iter.next());
            ptr = ptr.offset(1);
        }
    }
}

#[inline(never)]
fn extend5<T, I: RawIterator<Item=T>>(vec: &mut Vec<T>, mut iter: I) {
    let len = iter.len();
    vec.reserve(len);
    unsafe {
        let mut dst_ptr = vec.as_mut_ptr().offset(vec.len() as isize);
        let dst_end = dst_ptr.offset(len as isize);
        let mut src_ptr = iter.ptr();
        while dst_ptr != dst_end {
            ptr::write(dst_ptr, ptr::read(src_ptr));
            dst_ptr = dst_ptr.offset(1);
            src_ptr = src_ptr.offset(1);
        }
    }
}

fn main() {
    let src: Vec<_> = (0..6400000us).map(|x| x as u8).collect();
    let mut x: Vec<u8> = vec!();

    //extend4(&mut x, MyIntoIter::from_vec(src));
    extend5(&mut x, MyIntoIter::from_vec(src));
}

You can see that extend5 is nothing but extend4 with next inlined manually. However, extend5 optimizes to a memcpy while extend4 does not.

From this we can see that adding an unsafe ExactSizeIterator trait will not improve the extend performance on its own.

@mahkoh
Copy link
Contributor Author

mahkoh commented Jan 21, 2015

On the other hand, extend4 does not optimize better than extend2:

#[inline(never)]
fn extend2<T, I: Iterator<Item=T>+ExactSizeIterator>(vec: &mut Vec<T>, mut iter: I) {
    let len = iter.len();
    vec.reserve(len);
    unsafe {
        let mut ptr = vec.as_mut_ptr().offset(vec.len() as isize);
        let end = ptr.offset(len as isize);
        while ptr != end {
            let el = iter.next();
            assume(el.is_some());
            ptr::write(ptr, el.unwrap());
            ptr = ptr.offset(1);
        }
    }
}

So one might assume that if one can get extend4 to optimize to a memcpy, then extend2 will also optimize to a memcpy.

@Gankra
Copy link
Contributor

Gankra commented Jan 21, 2015

CC @Aatch

@Gankra Gankra added I-slow Issue: Problems and improvements with respect to performance of generated code. A-codegen Area: Code generation labels Jan 21, 2015
@Aatch
Copy link
Contributor

Aatch commented Jan 21, 2015

It is possible to always inline methods with #[inline(always)]. While I generally discourage use of the attribute, it certainly has its uses for when LLVM seems reluctant to do it itself.

I might investigate what's going on here though. I assume this is all in the same crate?

@mahkoh
Copy link
Contributor Author

mahkoh commented Jan 21, 2015

@Aatch: Adding this attribute to the next function up there changes nothing.

I assume this is all in the same crate?

It's all one file.

@mahkoh
Copy link
Contributor Author

mahkoh commented Jan 21, 2015

Maybe related: #18363

@Aatch
Copy link
Contributor

Aatch commented Jan 21, 2015

Ok, figured out the problem. It's one I already know about, which is nice I guess. The long and short of it though is that LLVM doesn't know about ownership. To demonstrate, this change to extend4 causes it to generate the same IR as extend5:

#[inline(never)]
fn extend4<T, I: RawIterator<Item=T>>(vec: &mut Vec<T>, iter: I) {
    let mut iter = iter; // Add this line
    let len = iter.len();
    vec.reserve(len);
    unsafe {
        let mut ptr = vec.as_mut_ptr().offset(vec.len() as isize);
        let end = ptr.offset(len as isize);
        while ptr != end {
            ptr::write(ptr, iter.next());
            ptr = ptr.offset(1);
        }
    }
}

That is literally it. The problem is that LLVM doesn't realise that, in the caller, the changes won't be visible. The iterator is passed as a pointer, due to it being too big to be an immediate. Thus we're writing to memory that can be seen from outside the function.

Making a new variable copies the iterator into a local stack slot and LLVM suddenly gets much happier about optimising to much faster code. The ideal fix would be to teach LLVM a bit about our ownership semantics.

@mahkoh
Copy link
Contributor Author

mahkoh commented Jan 21, 2015

Ah, so the problem wasn't the function call but the place where the pointer was stored? Either way, with this change the extend2 function which just uses the proposed ExactSizeIterator trait does optimize to a memcpy:

#[inline(never)]
fn extend2<T, I: Iterator<Item=T>+ExactSizeIterator>(vec: &mut Vec<T>, iter: I) {
    let mut iter = iter;
    let len = iter.len();
    vec.reserve(len);
    unsafe {
        let mut ptr = vec.as_mut_ptr().offset(vec.len() as isize);
        let end = ptr.offset(len as isize);
        while ptr != end {
            let el = iter.next();
            assume(el.is_some());
            ptr::write(ptr, el.unwrap());
            ptr = ptr.offset(1);
        }
    }
}

@Gankra
Copy link
Contributor

Gankra commented Jan 21, 2015

@mahkoh Is the while ptr != end style really necessary? Would a normal for not suffice (wouldn't be particularly surprised either way, just curious)?

@mahkoh
Copy link
Contributor Author

mahkoh commented Jan 21, 2015

Unfortunately

#[inline(never)]
fn extend1<T, I: Iterator<Item=T>+ExactSizeIterator>(vec: &mut Vec<T>, iter: I) {
    let mut iter = iter;
    let len = iter.len();
    vec.reserve(len);
    for _ in (0..len) {
        let el = iter.next();
        unsafe {
            assume(el.is_some());
            assume(vec.len() < vec.capacity());
            vec.push(el.unwrap())
        }
    }
}

doesn't work.

@mahkoh
Copy link
Contributor Author

mahkoh commented Jan 21, 2015

Maybe assume(vec.len() < vec.capacity()); is just too abstract. As a Vec method it might be possible to write this differently and then it might work.

@Gankra
Copy link
Contributor

Gankra commented Jan 21, 2015

This is how I would "ideally" write it (ignoring panic concerns, which is I guess why you went with push?):

fn extend1<T, I: Iterator<Item=T>+ExactSizeIterator>(vec: &mut Vec<T>, iter: I) {
    let mut iter = iter;
    let len = iter.len();
    vec.reserve(len);
    let mut ptr = vec.as_mut_ptr().offset(vec.len());
    for el in iter {
       ptr.write(el);
       ptr = ptr.offset(1);
    }
    vec.set_len(len);
}

@mahkoh
Copy link
Contributor Author

mahkoh commented Jan 21, 2015

That does indeed work:

#[inline(never)]
fn extend7<T, I: Iterator<Item=T>+ExactSizeIterator>(vec: &mut Vec<T>, mut iter: I) {
    let mut iter = iter;
    let len = iter.len();
    vec.reserve(len);
    unsafe {
        let mut ptr = vec.as_mut_ptr().offset(vec.len() as isize);
        for el in iter {
           ptr::write(ptr, el);
           ptr = ptr.offset(1);
        }
        vec.set_len(len);
    }
}

@Gankra
Copy link
Contributor

Gankra commented Jan 21, 2015

Okay that's great. Just need to add an O(1) unwind guard to set the len and we're golddeeen... maybeeee..?

Ah zero-sized-types complicate this all, but it's easy enough to just have another branch to cover that.

@mahkoh
Copy link
Contributor Author

mahkoh commented Jan 21, 2015

#[inline(never)]
fn extend7<T, I: Iterator<Item=T>+ExactSizeIterator>(vec: &mut Vec<T>, iter: I) {
    struct PanicGuard<'a, T: 'a> {
        vec: &'a mut Vec<T>,
        ptr: *const *mut T,
    }

    #[unsafe_destructor]
    impl<'a, T> Drop for PanicGuard<'a, T> {
        fn drop(&mut self) {
            unsafe {
                let diff = *self.ptr as usize - self.vec.as_ptr() as usize;
                let size = mem::size_of::<T>();
                let len = diff / if size == 0 { 1 } else { size };
                self.vec.set_len(len);
            }
        }
    }

    let mut iter = iter;
    let len = iter.len();
    vec.reserve(len);

    unsafe {
        {
            let mut ptr = if mem::size_of::<T>() == 0 {
                (vec.as_mut_ptr() as usize + vec.len()) as *mut T
            } else {
                vec.as_mut_ptr().offset(vec.len() as isize)
            };
            let guard = PanicGuard { vec: vec, ptr: &ptr };
            for el in iter {
               ptr::write(ptr, el);
               ptr = if mem::size_of::<T>() == 0 {
                   (ptr as usize + 1) as *mut T
               } else {
                   ptr.offset(1)
               };
            }
            mem::forget(guard);
        }
        vec.set_len(len);
    }
}

This works but only for primitive types.

@mahkoh
Copy link
Contributor Author

mahkoh commented Jan 21, 2015

LLVM will not optimize this for structs even if the struct contains a single primitive field.

@sanxiyn sanxiyn added the A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. label Jan 25, 2015
@mahkoh mahkoh closed this as completed Apr 11, 2015
@rust-lang rust-lang locked and limited conversation to collaborators Apr 11, 2015
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-codegen Area: Code generation A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. I-slow Issue: Problems and improvements with respect to performance of generated code.
Projects
None yet
Development

No branches or pull requests

4 participants