Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: Use for_each in Vec::extend #68046

Closed
wants to merge 8 commits into from

Conversation

Marwes
Copy link
Contributor

@Marwes Marwes commented Jan 9, 2020

for_each are specialized for iterators such as chain allowing for
faster iteration than a normal for/while loop.

Note that since this only checks size_hint once at the start it may
end up needing to call reserve more in the case that size_hint
returns a larger and more accurate lower bound during iteration.

This could maybe be alleviated with an implementation closer to the current
one but the extra complexity will likely end up harming the normal case
of an accurate or 0 (think filter) lower bound.

while let Some(element) = iterator.next() {
    let (lower, _) = iterator.size_hint();
    self.reserve(lower.saturating_add(1));
    unsafe {
        let len = self.len();
        ptr::write(self.get_unchecked_mut(len), element);
        // NB can't overflow since we would have had to alloc the address space
        self.set_len(len + 1);
    }

    iterator.by_ref().take(self.capacity()).for_each(|element| {
        unsafe {
            let len = self.len();
            ptr::write(self.get_unchecked_mut(len), element);
            // NB can't overflow since we would have had to alloc the address space
            self.set_len(len + 1);
        }
    });
}

// OR

let (lower, _) = iterator.size_hint();
self.reserve(lower);
loop {
    let result = iterator.by_ref().try_for_each(|element| {
        if self.len() == self.capacity() {
            return Err(element);
        }
        unsafe {
            let len = self.len();
            ptr::write(self.get_unchecked_mut(len), element);
            // NB can't overflow since we would have had to alloc the address space
            self.set_len(len + 1);
        }
        Ok(())
    });

    match result {
        Ok(()) => break,
        Err(element) => {
            let (lower, _) = iterator.size_hint();
            self.reserve(lower.saturating_add(1));
            self.push(element);
        }
    }
}

Closes #63340

@rust-highfive
Copy link
Collaborator

r? @Mark-Simulacrum

(rust_highfive has picked a reviewer for you, use r? to override)

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Jan 9, 2020
@Marwes
Copy link
Contributor Author

Marwes commented Jan 9, 2020

It seems that benchmarks aren't run/checked by CI as ./x.py bench --stage=1 src/liballoc failed with a missing import for black_box.

@Centril
Copy link
Contributor

Centril commented Jan 9, 2020

@bors try @rust-timer queue

@rust-timer
Copy link
Collaborator

Awaiting bors try build completion

@bors
Copy link
Contributor

bors commented Jan 9, 2020

⌛ Trying commit fd175a8 with merge 45ce712...

bors added a commit that referenced this pull request Jan 9, 2020
perf: Use `for_each` in `Vec::extend`

`for_each` are specialized for iterators such as `chain` allowing for
faster iteration than a normal `for/while` loop.

Note that since this only checks `size_hint` once at the start it may
end up needing to call `reserve` more in the case that `size_hint`
returns a larger and more accurate lower bound during iteration.

This could maybe be alleviated with an implementation closure like the current
one but the extra complexity will likely end up harming the normal case
of an accurate or 0 (think `filter`) lower bound.

```rust
while let Some(element) = iterator.next() {
    let (lower, _) = iterator.size_hint();
    self.reserve(lower.saturating_add(1));
    unsafe {
        let len = self.len();
        ptr::write(self.get_unchecked_mut(len), element);
        // NB can't overflow since we would have had to alloc the address space
        self.set_len(len + 1);
    }

    iterator.by_ref().take(self.capacity()).for_each(|element| {
        unsafe {
            let len = self.len();
            ptr::write(self.get_unchecked_mut(len), element);
            // NB can't overflow since we would have had to alloc the address space
            self.set_len(len + 1);
        }
    });
}

// OR

let (lower, _) = iterator.size_hint();
self.reserve(lower);
loop {
    let result = iterator.by_ref().try_for_each(|element| {
        if self.len() == self.capacity() {
            return Err(element);
        }
        unsafe {
            let len = self.len();
            ptr::write(self.get_unchecked_mut(len), element);
            // NB can't overflow since we would have had to alloc the address space
            self.set_len(len + 1);
        }
        Ok(())
    });

    match result {
        Ok(()) => break,
        Err(element) => {
            let (lower, _) = iterator.size_hint();
            self.reserve(lower.saturating_add(1));
            self.push(element);
        }
    }
}
```

Closes #63340
@bors
Copy link
Contributor

bors commented Jan 9, 2020

☀️ Try build successful - checks-azure
Build commit: 45ce712 (45ce712754dc7ab7e68cbc506e6d43eb04e74128)

@rust-timer
Copy link
Collaborator

Queued 45ce712 with parent adc6572, future comparison URL.

@rust-timer
Copy link
Collaborator

Finished benchmarking try commit 45ce712, comparison URL.

@Marwes
Copy link
Contributor Author

Marwes commented Jan 9, 2020

The piston-image regression don't make any sense...

@Mark-Simulacrum
Copy link
Member

I don't have time myself to figure out why the benchmarks here regressed, but it's not obvious that it's truly spurious; if we're generating many more impls or more complicated impls due to unrolling or so it's not impossible we're regressing.

(For one thing, the new code contains a closure, whereas the previous didn't, and could plausibly instantiate less as such).

Happy to rerun the compiler benchmarks if you want to remove the closure (I suspect it might be replaceable with Vec::push).

I would also like to see some ad-hoc benchmarks (e.g., extending with 100 elements or something) done locally before/after this.

@Mark-Simulacrum Mark-Simulacrum added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Jan 19, 2020
@rust-highfive
Copy link
Collaborator

Your PR failed (pretty log, raw log). Through arcane magic we have determined that the following fragments from the build log may contain information about the problem.

Click to expand the log.
2020-01-21T18:05:13.5527553Z ========================== Starting Command Output ===========================
2020-01-21T18:05:13.5531691Z [command]/bin/bash --noprofile --norc /home/vsts/work/_temp/c22da964-3b9a-4ac0-ba81-1adfd7784d5e.sh
2020-01-21T18:05:13.5531798Z 
2020-01-21T18:05:13.5536050Z ##[section]Finishing: Disable git automatic line ending conversion
2020-01-21T18:05:13.5542651Z ##[section]Starting: Checkout rust-lang/rust@refs/pull/68046/merge to s
2020-01-21T18:05:13.5544420Z Task         : Get sources
2020-01-21T18:05:13.5544501Z Description  : Get sources from a repository. Supports Git, TfsVC, and SVN repositories.
2020-01-21T18:05:13.5544535Z Version      : 1.0.0
2020-01-21T18:05:13.5544566Z Author       : Microsoft
---
2020-01-21T18:05:14.5844515Z ##[command]git remote add origin https://github.com/rust-lang/rust
2020-01-21T18:05:14.5926728Z ##[command]git config gc.auto 0
2020-01-21T18:05:14.5933752Z ##[command]git config --get-all http.https://github.com/rust-lang/rust.extraheader
2020-01-21T18:05:14.5938643Z ##[command]git config --get-all http.proxy
2020-01-21T18:05:14.5946458Z ##[command]git -c http.extraheader="AUTHORIZATION: basic ***" fetch --force --tags --prune --progress --no-recurse-submodules --depth=2 origin +refs/heads/*:refs/remotes/origin/* +refs/pull/68046/merge:refs/remotes/pull/68046/merge

I'm a bot! I can only do what humans tell me to, so if this was not helpful or you have suggestions for improvements, please ping or otherwise contact @TimNN. (Feature Requests)

@the8472
Copy link
Member

the8472 commented Jan 22, 2020

Note that since this only checks size_hint once at the start it may
end up needing to call reserve more in the case that size_hint
returns a larger and more accurate lower bound during iteration.

It's worse than that. The current implementation takes one element from the iterator and then queried for a size_hint, this PR takes the size_hint first. The standard library contains iterators that initially can't provide a size hint at all and only provide better ones as you iterate, perhaps even only later down the road.

The btree range iterators and flattening iterators would be such examples.

let mut flat = vec![vec!['b'; 1], vec!['c'; 1_000_000]].into_iter().flatten();
dbg!(flat.size_hint());
flat.next();
dbg!(flat.size_hint());
flat.next();
dbg!(flat.size_hint());
[src/main.rs:7] flat.size_hint() = (
    0,
    None,
)
[src/main.rs:9] flat.size_hint() = (
    0,
    None,
)
[src/main.rs:11] flat.size_hint() = (
    999999,
    Some(
        999999,
    ),
)

I don't think this is currently covered by benchmarks, so the drawbacks of one case are easily missed over the advantages in another case.

Copy link
Member

@the8472 the8472 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you run the whole battery of vec benchmarks and done a before/after comparison, e.g. via cargo benchcmp?

self.reserve(lower);
loop {
let cap = self.capacity();
let result = iterator.by_ref().try_fold((), |(), element| {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use try_for_each which takes a reference (so no by_ref needed) and doesn't require an element to fold through.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Until #46477 is fixed you probably should replace the closure by passing a free function instead (such as #62429 did for iterators). I think that's what @Mark-Simulacrum meant.

Copy link
Contributor Author

@Marwes Marwes Jan 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use try_for_each which takes a reference (so no by_ref needed) and doesn't require an element to fold through.

try_for_each just forwards to try_fold for Chain (same for most other iterators). I wanted to avoid getting the optimizer tripping up as much as possible as I have not seen any improvements so far.

try_fold also only takes a &mut so by_ref should definitely be removed as I just saw that Iterator for &mut T is unable to forward try_fold and friends due to object safety which I means I were just using the default implementation for try_fold, easily explaining why I am not seeing any improvement for chained iterators... by_ref() shouldn't affect anything in this case actually. The correct try_fold should still be selected as it would require calling try_fold on a &mut &mut T to get the wrong one.

This might be fixable with some specialization at least #68472

Until #46477 is fixed you probably should replace the closure by passing a free function instead (such as #62429 did for iterators). I think that's what @Mark-Simulacrum meant.

Yeah, should do that. Was just trying to get an actual benchmark showing that this was an improvement first :/

// NB can't overflow since we would have had to alloc the address space
self.set_len(len + 1);
let (lower, _) = iterator.size_hint();
self.reserve(lower);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still has the issue of trying to reserve before advancing the iterator. taking an element and then checking the hint and capacity is strictly better because the hint will be more accurate and if you take a None you can skip the hint calculation and reserve attempt.

Copy link
Contributor Author

@Marwes Marwes Jan 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The benefit is that it avoids calling next at all. While the hint should be more accurate by calling next it would also be even more accurate if we called next twice so I don't quite buy that argument. Calling next once would give a fast path for the empty iterator at least which I find a more convincing argument, however it is only a benefit if size_hint(); reserve(0) is slow enough to warrant this fast path.

Pushed another variant which does next first which should optimize better for the empty iterator. Haven't got time to benchmark it yet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was more concerned about doing an suboptimal resize if we got an incorrect lower bound on the first attempt, but you're right, if it's just 0 it hopefully would be cheap enough and the next reserve will do it properly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have gone a bit back and forth but I think the of reserving eagerly has a better tradeoff.

On size_hint == 0 it does call reserve unnecessarily but it is only one more branch in an already fast case. If size_hint > 0 then most of the time we will reserve either way and most of the time with the same value (size_hint_minus_removed_value + 1 vs size_hint).

The only case where I'd expect it to be faster to call next first is if size_hint is expensive, despite the iterator being empty and next being faster to call than size_hint which seems unlikely (happy to be corrected however!).

@Swatinem
Copy link
Contributor

Just a driveby comment, without knowing any of the details…

Is it possible this is an opposite case of #64572 ?

@the8472
Copy link
Member

the8472 commented Jan 24, 2020

In some ways. But that PR was superseded by #64600 and #64885 which achieved most of those gains without doing away with internal iteration. So ideally we'd improve chain iterators here without incurring significant losses elsewhere.

@Marwes
Copy link
Contributor Author

Marwes commented Jan 24, 2020

LLVM seems to be clever enough to optimize the current, next calling extend to the same optimized code that we expect from try_fold. Will try to create a better benchmark which shows an actual difference between the approaches.

Another perf run could be interesting though, to see if the regressions are at least fixed with the current implementation (seeing lots of variance in my benchmarks, but it does not appear slower at least).

@the8472
Copy link
Member

the8472 commented Jan 24, 2020

seeing lots of variance in my benchmarks

Monitor your CPU clocks when running microbenchmarks. Thermal throttling can confound results. Disabling boost clocks can help if that's the case.

Markus Westerlind and others added 6 commits February 6, 2020 21:14
`for_each` are specialized for iterators such as `chain` allowing for
faster iteration than a normal `for/while` loop.

Note that since this only checks `size_hint` once at the start it may
end up needing to call `reserve` more in the case that `size_hint`
returns a larger and more accurate lower bound during iteration.

This could maybe be alleviated with an implementation closure like the current
one but the extra complexity will likely end up harming the normal case
of an accurate or 0 (think `filter`) lower bound.

```rust
while let Some(element) = iterator.next() {
    let (lower, _) = iterator.size_hint();
    self.reserve(lower.saturating_add(1));
    unsafe {
        let len = self.len();
        ptr::write(self.get_unchecked_mut(len), element);
        // NB can't overflow since we would have had to alloc the address space
        self.set_len(len + 1);
    }

    iterator.by_ref().take(self.capacity()).for_each(|element| {
        unsafe {
            let len = self.len();
            ptr::write(self.get_unchecked_mut(len), element);
            // NB can't overflow since we would have had to alloc the address space
            self.set_len(len + 1);
        }
    });
}

// OR

let (lower, _) = iterator.size_hint();
self.reserve(lower);
loop {
    let result = iterator.by_ref().try_for_each(|element| {
        if self.len() == self.capacity() {
            return Err(element);
        }
        unsafe {
            let len = self.len();
            ptr::write(self.get_unchecked_mut(len), element);
            // NB can't overflow since we would have had to alloc the address space
            self.set_len(len + 1);
        }
        Ok(())
    });

    match result {
        Ok(()) => break,
        Err(element) => {
            let (lower, _) = iterator.size_hint();
            self.reserve(lower.saturating_add(1));
            self.push(element);
        }
    }
}
```

Closes rust-lang#63340
Should put less stress on LLVM since there are less closures passed
around and lets us refine how much we reserve with `size_hint` if the
first guess is too low.
@Marwes
Copy link
Contributor Author

Marwes commented Feb 25, 2020

This could use another perf run with the latest changes. I suspect it won't be enough to remedy the compile time regressions though.

@joelpalmer
Copy link

Ping from Triage: Any updates? @Marwes?

@Marwes
Copy link
Contributor Author

Marwes commented Mar 9, 2020

This could use another perf run with the latest changes. I suspect it won't be enough to remedy the compile time regressions though.

#68046 (comment)

If the perf run still shows an unacceptable regression I don't see a way to land this right now. Perhaps with MIR optimizations acting on the generic functions it could be optimized enough before getting to LLVM to make it possible, but otherwise the overhead is fairly fundamental to the change.

@Dylan-DPC-zz
Copy link

@bors try @rust-timer queue

@rust-timer
Copy link
Collaborator

Awaiting bors try build completion

@bors
Copy link
Contributor

bors commented Mar 13, 2020

⌛ Trying commit e41f55e with merge f2ee309252c8b5a6db0a206d760db8047cfb69eb...

@bors
Copy link
Contributor

bors commented Mar 13, 2020

💥 Test timed out

@bors bors added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Mar 13, 2020
@rust-highfive
Copy link
Collaborator

Your PR failed (pretty log, raw log). Through arcane magic we have determined that the following fragments from the build log may contain information about the problem.

Click to expand the log.

I'm a bot! I can only do what humans tell me to, so if this was not helpful or you have suggestions for improvements, please ping or otherwise contact @rust-lang/infra. (Feature Requests)

@Dylan-DPC-zz
Copy link

@bors try @rust-timer queue

@rust-timer
Copy link
Collaborator

Awaiting bors try build completion

@bors
Copy link
Contributor

bors commented Mar 21, 2020

⌛ Trying commit e41f55e with merge 18cc66a1f5cf20ca7c7e28fbe43469d13c436b05...

@bors
Copy link
Contributor

bors commented Mar 21, 2020

☀️ Try build successful - checks-azure
Build commit: 18cc66a1f5cf20ca7c7e28fbe43469d13c436b05 (18cc66a1f5cf20ca7c7e28fbe43469d13c436b05)

@rust-timer
Copy link
Collaborator

Queued 18cc66a1f5cf20ca7c7e28fbe43469d13c436b05 with parent 38114ff, future comparison URL.

@Mark-Simulacrum Mark-Simulacrum added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Mar 29, 2020
@JohnCSimon
Copy link
Member

Ping from Triage: Any updates @Marwes? Thank you.

@Marwes
Copy link
Contributor Author

Marwes commented Apr 6, 2020

I can't see a way to resolve the compiler performance regressions from this at this time. Perhaps with more optimizations before monomorphization the overhead could be reduced enough (so that LLVM wouldn't need to instantiate and inline this added complexity for each and every iterator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

chain() make collect very slow