-
Notifications
You must be signed in to change notification settings - Fork 12.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: Use for_each
in Vec::extend
#68046
Conversation
(rust_highfive has picked a reviewer for you, use r? to override) |
It seems that benchmarks aren't run/checked by CI as |
@bors try @rust-timer queue |
Awaiting bors try build completion |
perf: Use `for_each` in `Vec::extend` `for_each` are specialized for iterators such as `chain` allowing for faster iteration than a normal `for/while` loop. Note that since this only checks `size_hint` once at the start it may end up needing to call `reserve` more in the case that `size_hint` returns a larger and more accurate lower bound during iteration. This could maybe be alleviated with an implementation closure like the current one but the extra complexity will likely end up harming the normal case of an accurate or 0 (think `filter`) lower bound. ```rust while let Some(element) = iterator.next() { let (lower, _) = iterator.size_hint(); self.reserve(lower.saturating_add(1)); unsafe { let len = self.len(); ptr::write(self.get_unchecked_mut(len), element); // NB can't overflow since we would have had to alloc the address space self.set_len(len + 1); } iterator.by_ref().take(self.capacity()).for_each(|element| { unsafe { let len = self.len(); ptr::write(self.get_unchecked_mut(len), element); // NB can't overflow since we would have had to alloc the address space self.set_len(len + 1); } }); } // OR let (lower, _) = iterator.size_hint(); self.reserve(lower); loop { let result = iterator.by_ref().try_for_each(|element| { if self.len() == self.capacity() { return Err(element); } unsafe { let len = self.len(); ptr::write(self.get_unchecked_mut(len), element); // NB can't overflow since we would have had to alloc the address space self.set_len(len + 1); } Ok(()) }); match result { Ok(()) => break, Err(element) => { let (lower, _) = iterator.size_hint(); self.reserve(lower.saturating_add(1)); self.push(element); } } } ``` Closes #63340
☀️ Try build successful - checks-azure |
Queued 45ce712 with parent adc6572, future comparison URL. |
Finished benchmarking try commit 45ce712, comparison URL. |
The piston-image regression don't make any sense... |
I don't have time myself to figure out why the benchmarks here regressed, but it's not obvious that it's truly spurious; if we're generating many more impls or more complicated impls due to unrolling or so it's not impossible we're regressing. (For one thing, the new code contains a closure, whereas the previous didn't, and could plausibly instantiate less as such). Happy to rerun the compiler benchmarks if you want to remove the closure (I suspect it might be replaceable with I would also like to see some ad-hoc benchmarks (e.g., extending with 100 elements or something) done locally before/after this. |
fd175a8
to
acf3b98
Compare
Your PR failed (pretty log, raw log). Through arcane magic we have determined that the following fragments from the build log may contain information about the problem. Click to expand the log.
I'm a bot! I can only do what humans tell me to, so if this was not helpful or you have suggestions for improvements, please ping or otherwise contact |
It's worse than that. The current implementation takes one element from the iterator and then queried for a size_hint, this PR takes the size_hint first. The standard library contains iterators that initially can't provide a size hint at all and only provide better ones as you iterate, perhaps even only later down the road. The btree range iterators and flattening iterators would be such examples. let mut flat = vec![vec!['b'; 1], vec!['c'; 1_000_000]].into_iter().flatten();
dbg!(flat.size_hint());
flat.next();
dbg!(flat.size_hint());
flat.next();
dbg!(flat.size_hint());
I don't think this is currently covered by benchmarks, so the drawbacks of one case are easily missed over the advantages in another case. |
acf3b98
to
e3787b2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you run the whole battery of vec benchmarks and done a before/after comparison, e.g. via cargo benchcmp
?
src/liballoc/vec.rs
Outdated
self.reserve(lower); | ||
loop { | ||
let cap = self.capacity(); | ||
let result = iterator.by_ref().try_fold((), |(), element| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can use try_for_each
which takes a reference (so no by_ref
needed) and doesn't require an element to fold through.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Until #46477 is fixed you probably should replace the closure by passing a free function instead (such as #62429 did for iterators). I think that's what @Mark-Simulacrum meant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can use try_for_each which takes a reference (so no by_ref needed) and doesn't require an element to fold through.
try_for_each
just forwards to try_fold
for Chain
(same for most other iterators). I wanted to avoid getting the optimizer tripping up as much as possible as I have not seen any improvements so far.
try_fold
also only takes a &mut
so by_ref
should definitely be removed as I just saw that Iterator for &mut T
is unable to forward try_fold
and friends due to object safety which I means I were just using the default implementation for try_fold
, easily explaining why I am not seeing any improvement for chained iterators...by_ref()
shouldn't affect anything in this case actually. The correct try_fold
should still be selected as it would require calling try_fold on a &mut &mut T
to get the wrong one.
This might be fixable with some specialization at least #68472
Until #46477 is fixed you probably should replace the closure by passing a free function instead (such as #62429 did for iterators). I think that's what @Mark-Simulacrum meant.
Yeah, should do that. Was just trying to get an actual benchmark showing that this was an improvement first :/
// NB can't overflow since we would have had to alloc the address space | ||
self.set_len(len + 1); | ||
let (lower, _) = iterator.size_hint(); | ||
self.reserve(lower); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still has the issue of trying to reserve before advancing the iterator. taking an element and then checking the hint and capacity is strictly better because the hint will be more accurate and if you take a None
you can skip the hint calculation and reserve attempt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The benefit is that it avoids calling next
at all. While the hint should be more accurate by calling next
it would also be even more accurate if we called next
twice so I don't quite buy that argument. Calling next
once would give a fast path for the empty iterator at least which I find a more convincing argument, however it is only a benefit if size_hint(); reserve(0)
is slow enough to warrant this fast path.
Pushed another variant which does next
first which should optimize better for the empty iterator. Haven't got time to benchmark it yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was more concerned about doing an suboptimal resize if we got an incorrect lower bound on the first attempt, but you're right, if it's just 0 it hopefully would be cheap enough and the next reserve will do it properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have gone a bit back and forth but I think the of reserving eagerly has a better tradeoff.
On size_hint == 0
it does call reserve unnecessarily but it is only one more branch in an already fast case. If size_hint > 0
then most of the time we will reserve either way and most of the time with the same value (size_hint_minus_removed_value + 1
vs size_hint
).
The only case where I'd expect it to be faster to call next first is if size_hint
is expensive, despite the iterator being empty and next
being faster to call than size_hint
which seems unlikely (happy to be corrected however!).
15de55b
to
6532de7
Compare
Just a driveby comment, without knowing any of the details… Is it possible this is an opposite case of #64572 ? |
LLVM seems to be clever enough to optimize the current, Another perf run could be interesting though, to see if the regressions are at least fixed with the current implementation (seeing lots of variance in my benchmarks, but it does not appear slower at least). |
Monitor your CPU clocks when running microbenchmarks. Thermal throttling can confound results. Disabling boost clocks can help if that's the case. |
6532de7
to
0ac47b4
Compare
`for_each` are specialized for iterators such as `chain` allowing for faster iteration than a normal `for/while` loop. Note that since this only checks `size_hint` once at the start it may end up needing to call `reserve` more in the case that `size_hint` returns a larger and more accurate lower bound during iteration. This could maybe be alleviated with an implementation closure like the current one but the extra complexity will likely end up harming the normal case of an accurate or 0 (think `filter`) lower bound. ```rust while let Some(element) = iterator.next() { let (lower, _) = iterator.size_hint(); self.reserve(lower.saturating_add(1)); unsafe { let len = self.len(); ptr::write(self.get_unchecked_mut(len), element); // NB can't overflow since we would have had to alloc the address space self.set_len(len + 1); } iterator.by_ref().take(self.capacity()).for_each(|element| { unsafe { let len = self.len(); ptr::write(self.get_unchecked_mut(len), element); // NB can't overflow since we would have had to alloc the address space self.set_len(len + 1); } }); } // OR let (lower, _) = iterator.size_hint(); self.reserve(lower); loop { let result = iterator.by_ref().try_for_each(|element| { if self.len() == self.capacity() { return Err(element); } unsafe { let len = self.len(); ptr::write(self.get_unchecked_mut(len), element); // NB can't overflow since we would have had to alloc the address space self.set_len(len + 1); } Ok(()) }); match result { Ok(()) => break, Err(element) => { let (lower, _) = iterator.size_hint(); self.reserve(lower.saturating_add(1)); self.push(element); } } } ``` Closes rust-lang#63340
Should put less stress on LLVM since there are less closures passed around and lets us refine how much we reserve with `size_hint` if the first guess is too low.
aa08f68
to
e41f55e
Compare
This could use another perf run with the latest changes. I suspect it won't be enough to remedy the compile time regressions though. |
Ping from Triage: Any updates? @Marwes? |
If the perf run still shows an unacceptable regression I don't see a way to land this right now. Perhaps with MIR optimizations acting on the generic functions it could be optimized enough before getting to LLVM to make it possible, but otherwise the overhead is fairly fundamental to the change. |
@bors try @rust-timer queue |
Awaiting bors try build completion |
⌛ Trying commit e41f55e with merge f2ee309252c8b5a6db0a206d760db8047cfb69eb... |
💥 Test timed out |
Your PR failed (pretty log, raw log). Through arcane magic we have determined that the following fragments from the build log may contain information about the problem. Click to expand the log.
I'm a bot! I can only do what humans tell me to, so if this was not helpful or you have suggestions for improvements, please ping or otherwise contact |
@bors try @rust-timer queue |
Awaiting bors try build completion |
⌛ Trying commit e41f55e with merge 18cc66a1f5cf20ca7c7e28fbe43469d13c436b05... |
☀️ Try build successful - checks-azure |
Queued 18cc66a1f5cf20ca7c7e28fbe43469d13c436b05 with parent 38114ff, future comparison URL. |
Ping from Triage: Any updates @Marwes? Thank you. |
I can't see a way to resolve the compiler performance regressions from this at this time. Perhaps with more optimizations before monomorphization the overhead could be reduced enough (so that LLVM wouldn't need to instantiate and inline this added complexity for each and every iterator. |
for_each
are specialized for iterators such aschain
allowing forfaster iteration than a normal
for/while
loop.Note that since this only checks
size_hint
once at the start it mayend up needing to call
reserve
more in the case thatsize_hint
returns a larger and more accurate lower bound during iteration.
This could maybe be alleviated with an implementation closer to the current
one but the extra complexity will likely end up harming the normal case
of an accurate or 0 (think
filter
) lower bound.Closes #63340