Tweak the default `PartialOrd::{lt,le,gt,ge}` #106065

scottmcm · 2022-12-22T21:50:51Z

r? @saethlin
who noticed that #105840 was having trouble because of these default implementations.

That got me inspired to give this a shot, to see whether tweaking those defaults might actually improve things -- and hopefully make that PR easier to land. (And maybe even test, since this adds a codegen test that it would not want to regress.)

Specifically, I noticed in https://rust.godbolt.org/z/3fbve7eW7 that

new_cmp(x, y) < Ordering::Equal

did optimize as desired, whereas

new_cmp(x, y) == Ordering::Less

didn't. So this PR bases all the Ordering methods around comparisons against 0, rather than trying to match specific variants.

Let's see what perf says 🤞

EDIT: Also, credit to @joboet in #105840 (comment) who first pointed out that matching the variants directly isn't necessarily better.

rustbot · 2022-12-22T21:50:56Z

Failed to set assignee to saethlin: invalid assignee

Note: Only org members, users with write permissions, or people who have commented on the PR may be assigned.

scottmcm · 2022-12-22T21:51:10Z

@bors try @rust-timer queue

bors · 2022-12-22T21:51:19Z

⌛ Trying commit 6e1c3f0 with merge b6f32e9a3b254c2d1a3431d90ed5169aca532ea6...

scottmcm · 2022-12-22T22:01:37Z

src/test/codegen/newtype-relational-operators.rs

+use std::cmp::Ordering;
+
+#[derive(PartialOrd, PartialEq)]
+pub struct Foo(u16);


Hopefully this test will ensure that the problem you saw with BytePos won't happen again, and will be easier to catch if accidentally regressed.

scottmcm · 2022-12-22T22:45:35Z

library/core/src/cmp.rs

@@ -1161,7 +1175,11 @@ pub trait PartialOrd<Rhs: ?Sized = Self>: PartialEq<Rhs> {
    #[must_use]
    #[stable(feature = "rust1", since = "1.0.0")]
    fn gt(&self, other: &Rhs) -> bool {
-        matches!(self.partial_cmp(other), Some(Greater))
+        if let Some(ordering) = self.partial_cmp(other) {
+            ordering.is_gt()


This is now conceptually two checks, rather than just one, so it's possible it's not always better. None is currently 2 here, so the old code was hypothetically just c == 1, and now it's c != 2 && c > 0. (Of course lt ends up being c != 2 && c < 0, which obviously folds to c < 0, so that one's probably not impacted.)

My hope is that this is still better in practice:

I would bet that most partial_cmps are actually cmps, and thus the optimizer will easily notice that the result is never None -- like happens in the codegen test.

For things that can actually return None, hopefully jump-threading will usually notice that the None becomes false and will again bypass actually running this check at runtime.

I'll see if I can prove that out in a codegen test...

Well, I didn't manage to make a great codegen test for this, but I did in passing find two other things:

We should start putting more noundef on parameters, https://rust-lang.zulipchat.com/#narrow/stream/187780-t-compiler.2Fwg-llvm/topic/We.20will.20want.20a.20lot.20of.20noundefs/near/317472833

LLVM doesn't optimize everything as well as it should: Implementing <= via 3-way comparison doesn't optimize down llvm/llvm-project#59666

bors · 2022-12-23T00:00:11Z

☀️ Try build successful - checks-actions
Build commit: b6f32e9a3b254c2d1a3431d90ed5169aca532ea6 (b6f32e9a3b254c2d1a3431d90ed5169aca532ea6)

rust-timer · 2022-12-23T01:20:07Z

Finished benchmarking commit (b6f32e9a3b254c2d1a3431d90ed5169aca532ea6): comparison URL.

Overall result: no relevant changes - no action needed

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

@bors rollup=never
@rustbot label: -S-waiting-on-perf -perf-regression

Instruction count

This benchmark run did not return any relevant results for this metric.

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	2.0%	[2.0%, 2.0%]	1
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	-	-	0

Cycles

This benchmark run did not return any relevant results for this metric.

scottmcm · 2022-12-23T01:42:58Z

Well that's a whole lot of nothing in perf 😅

I saw your thumb,
r? @compiler-errors
so how do you feel about this change given it's perf-neutral?

I could also cut this back to just the codegen test, since it already passes, if that's useful but we don't want the core changes.

(Apparently highfive didn't like my previously-proposed reviewer.)

saethlin · 2022-12-23T01:48:53Z

Yeah, I'm only t-miri, I can't approve anything in this repo.

I like this work, but I'm extremely wary of checking in subtle changes like this that aren't backed up by any kind of test. I'm very curious to know what an LLVM expert thinks of that issue. If this is another "oh we're missing a fold for that" situation, that would be awesome. But I kind of doubt it.

scottmcm · 2022-12-23T03:18:00Z

@saethlin I went to try making an assembly test that is_le gives setle and such, but LLVM does very different things for the different cases, so I opened llvm/llvm-project#59668 to see whether those make sense -- I wouldn't want to add a super-flaky test that would break on improved LLVM.

I'm hoping that the answer really is that there's just some fold or range logic missing. Alive2 proves that it's allowed to do it, so it's a matter of how/where to recognize it.

saethlin · 2022-12-23T04:31:02Z

Wow that's very minimized. You're getting my hopes up...

scottmcm · 2022-12-23T04:41:18Z

That new 59668 one is too minimized to help the original, though -- it's about the backend, and in IR (where the optimizations that #105840 cares about would happen) they're all just icmps so a change to the x64 codegen for comparisons against 0 wouldn't help.

scottmcm · 2022-12-23T23:23:17Z

I'm going to close this since the lts for non-Ord aren't always better, so it's not obvious that this should happen without other stuff -- maybe #105840 or llvm/llvm-project#59666.

I've submitted the codegen test as #106100.

compiler-errors · 2022-12-23T23:33:36Z

Sorry, didn't comment on this before it was closed, but I agree that given the lack of improvement these changes are not worth.

scottmcm · 2022-12-24T00:55:20Z

@compiler-errors No worries! Thanks for commenting.

Given that it's the end of year I have no expectations that people would be looking at things for a while.

…=compiler-errors Codegen test for derived `<` on trivial newtype [TEST ONLY] I originally wrote this for rust-lang#106065, but the libcore changes there aren't necessarily a win. So I pulled out this test to be its own PR since it's important (see rust-lang#105840 (comment)) and well-intentioned changes to core or the derive could accidentally break it without that being obvious (other than by massive unexplained perf changes).

Micro-optimize Ord::cmp for primitives I originally started looking into this because in MIR, `PartialOrd::cmp` is _huge_ and even for trivial types like `u32` which are theoretically a single statement to compare, the `PartialOrd::cmp` impl doesn't inline. A significant contributor to the size of the implementation is that it has two comparisons. And this actually follows through to the final x86_64 codegen too, which is... strange. We don't need two `cmp` instructions in order to do a single Rust-level comparison. So I started tweaking the implementation, and came up with the same thing as rust-lang#64082 (which I didn't know about at the time), I ran `llvm-mca` on it per the issue which was linked in the code to establish that it looked better, and submitted it for a benchmark run. The initial benchmark run regresses basically everything. By looking through the cachegrind diffs in the perf report then the `perf annotate` for regressed functions, I was able to identify one source of the regression: `Ord::min` and `Ord::max` no longer optimize well. Tweaking them to bypass `Ord::cmp` removed some regressions, but not much. Diving back into the cachegrind diffs and disassembly, I found one huge widespread issue was that the codegen for `Span`'s `hash_stable` regressed because `span_data_to_lines_and_cols` no longer inlined into it, because that function does a lot of `Range<BytePos>::contains`. The implementation of `Range::contains` uses `PartialOrd` multiple times, and we had massively regressed the codegen of `Range::contains`. The root problem here seems to be that `PartialOrd` is derived on `BytePos`, which is a simple wrapper around a `u32`. So for `BytePos`, `PartialOrd::{le, lt, ge, gt}` use the default impls, which go through `PartialOrd::cmp`, and LLVM fails to optimize these combinations of methods with the new `Ord::cmp` implementation. At a guess, the new implementation makes LLVM totally loses track of the fact that `<Ord for u32>::cmp` is an elaborate way to compare two integers. So I have low hopes for this overall, because my strategy (which is working) to recover the regressions is to avoid the "faster" implementation that this PR is based around. If we have to settle for an implementation of `Ord::cmp` which is on its own sub-optimal but is optimized better in combination with functions that use its return value in specific ways, so be it. However, one of the runs had an improvement in `coercions`. I don't know if that is jitter or relevant. But I'm still finding threads to pull here, so I'm going to keep at it. For the moment I am hacking up the implementations on `BytePos` instead of modifying the code that `derive(PartialOrd, Ord)` expands to because that would be hard, and it would also mean that we would just expand to more code, perhaps regressing compile time for that reason, even if the generated assembly is more efficient. --- Hacking up the remainder of the `PartialOrd`/`Ord` methods on `BytePos` took us down to 3 regressions and 6 improvements, which is interesting. All the improvements are in `coercions`, so I'm sure this improved _something_ but whether it matters... hard to say. Based on the findings of `@joboet,` I'm going to cherry-pick rust-lang#106065 onto this branch, because that strategy seems to improve `PartialOrd::lt` and `PartialOrd::ge` back to the original codegen, even when they are using our new `Ord::cmp` impl. If the remaining perf regressions are due to de-optimizing a `PartialOrd::lt` not on `BytePos`, this might be a further improvement. --- Okay, that cherry-pick brought us down to 2 regressions but that might be noise. We still have the same 6 improvements, all on `coercions`. I think the next thing to try here is modifying the implementation of `derive(PartialOrd)` to automatically emit the modifications that I made to `BytePos` (directly implementing all the methods for newtypes). But even if that works, I think the effect of this change is so mixed that it's probably not worth merging with current LLVM. What I'm afraid of is that this change currently pessimizes matching on `Ordering`, and that is the most natural thing to do with an enum. So I'm not closing this yet, but I think without a change from LLVM, I have other priorities at the moment. r? `@ghost`

Tweak the default PartialOrd::{lt,le,gt,ge}

6e1c3f0

rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Dec 22, 2022

This comment was marked as resolved.

Sign in to view

This comment has been minimized.

Sign in to view

rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Dec 22, 2022

scottmcm commented Dec 22, 2022

View reviewed changes

Oh yeah, tidy.

118c9b5

scottmcm commented Dec 22, 2022

View reviewed changes

This comment has been minimized.

Sign in to view

rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Dec 23, 2022

rustbot assigned compiler-errors Dec 23, 2022

saethlin mentioned this pull request Dec 23, 2022

Micro-optimize Ord::cmp for primitives #105840

Closed

scottmcm mentioned this pull request Dec 23, 2022

Codegen test for derived < on trivial newtype [TEST ONLY] #106100

Merged

scottmcm closed this Dec 23, 2022

scottmcm mentioned this pull request Dec 24, 2022

derived PartialOrd::le is non-optimal for 2-field struct of primitive integers #106107

Open

scottmcm mentioned this pull request Jan 18, 2023

replace if with match in binary_search #106969

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tweak the default `PartialOrd::{lt,le,gt,ge}` #106065

Tweak the default `PartialOrd::{lt,le,gt,ge}` #106065

scottmcm commented Dec 22, 2022 •

edited

Loading

rustbot commented Dec 22, 2022

This comment was marked as resolved.

scottmcm commented Dec 22, 2022

This comment has been minimized.

bors commented Dec 22, 2022

scottmcm Dec 22, 2022

scottmcm Dec 22, 2022

scottmcm Dec 23, 2022

bors commented Dec 23, 2022

This comment has been minimized.

This comment has been minimized.

rust-timer commented Dec 23, 2022

scottmcm commented Dec 23, 2022 •

edited

Loading

saethlin commented Dec 23, 2022

scottmcm commented Dec 23, 2022 •

edited

Loading

saethlin commented Dec 23, 2022

scottmcm commented Dec 23, 2022

scottmcm commented Dec 23, 2022

compiler-errors commented Dec 23, 2022

scottmcm commented Dec 24, 2022

Tweak the default PartialOrd::{lt,le,gt,ge} #106065

Tweak the default PartialOrd::{lt,le,gt,ge} #106065

Conversation

scottmcm commented Dec 22, 2022 • edited Loading

rustbot commented Dec 22, 2022

This comment was marked as resolved.

scottmcm commented Dec 22, 2022

This comment has been minimized.

bors commented Dec 22, 2022

scottmcm Dec 22, 2022

Choose a reason for hiding this comment

scottmcm Dec 22, 2022

Choose a reason for hiding this comment

scottmcm Dec 23, 2022

Choose a reason for hiding this comment

bors commented Dec 23, 2022

This comment has been minimized.

This comment has been minimized.

rust-timer commented Dec 23, 2022

Overall result: no relevant changes - no action needed

Instruction count

Max RSS (memory usage)

Cycles

scottmcm commented Dec 23, 2022 • edited Loading

saethlin commented Dec 23, 2022

scottmcm commented Dec 23, 2022 • edited Loading

saethlin commented Dec 23, 2022

scottmcm commented Dec 23, 2022

scottmcm commented Dec 23, 2022

compiler-errors commented Dec 23, 2022

scottmcm commented Dec 24, 2022

Tweak the default `PartialOrd::{lt,le,gt,ge}` #106065

Tweak the default `PartialOrd::{lt,le,gt,ge}` #106065

scottmcm commented Dec 22, 2022 •

edited

Loading

scottmcm commented Dec 23, 2022 •

edited

Loading

scottmcm commented Dec 23, 2022 •

edited

Loading