-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: specialize range operations for better performances #177
perf: specialize range operations for better performances #177
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
/// Compute the union of two terms. | ||
/// If at least one term is negative, the union is also negative. | ||
pub(crate) fn union(&self, other: &Self) -> Self { | ||
(self.negate().intersection(&other.negate())).negate() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inlining this code may be a micro optimization, saving a clone or two, but definitely seems worth doing.
8ddd053
to
544ed6a
Compare
The new version is slightly slower but passes the tests:
It spends 45% in union, mostly |
/// Whether all range of `self` are contained in `other` | ||
fn subset_of(&self, other: &Self) -> bool { | ||
self == &self.intersection(other) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In #59 @mpizenberg was concerned how much complexity this adds to the api. So I would appreciate getting his opinion now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, i'm not too worried about adding methods which have default implementations. It doesn't add additional effort for the implementer but allows extra control where performance matters.
a3376d9
to
c0e306f
Compare
I had trouble thinking useful thoughts looking at the full diff all at once. So I pulled it into my editor and started a new branch adding each new piece of functionality one at a time in separate commits. Then adding the tests and cleaning up the nits I saw before moving onto the next piece of functionality. The result is https://github.com/pubgrub-rs/pubgrub/tree/jf/177-review consider it a "yes and" to the wonderful work in this PR. But I'm not sure on a practical level how to take the next step. Do you want to pull those changes into this PR? Or should I start a new one? Anyway, I'm gonna go run some benchmarks. |
d198313
to
f06e511
Compare
Thanks! I forced pushed your commits to this branch (there's backup branch just in case also) and added a clippy commit on top. |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rewrote the things I didn't like, and so like what's left.
I'm going to give it a few days in case other people want to respond. Specifically for @mpizenberg about the public API.
If I don't hear back I will merge later this week.
71c9a51
to
4fc3a50
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this performance work! As suggested by @Eh2406 I was previously skeptic of this kind of inlining degrading the expressiveness of the computations. According to Jacob benchmarks, I'm still doubting it's the best course of action. On sub-second runs, it does not seem to fundamentally change speed, and for very long run, it has a non-negligeable impact, but I feel like this order of magnitude (multiple tens of seconds) is something that should be approached on a computation complexity level (solving the big-O bottlenecks) instead.
However I can also understand that 30% of something that takes a very long time is a non-negligible gain so I'm in favor if you want to merge this.
I've made a few suggestions, feel free to address or ignore them and go through with the merge.
/// |----| | ||
/// |-----| | ||
/// ``` | ||
fn end_before_start_with_gap<V: PartialOrd>(end: &Bound<V>, start: &Bound<V>) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you think of non_contiguous_bounds
? or contiguous_bounds
with the bool flipped?
@mpizenberg does the "real world" improvement of 12s to 3s on the |
Indeed that's a substantial difference! |
Do you know if some of the inlinings produce most of the improvements, or if they all make small improvements that total to the big difference? |
This work is entirely profiling and benchmark driven. I started off with two smaller problems,
|
Yesterday I went back and looked at my code from when I originally proposed this. You are right to reject that code. It was inscrutable. The code Konstin has now is much better.
There is definitely a gap in the benchmarking infrastructure I have available. I can't seem to get a benchmark that has the performance characteristics of your real world data. Help on this would definitely be appreciated. The closest I have is
I should be able to run all benchmarks with version wrapped in |
I added two more small optimizations on top for your review https://github.com/pubgrub-rs/pubgrub/tree/jf/177-review |
No difference in performance when cherry-picking the three commits on top:
|
|
This update pulls in pubgrub-rs/pubgrub#177, optimizing common range operations in pubgrub. Please refer to this PR for a more extensive description and discussion of the changes. It optimizes that last remaining pathological case, `bio_embeddings[all]` on python 3.12, which has to try 100k versions. **before** 12s ![image](https://github.com/pubgrub-rs/pubgrub/assets/6826232/80ffdc49-1159-453d-a3ea-0dd431df6d3b) **after** 3s ![image](https://github.com/pubgrub-rs/pubgrub/assets/6826232/69508c29-73ab-4593-a588-d8c722242513) ``` $ taskset -c 0 hyperfine --warmup 1 "../uv/target/profiling/main-uv pip compile ../uv/scripts/requirements/bio_embeddings.in" "../uv/target/profiling/branch-uv pip compile ../uv/scripts/requirements/bio_embeddings.in" Benchmark 1: ../uv/target/profiling/main-uv pip compile ../uv/scripts/requirements/bio_embeddings.in Time (mean ± σ): 12.321 s ± 0.064 s [User: 12.014 s, System: 0.300 s] Range (min … max): 12.224 s … 12.406 s 10 runs Benchmark 2: ../uv/target/profiling/branch-uv pip compile ../uv/scripts/requirements/bio_embeddings.in Time (mean ± σ): 3.109 s ± 0.004 s [User: 2.782 s, System: 0.321 s] Range (min … max): 3.103 s … 3.116 s 10 runs Summary ../uv/target/profiling/branch-uv pip compile ../uv/scripts/requirements/bio_embeddings.in ran 3.96 ± 0.02 times faster than ../uv/target/profiling/main-uv pip compile ../uv/scripts/requirements/bio_embeddings.in ``` It also adds `bio_embeddings[all]` as a requirements test case.
This update pulls in pubgrub-rs/pubgrub#177, optimizing common range operations in pubgrub. Please refer to this PR for a more extensive description and discussion of the changes. It optimizes that last remaining pathological case, `bio_embeddings[all]` on python 3.12, which has to try 100k versions. **before** 12s ![image](https://github.com/pubgrub-rs/pubgrub/assets/6826232/80ffdc49-1159-453d-a3ea-0dd431df6d3b) **after** 3s ![image](https://github.com/pubgrub-rs/pubgrub/assets/6826232/69508c29-73ab-4593-a588-d8c722242513) ``` $ taskset -c 0 hyperfine --warmup 1 "../uv/target/profiling/main-uv pip compile ../uv/scripts/requirements/bio_embeddings.in" "../uv/target/profiling/branch-uv pip compile ../uv/scripts/requirements/bio_embeddings.in" Benchmark 1: ../uv/target/profiling/main-uv pip compile ../uv/scripts/requirements/bio_embeddings.in Time (mean ± σ): 12.321 s ± 0.064 s [User: 12.014 s, System: 0.300 s] Range (min … max): 12.224 s … 12.406 s 10 runs Benchmark 2: ../uv/target/profiling/branch-uv pip compile ../uv/scripts/requirements/bio_embeddings.in Time (mean ± σ): 3.109 s ± 0.004 s [User: 2.782 s, System: 0.321 s] Range (min … max): 3.103 s … 3.116 s 10 runs Summary ../uv/target/profiling/branch-uv pip compile ../uv/scripts/requirements/bio_embeddings.in ran 3.96 ± 0.02 times faster than ../uv/target/profiling/main-uv pip compile ../uv/scripts/requirements/bio_embeddings.in ``` It also adds `bio_embeddings[all]` as a requirements test case.
This update pulls in pubgrub-rs/pubgrub#177, optimizing common range operations in pubgrub. Please refer to this PR for a more extensive description and discussion of the changes. It optimizes that last remaining pathological case, `bio_embeddings[all]` on python 3.12, which has to try 100k versions. **before** 12s ![image](https://github.com/pubgrub-rs/pubgrub/assets/6826232/80ffdc49-1159-453d-a3ea-0dd431df6d3b) **after** 3s ![image](https://github.com/pubgrub-rs/pubgrub/assets/6826232/69508c29-73ab-4593-a588-d8c722242513) ``` $ taskset -c 0 hyperfine --warmup 1 "../uv/target/profiling/main-uv pip compile ../uv/scripts/requirements/bio_embeddings.in" "../uv/target/profiling/branch-uv pip compile ../uv/scripts/requirements/bio_embeddings.in" Benchmark 1: ../uv/target/profiling/main-uv pip compile ../uv/scripts/requirements/bio_embeddings.in Time (mean ± σ): 12.321 s ± 0.064 s [User: 12.014 s, System: 0.300 s] Range (min … max): 12.224 s … 12.406 s 10 runs Benchmark 2: ../uv/target/profiling/branch-uv pip compile ../uv/scripts/requirements/bio_embeddings.in Time (mean ± σ): 3.109 s ± 0.004 s [User: 2.782 s, System: 0.321 s] Range (min … max): 3.103 s … 3.116 s 10 runs Summary ../uv/target/profiling/branch-uv pip compile ../uv/scripts/requirements/bio_embeddings.in ran 3.96 ± 0.02 times faster than ../uv/target/profiling/main-uv pip compile ../uv/scripts/requirements/bio_embeddings.in ``` It also adds `bio_embeddings[all]` as a requirements test case.
This update pulls in pubgrub-rs/pubgrub#177, optimizing common range operations in pubgrub. Please refer to this PR for a more extensive description and discussion of the changes. The changes optimize that last remaining pathological case, `bio_embeddings[all]` on python 3.12, which has to try 100k versions, from 12s to 3s in the cached case. It should also enable smarter prefetching in batches (#170), even though a naive attempt didn't show better network usage. **before** 12s ![image](https://github.com/pubgrub-rs/pubgrub/assets/6826232/80ffdc49-1159-453d-a3ea-0dd431df6d3b) **after** 3s ![image](https://github.com/pubgrub-rs/pubgrub/assets/6826232/69508c29-73ab-4593-a588-d8c722242513) ``` $ taskset -c 0 hyperfine --warmup 1 "../uv/target/profiling/main-uv pip compile ../uv/scripts/requirements/bio_embeddings.in" "../uv/target/profiling/branch-uv pip compile ../uv/scripts/requirements/bio_embeddings.in" Benchmark 1: ../uv/target/profiling/main-uv pip compile ../uv/scripts/requirements/bio_embeddings.in Time (mean ± σ): 12.321 s ± 0.064 s [User: 12.014 s, System: 0.300 s] Range (min … max): 12.224 s … 12.406 s 10 runs Benchmark 2: ../uv/target/profiling/branch-uv pip compile ../uv/scripts/requirements/bio_embeddings.in Time (mean ± σ): 3.109 s ± 0.004 s [User: 2.782 s, System: 0.321 s] Range (min … max): 3.103 s … 3.116 s 10 runs Summary ../uv/target/profiling/branch-uv pip compile ../uv/scripts/requirements/bio_embeddings.in ran 3.96 ± 0.02 times faster than ../uv/target/profiling/main-uv pip compile ../uv/scripts/requirements/bio_embeddings.in ``` It also adds `bio_embeddings[all]` as a requirements test case.
In uv, we have a very slow case with bio_embeddings, which include boto3 and botocore. In the fully cached case, the great majority of time is spent in range operation with large ranges due to the many version.
I replaced some range operations implemented by their mathematical definitions with optimized versions. This PR is based on profiling my test case,
bio_embeddings[all]
on python 3.12.To reproduce, checkout
git clone https://github.com/astral-sh/uv
and replace the pubgrub dependency with astral-sh#25, build withcargo build --profile profiling
and run withsamply record
, i ran with--rate 5000
.before 12s
after 3s
The atomic/drop operations are clone and drop for our version type, which is an
Arc
.This scenarios slow because we try a lot of versions:
(Taskset descreases the variance for a minimal slowdown, we're bound on single-threaded pubgrub in this benchmark and not on multithreaded io/parsing.)
I'd say this closes #135.
The lack combination of lack of
Ord
onBound
andUnbounded
being-inf
orinf
depending on the (untyped) context makes implementing these manually tedious, i wonder if we could add some abstraction?The first two commits of this PR can be split out into separate PRs. From main over each of the three commits: