Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: specialize range operations for better performances #177

Merged
merged 15 commits into from
Mar 13, 2024

Conversation

konstin
Copy link
Member

@konstin konstin commented Jan 5, 2024

In uv, we have a very slow case with bio_embeddings, which include boto3 and botocore. In the fully cached case, the great majority of time is spent in range operation with large ranges due to the many version.
I replaced some range operations implemented by their mathematical definitions with optimized versions. This PR is based on profiling my test case, bio_embeddings[all] on python 3.12.

To reproduce, checkout git clone https://github.com/astral-sh/uv and replace the pubgrub dependency with astral-sh#25, build with cargo build --profile profiling and run with samply record, i ran with --rate 5000.

before 12s

image

after 3s

image

The atomic/drop operations are clone and drop for our version type, which is an Arc.

This scenarios slow because we try a lot of versions:

Tried 101329 versions: pytest 2628, botocore 2627, blis 1314, boto3 1314, certifi 1314, charset-normalizer 1314, click 1314, conllu 1314, contourpy 1314, cycler 1314, cymem 1314, editdistance 1314, flaky 1314, flask 1314, flask-cors 1314, fonttools 1314, fsspec 1314, ftfy 1314, gevent 1314, huggingface-hub 1314, idna 1314, jinja2 1314, joblib 1314, jsonnet 1314, jsonpickle 1314, kiwisolver 1314, murmurhash 1314, networkx 1314, nltk 1314, numba 1314, numpydoc 1314, nvidia-cublas-cu12 1314, nvidia-cuda-cupti-cu12 1314, nvidia-cuda-nvrtc-cu12 1314, nvidia-cuda-runtime-cu12 1314, nvidia-cudnn-cu12 1314, nvidia-cufft-cu12 1314, nvidia-curand-cu12 1314, nvidia-cusolver-cu12 1314, nvidia-cusparse-cu12 1314, nvidia-nccl-cu12 1314, nvidia-nvtx-cu12 1314, overrides 1314, packaging 1314, parsimonious 1314, plac 1314, preshed 1314, protobuf 1314, pynndescent 1314, pyparsing 1314, python-dateutil 1314, pytorch-pretrained-bert 1314, pytorch-transformers 1314, pyyaml 1314, regex 1314, responses 1314, ruamel-yaml-clib 1314, safetensors 1314, sentencepiece 1314, six 1314, spacy 1314, sqlparse 1314, srsly 1314, sympy 1314, tenacity 1314, tensorboardx 1314, thinc 1314, threadpoolctl 1314, tokenizers 1314, typing-extensions 1314, tzdata 1314, unidecode 1314, wasabi 1314, word2number 1314, wrapt 1314, allennlp 27, torch 20, bio-embeddings[all] 10, bio-embeddings 4, biopython 4, gensim 4, h5py 4, matplotlib 3, numpy 3, pandas 3, plotly 3, ruamel-yaml 3, scikit-learn 3, scipy 3, Python 2, appdirs 2, importlib-metadata 2, jax-unirep 2, jaxlib 2, lock 2, alabaster 1, babel 1, bio-embeddings-allennlp 1, bio-embeddings-bepler 1, bio-embeddings-cpcprot 1, bio-embeddings-esm 1, bio-embeddings-plus 1, bio-embeddings-tape-proteins 1, blinker 1, docutils 1, filelock 1, greenlet 1, imagesize 1, iniconfig 1, itsdangerous 1, jmespath 1, llvmlite 1, markupsafe 1, mpmath 1, nvidia-nvjitlink-cu12 1, pillow 1, pluggy 1, pygments 1, pytz 1, requests 1, root 1, s3transfer 1, setuptools 1, smart-open 1, snowballstemmer 1, sphinx 1, sphinxcontrib-applehelp 1, sphinxcontrib-devhelp 1, sphinxcontrib-htmlhelp 1, sphinxcontrib-jsmath 1, sphinxcontrib-qthelp 1, sphinxcontrib-serializinghtml 1, tabulate 1, tqdm 1, transformers 1, umap-learn 1, urllib3 1, wcwidth 1, werkzeug 1, zope-event 1, zope-interface 1

(Taskset descreases the variance for a minimal slowdown, we're bound on single-threaded pubgrub in this benchmark and not on multithreaded io/parsing.)

$  taskset -c 0 hyperfine --warmup 1 "../uv/target/profiling/main-uv pip compile ../uv/scripts/requirements/bio_embeddings.in"  "../uv/target/profiling/branch-uv pip compile ../uv/scripts/requirements/bio_embeddings.in" 
Benchmark 1: ../uv/target/profiling/main-uv pip compile ../uv/scripts/requirements/bio_embeddings.in
  Time (mean ± σ):     12.321 s ±  0.064 s    [User: 12.014 s, System: 0.300 s]
  Range (min … max):   12.224 s … 12.406 s    10 runs
 
Benchmark 2: ../uv/target/profiling/branch-uv pip compile ../uv/scripts/requirements/bio_embeddings.in
  Time (mean ± σ):      3.109 s ±  0.004 s    [User: 2.782 s, System: 0.321 s]
  Range (min … max):    3.103 s …  3.116 s    10 runs
 
Summary
  ../uv/target/profiling/branch-uv pip compile ../uv/scripts/requirements/bio_embeddings.in ran
    3.96 ± 0.02 times faster than ../uv/target/profiling/main-uv pip compile ../uv/scripts/requirements/bio_embeddings.in

I'd say this closes #135.


The lack combination of lack of Ord on Bound and Unbounded being -inf or inf depending on the (untyped) context makes implementing these manually tedious, i wonder if we could add some abstraction?


The first two commits of this PR can be split out into separate PRs. From main over each of the three commits:

image

image

image

image

src/term.rs Outdated Show resolved Hide resolved
src/term.rs Outdated Show resolved Hide resolved
src/range.rs Outdated Show resolved Hide resolved
@Eh2406

This comment was marked as outdated.

/// Compute the union of two terms.
/// If at least one term is negative, the union is also negative.
pub(crate) fn union(&self, other: &Self) -> Self {
(self.negate().intersection(&other.negate())).negate()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inlining this code may be a micro optimization, saving a clone or two, but definitely seems worth doing.

@konstin konstin force-pushed the konsti/dev/faster-range-operations branch from 8ddd053 to 544ed6a Compare January 8, 2024 18:22
@konstin
Copy link
Member Author

konstin commented Jan 8, 2024

The new version is slightly slower but passes the tests:

$ hyperfine --warmup 1 --runs 3 "target/profiling/main bio_embeddings.in"  "target/profiling/branch bio_embeddings.in"
Benchmark 1: target/profiling/main bio_embeddings.in
  Time (mean ± σ):     12.007 s ±  0.035 s    [User: 11.654 s, System: 0.301 s]
  Range (min … max):   11.977 s … 12.045 s    3 runs

Benchmark 2: target/profiling/branch bio_embeddings.in
  Time (mean ± σ):      6.961 s ±  0.017 s    [User: 6.565 s, System: 0.323 s]
  Range (min … max):    6.942 s …  6.974 s    3 runs

Summary
  target/profiling/branch bio_embeddings.in ran
    1.72 ± 0.01 times faster than target/profiling/main bio_embeddings.in

It spends 45% in union, mostly Arc drop, add and vec pushes.

src/range.rs Show resolved Hide resolved
src/range.rs Outdated Show resolved Hide resolved
/// Whether all range of `self` are contained in `other`
fn subset_of(&self, other: &Self) -> bool {
self == &self.intersection(other)
}
Copy link
Member

@Eh2406 Eh2406 Jan 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In #59 @mpizenberg was concerned how much complexity this adds to the api. So I would appreciate getting his opinion now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, i'm not too worried about adding methods which have default implementations. It doesn't add additional effort for the implementer but allows extra control where performance matters.

@konstin konstin changed the title Faster range operations Specialize range operations for better performance Mar 6, 2024
@konstin konstin force-pushed the konsti/dev/faster-range-operations branch 3 times, most recently from a3376d9 to c0e306f Compare March 6, 2024 14:34
@konstin konstin marked this pull request as ready for review March 6, 2024 14:49
src/range.rs Outdated Show resolved Hide resolved
src/range.rs Show resolved Hide resolved
@Eh2406
Copy link
Member

Eh2406 commented Mar 11, 2024

I had trouble thinking useful thoughts looking at the full diff all at once. So I pulled it into my editor and started a new branch adding each new piece of functionality one at a time in separate commits. Then adding the tests and cleaning up the nits I saw before moving onto the next piece of functionality. The result is https://github.com/pubgrub-rs/pubgrub/tree/jf/177-review consider it a "yes and" to the wonderful work in this PR. But I'm not sure on a practical level how to take the next step. Do you want to pull those changes into this PR? Or should I start a new one?

Anyway, I'm gonna go run some benchmarks.

@konstin konstin force-pushed the konsti/dev/faster-range-operations branch from d198313 to f06e511 Compare March 11, 2024 20:40
@konstin konstin changed the title Specialize range operations for better performance perf: specialize range operations for better performances Mar 11, 2024
@konstin
Copy link
Member Author

konstin commented Mar 11, 2024

Thanks! I forced pushed your commits to this branch (there's backup branch just in case also) and added a clippy commit on top.

@Eh2406
Copy link
Member

Eh2406 commented Mar 11, 2024

time Dev PR %
sudoku 44s 30s 32%
elm 274ms 277ms ---
large 11ms 10ms 9%
slow_135 19ms 6.3ms 66%
zuse 311 ms 297ms 4.5%
hobo* 1045s 851s 18%

Copy link
Member

@Eh2406 Eh2406 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rewrote the things I didn't like, and so like what's left.

I'm going to give it a few days in case other people want to respond. Specifically for @mpizenberg about the public API.

If I don't hear back I will merge later this week.

@konstin konstin force-pushed the konsti/dev/faster-range-operations branch from 71c9a51 to 4fc3a50 Compare March 12, 2024 16:11
src/term.rs Outdated Show resolved Hide resolved
Copy link
Member

@mpizenberg mpizenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this performance work! As suggested by @Eh2406 I was previously skeptic of this kind of inlining degrading the expressiveness of the computations. According to Jacob benchmarks, I'm still doubting it's the best course of action. On sub-second runs, it does not seem to fundamentally change speed, and for very long run, it has a non-negligeable impact, but I feel like this order of magnitude (multiple tens of seconds) is something that should be approached on a computation complexity level (solving the big-O bottlenecks) instead.

However I can also understand that 30% of something that takes a very long time is a non-negligible gain so I'm in favor if you want to merge this.

I've made a few suggestions, feel free to address or ignore them and go through with the merge.

src/internal/small_map.rs Show resolved Hide resolved
/// |----|
/// |-----|
/// ```
fn end_before_start_with_gap<V: PartialOrd>(end: &Bound<V>, start: &Bound<V>) -> bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you think of non_contiguous_bounds? or contiguous_bounds with the bool flipped?

src/range.rs Show resolved Hide resolved
src/range.rs Show resolved Hide resolved
@zanieb
Copy link
Member

zanieb commented Mar 12, 2024

@mpizenberg does the "real world" improvement of 12s to 3s on the bio_embeddings requirements in uv make this feel more worthwhile? Does that suggest that there is a missing niche in the benchmarks for the project (i.e. that Jacob ran)?

@mpizenberg
Copy link
Member

@mpizenberg does the "real world" improvement of 12s to 3s on the bio_embeddings requirements in uv make this feel more worthwhile? Does that suggest that there is a missing niche in the benchmarks for the project (i.e. that Jacob ran)?

Indeed that's a substantial difference!

@mpizenberg
Copy link
Member

@mpizenberg does the "real world" improvement of 12s to 3s on the bio_embeddings

Do you know if some of the inlinings produce most of the improvements, or if they all make small improvements that total to the big difference?

@konstin
Copy link
Member Author

konstin commented Mar 13, 2024

Do you know if some of the inlinings produce most of the improvements, or if they all make small improvements that total to the big difference?

This work is entirely profiling and benchmark driven. I started off with two smaller problems, Range::contains being slower than needed and excessive clone/drop in prior_cause. After the first two commits you can see in the flamegraphs in the PR description it's almost entirely union and intersection calls, being slow due to excessive clone/drop (cor... in the flamegraphs). We had previously seen the slow drop/clone and put our version in an Arc, so all you see now are clone/drop on Arcs.

satisfier was calling intersection a lot, but it wasn't interested in the intersection itself, it just wanted to know if the terms are disjoint, so i special cased this into is_disjoint. The union calls were calling intersection once and complement three times, so i unrolled union into a version that clones as few as possible. I don't have numbers on hand for individual changes, but each made a noticeable difference.

@Eh2406
Copy link
Member

Eh2406 commented Mar 13, 2024

I was previously skeptic of this kind of inlining degrading the expressiveness of the computations.

Yesterday I went back and looked at my code from when I originally proposed this. You are right to reject that code. It was inscrutable. The code Konstin has now is much better.

does the "real world" improvement of 12s to 3s on the bio_embeddings requirements in uv make this feel more worthwhile? Does that suggest that there is a missing niche in the benchmarks for the project (i.e. that Jacob ran)?

There is definitely a gap in the benchmarking infrastructure I have available. I can't seem to get a benchmark that has the performance characteristics of your real world data. Help on this would definitely be appreciated. The closest I have is slow_135, which is not nearly as slow as your original report, but has a similar shaped profile. The 66% improvement for that benchmark is nicely similar to the 75% your reporting.

We had previously seen the slow drop/clone and put our version in an Arc, so all you see now are clone/drop on Arcs.

I should be able to run all benchmarks with version wrapped in Arcs. I wonder if that will give me graphs that looked more like yours.

@Eh2406
Copy link
Member

Eh2406 commented Mar 13, 2024

I added two more small optimizations on top for your review https://github.com/pubgrub-rs/pubgrub/tree/jf/177-review

@konstin
Copy link
Member Author

konstin commented Mar 13, 2024

No difference in performance when cherry-picking the three commits on top:

$ taskset -c 0 hyperfine --warmup 1 --runs 3 "./main-uv pip compile scripts/requirements/bio_embeddings.in"  "./branch-uv pip compile scripts/requirements/bio_embeddings.in" 
Benchmark 1: ./main-uv pip compile scripts/requirements/bio_embeddings.in
  Time (mean ± σ):      2.821 s ±  0.007 s    [User: 2.510 s, System: 0.303 s]
  Range (min … max):    2.813 s …  2.827 s    3 runs
 
Benchmark 2: ./branch-uv pip compile scripts/requirements/bio_embeddings.in
  Time (mean ± σ):      2.826 s ±  0.001 s    [User: 2.527 s, System: 0.291 s]
  Range (min … max):    2.825 s …  2.827 s    3 runs
 
Summary
  ./main-uv pip compile scripts/requirements/bio_embeddings.in ran
    1.00 ± 0.00 times faster than ./branch-uv pip compile scripts/requirements/bio_embeddings.in

@Eh2406
Copy link
Member

Eh2406 commented Mar 13, 2024

time Dev PR %
sudoku 44s 22s -50%
elm 274ms 249.07 ms -9%
large 11ms 9.8821 ms -7%
slow_135 19ms 6.6884 ms -63%
zuse 311 ms 285.94 ms -6%
hobo* 985.2s 803.6s -18%
arc_elm 381.19 ms 318.25 ms -16%
arc_large 15.674 ms 14.214 ms -9%
arc_slow_135 36.262 ms 14.885 ms -59%
arc_zuse 387.93 ms 340.78 ms -12%

@Eh2406 Eh2406 added this pull request to the merge queue Mar 13, 2024
Merged via the queue into pubgrub-rs:dev with commit 0d80e9d Mar 13, 2024
5 checks passed
@konstin konstin deleted the konsti/dev/faster-range-operations branch March 13, 2024 18:35
konstin added a commit to astral-sh/uv that referenced this pull request Mar 13, 2024
This update pulls in pubgrub-rs/pubgrub#177, optimizing common range operations in pubgrub. Please refer to this PR for a more extensive description and discussion of the changes.

It optimizes that last remaining pathological case, `bio_embeddings[all]` on python 3.12, which has to try 100k versions.

**before** 12s

![image](https://github.com/pubgrub-rs/pubgrub/assets/6826232/80ffdc49-1159-453d-a3ea-0dd431df6d3b)

**after** 3s

![image](https://github.com/pubgrub-rs/pubgrub/assets/6826232/69508c29-73ab-4593-a588-d8c722242513)

```
$  taskset -c 0 hyperfine --warmup 1 "../uv/target/profiling/main-uv pip compile ../uv/scripts/requirements/bio_embeddings.in"  "../uv/target/profiling/branch-uv pip compile ../uv/scripts/requirements/bio_embeddings.in"
Benchmark 1: ../uv/target/profiling/main-uv pip compile ../uv/scripts/requirements/bio_embeddings.in
  Time (mean ± σ):     12.321 s ±  0.064 s    [User: 12.014 s, System: 0.300 s]
  Range (min … max):   12.224 s … 12.406 s    10 runs

Benchmark 2: ../uv/target/profiling/branch-uv pip compile ../uv/scripts/requirements/bio_embeddings.in
  Time (mean ± σ):      3.109 s ±  0.004 s    [User: 2.782 s, System: 0.321 s]
  Range (min … max):    3.103 s …  3.116 s    10 runs

Summary
  ../uv/target/profiling/branch-uv pip compile ../uv/scripts/requirements/bio_embeddings.in ran
    3.96 ± 0.02 times faster than ../uv/target/profiling/main-uv pip compile ../uv/scripts/requirements/bio_embeddings.in
```

It also adds `bio_embeddings[all]` as a requirements test case.
konstin added a commit to astral-sh/uv that referenced this pull request Mar 13, 2024
This update pulls in pubgrub-rs/pubgrub#177, optimizing common range operations in pubgrub. Please refer to this PR for a more extensive description and discussion of the changes.

It optimizes that last remaining pathological case, `bio_embeddings[all]` on python 3.12, which has to try 100k versions.

**before** 12s

![image](https://github.com/pubgrub-rs/pubgrub/assets/6826232/80ffdc49-1159-453d-a3ea-0dd431df6d3b)

**after** 3s

![image](https://github.com/pubgrub-rs/pubgrub/assets/6826232/69508c29-73ab-4593-a588-d8c722242513)

```
$  taskset -c 0 hyperfine --warmup 1 "../uv/target/profiling/main-uv pip compile ../uv/scripts/requirements/bio_embeddings.in"  "../uv/target/profiling/branch-uv pip compile ../uv/scripts/requirements/bio_embeddings.in"
Benchmark 1: ../uv/target/profiling/main-uv pip compile ../uv/scripts/requirements/bio_embeddings.in
  Time (mean ± σ):     12.321 s ±  0.064 s    [User: 12.014 s, System: 0.300 s]
  Range (min … max):   12.224 s … 12.406 s    10 runs

Benchmark 2: ../uv/target/profiling/branch-uv pip compile ../uv/scripts/requirements/bio_embeddings.in
  Time (mean ± σ):      3.109 s ±  0.004 s    [User: 2.782 s, System: 0.321 s]
  Range (min … max):    3.103 s …  3.116 s    10 runs

Summary
  ../uv/target/profiling/branch-uv pip compile ../uv/scripts/requirements/bio_embeddings.in ran
    3.96 ± 0.02 times faster than ../uv/target/profiling/main-uv pip compile ../uv/scripts/requirements/bio_embeddings.in
```

It also adds `bio_embeddings[all]` as a requirements test case.
konstin added a commit to astral-sh/uv that referenced this pull request Mar 13, 2024
This update pulls in pubgrub-rs/pubgrub#177, optimizing common range operations in pubgrub. Please refer to this PR for a more extensive description and discussion of the changes.

It optimizes that last remaining pathological case, `bio_embeddings[all]` on python 3.12, which has to try 100k versions.

**before** 12s

![image](https://github.com/pubgrub-rs/pubgrub/assets/6826232/80ffdc49-1159-453d-a3ea-0dd431df6d3b)

**after** 3s

![image](https://github.com/pubgrub-rs/pubgrub/assets/6826232/69508c29-73ab-4593-a588-d8c722242513)

```
$  taskset -c 0 hyperfine --warmup 1 "../uv/target/profiling/main-uv pip compile ../uv/scripts/requirements/bio_embeddings.in"  "../uv/target/profiling/branch-uv pip compile ../uv/scripts/requirements/bio_embeddings.in"
Benchmark 1: ../uv/target/profiling/main-uv pip compile ../uv/scripts/requirements/bio_embeddings.in
  Time (mean ± σ):     12.321 s ±  0.064 s    [User: 12.014 s, System: 0.300 s]
  Range (min … max):   12.224 s … 12.406 s    10 runs

Benchmark 2: ../uv/target/profiling/branch-uv pip compile ../uv/scripts/requirements/bio_embeddings.in
  Time (mean ± σ):      3.109 s ±  0.004 s    [User: 2.782 s, System: 0.321 s]
  Range (min … max):    3.103 s …  3.116 s    10 runs

Summary
  ../uv/target/profiling/branch-uv pip compile ../uv/scripts/requirements/bio_embeddings.in ran
    3.96 ± 0.02 times faster than ../uv/target/profiling/main-uv pip compile ../uv/scripts/requirements/bio_embeddings.in
```

It also adds `bio_embeddings[all]` as a requirements test case.
zanieb pushed a commit to astral-sh/uv that referenced this pull request Mar 13, 2024
This update pulls in pubgrub-rs/pubgrub#177,
optimizing common range operations in pubgrub. Please refer to this PR
for a more extensive description and discussion of the changes.

The changes optimize that last remaining pathological case,
`bio_embeddings[all]` on python 3.12, which has to try 100k versions,
from 12s to 3s in the cached case. It should also enable smarter
prefetching in batches (#170),
even though a naive attempt didn't show better network usage.

**before** 12s


![image](https://github.com/pubgrub-rs/pubgrub/assets/6826232/80ffdc49-1159-453d-a3ea-0dd431df6d3b)

**after** 3s


![image](https://github.com/pubgrub-rs/pubgrub/assets/6826232/69508c29-73ab-4593-a588-d8c722242513)

```
$  taskset -c 0 hyperfine --warmup 1 "../uv/target/profiling/main-uv pip compile ../uv/scripts/requirements/bio_embeddings.in"  "../uv/target/profiling/branch-uv pip compile ../uv/scripts/requirements/bio_embeddings.in"
Benchmark 1: ../uv/target/profiling/main-uv pip compile ../uv/scripts/requirements/bio_embeddings.in
  Time (mean ± σ):     12.321 s ±  0.064 s    [User: 12.014 s, System: 0.300 s]
  Range (min … max):   12.224 s … 12.406 s    10 runs

Benchmark 2: ../uv/target/profiling/branch-uv pip compile ../uv/scripts/requirements/bio_embeddings.in
  Time (mean ± σ):      3.109 s ±  0.004 s    [User: 2.782 s, System: 0.321 s]
  Range (min … max):    3.103 s …  3.116 s    10 runs

Summary
  ../uv/target/profiling/branch-uv pip compile ../uv/scripts/requirements/bio_embeddings.in ran
    3.96 ± 0.02 times faster than ../uv/target/profiling/main-uv pip compile ../uv/scripts/requirements/bio_embeddings.in
```

It also adds `bio_embeddings[all]` as a requirements test case.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants