perf: specialize range operations for better performances #177

konstin · 2024-01-05T23:25:17Z

In uv, we have a very slow case with bio_embeddings, which include boto3 and botocore. In the fully cached case, the great majority of time is spent in range operation with large ranges due to the many version.
I replaced some range operations implemented by their mathematical definitions with optimized versions. This PR is based on profiling my test case, bio_embeddings[all] on python 3.12.

To reproduce, checkout git clone https://github.com/astral-sh/uv and replace the pubgrub dependency with astral-sh#25, build with cargo build --profile profiling and run with samply record, i ran with --rate 5000.

before 12s

after 3s

The atomic/drop operations are clone and drop for our version type, which is an Arc.

This scenarios slow because we try a lot of versions:

Tried 101329 versions: pytest 2628, botocore 2627, blis 1314, boto3 1314, certifi 1314, charset-normalizer 1314, click 1314, conllu 1314, contourpy 1314, cycler 1314, cymem 1314, editdistance 1314, flaky 1314, flask 1314, flask-cors 1314, fonttools 1314, fsspec 1314, ftfy 1314, gevent 1314, huggingface-hub 1314, idna 1314, jinja2 1314, joblib 1314, jsonnet 1314, jsonpickle 1314, kiwisolver 1314, murmurhash 1314, networkx 1314, nltk 1314, numba 1314, numpydoc 1314, nvidia-cublas-cu12 1314, nvidia-cuda-cupti-cu12 1314, nvidia-cuda-nvrtc-cu12 1314, nvidia-cuda-runtime-cu12 1314, nvidia-cudnn-cu12 1314, nvidia-cufft-cu12 1314, nvidia-curand-cu12 1314, nvidia-cusolver-cu12 1314, nvidia-cusparse-cu12 1314, nvidia-nccl-cu12 1314, nvidia-nvtx-cu12 1314, overrides 1314, packaging 1314, parsimonious 1314, plac 1314, preshed 1314, protobuf 1314, pynndescent 1314, pyparsing 1314, python-dateutil 1314, pytorch-pretrained-bert 1314, pytorch-transformers 1314, pyyaml 1314, regex 1314, responses 1314, ruamel-yaml-clib 1314, safetensors 1314, sentencepiece 1314, six 1314, spacy 1314, sqlparse 1314, srsly 1314, sympy 1314, tenacity 1314, tensorboardx 1314, thinc 1314, threadpoolctl 1314, tokenizers 1314, typing-extensions 1314, tzdata 1314, unidecode 1314, wasabi 1314, word2number 1314, wrapt 1314, allennlp 27, torch 20, bio-embeddings[all] 10, bio-embeddings 4, biopython 4, gensim 4, h5py 4, matplotlib 3, numpy 3, pandas 3, plotly 3, ruamel-yaml 3, scikit-learn 3, scipy 3, Python 2, appdirs 2, importlib-metadata 2, jax-unirep 2, jaxlib 2, lock 2, alabaster 1, babel 1, bio-embeddings-allennlp 1, bio-embeddings-bepler 1, bio-embeddings-cpcprot 1, bio-embeddings-esm 1, bio-embeddings-plus 1, bio-embeddings-tape-proteins 1, blinker 1, docutils 1, filelock 1, greenlet 1, imagesize 1, iniconfig 1, itsdangerous 1, jmespath 1, llvmlite 1, markupsafe 1, mpmath 1, nvidia-nvjitlink-cu12 1, pillow 1, pluggy 1, pygments 1, pytz 1, requests 1, root 1, s3transfer 1, setuptools 1, smart-open 1, snowballstemmer 1, sphinx 1, sphinxcontrib-applehelp 1, sphinxcontrib-devhelp 1, sphinxcontrib-htmlhelp 1, sphinxcontrib-jsmath 1, sphinxcontrib-qthelp 1, sphinxcontrib-serializinghtml 1, tabulate 1, tqdm 1, transformers 1, umap-learn 1, urllib3 1, wcwidth 1, werkzeug 1, zope-event 1, zope-interface 1

(Taskset descreases the variance for a minimal slowdown, we're bound on single-threaded pubgrub in this benchmark and not on multithreaded io/parsing.)

$  taskset -c 0 hyperfine --warmup 1 "../uv/target/profiling/main-uv pip compile ../uv/scripts/requirements/bio_embeddings.in"  "../uv/target/profiling/branch-uv pip compile ../uv/scripts/requirements/bio_embeddings.in" 
Benchmark 1: ../uv/target/profiling/main-uv pip compile ../uv/scripts/requirements/bio_embeddings.in
  Time (mean ± σ):     12.321 s ±  0.064 s    [User: 12.014 s, System: 0.300 s]
  Range (min … max):   12.224 s … 12.406 s    10 runs
 
Benchmark 2: ../uv/target/profiling/branch-uv pip compile ../uv/scripts/requirements/bio_embeddings.in
  Time (mean ± σ):      3.109 s ±  0.004 s    [User: 2.782 s, System: 0.321 s]
  Range (min … max):    3.103 s …  3.116 s    10 runs
 
Summary
  ../uv/target/profiling/branch-uv pip compile ../uv/scripts/requirements/bio_embeddings.in ran
    3.96 ± 0.02 times faster than ../uv/target/profiling/main-uv pip compile ../uv/scripts/requirements/bio_embeddings.in

I'd say this closes #135.

The lack combination of lack of Ord on Bound and Unbounded being -inf or inf depending on the (untyped) context makes implementing these manually tedious, i wonder if we could add some abstraction?

The first two commits of this PR can be split out into separate PRs. From main over each of the three commits:

src/term.rs

src/range.rs

Eh2406 · 2024-01-06T15:49:55Z

src/term.rs

    /// Compute the union of two terms.
    /// If at least one term is negative, the union is also negative.
    pub(crate) fn union(&self, other: &Self) -> Self {
-        (self.negate().intersection(&other.negate())).negate()


Inlining this code may be a micro optimization, saving a clone or two, but definitely seems worth doing.

konstin · 2024-01-08T18:56:01Z

The new version is slightly slower but passes the tests:

$ hyperfine --warmup 1 --runs 3 "target/profiling/main bio_embeddings.in"  "target/profiling/branch bio_embeddings.in"
Benchmark 1: target/profiling/main bio_embeddings.in
  Time (mean ± σ):     12.007 s ±  0.035 s    [User: 11.654 s, System: 0.301 s]
  Range (min … max):   11.977 s … 12.045 s    3 runs

Benchmark 2: target/profiling/branch bio_embeddings.in
  Time (mean ± σ):      6.961 s ±  0.017 s    [User: 6.565 s, System: 0.323 s]
  Range (min … max):    6.942 s …  6.974 s    3 runs

Summary
  target/profiling/branch bio_embeddings.in ran
    1.72 ± 0.01 times faster than target/profiling/main bio_embeddings.in

It spends 45% in union, mostly Arc drop, add and vec pushes.

src/range.rs

Eh2406 · 2024-01-08T21:38:33Z

src/version_set.rs

+    /// Whether all range of `self` are contained in `other`
+    fn subset_of(&self, other: &Self) -> bool {
+        self == &self.intersection(other)
+    }


In #59 @mpizenberg was concerned how much complexity this adds to the api. So I would appreciate getting his opinion now.

Personally, i'm not too worried about adding methods which have default implementations. It doesn't add additional effort for the implementer but allows extra control where performance matters.

src/internal/partial_solution.rs

src/range.rs

Eh2406 · 2024-03-11T20:18:12Z

I had trouble thinking useful thoughts looking at the full diff all at once. So I pulled it into my editor and started a new branch adding each new piece of functionality one at a time in separate commits. Then adding the tests and cleaning up the nits I saw before moving onto the next piece of functionality. The result is https://github.com/pubgrub-rs/pubgrub/tree/jf/177-review consider it a "yes and" to the wonderful work in this PR. But I'm not sure on a practical level how to take the next step. Do you want to pull those changes into this PR? Or should I start a new one?

Anyway, I'm gonna go run some benchmarks.

konstin · 2024-03-11T20:42:45Z

Thanks! I forced pushed your commits to this branch (there's backup branch just in case also) and added a clippy commit on top.

Eh2406 · 2024-03-11T21:04:04Z

time	Dev	PR	%
sudoku	44s	30s	32%
elm	274ms	277ms	---
large	11ms	10ms	9%
slow_135	19ms	6.3ms	66%
zuse	311 ms	297ms	4.5%
hobo*	1045s	851s	18%

Eh2406

I rewrote the things I didn't like, and so like what's left.

I'm going to give it a few days in case other people want to respond. Specifically for @mpizenberg about the public API.

If I don't hear back I will merge later this week.

src/term.rs

mpizenberg

Thanks for doing this performance work! As suggested by @Eh2406 I was previously skeptic of this kind of inlining degrading the expressiveness of the computations. According to Jacob benchmarks, I'm still doubting it's the best course of action. On sub-second runs, it does not seem to fundamentally change speed, and for very long run, it has a non-negligeable impact, but I feel like this order of magnitude (multiple tens of seconds) is something that should be approached on a computation complexity level (solving the big-O bottlenecks) instead.

However I can also understand that 30% of something that takes a very long time is a non-negligible gain so I'm in favor if you want to merge this.

I've made a few suggestions, feel free to address or ignore them and go through with the merge.

src/internal/small_map.rs

mpizenberg · 2024-03-12T19:57:40Z

src/range.rs

+///  |----|
+///       |-----|
+/// ```
+fn end_before_start_with_gap<V: PartialOrd>(end: &Bound<V>, start: &Bound<V>) -> bool {


what do you think of non_contiguous_bounds? or contiguous_bounds with the bool flipped?

src/range.rs

zanieb · 2024-03-12T20:34:07Z

@mpizenberg does the "real world" improvement of 12s to 3s on the bio_embeddings requirements in uv make this feel more worthwhile? Does that suggest that there is a missing niche in the benchmarks for the project (i.e. that Jacob ran)?

mpizenberg · 2024-03-12T20:35:24Z

@mpizenberg does the "real world" improvement of 12s to 3s on the bio_embeddings requirements in uv make this feel more worthwhile? Does that suggest that there is a missing niche in the benchmarks for the project (i.e. that Jacob ran)?

Indeed that's a substantial difference!

mpizenberg · 2024-03-12T20:37:48Z

@mpizenberg does the "real world" improvement of 12s to 3s on the bio_embeddings

Do you know if some of the inlinings produce most of the improvements, or if they all make small improvements that total to the big difference?

konstin · 2024-03-13T09:46:35Z

Do you know if some of the inlinings produce most of the improvements, or if they all make small improvements that total to the big difference?

This work is entirely profiling and benchmark driven. I started off with two smaller problems, Range::contains being slower than needed and excessive clone/drop in prior_cause. After the first two commits you can see in the flamegraphs in the PR description it's almost entirely union and intersection calls, being slow due to excessive clone/drop (cor... in the flamegraphs). We had previously seen the slow drop/clone and put our version in an Arc, so all you see now are clone/drop on Arcs.

satisfier was calling intersection a lot, but it wasn't interested in the intersection itself, it just wanted to know if the terms are disjoint, so i special cased this into is_disjoint. The union calls were calling intersection once and complement three times, so i unrolled union into a version that clones as few as possible. I don't have numbers on hand for individual changes, but each made a noticeable difference.

Eh2406 · 2024-03-13T15:37:57Z

I was previously skeptic of this kind of inlining degrading the expressiveness of the computations.

Yesterday I went back and looked at my code from when I originally proposed this. You are right to reject that code. It was inscrutable. The code Konstin has now is much better.

does the "real world" improvement of 12s to 3s on the bio_embeddings requirements in uv make this feel more worthwhile? Does that suggest that there is a missing niche in the benchmarks for the project (i.e. that Jacob ran)?

There is definitely a gap in the benchmarking infrastructure I have available. I can't seem to get a benchmark that has the performance characteristics of your real world data. Help on this would definitely be appreciated. The closest I have is slow_135, which is not nearly as slow as your original report, but has a similar shaped profile. The 66% improvement for that benchmark is nicely similar to the 75% your reporting.

We had previously seen the slow drop/clone and put our version in an Arc, so all you see now are clone/drop on Arcs.

I should be able to run all benchmarks with version wrapped in Arcs. I wonder if that will give me graphs that looked more like yours.

Eh2406 · 2024-03-13T16:31:20Z

I added two more small optimizations on top for your review https://github.com/pubgrub-rs/pubgrub/tree/jf/177-review

konstin · 2024-03-13T16:41:34Z

No difference in performance when cherry-picking the three commits on top:

$ taskset -c 0 hyperfine --warmup 1 --runs 3 "./main-uv pip compile scripts/requirements/bio_embeddings.in"  "./branch-uv pip compile scripts/requirements/bio_embeddings.in" 
Benchmark 1: ./main-uv pip compile scripts/requirements/bio_embeddings.in
  Time (mean ± σ):      2.821 s ±  0.007 s    [User: 2.510 s, System: 0.303 s]
  Range (min … max):    2.813 s …  2.827 s    3 runs
 
Benchmark 2: ./branch-uv pip compile scripts/requirements/bio_embeddings.in
  Time (mean ± σ):      2.826 s ±  0.001 s    [User: 2.527 s, System: 0.291 s]
  Range (min … max):    2.825 s …  2.827 s    3 runs
 
Summary
  ./main-uv pip compile scripts/requirements/bio_embeddings.in ran
    1.00 ± 0.00 times faster than ./branch-uv pip compile scripts/requirements/bio_embeddings.in

Eh2406 · 2024-03-13T17:01:37Z

time	Dev	PR	%
sudoku	44s	22s	-50%
elm	274ms	249.07 ms	-9%
large	11ms	9.8821 ms	-7%
slow_135	19ms	6.6884 ms	-63%
zuse	311 ms	285.94 ms	-6%
hobo*	985.2s	803.6s	-18%
arc_elm	381.19 ms	318.25 ms	-16%
arc_large	15.674 ms	14.214 ms	-9%
arc_slow_135	36.262 ms	14.885 ms	-59%
arc_zuse	387.93 ms	340.78 ms	-12%

This update pulls in pubgrub-rs/pubgrub#177, optimizing common range operations in pubgrub. Please refer to this PR for a more extensive description and discussion of the changes. It optimizes that last remaining pathological case, `bio_embeddings[all]` on python 3.12, which has to try 100k versions. **before** 12s ![image](https://github.com/pubgrub-rs/pubgrub/assets/6826232/80ffdc49-1159-453d-a3ea-0dd431df6d3b) **after** 3s ![image](https://github.com/pubgrub-rs/pubgrub/assets/6826232/69508c29-73ab-4593-a588-d8c722242513) ``` $ taskset -c 0 hyperfine --warmup 1 "../uv/target/profiling/main-uv pip compile ../uv/scripts/requirements/bio_embeddings.in" "../uv/target/profiling/branch-uv pip compile ../uv/scripts/requirements/bio_embeddings.in" Benchmark 1: ../uv/target/profiling/main-uv pip compile ../uv/scripts/requirements/bio_embeddings.in Time (mean ± σ): 12.321 s ± 0.064 s [User: 12.014 s, System: 0.300 s] Range (min … max): 12.224 s … 12.406 s 10 runs Benchmark 2: ../uv/target/profiling/branch-uv pip compile ../uv/scripts/requirements/bio_embeddings.in Time (mean ± σ): 3.109 s ± 0.004 s [User: 2.782 s, System: 0.321 s] Range (min … max): 3.103 s … 3.116 s 10 runs Summary ../uv/target/profiling/branch-uv pip compile ../uv/scripts/requirements/bio_embeddings.in ran 3.96 ± 0.02 times faster than ../uv/target/profiling/main-uv pip compile ../uv/scripts/requirements/bio_embeddings.in ``` It also adds `bio_embeddings[all]` as a requirements test case.

This update pulls in pubgrub-rs/pubgrub#177, optimizing common range operations in pubgrub. Please refer to this PR for a more extensive description and discussion of the changes. The changes optimize that last remaining pathological case, `bio_embeddings[all]` on python 3.12, which has to try 100k versions, from 12s to 3s in the cached case. It should also enable smarter prefetching in batches (#170), even though a naive attempt didn't show better network usage. **before** 12s ![image](https://github.com/pubgrub-rs/pubgrub/assets/6826232/80ffdc49-1159-453d-a3ea-0dd431df6d3b) **after** 3s ![image](https://github.com/pubgrub-rs/pubgrub/assets/6826232/69508c29-73ab-4593-a588-d8c722242513) ``` $ taskset -c 0 hyperfine --warmup 1 "../uv/target/profiling/main-uv pip compile ../uv/scripts/requirements/bio_embeddings.in" "../uv/target/profiling/branch-uv pip compile ../uv/scripts/requirements/bio_embeddings.in" Benchmark 1: ../uv/target/profiling/main-uv pip compile ../uv/scripts/requirements/bio_embeddings.in Time (mean ± σ): 12.321 s ± 0.064 s [User: 12.014 s, System: 0.300 s] Range (min … max): 12.224 s … 12.406 s 10 runs Benchmark 2: ../uv/target/profiling/branch-uv pip compile ../uv/scripts/requirements/bio_embeddings.in Time (mean ± σ): 3.109 s ± 0.004 s [User: 2.782 s, System: 0.321 s] Range (min … max): 3.103 s … 3.116 s 10 runs Summary ../uv/target/profiling/branch-uv pip compile ../uv/scripts/requirements/bio_embeddings.in ran 3.96 ± 0.02 times faster than ../uv/target/profiling/main-uv pip compile ../uv/scripts/requirements/bio_embeddings.in ``` It also adds `bio_embeddings[all]` as a requirements test case.

konstin commented Jan 5, 2024

View reviewed changes

src/term.rs Outdated Show resolved Hide resolved

konstin commented Jan 5, 2024

View reviewed changes

src/term.rs Outdated Show resolved Hide resolved

Eh2406 reviewed Jan 6, 2024

View reviewed changes

src/range.rs Outdated Show resolved Hide resolved

This comment was marked as outdated.

Sign in to view

Eh2406 reviewed Jan 6, 2024

View reviewed changes

konstin force-pushed the konsti/dev/faster-range-operations branch from 8ddd053 to 544ed6a Compare January 8, 2024 18:22

Eh2406 reviewed Jan 8, 2024

View reviewed changes

src/range.rs Show resolved Hide resolved

Eh2406 reviewed Jan 8, 2024

View reviewed changes

src/range.rs Outdated Show resolved Hide resolved

Eh2406 reviewed Jan 8, 2024

View reviewed changes

src/internal/partial_solution.rs Show resolved Hide resolved

konstin mentioned this pull request Feb 20, 2024

Fetch version metadata in batches astral-sh/uv#170

Closed

konstin changed the title ~~Faster range operations~~ Specialize range operations for better performance Mar 6, 2024

konstin force-pushed the konsti/dev/faster-range-operations branch 3 times, most recently from a3376d9 to c0e306f Compare March 6, 2024 14:34

konstin marked this pull request as ready for review March 6, 2024 14:49

konstin mentioned this pull request Mar 6, 2024

Specialize range operations for better performance astral-sh/pubgrub#25

Closed

Eh2406 reviewed Mar 6, 2024

View reviewed changes

src/range.rs Outdated Show resolved Hide resolved

Eh2406 reviewed Mar 6, 2024

View reviewed changes

src/range.rs Show resolved Hide resolved

konstin force-pushed the konsti/dev/faster-range-operations branch from d198313 to f06e511 Compare March 11, 2024 20:40

konstin changed the title ~~Specialize range operations for better performance~~ perf: specialize range operations for better performances Mar 11, 2024

Eh2406 approved these changes Mar 11, 2024

View reviewed changes

konstin and others added 4 commits March 12, 2024 17:11

Use binary search for Range::contains

af66ced

Avoid cloning and dropping the term in prior_cause

d306d9c

Simplify contains

288dd67

Specialize union code

775a5d8

Eh2406 and others added 5 commits March 12, 2024 17:11

Specialize subset_of

f6498f8

simplify and add tests for subset_of

ea90799

Specialize range operations for better performance

776ed8a

add tests and use in more places

2c56fee

Fix clippy lints

4fc3a50

konstin force-pushed the konsti/dev/faster-range-operations branch from 71c9a51 to 4fc3a50 Compare March 12, 2024 16:11

Eh2406 reviewed Mar 12, 2024

View reviewed changes

src/term.rs Outdated Show resolved Hide resolved

mpizenberg approved these changes Mar 12, 2024

View reviewed changes

Eh2406 added 3 commits March 13, 2024 17:42

replace a complement with is_disjoint

a11f1d7

one fewer complement using intersection

72c12b8

use the symmetry of set functions

dc1fc6e

Eh2406 added this pull request to the merge queue Mar 13, 2024

Merged via the queue into pubgrub-rs:dev with commit 0d80e9d Mar 13, 2024
5 checks passed

Eh2406 mentioned this pull request Mar 13, 2024

Build a benchmark based on UV slow case #191

Open

konstin deleted the konsti/dev/faster-range-operations branch March 13, 2024 18:35

konstin mentioned this pull request Mar 13, 2024

Update pubgrub for faster range operations astral-sh/uv#2421

Merged

mpizenberg mentioned this pull request Oct 28, 2024

Improve Term documentation #264

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: specialize range operations for better performances #177

perf: specialize range operations for better performances #177

konstin commented Jan 5, 2024 •

edited

Loading

This comment was marked as outdated.

Eh2406 Jan 6, 2024

konstin commented Jan 8, 2024

Eh2406 Jan 8, 2024 •

edited

Loading

konstin Mar 6, 2024

Eh2406 commented Mar 11, 2024

konstin commented Mar 11, 2024

Eh2406 commented Mar 11, 2024

Eh2406 left a comment

mpizenberg left a comment

mpizenberg Mar 12, 2024

zanieb commented Mar 12, 2024

mpizenberg commented Mar 12, 2024

mpizenberg commented Mar 12, 2024

konstin commented Mar 13, 2024

Eh2406 commented Mar 13, 2024 •

edited

Loading

Eh2406 commented Mar 13, 2024

konstin commented Mar 13, 2024

Eh2406 commented Mar 13, 2024

perf: specialize range operations for better performances #177

perf: specialize range operations for better performances #177

Conversation

konstin commented Jan 5, 2024 • edited Loading

This comment was marked as outdated.

Eh2406 Jan 6, 2024

Choose a reason for hiding this comment

konstin commented Jan 8, 2024

Eh2406 Jan 8, 2024 • edited Loading

Choose a reason for hiding this comment

konstin Mar 6, 2024

Choose a reason for hiding this comment

Eh2406 commented Mar 11, 2024

konstin commented Mar 11, 2024

Eh2406 commented Mar 11, 2024

Eh2406 left a comment

Choose a reason for hiding this comment

mpizenberg left a comment

Choose a reason for hiding this comment

mpizenberg Mar 12, 2024

Choose a reason for hiding this comment

zanieb commented Mar 12, 2024

mpizenberg commented Mar 12, 2024

mpizenberg commented Mar 12, 2024

konstin commented Mar 13, 2024

Eh2406 commented Mar 13, 2024 • edited Loading

Eh2406 commented Mar 13, 2024

konstin commented Mar 13, 2024

Eh2406 commented Mar 13, 2024

konstin commented Jan 5, 2024 •

edited

Loading

Eh2406 Jan 8, 2024 •

edited

Loading

Eh2406 commented Mar 13, 2024 •

edited

Loading