Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[performance] beating git in git verify-pack βœ… πŸš€ #1

Closed
Byron opened this issue Jun 29, 2020 · 31 comments
Closed

[performance] beating git in git verify-pack βœ… πŸš€ #1

Byron opened this issue Jun 29, 2020 · 31 comments

Comments

@Byron
Copy link
Member

Byron commented Jun 29, 2020

When trying to beat git verify-pack

Attempt 1

I remembered timings on a cold cache that indicated something around 5:50min for git to run a verify pack on the linux kernel pack. However, turns out that if the environment is a little more controlled, git is still considerably faster than us despite using an LRU cache and despite using multiple cores quite efficiently.

hard-to-beat-the-king

Observation

Git uses a streaming pack approach which is optimized to apply objects inversely. It works by

  • decompressing all deltas
  • applying all deltas that depend on a base, recursively (and thus avoiding to have to decompress deltas multiple times)

We work using a memory mapped file which is optimized for random access, but won't be very fast for this kind of workload.

How to fix

Wait until we have implemented a streaming pack as well and try again, having the same algorithmical benefits possibly faired with more efficient memory handling.
Git for some reason limits the application to 3 threads, even though we do benefit from having more threads so could be faster just because of this.
The streaming (indexing) phase of reading a pack can be parallelised in case we have a pack on disk, and it should be easy to implement if the index datastructure itself is threadsafe (but might not be worth the complexity or memory overhead, let's see).

@Byron Byron changed the title [random] performance comparisons [performance] beating git in git verify-pack Jun 29, 2020
@Byron
Copy link
Member Author

Byron commented Jul 20, 2020

Screenshot 2020-07-20 at 21 16 59

Measurements with the lookup based algorithm, constrained to three threads for fairness. Lookup is clearly slower, but thanks to LRU cache not nearly as bad as it could be given the enormous amount of additional work that it does.

@Byron
Copy link
Member Author

Byron commented Jul 21, 2020

Screenshot 2020-07-21 at 14 22 34

note that the statistics in the image above show delta chain length of 1 at most, as we have a 'perfect' cache due to the tree-like delta traversal.

We are now faster than git for the very first time, a win that didn't come easily and certainly isn't perfect: git uses less system calls, and doesn't seem to rely visibly on memory mapping packs for random access. Git seems to use way less memory because of that.

However, our implementation scales perfectly and isn't artificially limited to three threads (unless this is somewhat covertly overridden by the pack.threads config variable). Git will not scale that well with more cores, and the artificial restriction to three doesn't seem too far off. Below the first call to git is using three threads as default, the other one uses four, the amount of logical cores in the machine.

Screenshot 2020-07-21 at 14 37 09

Since implementing new variants on how to do this kind of work quickly is fairly straightforward, this might not even be the end of it.

@Byron
Copy link
Member Author

Byron commented Jul 21, 2020

@joshtriplett If you feel like setting a new speed record, you could try running the latest version against the kernel pack (and publish your results here :) ). This time, and with three threads, the Rust version seems to be equal or a little faster, gaining more and more of an advantage the more cores you add. Git does indeed not scale well and gets slower beyond three cores.

Now there is a -t <num-threads> flag which makes experimentation easy.

(My current plan is to eventually post a link to this issue on reddit)

@Byron
Copy link
Member Author

Byron commented Jul 21, 2020

Screenshot 2020-07-21 at 16 09 43

A few more tests show that the less-time algorithm doesn't always scale better at least on my machine and without meeting any standards when it comes to testing. The less-memory algorithm, the one that was there initially, fares even worse though which makes sense as its cache is far less efficient.

My expectation would nonetheless be that it will scale well with more cores.

@Byron
Copy link
Member Author

Byron commented Jul 21, 2020

Screenshot 2020-07-21 at 16 16 03

Interestingly, with more threads than logically available in the machine, there is still some gain to be had - 6 threads on a 4 core machine seems to yield the best results, allowing my machine to crack the 1GB mark!

Git with three threads looks like this:
Screenshot 2020-07-21 at 16 19 15

@Byron Byron pinned this issue Jul 21, 2020
@Byron Byron changed the title [performance] beating git in git verify-pack [performance] beating git in git verify-pack βœ… πŸš€ Jul 21, 2020
@Byron
Copy link
Member Author

Byron commented Jul 21, 2020

According to this…

Screenshot 2020-07-22 at 07 24 42

…git-oxide is now 27.6/19.5 = 1.41x faster than git in this particular task with three threads, whereas it's 27.6/16.0 = 1.72x faster with 4 threads. And it could scale with more cores.

I wonder what the difference in algorithm is that causes git to use less memory (or not rely on memory maps) but fail to scale.

@joshtriplett
Copy link
Contributor

Just tested. I made sure the packfile and the index were in cache first; cached reads of them off disk take about 0.23s and 0.04s, respectively.

Is giop verify-pack directly comparable to git index-pack --verify? Are you doing the equivalent of git's Indexing objects phase? To what extent does git rely on the existing index, and to what extent does git-oxide?

git's Indexing objects phase always seems to be single-threaded. Its Resolving deltas phase depends on pack.threads, but it definitely has scalability limits; when I ran it with a full 96 threads it went slower, with htop showing relatively little CPU activity. git with fewer threads did a better job of saturating the CPU with each thread. Results with git:

~/linux/.git/objects/pack$ time git index-pack --verify -v --show-resolving-progress pack-cdbb7266619026cecacf615b2453c93af4c64a70.pack
Indexing objects: 100% (7492721/7492721), done.
Resolving deltas: 100% (6307033/6307033), done.

real	2m41.735s
user	6m4.195s
sys	0m13.454s
~/linux/.git/objects/pack$ time git -c pack.threads=96 index-pack --verify -v --show-resolving-progress pack-cdbb7266619026cecacf615b2453c93af4c64a70.pack
Indexing objects: 100% (7492721/7492721), done.
Resolving deltas: 100% (6307033/6307033), done.

real	3m40.570s
user	10m18.655s
sys	1m14.089s
~/linux/.git/objects/pack$ time git -c pack.threads=16 index-pack --verify -v --show-resolving-progress pack-cdbb7266619026cecacf615b2453c93af4c64a70.pack
Indexing objects: 100% (7492721/7492721), done.
Resolving deltas: 100% (6307033/6307033), done.

real	1m46.341s
user	6m35.325s
sys	0m27.599s
~/linux/.git/objects/pack$ time git -c pack.threads=8 index-pack --verify -v --show-resolving-progress pack-cdbb7266619026cecacf615b2453c93af4c64a70.pack
Indexing objects: 100% (7492721/7492721), done.
Resolving deltas: 100% (6307033/6307033), done.

real	1m51.208s
user	6m23.465s
sys	0m24.761s

Note on git-oxide: its help says it can take either a .pack file or a .idx file, but when run on a .pack it doesn't seem to do anything visible except read the pack. To get anything interesting I had to run it on the index.

I also found that with 96 threads, the less-memory algorithm was much faster; it uses more user time, but finishes in half the real time, and pulls off more than 10GB/s.

Results with git-oxide:

$ time ~/git-oxide/target/release/giop verify-pack --verbose pack-cdbb7266619026cecacf615b2453c93af4c64a70.idx
 23:47:53 Sha1 of index finished in 0.35s at 592.5 MB/s
 23:47:55  Sha1 of pack finished in 2.34s at 567.1 MB/s
 23:47:56 collecting sorted index in 1.38s (5445749.5 entries/s)                
 23:47:58                indexing tree from 7492721 entries in 2.25s (3329364.5 entries /s)
 23:48:18                Checking finished                                      
 23:48:18                Checking Verified 7492721 objects in 19.87s (377108 objects/s, ~4.7 GB/s)

real	0m26.000s
user	5m22.839s
sys	2m27.790s
~/linux/.git/objects/pack$ time ~/git-oxide/target/release/giop verify-pack --verbose --algorithm less-memory pack-cdbb7266619026cecacf615b2453c93af4c64a70.idx
 23:48:48 Sha1 of index finished in 0.35s at 592.7 MB/s
 23:48:50  Sha1 of pack finished in 2.34s at 568.1 MB/s
 23:48:52 collecting sorted index in 1.37s (5476950 entries/s)                  
 23:49:01                Checking finished                                      
 23:49:01                Checking Verified 7492721 objects in 8.86s (845661 objects/s, ~10.6 GB/s)

real	0m12.654s
user	7m30.161s
sys	0m23.244s

You've definitely got a scaling problem in the default less-time algorithm, because with 48 threads instead of 96, it finishes faster:

~/linux/.git/objects/pack$ time ~/git-oxide/target/release/giop -t 48 verify-pack --verbose pack-cdbb7266619026cecacf615b2453c93af4c64a70.idx
 23:51:50 Sha1 of index finished in 0.35s at 591.6 MB/s
 23:51:52  Sha1 of pack finished in 2.33s at 569.2 MB/s
 23:51:54 collecting sorted index in 1.36s (5495393 entries/s)                  
 23:51:56                indexing tree from 7492721 entries in 2.25s (3325254.8 entries /s)
 23:52:12                Checking finished                                      
 23:52:12                Checking Verified 7492721 objects in 15.63s (479405 objects/s, ~6.0 GB/s)

real	0m21.718s
user	5m3.554s
sys	1m4.436s

By contrast, the less-memory algorithm ran just barely slower with 48 threads; that suggests a scaling problem as well, but not quite as bad of one:

~/linux/.git/objects/pack$ time ~/git-oxide/target/release/giop -t 48 verify-pack --verbose --algorithm less-memory pack-cdbb7266619026cecacf615b2453c93af4c64a70.idx
 23:52:22 Sha1 of index finished in 0.35s at 591.9 MB/s
 23:52:24  Sha1 of pack finished in 2.33s at 569.7 MB/s
 23:52:25 collecting sorted index in 1.37s (5477936.5 entries/s)                
 23:52:36                Checking finished                                      
 23:52:36                Checking Verified 7492721 objects in 11.15s (671699 objects/s, ~8.4 GB/s)

real	0m14.904s
user	6m19.774s
sys	0m6.795s

Results at 24 threads:

~/linux/.git/objects/pack$ time ~/git-oxide/target/release/giop -t 24 verify-pack --verbose pack-cdbb7266619026cecacf615b2453c93af4c64a70.idx
 23:54:30 Sha1 of index finished in 0.35s at 592.9 MB/s
 23:54:32  Sha1 of pack finished in 2.32s at 573.0 MB/s
 23:54:34 collecting sorted index in 1.37s (5478507.5 entries/s)                
 23:54:36                indexing tree from 7492721 entries in 2.24s (3344277.8 entries /s)
 23:54:53                Checking finished                                      
 23:54:53                Checking Verified 7492721 objects in 16.98s (441148 objects/s, ~5.5 GB/s)

real	0m23.030s
user	4m11.169s
sys	0m26.875s
~/linux/.git/objects/pack$ time ~/git-oxide/target/release/giop -t 24 verify-pack --verbose --algorithm less-memory pack-cdbb7266619026cecacf615b2453c93af4c64a70.idx
 23:55:08 Sha1 of index finished in 0.35s at 593.0 MB/s
 23:55:10  Sha1 of pack finished in 2.31s at 574.4 MB/s
 23:55:11 collecting sorted index in 1.37s (5477653 entries/s)                  
 23:55:23                Checking finished                                      
 23:55:23                Checking Verified 7492721 objects in 12.01s (624009 objects/s, ~7.8 GB/s)

real	0m15.725s
user	4m26.295s
sys	0m3.535s

12 threads:

~/linux/.git/objects/pack$ time ~/git-oxide/target/release/giop -t 12 verify-pack --verbose pack-cdbb7266619026cecacf615b2453c93af4c64a70.idx
 23:56:21 Sha1 of index finished in 0.35s at 592.0 MB/s
 23:56:23  Sha1 of pack finished in 2.32s at 571.7 MB/s
 23:56:24 collecting sorted index in 1.37s (5449349 entries/s)                  
 23:56:26                indexing tree from 7492721 entries in 2.25s (3324838.5 entries /s)
 23:56:50                Checking finished                                      
 23:56:50                Checking Verified 7492721 objects in 23.71s (316049 objects/s, ~4.0 GB/s)

real	0m29.783s
user	4m3.463s
sys	0m20.722s
~/linux/.git/objects/pack$ time ~/git-oxide/target/release/giop -t 12 verify-pack --verbose --algorithm less-memory pack-cdbb7266619026cecacf615b2453c93af4c64a70.idx
 23:56:58 Sha1 of index finished in 0.35s at 591.7 MB/s
 23:57:00  Sha1 of pack finished in 2.32s at 571.7 MB/s
 23:57:01 collecting sorted index in 1.38s (5446238 entries/s)                  
 23:57:23                Checking finished                                      
 23:57:23                Checking Verified 7492721 objects in 21.85s (342929 objects/s, ~4.3 GB/s)

real	0m25.580s
user	4m23.070s
sys	0m2.476s

And 6 threads:

~/linux/.git/objects/pack$ time ~/git-oxide/target/release/giop -t 6 verify-pack --verbose pack-cdbb7266619026cecacf615b2453c93af4c64a70.idx
 23:57:49 Sha1 of index finished in 0.35s at 592.1 MB/s
 23:57:51  Sha1 of pack finished in 2.33s at 569.9 MB/s
 23:57:53 collecting sorted index in 1.38s (5434567.5 entries/s)                
 23:57:55                indexing tree from 7492721 entries in 2.25s (3328975 entries /s)
 23:58:36                Checking finished                                      
 23:58:36                Checking Verified 7492721 objects in 41.31s (181356 objects/s, ~2.3 GB/s)

real	0m47.386s
user	4m2.047s
sys	0m16.788s
~/linux/.git/objects/pack$ time ~/git-oxide/target/release/giop -t 6 verify-pack --verbose --algorithm less-memory pack-cdbb7266619026cecacf615b2453c93af4c64a70.idx
 23:58:42 Sha1 of index finished in 0.35s at 592.5 MB/s
 23:58:44  Sha1 of pack finished in 2.31s at 574.0 MB/s
 23:58:46 collecting sorted index in 1.37s (5449867.5 entries/s)                
 23:59:28                Checking finished                                      
 23:59:28                Checking Verified 7492721 objects in 42.72s (175372 objects/s, ~2.2 GB/s)

real	0m46.437s
user	4m18.107s
sys	0m2.161s

Side note: is there any reason you can't compute the sha1s of the pack and index in parallel, and in parallel with the subsequent checking?

Also, the --progress display gave some interesting information. With the less-memory algorithm, I noticed that the last couple of seconds were spent on a few threads finishing up their chunk of work, with all other threads done; if you did some kind of work-stealing, could other threads help the last ones finish up? With the less-time algorithm, I didn't understand exactly what the progress display was showing, but it still had several seconds of stragglers, and in addition, I saw something like this for several seconds at the beginning:

β”Œgitoxide──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── 'q' or CTRL+c to quit   97 running +   0 blocked +   1 groups = 98 ─┐
β”œ verify-pack    β”œ verify-pack ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── β”‚
β”‚ β”” Checking     β”‚ β”” 243333 / 7492721 objects                                                                                                                                                                                                                                                                                 β”‚
β”‚   β”œ thread 1   β”‚   β”œ 0 entries                                                                                                                                                                                                                                                                                              β”‚
β”‚   β”œ thread 0   β”‚   β”œ 1 entries                                                                                                                                                                                                                                                                                              β”‚
β”‚   β”œ thread 3   β”‚   β”œ 1 entries                                                                                                                                                                                                                                                                                              β”‚
β”‚   β”œ thread 4   β”‚   β”œ 2 entries                                                                                                                                                                                                                                                                                              β”‚
β”‚   β”œ thread 5   β”‚   β”œ 1 entries                                                                                                                                                                                                                                                                                              β”‚
β”‚   β”œ thread 2   β”‚   β”œ 0 entries                                                                                                                                                                                                                                                                                              β”‚
β”‚   β”œ thread 6   β”‚   β”œ 1 entries                                                                                                                                                                                                                                                                                              β”‚
β”‚   β”œ thread 7   β”‚   β”œ 3 entries                                                                                                                                                                                                                                                                                              β”‚
β”‚   β”œ thread 9   β”‚   β”œ 1 entries                                                                                                                                                                                                                                                                                              β”‚
β”‚   β”œ thread 10  β”‚   β”œ 1 entries                                                                                                                                                                                                                                                                                              β”‚
β”‚   β”œ thread 8   β”‚   β”œ 1 entries                                                                                                                                                                                                                                                                                              β”‚
β”‚   β”œ thread 11  β”‚   β”œ 1 entries                                                                                                                                                                                                                                                                                              β”‚
β”‚   β”œ thread 12  β”‚   β”œ 1 entries                                                                                                                                                                                                                                                                                              β”‚
β”‚   β”œ thread 13  β”‚   β”œ 1 entries                                                                                                                                                                                                                                                                                              β”‚
β”‚   β”œ thread 14  β”‚   β”œ 1 entries                                                                                                                                                                                                                                                                                              β”‚
β”‚   β”œ thread 15  β”‚   β”œ 1 entries                                                                                                                                                                                                                                                                                              β”‚
β”‚   β”œ thread 16  β”‚   β”œ 1 entries                                                                                                                                                                                                                                                                                              β”‚
β”‚   β”œ thread 17  β”‚   β”œ 1 entries                                                                                                                                                                                                                                                                                              β”‚
β”‚   β”œ thread 18  β”‚   β”œ 1 entries                                                                                                                                                                                                                                                                                              β”‚
β”‚   β”œ thread 19  β”‚   β”œ 0 entries                                                                                                                                                                                                                                                                                              β”‚
β”‚   β”œ thread 20  β”‚   β”œ 1 entries                                                                                                                                                                                                                                                                                              β”‚
β”‚   β”œ thread 21  β”‚   β”œ 1 entries                                                                                                                                                                                                                                                                                              β”‚
β”‚   β”œ thread 22  β”‚   β”œ 1 entries                                                                                                                                                                                                                                                                                              β”‚
β”‚   β”œ thread 23  β”‚   β”œ 1 entries                                                                                                                                                                                                                                                                                              β”‚
β”‚   β”œ thread 24  β”‚   β”œ 1 entries                                                                                                                                                                                                                                                                                              β”‚
β”‚   β”œ thread 25  β”‚   β”œ 1 entries                                                                                                                                                                                                                                                                                              β”‚
β”‚   β”œ thread 26  β”‚   β”œ 1 entries                                                                                                                                                                                                                                                                                              β”‚
β”‚   β”œ thread 27  β”‚   β”œ 1 entries                                                                                                                                                                                                                                                                                              β”‚
β”‚   β”œ thread 28  β”‚   β”œ 1 entries                                                                                                                                                                                                                                                                                              β”‚
β”‚   β”œ thread 29  β”‚   β”œ 0 entries                                                                                                                                                                                                                                                                                              β”‚
β”‚   β”œ thread 30  β”‚   β”œ 1 entries                                                                                                                                                                                                                                                                                              β”‚
β”‚   β”œ thread 31  β”‚   β”œ 1 entries                                                                                                                                                                                                                                                                                              β”‚
β”‚   β”œ thread 32  β”‚   β”œ 1 entries                                                                                                                                                                                                                                                                                              β”‚
β”‚   β”œ thread 33  β”‚   β”œ 1 entries                                                                                                                                                                                                                                                                                              β”‚
β”‚   β”œ thread 34  β”‚   β”œ 1 entries                                                                                                                                                                                                                                                                                              β”‚
β”‚   β”œ thread 35  β”‚   β”œ 1 entries                                                                                                                                                                                                                                                                                              β”‚
β”‚   β”œ thread 36  β”‚   β”œ 1 entries                                                                                                                                                                                                                                                                                              β”‚
β”‚   β”œ thread 37  β”‚   β”œ 1 entries                                                                                                                                                                                                                                                                                              β”‚
β”‚   β”œ thread 38  β”‚   β”œ 1 entries                                                                                                                                                                                                                                                                                              β”‚
β”‚   β”œ thread 39  β”‚   β”œ 1 entries                                                                                                                                                                                                                                                                                              β”‚
β”‚   β”œ thread 40  β”‚   β”œ 1 entries                                                                                                                                                                                                                                                                                              β”‚
β”‚   β”œ thread 41  β”‚   β”œ 1 entries                                                                                                                                                                                                                                                                                              β”‚
β”‚   β”œ thread 42  β”‚   β”œ 1 entries                                                                                                                                                                                                                                                                                              β”‚
β”‚   β”œ thread 43  β”‚   β”œ 0 entries                                                                                                                                                                                                                                                                                              β”‚
β”‚   β”œ thread 44  β”‚   β”œ 1 entries                                                                                                                                                                                                                                                                                              β”‚
β”‚   β”œ thread 45  β”‚   β”œ 1 entries                                                                                                                                                                                                                                                                                              β”‚
β”‚   β”œ thread 46  β”‚   β”œ 1 entries                                                                                                                                                                                                                                                                                              β”‚
β”‚   β”œ thread 47  β”‚   β”œ 1 entries                                                                                                                                                                                                                                                                                              β”‚
β”‚   β”œ thread 48  β”‚   β”œ 2 entries                                                                                                                                                                                                                                                                                              β”‚
β”‚   β”œ thread 49  β”‚   β”œ 1 entries                                                                                                                                                                                                                                                                                              β”‚
β”‚   β”œ thread 50  β”‚   β”œ 1 entries                                                                                                                                                                                                                                                                                              β”‚
β”‚   β”œ thread 51  β”‚   β”œ 1 entries                                                                                                                                                                                                                                                                                              β”‚
β”‚   β”œ thread 52  β”‚   β”œ 1 entries                                                                                                                                                                                                                                                                                              β”‚
β”‚   β”œ thread 53  β”‚   β”œ 1 entries                                                                                                                                                                                                                                                                                              β”‚
β”‚   β”œ thread 54  β”‚   β”œ 1 entries                                                                                                                                                                                                                                                                                              β”‚
β”‚   β”œ thread 55  β”‚   β”œ 1 entries                                                                                                                                                                                                                                                                                              β”‚
β”‚   β”œ thread 56  β”‚   β”œ 1 entries                                                                                                                                                                                                                                                                                              β”‚
β”‚   β”œ thread 57  β”‚   β”œ 1 entries                                                                                                                                                                                                                                                                                              β”‚
β”‚   β”œ thread 58  β”‚   β”œ 1 entries                                                                                                                                                                                                                                                                                              β”‚
β”‚   β”œ thread 59  β”‚   β”œ 1 entries                                                                                                                                                                                                                                                                                              β”‚
β”‚   β”œ thread 60  β”‚   β”œ 1 entries                                                                                                                                                                                                                                                                                              β”‚
β”‚   β”œ thread 61  β”‚   β”œ 1 entries                                                                                                                                                                                                                                                                                              β”‚
β”‚   β”œ thread 62  β”‚   β”œ 1 entries                                                                                                                                                                                                                                                                                              β”‚
β”‚   β”œ thread 63  β”‚   β”œ 1 entries                                                                                                                                                                                                                                                                                              β”‚
β”‚   β”œ thread 64  β”‚   β”œ 1 entries                                                                                                                                                                                                                                                                                              β”‚
β”‚   β”œ thread 65  β”‚   β”œ 1 entries                                                                                                                                                                                                                                                                                              β”‚
β”‚   β”œ thread 66  β”‚   β”œ 1 entries                                                                                                                                                                                                                                                                                              β”‚
β”‚   β”œ thread 67  β”‚   β”œ 1 entries                                                                                                                                                                                                                                                                                              β”‚
β”‚   β”œ thread 68  β”‚   β”œ 1 entries                                                                                                                                                                                                                                                                                              β”‚
β”‚   β”œ thread 69  β”‚   β”œ 1 entries                                                                                                                                                                                                                                                                                              β”‚
β”‚   β”œ thread 70  β”‚   β”œ 1 entries                                                                                                                                                                                                                                                                                              β”‚
β”‚                β”‚ …0 skipped and 24 more                                                                                                                                                                                                                                                             β‡Š = d|↓ = j|β‡ˆ = u|↑ = k β”‚
β”‚Messages─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── β¨― = `| β–’ = ~ β”‚
β”‚0:04:19β”‚ infoβ”‚indexing               β†’ tree from 7492721 entries in 2.26s (3317468 entries /s)                                                                                                                                                                                                                               β”‚
│0:04:17│ info│collecting sorted index→ in 1.36s (5489628.5 entries/s)                                                                                                                                                                                                                                                        │
β”‚0:04:15β”‚ doneβ”‚Sha1 of pack           β†’ finished in 2.31s at 574.8 MB/s                                                                                                                                                                                                                                                       β”‚
β”‚0:04:13β”‚ doneβ”‚Sha1 of index          β†’ finished in 0.35s at 592.6 MB/s                                                                                                                                                                                                                                                       β”‚
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

A bunch of threads with 0-2 entries looks like a problematic bottleneck.

Hope that helps.

@Byron
Copy link
Member Author

Byron commented Jul 22, 2020

Thanks so much for taking the time! Very informative, and very valuable information…! Right now I wouldn't know if this can get any faster, or if at some point the memory subsystem of the machine is simply overloaded if too many threads compete. It's accessing the memory mapped pack, decompresses the data and computes SHA1 and CRC32 on all entries, and without debugging on such a machine I doubt any further optimizations can be done reliably.
However, personally I am already quite happy with the result, and verifying the linux kernel pack in 15s is… a record :D?

My answers

Are you doing the equivalent of git's Indexing objects phase?

At least that's what I think, more or less. The idea is to build a tree of deltas to allow traversing the chain from base to delta, and not from delta to base. This allows for a cache implementation that is perfect, i.e. a delta can be applied to it's base from cache all the time.

To what extent does git rely on the existing index, and to what extent does git-oxide?

Without having studied git's code in detail, it looks like it decompresses objects while indexing them, storing some of them in memory. To me it seems it doesn't need an index for this to work.
git-oxide also doesn't need an index, but I use it to speed up the indexing operation itself. Furthermore, the index is used for obtaining the desired SHA1s and CRC32 to compare each pack object to.

Note on git-oxide: its help says it can take either a .pack file or a .idx file, but when run on a .pack it doesn't seem to do anything visible except read the pack. To get anything interesting I had to run it on the index.

Yes, related to the above. A valid pack always has an index and I think it's beneficial to make use of it. The reason git doesn't do it to that extend might be that git index-pack is primarily used (and maybe I am totally wrong :D) for receiving packs, which don't come with an index as far as I know as it can be generated from a pack which is protected by a trailing SHA1 over all its content.

With the less-memory algorithm, I noticed that the last couple of seconds were spent on a few threads finishing up their chunk of work, with all other threads done; if you did some kind of work-stealing, could other threads help the last ones finish up?

Indeed, the parallelisation is done using a super simple fan-in-fan-out map reduce using nothing more than std threads and channels. In a way, this already is work stealing, but it can only steal entire chunks of work. Right now, the chunk size is rather big, and the smaller it is, the more one pays for the channel synchronisation. Probably some tuning is possible there. Maybe it's worth trying to have chunks as small as 10 (objects) - for all I know having one object (without chunking) is visibly slowing it down due to sync overhead.
As a requirement, it must be trivial to compile without parallelism at all, and to decide at runtime whether to use threads or not. Right now, threads are only used for packs above a certain (magic, really) size.

@Byron
Copy link
Member Author

Byron commented Jul 22, 2020

Regarding the less-memory algorithm actually being faster than less-time with many threads: that's an interesting finding, and probably has to do with chunk sizes. These are potentially small with the less-time version as they depend on the delta tree itself. With the less-memory version, the pack is chunked up quite roughly into 1000 chunks or so, giving each thread a lot of work without having to content for work.

Maybe using a different channel implementation can speed this up - right now crossbeam-channel is used, and an alternative could be flume.

Byron added a commit that referenced this issue Jul 23, 2020
@Byron
Copy link
Member Author

Byron commented Jul 23, 2020

I have played around with the relevant variables a little and came up with a better chunk size (read: higher) chunk size for the 'less-time' algorithm, while opening the code up a little for smarter heuristics as time passes. Ideally, it finds chunk sizes that keep threads busy for no more than around a second - that way starvation won't really be noticeable, while keeping the syncronization cost low.

@Byron
Copy link
Member Author

Byron commented Jul 23, 2020

Closing this issue as git-oxide clearly beats git in this discipline :).

@Byron Byron closed this as completed Jul 23, 2020
@Byron
Copy link
Member Author

Byron commented Jul 23, 2020

@joshtriplett

Side note: is there any reason you can't compute the sha1s of the pack and index in parallel, and in parallel with the subsequent checking?

It could be done, certainly if optimizing for speed is the goal. However, I thought it's not necessary to start checking if the SHA1 of the pack already turns out to be wrong.

Maybe there could be a flag to simply skip the Sha1 checks and proceed with the per-object checks.

Something like that could certainly be added though if it should get a little faster overall.

@joshtriplett
Copy link
Contributor

joshtriplett commented Jul 25, 2020 via email

@Byron
Copy link
Member Author

Byron commented Aug 6, 2020

With further optimizations, I now get down to 90s for verifying the kernel pack, compared to three minutes for git. The same data structure is used for receiving/indexing packs too, which allows to clone faster than git, with gitoxide (see #5 )

Screenshot 2020-08-06 at 10 21 45

The above time was achieved with four threads, as opposed to three that git uses by default.
Once more with 3 threads is just a little slower, but still nearly twice as fast as git.

Screenshot 2020-08-06 at 10 27 48

For comparison, here is the same task with git.

Screenshot 2020-08-06 at 10 22 02

I think one major boost gitoxide allows itself to use is the index file - thank to that, building the in-memory index is done in less than 10 seconds, whereas otherwise it would take about 80seconds (in fact, git is faster in that particular task).

My reasoning for doing it like this is that pack-verify is certainly used as part of a complete repository check, so having an index can be assumed. If an index is not present, one can be built (even from partial packs) using gixp index-for-pack.

@joshtriplett
Copy link
Contributor

@Byron Nice! I would love to see a faster clone or fetch, as well as push.

@Byron
Copy link
Member Author

Byron commented Aug 12, 2020

@joshtriplett The current main (6725104) now does everything in parallel, even the initial file hashes run while the traversal is already happening. On my machine, it's not actually faster as both tasks compete for IO, so I am considering to revert it in favour of simpler code.

However, I wouldn't want to do that before getting some more data from a bigger machine - the new indexing algorithm should shatter the previous results as it is entirely lock-free, yielding a 2.7x speedup on my machine alone for verifying the kernel pack.

It should be particularly rewarding to try out creating an index for a pack as well (see issue #5), which is 1.7x faster than git on my machine. A whopping 60s is spent on indexing, and there gitoxide is just as fast as git with not much room for improvement - it's inherently 'streaming' and locked to how fast decompression works. However, once the resolving phase begins gitoxide can take over, this is where it saves huge amounts of time and where it can soon help making clones faster.

Everything related to push is a while ahead though, the next work block is 'clone', and possibly after that building packs and pushing can be tackled as well.

Bonus: Here is how long this takes in single-threaded mode (which can be produced with make release-small). Verifying the linux kernel takes 10m55s there, which shows how nicely we scale already.

Screenshot 2020-08-12 at 10 39 21

@Byron
Copy link
Member Author

Byron commented Aug 12, 2020

And now that I ran the ultra-parallel version again on a hot cache and without doing anything else, my previous record was bested once again by what seems like 10s!

Screenshot 2020-08-12 at 11 48 03

@joshtriplett
Copy link
Contributor

You should consider making a static copy of that pack and index available somewhere, to allow reproducible tests rather than testing with the latest Linux pack each time.

@joshtriplett
Copy link
Contributor

joshtriplett commented Aug 12, 2020

This is incredibly impressive scalability now.

$ time ~/gitoxide-inst/bin/gixp pack-verify pack-22ea4a9baa332d5aee13a63a3c86e7bd76eee97e.idx 

real	0m10.606s
user	6m17.265s
sys	0m16.748s

You may want to rate-limit the --verbose output, because it now adds a non-trivial fraction of the total time:

$ time ~/gitoxide-inst/bin/gixp --verbose pack-verify pack-22ea4a9baa332d5aee13a63a3c86e7bd76eee97e.idx 
 4:11:15 SHA1 of index done 212.8MB in 0.39s (549.5MB/s)
 4:11:16 collecting sorted index done 7.6M entries in 1.47s (5.2M entries/s)    
 4:11:17            SHA1 of pack done 1.4GB in 2.41s (564.2MB/s)                
 4:11:18                indexing done 7.6M objects in 2.29s (3.3M objects/s)    
 4:11:25               Resolving done 7.6M objects in 7.13s (1.1M objects/s)    
 4:11:25                Decoding done 95.6GB in 7.13s (13.4GB/s)                

real	0m11.212s
user	6m19.052s
sys	0m14.940s

Watching htop, it looks like across those 10s, the middle 4-5 seconds have all 96 CPUs completely saturated (nicely done), while the first and last few seconds have just a few threads running. The --progress display shows that for the last couple of seconds, just a few resolver threads are finishing up their batches.

With the less-memory algorithm, it breaks the 10s barrier now, albeit by using much more CPU time in the process:

$ time ~/gitoxide-inst/bin/gixp pack-verify --algorithm less-memory  pack-22ea4a9baa332d5aee13a63a3c86e7bd76eee97e.idx 

real	0m9.244s
user	10m13.571s
sys	0m12.382s
$ time ~/gitoxide-inst/bin/gixp --verbose pack-verify --algorithm less-memory  pack-22ea4a9baa332d5aee13a63a3c86e7bd76eee97e.idx 
 4:18:25 SHA1 of index done 212.8MB in 0.39s (543.7MB/s)
 4:18:26 collecting sorted index done 7.6M entries in 1.48s (5.1M entries/s)    
 4:18:28            SHA1 of pack done 1.4GB in 3.29s (414.2MB/s)                
 4:18:34              Traversing of 7600359 objects done in 7.89s (962996 objects/s, ~12.1 GB/s)

real	0m9.465s
user	10m13.364s
sys	0m12.523s

Pack indexing:

$ time ~/gitoxide-inst/bin/gixp index-from-pack -p pack-22ea4a9baa332d5aee13a63a3c86e7bd76eee97e.pack 
index: 8b434ea70e1bebaa529500824274c0154be903ac
pack: 22ea4a9baa332d5aee13a63a3c86e7bd76eee97e

real	0m46.196s
user	7m0.051s
sys	0m6.478s
$ time ~/gitoxide-inst/bin/gixp --verbose index-from-pack -p pack-22ea4a9baa332d5aee13a63a3c86e7bd76eee97e.pack 
 4:23:37 read pack done 1.4GB in 36.58s (37.2MB/s)                              
 4:23:37  indexing done 7.6M objects in 36.58s (207.8k objects/s)               
 4:23:37 decompressing done 2.7GB in 36.58s (74.3MB/s)                          
 4:23:44     Resolving done 7.6M objects in 6.72s (1.1M objects/s)              
 4:23:44      Decoding done 95.6GB in 6.72s (14.2GB/s)                          
 4:23:47 writing index file done 212.8MB in 0.57s (376.5MB/s)                   
 4:23:47  create index file done 7.6M objects in 46.19s (164.5k objects/s)      
index: 8b434ea70e1bebaa529500824274c0154be903ac
pack: 22ea4a9baa332d5aee13a63a3c86e7bd76eee97e

real	0m46.311s
user	7m3.584s
sys	0m5.409s

You're right that it spends most of its time limited by decompression, then spends a few seconds using many CPUs, then finishes.

Regarding decompression performance, try replacing miniz_oxide with a better zlib decoder. Build with libz-sys, and then try substituting zlib-ng built with --zlib-compat. (I'm working on making that easier.) That should substantially improve decompression.

Also, as far as I can tell, don't pack files have many separately deflate-compressed streams, rather than one giant deflate-compressed stream? Couldn't you deflate multiple pieces in parallel?

@joshtriplett
Copy link
Contributor

joshtriplett commented Aug 12, 2020

Quick informal benchmarks of DEFLATE implementations

For test data, I tarred up the contents of /usr/bin on my system, which produced a 462356480 byte tar file. I compressed that with gzip, and got a 160945647 byte tar.gz file. I used the following quick-and-dirty test code, based on the flate2 crate. (Note that it always decompresses the same data, rather than decompressing the data it compressed, to reduce variability. Also, I only tested the default compression level here, for a quick test.)

use std::io::Read;
use std::time::Instant;

use flate2::Compression;
use flate2::bufread::{GzDecoder, GzEncoder};

fn main() -> anyhow::Result<()> {
    let mut dest = Vec::with_capacity(1024*1024*1024);

    let uncompressed_data = std::fs::read("/tmp/usr-bin.tar")?;
    let time = Instant::now();
    let mut gz = GzEncoder::new(uncompressed_data.as_slice(), Compression::default());
    gz.read_to_end(&mut dest)?;
    let elapsed = time.elapsed();
    eprintln!("Compressed {} bytes to {} bytes in {:?}", uncompressed_data.len(), dest.len(), elapsed);

    dest.clear();

    let compressed_data = std::fs::read("/tmp/usr-bin.tar.gz")?;
    let time = Instant::now();
    let mut gz = GzDecoder::new(compressed_data.as_slice());
    gz.read_to_end(&mut dest)?;
    let elapsed = time.elapsed();
    eprintln!("Decompressed {} bytes to {} bytes in {:?}", compressed_data.len(), dest.len(), elapsed);

    Ok(())
}

For each test, I compiled flate2 with the appropriate feature flag.

miniz (default pure-Rust implementation)

Compressed 462356480 bytes to 161332179 bytes in 25.073941226s
Decompressed 160945647 bytes to 462356480 bytes in 1.802195335s

miniz-sys (miniz.c)

Compressed 462356480 bytes to 161332179 bytes in 24.158530166s
Decompressed 160945647 bytes to 462356480 bytes in 2.008726615s

zlib (system zlib on Debian, package version 1:1.2.11.dfsg-2)

Compressed 462356480 bytes to 160771801 bytes in 21.822209389s
Decompressed 160945647 bytes to 462356480 bytes in 1.809420268s

cloudflare_zlib (Note: major portability issues, only works on modern systems)

Compressed 462356480 bytes to 164343457 bytes in 11.124091826s
Decompressed 160945647 bytes to 462356480 bytes in 1.782134224s

zlib-ng (uses CPU feature detection; tested by compiling for zlib and setting LD_LIBRARY_PATH)

Compressed 462356480 bytes to 165659015 bytes in 9.892405882s
Decompressed 160945647 bytes to 462356480 bytes in 1.520923279s

Summary

Use zlib-ng. It's portable, and it's faster at both compression and decompression. I plan to make it easier to use from Rust. For now, I'd suggest using zlib, which makes it easy to test other implementations via LD_LIBRARY_PATH.

Byron added a commit that referenced this issue Aug 12, 2020
@Byron
Copy link
Member Author

Byron commented Aug 12, 2020

Whoop whoop, this feels like being knighted, and all the work totally paid off. Iteration after iteration after iteration, and it really was just Rust pushing me into doing "what's right", which also happens to be fast as a freebie.

You should consider making a static copy of that pack and index available somewhere, to allow reproducible tests rather than testing with the latest Linux pack each time.

I will try, is there a platform you would recommend for that? Thus far I was unable to obtain the tiny kernel pack, 1.4GB only, as the clones would always abort midway given my incredible 6kb/s download speed. Alternatively, if you have an easy way to share that file, I would be happy to give downloading that a try.

You may want to rate-limit the --verbose output, because it now adds a non-trivial fraction of the total time:

To me it looks like it's still a tiny fraction, but I might misread what is shown here. My disposition for this one is to leave it, as I am happy with the small overhead, and that turning it off works so well. It's all monomorphic and the branch predictor certainly has an easy time seeing that this thing is always off or always on :D.
It's kind of incredible that 96 threads are pounding that poor dashmap and it's all working smoothly. prodash was created to support this kind of load without threats having to limit themselves, so maybe there is something prodash can do to make progress lookups even faster. The lookup time is probably what's costing here, and in my single-threaded tests it got 50million progress updates per second, and all threads together just make 7.5 million in ~7s.

Watching htop, it looks like across those 10s, the middle 4-5 seconds have all 96 CPUs completely saturated (nicely done), while the first and last few seconds have just a few threads running. [emphasis mine]

This all comes down to the chunk size, which right now is static and computed using a fantastic algorithm that probably should be done by someone who knows math. As the total of chunks can be recomputed I could imagine one know that in the last N chunks (N = num cores), we have to split them smaller to feed all threads. If that is implemented, I could imagine this ~6.2 number to go down to ~5.2 even. Definitely something to look into. Maybe as a first simple option, there could be a 'chunk-size' flag to override the batch size to tune it based on experience. Maybe that way it's already possible to squeeze out another half a second or so.

It's odd though that the first few seconds there isn't enough work available for all threads to start right away, and might indicate that the single-threaded producer is not fast enough.

With the less-memory algorithm, it breaks the 10s barrier now, albeit by using much more CPU time in the process:

It's great this one also got faster, even though the core algorithm is definitely unchanged. Probably it gets started more quickly now because the pack verification is done in parallel with everything else - thanks again for giving me the nudge in that direction!
This one probably won't get any faster as it relies entirely on the LRU cache with 64 entries per core for performance. It traverses the pack on a sorted list of pack offsets, ascending, which yields the best hit rate at low memory cost and quite fast preparation.
With a bigger and more performant LRU, there could be some gains here as well, but care should be taken to not make the 'less-memory' algorithm into a 'more-memory' one :D.
Whenever the cache is missed, objects have to be decompressed again, increasing the amount of unnecessary work done compared to the 'perfect' algorithm that is the less-time one. This is how it gets to burn the CPU more.

Also, as far as I can tell, don't pack files have many separately deflate-compressed streams, rather than one giant deflate-compressed stream? Couldn't you deflate multiple pieces in parallel?

Yes, but only if you know where these streams are. Right now the only way to find out (when only looking at the pack file) is to set the decompressor past the header of a pack entry and let it decompress till done. This marks the beginning of the next entry. If it would somehow be possible to find entries more quickly, an index can be built much faster. Index traversal already does limited only by sequential/burst IO speeds by skipping through the pack to read entry headers only. It can do that as it know where each entry begins though (and it's faster than seeking :D).
I was thinking that maybe there is a way pattern match bytes with some probability and find beginnings or ends of zlib streams, without the chance for ever missing an entry. If that was the case, one could probably screen the pack at IO speeds. A fun fact: nobody actually wants to decompress the objects during indexing - also gitoxide just decompresses them to learn the beginning of the next entry, the decompression result is thrown away. And I did have a version that could keep these results in memory, but it was simply not worth it as the cost in memory is too grand at 7.5 million objects, and for some reason I never got an inkling of additional performance when doing that.

Since you seem to be involved with zlib much more than I am, then let me restate that finding the beginning or end of a zlib stream reliably without decompressing it would be a major thing for clone speeds of git.

And while writing this, the quick zlib comparison came in: Thanks so much! Switching to zlib-ng will be very impactful!
For now I will not try it to leave you a chance to make it easier to use from Rust. Thus far I managed to avoid pulling in Flate2 to maintain better control and reduce compile times, putting the relatively little control code on top to support streaming and 'all-at-once' decompression. That makes it a bit harder to try new backends, but it's definitely something worth investing time in, as we might be looking at 18.5% faster decompression, while getting a 2.5x boost for compression.
Please note that it's also worth optimizing for keeping the time to reset the decompressor/compressor low - miniz-oxide currently has a very expensive reset that is very measurable when dealing with the kernel pack.

Getting a better zlib decompressor/compressor is definitely high up on the list and I would be happy to allocate some time for it once you give the go. Ideally it's possible to avoid pulling in Flate2, but I'd do it if you think it's the right thing to do after all (as long as streams can be re-used thanks to a fast reset).

Thanks again, it's great enjoying your support, and you are having quite an impact on this young project, shaping it to the better!

@joshtriplett
Copy link
Contributor

joshtriplett commented Aug 12, 2020

Whoop whoop, this feels like being knighted, and all the work totally paid off. Iteration after iteration after iteration, and it really was just Rust pushing me into doing "what's right", which also happens to be fast as a freebie.

You're doing incredible work, and I'm glad it's working out so well.

You should consider making a static copy of that pack and index available somewhere, to allow reproducible tests rather than testing with the latest Linux pack each time.

I will try, is there a platform you would recommend for that? Thus far I was unable to obtain the tiny kernel pack, 1.4GB only, as the clones would always abort midway given my incredible 6kb/s download speed. Alternatively, if you have an easy way to share that file, I would be happy to give downloading that a try.

Sure: <redacted by @Byron>

I'd suggest downloading with curl and making sure you can resume in case it gets aborted. Please let me know when you have them so I can remove them. I'll hang onto those same objects for future testing consistency.

SHA256 checksums:

231a2186704d7c2f92713857c9690b0d0eb745d790f0f2f54b0575700179e416  pack-3ee05b0f4e4c2cb59757c95c68e2d13c0a491289.idx
5b69ded6054dd8650c593cd303ff8054591a71ce83d7195fdf5f3d8fca88784a  pack-3ee05b0f4e4c2cb59757c95c68e2d13c0a491289.pack

You may want to rate-limit the --verbose output, because it now adds a non-trivial fraction of the total time:

To me it looks like it's still a tiny fraction, but I might misread what is shown here.

It adds 0.6s to a 10.6s build, which is a 5.7% increase. Not huge, but it's a non-trivial amount of measurement overhead. That said, I completely agree that you don't need to prioritize it. Long-term, there'd be value in improving it just a little bit, since people may want a simple progress indicator for long operations, without needing the full detail you currently capture; however, so many other things are higher priority.

My disposition for this one is to leave it, as I am happy with the small overhead, and that turning it off works so well. It's all monomorphic and the branch predictor certainly has an easy time seeing that this thing is always off or always on :D.
It's kind of incredible that 96 threads are pounding that poor dashmap and it's all working smoothly. prodash was created to support this kind of load without threats having to limit themselves, so maybe there is something prodash can do to make progress lookups even faster. The lookup time is probably what's costing here, and in my single-threaded tests it got 50million progress updates per second, and all threads together just make 7.5 million in ~7s.

Interesting! And yeah, I'm impressed with how good the progress data is. If you ever want to work on it, we should do some actual profiling on the 96-way to see where the overhead is. (It'd be easy to capture perf profiling data, for instance.) And profiling would be a good idea for many other optimizations on the git implementation as well, once we're out of algorithmic improvements and similar. :)

Watching htop, it looks like across those 10s, the middle 4-5 seconds have all 96 CPUs completely saturated (nicely done), while the first and last few seconds have just a few threads running. [emphasis mine]

This all comes down to the chunk size, which right now is static and computed using a fantastic algorithm that probably should be done by someone who knows math. As the total of chunks can be recomputed I could imagine one know that in the last N chunks (N = num cores), we have to split them smaller to feed all threads. If that is implemented, I could imagine this ~6.2 number to go down to ~5.2 even. Definitely something to look into. Maybe as a first simple option, there could be a 'chunk-size' flag to override the batch size to tune it based on experience. Maybe that way it's already possible to squeeze out another half a second or so.

I did find that algorithm when I was looking for the chunk size, though I didn't dig into the details. As a quick hack, I tried dropping the upper number from 1000 to 250, which made no apparent difference in performance.

What number would you expect to see the --progress interface showing as the number of objects per thread?

It's odd though that the first few seconds there isn't enough work available for all threads to start right away, and might indicate that the single-threaded producer is not fast enough.

When I looked at the full --progress interface, it looks like the threads hadn't even started by that point.

With the less-memory algorithm, it breaks the 10s barrier now, albeit by using much more CPU time in the process:

It's great this one also got faster, even though the core algorithm is definitely unchanged. Probably it gets started more quickly now because the pack verification is done in parallel with everything else - thanks again for giving me the nudge in that direction!
This one probably won't get any faster as it relies entirely on the LRU cache with 64 entries per core for performance. It traverses the pack on a sorted list of pack offsets, ascending, which yields the best hit rate at low memory cost and quite fast preparation.
With a bigger and more performant LRU, there could be some gains here as well, but care should be taken to not make the 'less-memory' algorithm into a 'more-memory' one :D.
Whenever the cache is missed, objects have to be decompressed again, increasing the amount of unnecessary work done compared to the 'perfect' algorithm that is the less-time one. This is how it gets to burn the CPU more.

Makes perfect sense.

Would it be possible, with some care, to use the index to figure out in advance which objects will be needed again and which ones won't? Could you compute a small DAG of objects you need for deltas (without storing the objects themselves), and use that to decide the order you process objects in?

There's a related problem I'd be extremely interested in a solution to: could you reconstruct objects incrementally, and not reconstruct the entire object at once? For instance, if I needed the first 4k of a blob, could you reconstruct just that 4k, without reconstructing the entire object? (And then reconstruct the next 4k, and the next 4k, on request, as efficiently as possible but without keeping around the whole object?)

Also, as far as I can tell, don't pack files have many separately deflate-compressed streams, rather than one giant deflate-compressed stream? Couldn't you deflate multiple pieces in parallel?

Yes, but only if you know where these streams are. Right now the only way to find out (when only looking at the pack file) is to set the decompressor past the header of a pack entry and let it decompress till done. This marks the beginning of the next entry. If it would somehow be possible to find entries more quickly, an index can be built much faster. Index traversal already does limited only by sequential/burst IO speeds by skipping through the pack to read entry headers only. It can do that as it know where each entry begins though (and it's faster than seeking :D).

Ah, I see. So, anything other than indexing can use the index file to know where all the objects are and decompress them in parallel, but indexing itself has to decompress sequentially to find those offsets? Makes sense.

It's unfortunate that the pack file format stores the decompressed size of each object, but not the compressed size.

I was thinking that maybe there is a way pattern match bytes with some probability and find beginnings or ends of zlib streams, without the chance for ever missing an entry. If that was the case, one could probably screen the pack at IO speeds. A fun fact: nobody actually wants to decompress the objects during indexing - also gitoxide just decompresses them to learn the beginning of the next entry, the decompression result is thrown away. And I did have a version that could keep these results in memory, but it was simply not worth it as the cost in memory is too grand at 7.5 million objects, and for some reason I never got an inkling of additional performance when doing that.

Since you seem to be involved with zlib much more than I am, then let me restate that finding the beginning or end of a zlib stream reliably without decompressing it would be a major thing for clone speeds of git.

Oooh, that's a fun idea. I don't think that it's possible to find the end of the stream in constant time, and I don't think pattern-matching bytes would be possible. But I think that with some care, it would be possible to rapidly skip through a DEFLATE-compressed stream without decompressing it, if you don't actually care about the data itself and you only want to find the end of it. You could skip many of the actual DEFLATE steps, and just decode the symbols from the Huffman tables and skip over match/distance repeats without constructing the decompressed data.

See https://en.wikipedia.org/wiki/DEFLATE and https://tools.ietf.org/html/rfc1951 for details on the DEFLATE format.

I filed zlib-ng/zlib-ng#714 requesting an implementation of this in zlib-ng.

And while writing this, the quick zlib comparison came in: Thanks so much! Switching to zlib-ng will be very impactful!
For now I will not try it to leave you a chance to make it easier to use from Rust. Thus far I managed to avoid pulling in Flate2 to maintain better control and reduce compile times, putting the relatively little control code on top to support streaming and 'all-at-once' decompression. That makes it a bit harder to try new backends, but it's definitely something worth investing time in, as we might be looking at 18.5% faster decompression, while getting a 2.5x boost for compression.
Please note that it's also worth optimizing for keeping the time to reset the decompressor/compressor low - miniz-oxide currently has a very expensive reset that is very measurable when dealing with the kernel pack.

Getting a better zlib decompressor/compressor is definitely high up on the list and I would be happy to allocate some time for it once you give the go. Ideally it's possible to avoid pulling in Flate2, but I'd do it if you think it's the right thing to do after all (as long as streams can be re-used thanks to a fast reset).

As far as I know, I'm not aware of flate2 adding any significant overhead, and it provides fairly low-level interfaces in addition to high-level ones. If there's a good reason to, you could use libz-sys directly, but that's a less safe interface. Either way, if you port to libz-sys or to a crate like flate2 that's based on libz-sys, that'll make it trivial to switch to zlib-ng later, as well as making it easy to test zlib-ng now via LD_LIBRARY_PATH.

As far as I can tell, I think zlib's reset should be efficient and not involve reallocating or zeroing any state.

I do think you should go ahead and switch to something that's based on libz-sys underneath, whichever one works best for you.

EDIT: and using flate2 would give you the ability to pick between performance and pure-Rust without having to implement that support yourself.

Thanks again, it's great enjoying your support, and you are having quite an impact on this young project, shaping it to the better!

Thank you for all the incredible work you're doing; I'm more than happy to keep benchmarking and keep cheering you on, as well as providing the occasional suggestion on approaches. :)

@Byron
Copy link
Member Author

Byron commented Aug 13, 2020

I'd suggest downloading with curl and making sure you can resume in case it gets aborted. Please let me know when you have them so I can remove them. I'll hang onto those same objects for future testing consistency.

The download is completed and I will use these files from hereon out. It seems the main difference to the ones produced by github is a greater zlib compression factor.

Interesting! And yeah, I'm impressed with how good the progress data is. If you ever want to work on it, we should do some actual profiling on the 96-way to see where the overhead is. (It'd be easy to capture perf profiling data, for instance.) And profiling would be a good idea for many other optimizations on the git implementation as well, once we're out of algorithmic improvements and similar. :)

I couldn't agree more! Learning from the previous iteration (the one that led to 0.3), I should really try not to be smart until there is some actual measurements. After all, that first version of the streaming code contained quite some complication to be able to keep decompressed streams in memory, for zero benefit except for boosting data structure size considerably (even if that feature wasn't used). Luckily, I feel not much time was lost, and fortunately it was never painful to implement it in the first place :D (Iβ™₯️Rust).

What number would you expect to see the --progress interface showing as the number of objects per thread?

I think every thread gets 50 base objects and will traverse the tree from there. The actual amount of work to do will differ greatly between the threads, there should be a lot of variance. I chose a small amount to avoid starvation, and with 1.2 million base objects, there should be enough chunks to keep most threads busy all the time.

When I looked at the full --progress interface, it looks like the threads hadn't even started by that point.

That's an interesting observation, and that seems to imply finding a way to spawn many threads faster would be a possible micro-optimization here. If every thread would take 10ms to start, we need a second to start 100 96 of them, so the math seems to pan out. Alternatively, it's contention when fetching work - when starting, they all want work at the same time.

Would it be possible, with some care, to use the index to figure out in advance which objects will be needed again and which ones won't? Could you compute a small DAG of objects you need for deltas (without storing the objects themselves), and use that to decide the order you process objects in?

I think that's worth an attempt - it's easy to inject a different cache implementation. It's worth noting that there is tension between having a smarter cache that adds latency to an algorithm that otherwise is instantly producing results.

There's a related problem I'd be extremely interested in a solution to: could you reconstruct objects incrementally, and not reconstruct the entire object at once? For instance, if I needed the first 4k of a blob, could you reconstruct just that 4k, without reconstructing the entire object? (And then reconstruct the next 4k, and the next 4k, on request, as efficiently as possible but without keeping around the whole object?)

That's an interesting one, and I dreamt about streaming deltified objects a long time ago πŸ˜…. There even is an attempt to do that written in C, which I believe even works at least superficially.

The problem would still be latency and memory consumption. The way I see it, one would have to merge all deltas into one, a relatively costly operation for longer chains that costs at least one allocation. Then there is the base object, which is needed for most chunks (presumably) as deltas try their best not to copy new data. Keeping around the 'whole object', which I assume here is the base object isn't necessary, but most of the time it will have to be loaded to resolve portions of the delta.

Seeing how fast the current implementation is I am absolutely sure that the 'streaming delta resolution' will be slower and cost more compute, and I feel that this would really only be interesting for servers who want to stream a lot of deltified objects at reduced memory footprints, trading CPU time for it and to some extend, IO if they chose to reload the base object on each request of 4kB. The more I think about it, the more I believe implementing this at a net-win for the user requires a lot of prior research to identify which tradeoffs should be made there.

I filed zlib-ng/zlib-ng#714 requesting an implementation of this in zlib-ng.

Awesome, I am eagerly watching and promise that before doing anything related to creating packs, I will assure zlib-ng can be toggled on. The current zlib module (in gitoxide) is nothing I really like, it's the level of "happy it actually works and I hope to not have to touch it again", so it's certainly necessary to take another close look and improve it.

As far as I know, I'm not aware of flate2 adding any significant overhead, and it provides fairly low-level interfaces in addition to high-level ones. If there's a good reason to, you could use libz-sys directly, but that's a less safe interface. Either way, if you port to libz-sys or to a crate like flate2 that's based on libz-sys, that'll make it trivial to switch to zlib-ng later, as well as making it easy to test zlib-ng now via LD_LIBRARY_PATH.

Maybe the way to get a comfortable high-level interface is to switch to Flate2. Maybe I was hesitant for the wrong reasons - now that I have a custom implementation, it should be easy to throw it out and replace it with Flate2 to see if this negatively impacts performance after all. Maybe, maybe, not using it was a bit of a premature optimization, so definitely something I will look at as well. I think as long as there is streaming inflate and deflate, fast resets, and ways to deflate or inflate everything at once into pre-allocated buffers, gitoxide would be good to go.

It's unfortunate that the pack file format stores the decompressed size of each object, but not the compressed size.

Indeed, and I think it might be worth looking into what pack format V3 is going to be like. If that's fixed there, there might not be huge payoffs in making pack format V2 faster.

Thank you for all the incredible work you're doing; I'm more than happy to keep benchmarking and keep cheering you on, as well as providing the occasional suggestion on approaches. :)

Yes, let's do that πŸ™Œ!

Byron added a commit that referenced this issue Aug 13, 2020
Keeping the current interation in tasks.md more focussed on what to
accomplish.

Related to #1
@joshtriplett
Copy link
Contributor

What number would you expect to see the --progress interface showing as the number of objects per thread?

I think every thread gets 50 base objects and will traverse the tree from there. The actual amount of work to do will differ greatly between the threads, there should be a lot of variance. I chose a small amount to avoid starvation, and with 1.2 million base objects, there should be enough chunks to keep most threads busy all the time.

When I saw the last few threads finishing up, the --progress output seemed to be saying they had a batch of around 8000 objects. So, I think something might be going wrong with batching.

When I looked at the full --progress interface, it looks like the threads hadn't even started by that point.

That's an interesting observation, and that seems to imply finding a way to spawn many threads faster would be a possible micro-optimization here. If every thread would take 10ms to start, we need a second to start 100 96 of them, so the math seems to pan out. Alternatively, it's contention when fetching work - when starting, they all want work at the same time.

No, it's not that only some of the threads had started, it's that none of the threads had started, and it wasn't 1s, it was several full seconds. As far as I can tell, gitoxide spent several seconds not processing objects at all; it was doing other processing that wasn't running a thread per CPU.

There's a related problem I'd be extremely interested in a solution to: could you reconstruct objects incrementally, and not reconstruct the entire object at once? For instance, if I needed the first 4k of a blob, could you reconstruct just that 4k, without reconstructing the entire object? (And then reconstruct the next 4k, and the next 4k, on request, as efficiently as possible but without keeping around the whole object?)

That's an interesting one, and I dreamt about streaming deltified objects a long time ago sweat_smile. There even is an attempt to do that written in C, which I believe even works at least superficially.

The problem would still be latency and memory consumption. The way I see it, one would have to merge all deltas into one, a relatively costly operation for longer chains that costs at least one allocation. Then there is the base object, which is needed for most chunks (presumably) as deltas try their best not to copy new data. Keeping around the 'whole object', which I assume here is the base object isn't necessary, but most of the time it will have to be loaded to resolve portions of the delta.

Seeing how fast the current implementation is I am absolutely sure that the 'streaming delta resolution' will be slower and cost more compute, and I feel that this would really only be interesting for servers who want to stream a lot of deltified objects at reduced memory footprints, trading CPU time for it and to some extend, IO if they chose to reload the base object on each request of 4kB. The more I think about it, the more I believe implementing this at a net-win for the user requires a lot of prior research to identify which tradeoffs should be made there.

I don't think you'd have to keep the base object around; it should be possible to just keep the deflate streams active for an object currently being streamed. Tiny deltas (those smaller than a zlib state) could be kept in memory, while larger deltas or base objects could be kept as zlib streams plus the most recent chunk of data. Each time the caller asks for more data, you could process the deltas and walk the zlib streams forward. (That's assuming that deltas reference the base object more-or-less sequentially, otherwise this might well be a loss.) The goal would be to maintain a kind of "cursor" that can walk forward through an object that might be several MB but isn't being read all at once. (The whole pack would already be mmapped, and the corresponding index loaded.)

I don't know if this would work well or not, but it seems worth an experiment.

@Byron
Copy link
Member Author

Byron commented Aug 14, 2020

When I saw the last few threads finishing up, the --progress output seemed to be saying they had a batch of around 8000 objects. So, I think something might be going wrong with batching.

I think what you probably saw is a thread ending up with 8000 objects to handle, based on the 50 base objects it got. During traversal, threads use unbounded progress reporting, as the actual number isn't known ahead of time except that it's at least 50. It might be that for some reason the chunksize computation is off, that would be a bug. Tests seem to indicate that it shouldn't happen though for kernel-pack sized requests.

No, it's not that only some of the threads had started, it's that none of the threads had started, and it wasn't 1s, it was several full seconds. As far as I can tell, gitoxide spent several seconds not processing objects at all; it was doing other processing that wasn't running a thread per CPU.

Aaah, good to know, this restores my faith in the performance of spawning threads :D. Indeed, it takes a moment until the index is built, and all that should be visible. It should never do nothing visibly, that would be a bug as well.
The line renderer is built to not display progress in the first second to not clutter up fast runs, maybe that's related as well. The TUI, however, will always display all progress without delay.

I don't think you'd have to keep the base object around; it should be possible to just keep the deflate streams active for an object currently being streamed. Tiny deltas (those smaller than a zlib state) could be kept in memory, while larger deltas or base objects could be kept as zlib streams plus the most recent chunk of data. Each time the caller asks for more data, you could process the deltas and walk the zlib streams forward. (That's assuming that deltas reference the base object more-or-less sequentially, otherwise this might well be a loss.) The goal would be to maintain a kind of "cursor" that can walk forward through an object that might be several MB but isn't being read all at once. (The whole pack would already be mmapped, and the corresponding index loaded.)

I don't know if this would work well or not, but it seems worth an experiment.

I see, I didn't think even keeping the zlib streams around under the given conditions, it makes perfect sense and seems to be the extra mile.

I think once the zlib handling is refactored to be nicer and support zlib-ng, such an experiment could be run. My suggestion is to do thorough analysis in an issue beforehand to think about the involved tradeoffs and the desired outcome, maybe along with a test that triggers current, unwanted behaviour that one tries to improve.

@joshtriplett
Copy link
Contributor

The latest version of the flate2 crate supports building with the zlib-ng-compat feature to enable zlib-ng in zlib-compat mode. That should substantially improve performance.

flate2 = { version = "1.0.17", features = ["zlib-ng-compat"], default-features = false }

@Byron
Copy link
Member Author

Byron commented Aug 19, 2020

I can't wait to try it out! By the looks of it, after this iteration (clone) will be an iteration to get basic push working.
For that, there will be the first stab at building packs, which is a good motivation to overhaul the current zlib implementation and support more backends, one way or another.

@joshtriplett
Copy link
Contributor

The current verifying and indexing code would also benefit from faster decompression.

@Byron
Copy link
Member Author

Byron commented Aug 20, 2020

Absolutely, they will benefit 'for free' when the changes have landed.
All code for this currently rests in git-odb, and I hope this can be turned into something that can feature-toggle between miniz-oxide and the libz-sys crate, potentially with vendored (and static) zlib-ng enabled.

Maybe for you that would be trivial to try out even, given your knowledge in the field, and maybe you would even conclude that flate2 would be all this project needs. I avoided it because of its complexity, but if it can do fast resets and won't allocate during a reset, I think it will be suitable.

@Byron
Copy link
Member Author

Byron commented Dec 5, 2020

Results on a MacBook Air M1, using INTEL binaries:

➜  gitoxide git:(main) ./target/release/gixp --verbose  pack-verify  tests/fixtures/repos/linux.git/objects/pack/pack-3ee05b0f4e4c2cb59757c95c68e2d13c0a491289.idx
 15:17:37 SHA1 of index done 212.8MB in 0.44s (480.9MB/s)
 15:17:37 collecting sorted index done 7.6M entries in 1.38s (5.5M entries/s)
 15:17:40                indexing done 7.6M objects in 2.29s (3.3M objects/s)
 15:17:40            SHA1 of pack done 1.4GB in 3.69s (369.1MB/s)
 15:18:23               Resolving done 7.6M objects in 43.14s (176.2k objects/s)
 15:18:23                Decoding done 95.6GB in 43.14s (2.2GB/s)
➜  gitoxide git:(main) ./target/release/gixp --verbose  pack-verify  tests/fixtures/repos/linux.git/objects/pack/pack-3ee05b0f4e4c2cb59757c95c68e2d13c0a491289.idx
 15:19:32 SHA1 of index done 212.8MB in 0.43s (498.3MB/s)
 15:19:32 collecting sorted index done 7.6M entries in 1.22s (6.2M entries/s)
 15:19:34            SHA1 of pack done 1.4GB in 2.72s (500.0MB/s)
 15:19:35                indexing done 7.6M objects in 2.26s (3.4M objects/s)
 15:20:17               Resolving done 7.6M objects in 42.69s (178.0k objects/s)
 15:20:17                Decoding done 95.6GB in 42.69s (2.2GB/s)
➜  gitoxide git:(main)

And for comparison, the same with the git binary, also INTEL.

➜  gitoxide git:(main)  time git -c pack.threads=8 index-pack --verify -v --show-resolving-progress tests/fixtures/repos/linux.git/objects/pack/pack-3ee05b0f4e4c2cb59757c95c68e2d13c0a491289.pack
Indexing objects: 100% (7600359/7600359), done.
Resolving deltas: 100% (6396700/6396700), done.
git -c pack.threads=8 index-pack --verify -v --show-resolving-progress   280.29s user 22.96s system 434% cpu 1:09.71 total

@Byron
Copy link
Member Author

Byron commented Dec 6, 2020

And here is the result with ARM binaries on a MacBook Air with M1 chip.

➜  gitoxide git:(main) βœ— ./target/release/gixp --verbose  pack-verify  tests/fixtures/repos/linux.git/objects/pack/pack-3ee05b0f4e4c2cb59757c95c68e2d13c0a491289.idx
 12:53:09 SHA1 of index done 212.8MB in 0.27s (778.6MB/s)
 12:53:10 collecting sorted index done 7.6M entries in 0.94s (8.1M entries/s)
 12:53:12            SHA1 of pack done 1.4GB in 2.87s (475.0MB/s)
 12:53:12                indexing done 7.6M objects in 2.32s (3.3M objects/s)
 12:53:39               Resolving done 7.6M objects in 26.44s (287.5k objects/s)
 12:53:39                Decoding done 95.6GB in 26.44s (3.6GB/s)
➜  gitoxide git:(main) βœ— ./target/release/gixp --verbose  pack-verify  tests/fixtures/repos/linux.git/objects/pack/pack-3ee05b0f4e4c2cb59757c95c68e2d13c0a491289.idx
 12:53:42 SHA1 of index done 212.8MB in 0.25s (844.9MB/s)
 12:53:43 collecting sorted index done 7.6M entries in 0.81s (9.4M entries/s)
 12:53:44            SHA1 of pack done 1.4GB in 1.55s (876.7MB/s)
 12:53:45                indexing done 7.6M objects in 1.90s (4.0M objects/s)
 12:54:11               Resolving done 7.6M objects in 26.52s (286.6k objects/s)
 12:54:11                Decoding done 95.6GB in 26.52s (3.6GB/s)

I had to run the git version of verification twice just to assure myself that I can trust my eyes: The arm git version is significantly faster than the one I was testing previously, despite still being a little slower than gitoxide.

➜  gitoxide git:(main) βœ— time git -c pack.threads=8 index-pack --verify -v --show-resolving-progress tests/fixtures/repos/linux.git/objects/pack/pack-3ee05b0f4e4c2cb59757c95c68e2d13c0a491289.pack
Indexing objects: 100% (7600359/7600359), done.
Resolving deltas: 100% (6396700/6396700), done.
git -c pack.threads=8 index-pack --verify -v --show-resolving-progress   92.66s user 21.61s system 353% cpu 32.283 total
➜  gitoxide git:(main) βœ— time git -c pack.threads=8 index-pack --verify -v --show-resolving-progress tests/fixtures/repos/linux.git/objects/pack/pack-3ee05b0f4e4c2cb59757c95c68e2d13c0a491289.pack
Indexing objects: 100% (7600359/7600359), done.
Resolving deltas: 100% (6396700/6396700), done.
git -c pack.threads=8 index-pack --verify -v --show-resolving-progress   92.79s user 22.30s system 355% cpu 32.390 total
➜  gitoxide git:(main) βœ— file `which git`
/usr/bin/git: Mach-O universal binary with 2 architectures: [x86_64:Mach-O 64-bit executable x86_64] [arm64e:Mach-O 64-bit executable arm64e]
/usr/bin/git (for architecture x86_64): Mach-O 64-bit executable x86_64
/usr/bin/git (for architecture arm64e): Mach-O 64-bit executable arm64e
➜  gitoxide git:(main) βœ— git --version
git version 2.24.3 (Apple Git-128)
➜  gitoxide git:(main) βœ—

What's incredible is that the effective user time of git indicates it's using no more than 4 threads, bringing its user time (and possibly efficiency) down to ~93s, whereas gitoxide uses all cores and more user time only for a small gain.

➜  gitoxide git:(main) βœ— time ./target/release/gixp --verbose  pack-verify  tests/fixtures/repos/linux.git/objects/pack/pack-3ee05b0f4e4c2cb59757c95c68e2d13c0a491289.idx
 13:59:25 SHA1 of index done 212.8MB in 0.25s (856.5MB/s)
 13:59:25 collecting sorted index done 7.6M entries in 0.82s (9.2M entries/s)
 13:59:26            SHA1 of pack done 1.4GB in 1.54s (883.4MB/s)
 13:59:27                indexing done 7.6M objects in 2.15s (3.5M objects/s)
 13:59:54               Resolving done 7.6M objects in 26.37s (288.3k objects/s)
 13:59:54                Decoding done 95.6GB in 26.37s (3.6GB/s)
./target/release/gixp --verbose pack-verify   207.23s user 1.62s system 707% cpu 29.535 total

When running again and limiting the process to 4 threads, the user time is still significantly higher, making git a clear winning when it comes to efficiency no matter what.

➜  gitoxide git:(main) βœ— time ./target/release/gixp -t 4 --verbose  pack-verify  tests/fixtures/repos/linux.git/objects/pack/pack-3ee05b0f4e4c2cb59757c95c68e2d13c0a491289.idx
 14:01:29 SHA1 of index done 212.8MB in 0.26s (823.4MB/s)
 14:01:30 collecting sorted index done 7.6M entries in 0.84s (9.0M entries/s)
 14:01:31            SHA1 of pack done 1.4GB in 1.61s (847.6MB/s)
 14:01:32                indexing done 7.6M objects in 2.18s (3.5M objects/s)
 14:02:09               Resolving done 7.6M objects in 36.69s (207.2k objects/s)
 14:02:09                Decoding done 95.6GB in 36.69s (2.6GB/s)
./target/release/gixp -t 4 --verbose pack-verify   150.37s user 1.38s system 380% cpu 39.879 total

@GitoxideLabs GitoxideLabs locked and limited conversation to collaborators Feb 26, 2021
@Byron Byron unpinned this issue Oct 7, 2021

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants