-
-
Notifications
You must be signed in to change notification settings - Fork 312
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[performance] beating git in git verify-pack
β
π
#1
Comments
git verify-pack
note that the statistics in the image above show delta chain length of 1 at most, as we have a 'perfect' cache due to the tree-like delta traversal. We are now faster than git for the very first time, a win that didn't come easily and certainly isn't perfect: git uses less system calls, and doesn't seem to rely visibly on memory mapping packs for random access. Git seems to use way less memory because of that. However, our implementation scales perfectly and isn't artificially limited to three threads (unless this is somewhat covertly overridden by the Since implementing new variants on how to do this kind of work quickly is fairly straightforward, this might not even be the end of it. |
@joshtriplett If you feel like setting a new speed record, you could try running the latest version against the kernel pack (and publish your results here :) ). This time, and with three threads, the Rust version seems to be equal or a little faster, gaining more and more of an advantage the more cores you add. Git does indeed not scale well and gets slower beyond three cores. Now there is a (My current plan is to eventually post a link to this issue on reddit) |
A few more tests show that the My expectation would nonetheless be that it will scale well with more cores. |
git verify-pack
git verify-pack
β
π
According to thisβ¦ β¦git-oxide is now 27.6/19.5 = 1.41x faster than git in this particular task with three threads, whereas it's 27.6/16.0 = 1.72x faster with 4 threads. And it could scale with more cores. I wonder what the difference in algorithm is that causes git to use less memory (or not rely on memory maps) but fail to scale. |
Just tested. I made sure the packfile and the index were in cache first; cached reads of them off disk take about 0.23s and 0.04s, respectively. Is git's
Note on git-oxide: its help says it can take either a .pack file or a .idx file, but when run on a .pack it doesn't seem to do anything visible except read the pack. To get anything interesting I had to run it on the index. I also found that with 96 threads, the less-memory algorithm was much faster; it uses more user time, but finishes in half the real time, and pulls off more than 10GB/s. Results with git-oxide:
You've definitely got a scaling problem in the default less-time algorithm, because with 48 threads instead of 96, it finishes faster:
By contrast, the less-memory algorithm ran just barely slower with 48 threads; that suggests a scaling problem as well, but not quite as bad of one:
Results at 24 threads:
12 threads:
And 6 threads:
Side note: is there any reason you can't compute the sha1s of the pack and index in parallel, and in parallel with the subsequent checking? Also, the
A bunch of threads with 0-2 entries looks like a problematic bottleneck. Hope that helps. |
Thanks so much for taking the time! Very informative, and very valuable informationβ¦! Right now I wouldn't know if this can get any faster, or if at some point the memory subsystem of the machine is simply overloaded if too many threads compete. It's accessing the memory mapped pack, decompresses the data and computes SHA1 and CRC32 on all entries, and without debugging on such a machine I doubt any further optimizations can be done reliably. My answers
At least that's what I think, more or less. The idea is to build a tree of deltas to allow traversing the chain from base to delta, and not from delta to base. This allows for a cache implementation that is perfect, i.e. a delta can be applied to it's base from cache all the time.
Without having studied git's code in detail, it looks like it decompresses objects while indexing them, storing some of them in memory. To me it seems it doesn't need an index for this to work.
Yes, related to the above. A valid pack always has an index and I think it's beneficial to make use of it. The reason git doesn't do it to that extend might be that
Indeed, the parallelisation is done using a super simple fan-in-fan-out map reduce using nothing more than std threads and channels. In a way, this already is work stealing, but it can only steal entire chunks of work. Right now, the chunk size is rather big, and the smaller it is, the more one pays for the channel synchronisation. Probably some tuning is possible there. Maybe it's worth trying to have chunks as small as 10 (objects) - for all I know having one object (without chunking) is visibly slowing it down due to sync overhead. |
Regarding the Maybe using a different channel implementation can speed this up - right now |
I have played around with the relevant variables a little and came up with a better chunk size (read: higher) chunk size for the 'less-time' algorithm, while opening the code up a little for smarter heuristics as time passes. Ideally, it finds chunk sizes that keep threads busy for no more than around a second - that way starvation won't really be noticeable, while keeping the syncronization cost low. |
Closing this issue as |
It could be done, certainly if optimizing for speed is the goal. However, I thought it's not necessary to start checking if the SHA1 of the pack already turns out to be wrong. Maybe there could be a flag to simply skip the Sha1 checks and proceed with the per-object checks. Something like that could certainly be added though if it should get a little faster overall. |
On Wed, Jul 22, 2020 at 10:00:09PM -0700, Sebastian Thiel wrote:
@joshtriplett
> Side note: is there any reason you can't compute the sha1s of the pack and index in parallel, and in parallel with the subsequent checking?
It could be done, certainly if optimizing for speed is the goal. However, I thought it's not necessary to start checking if the SHA1 of the pack already turns out to be wrong.
Maybe there could be a flag to simply skip the Sha1 checks and proceed with the per-object checks.
Something like that could certainly be added though if it should get a little faster overall.
I'm less concerned about the performance of verify-pack in particular,
and more thinking that in general git-oxide should try to do as much as
possible in parallel.
|
With further optimizations, I now get down to 90s for verifying the kernel pack, compared to three minutes for git. The same data structure is used for receiving/indexing packs too, which allows to clone faster than git, with The above time was achieved with four threads, as opposed to three that git uses by default. For comparison, here is the same task with git. I think one major boost My reasoning for doing it like this is that |
@Byron Nice! I would love to see a faster |
@joshtriplett The current main (6725104) now does everything in parallel, even the initial file hashes run while the traversal is already happening. On my machine, it's not actually faster as both tasks compete for IO, so I am considering to revert it in favour of simpler code. However, I wouldn't want to do that before getting some more data from a bigger machine - the new indexing algorithm should shatter the previous results as it is entirely lock-free, yielding a 2.7x speedup on my machine alone for verifying the kernel pack. It should be particularly rewarding to try out creating an index for a pack as well (see issue #5), which is 1.7x faster than git on my machine. A whopping 60s is spent on indexing, and there Everything related to Bonus: Here is how long this takes in single-threaded mode (which can be produced with |
You should consider making a static copy of that pack and index available somewhere, to allow reproducible tests rather than testing with the latest Linux pack each time. |
This is incredibly impressive scalability now.
You may want to rate-limit the
Watching With the
Pack indexing:
You're right that it spends most of its time limited by decompression, then spends a few seconds using many CPUs, then finishes. Regarding decompression performance, try replacing Also, as far as I can tell, don't pack files have many separately deflate-compressed streams, rather than one giant deflate-compressed stream? Couldn't you deflate multiple pieces in parallel? |
Quick informal benchmarks of DEFLATE implementationsFor test data, I tarred up the contents of use std::io::Read;
use std::time::Instant;
use flate2::Compression;
use flate2::bufread::{GzDecoder, GzEncoder};
fn main() -> anyhow::Result<()> {
let mut dest = Vec::with_capacity(1024*1024*1024);
let uncompressed_data = std::fs::read("/tmp/usr-bin.tar")?;
let time = Instant::now();
let mut gz = GzEncoder::new(uncompressed_data.as_slice(), Compression::default());
gz.read_to_end(&mut dest)?;
let elapsed = time.elapsed();
eprintln!("Compressed {} bytes to {} bytes in {:?}", uncompressed_data.len(), dest.len(), elapsed);
dest.clear();
let compressed_data = std::fs::read("/tmp/usr-bin.tar.gz")?;
let time = Instant::now();
let mut gz = GzDecoder::new(compressed_data.as_slice());
gz.read_to_end(&mut dest)?;
let elapsed = time.elapsed();
eprintln!("Decompressed {} bytes to {} bytes in {:?}", compressed_data.len(), dest.len(), elapsed);
Ok(())
} For each test, I compiled miniz (default pure-Rust implementation)Compressed 462356480 bytes to 161332179 bytes in 25.073941226s miniz-sys (
|
Whoop whoop, this feels like being knighted, and all the work totally paid off. Iteration after iteration after iteration, and it really was just Rust pushing me into doing "what's right", which also happens to be fast as a freebie.
I will try, is there a platform you would recommend for that? Thus far I was unable to obtain the tiny kernel pack, 1.4GB only, as the clones would always abort midway given my incredible 6kb/s download speed. Alternatively, if you have an easy way to share that file, I would be happy to give downloading that a try.
To me it looks like it's still a tiny fraction, but I might misread what is shown here. My disposition for this one is to leave it, as I am happy with the small overhead, and that turning it off works so well. It's all monomorphic and the branch predictor certainly has an easy time seeing that this thing is always off or always on :D.
This all comes down to the chunk size, which right now is static and computed using a fantastic algorithm that probably should be done by someone who knows math. As the total of chunks can be recomputed I could imagine one know that in the last N chunks (N = num cores), we have to split them smaller to feed all threads. If that is implemented, I could imagine this ~6.2 number to go down to ~5.2 even. Definitely something to look into. Maybe as a first simple option, there could be a 'chunk-size' flag to override the batch size to tune it based on experience. Maybe that way it's already possible to squeeze out another half a second or so. It's odd though that the first few seconds there isn't enough work available for all threads to start right away, and might indicate that the single-threaded producer is not fast enough.
It's great this one also got faster, even though the core algorithm is definitely unchanged. Probably it gets started more quickly now because the pack verification is done in parallel with everything else - thanks again for giving me the nudge in that direction!
Yes, but only if you know where these streams are. Right now the only way to find out (when only looking at the pack file) is to set the decompressor past the header of a pack entry and let it decompress till done. This marks the beginning of the next entry. If it would somehow be possible to find entries more quickly, an index can be built much faster. Index traversal already does limited only by sequential/burst IO speeds by skipping through the pack to read entry headers only. It can do that as it know where each entry begins though (and it's faster than seeking :D). Since you seem to be involved with zlib much more than I am, then let me restate that finding the beginning or end of a zlib stream reliably without decompressing it would be a major thing for clone speeds of git. And while writing this, the quick zlib comparison came in: Thanks so much! Switching to zlib-ng will be very impactful! Getting a better zlib decompressor/compressor is definitely high up on the list and I would be happy to allocate some time for it once you give the go. Ideally it's possible to avoid pulling in Flate2, but I'd do it if you think it's the right thing to do after all (as long as streams can be re-used thanks to a fast reset). Thanks again, it's great enjoying your support, and you are having quite an impact on this young project, shaping it to the better! |
You're doing incredible work, and I'm glad it's working out so well.
Sure: <redacted by @Byron> I'd suggest downloading with curl and making sure you can resume in case it gets aborted. Please let me know when you have them so I can remove them. I'll hang onto those same objects for future testing consistency. SHA256 checksums:
It adds 0.6s to a 10.6s build, which is a 5.7% increase. Not huge, but it's a non-trivial amount of measurement overhead. That said, I completely agree that you don't need to prioritize it. Long-term, there'd be value in improving it just a little bit, since people may want a simple progress indicator for long operations, without needing the full detail you currently capture; however, so many other things are higher priority.
Interesting! And yeah, I'm impressed with how good the progress data is. If you ever want to work on it, we should do some actual profiling on the 96-way to see where the overhead is. (It'd be easy to capture
I did find that algorithm when I was looking for the chunk size, though I didn't dig into the details. As a quick hack, I tried dropping the What number would you expect to see the
When I looked at the full
Makes perfect sense. Would it be possible, with some care, to use the index to figure out in advance which objects will be needed again and which ones won't? Could you compute a small DAG of objects you need for deltas (without storing the objects themselves), and use that to decide the order you process objects in? There's a related problem I'd be extremely interested in a solution to: could you reconstruct objects incrementally, and not reconstruct the entire object at once? For instance, if I needed the first 4k of a blob, could you reconstruct just that 4k, without reconstructing the entire object? (And then reconstruct the next 4k, and the next 4k, on request, as efficiently as possible but without keeping around the whole object?)
Ah, I see. So, anything other than indexing can use the index file to know where all the objects are and decompress them in parallel, but indexing itself has to decompress sequentially to find those offsets? Makes sense. It's unfortunate that the pack file format stores the decompressed size of each object, but not the compressed size.
Oooh, that's a fun idea. I don't think that it's possible to find the end of the stream in constant time, and I don't think pattern-matching bytes would be possible. But I think that with some care, it would be possible to rapidly skip through a DEFLATE-compressed stream without decompressing it, if you don't actually care about the data itself and you only want to find the end of it. You could skip many of the actual DEFLATE steps, and just decode the symbols from the Huffman tables and skip over match/distance repeats without constructing the decompressed data. See https://en.wikipedia.org/wiki/DEFLATE and https://tools.ietf.org/html/rfc1951 for details on the DEFLATE format. I filed zlib-ng/zlib-ng#714 requesting an implementation of this in zlib-ng.
As far as I know, I'm not aware of As far as I can tell, I think zlib's reset should be efficient and not involve reallocating or zeroing any state. I do think you should go ahead and switch to something that's based on EDIT: and using flate2 would give you the ability to pick between performance and pure-Rust without having to implement that support yourself.
Thank you for all the incredible work you're doing; I'm more than happy to keep benchmarking and keep cheering you on, as well as providing the occasional suggestion on approaches. :) |
The download is completed and I will use these files from hereon out. It seems the main difference to the ones produced by github is a greater zlib compression factor.
I couldn't agree more! Learning from the previous iteration (the one that led to 0.3), I should really try not to be smart until there is some actual measurements. After all, that first version of the streaming code contained quite some complication to be able to keep decompressed streams in memory, for zero benefit except for boosting data structure size considerably (even if that feature wasn't used). Luckily, I feel not much time was lost, and fortunately it was never painful to implement it in the first place :D (I
I think every thread gets 50 base objects and will traverse the tree from there. The actual amount of work to do will differ greatly between the threads, there should be a lot of variance. I chose a small amount to avoid starvation, and with 1.2 million base objects, there should be enough chunks to keep most threads busy all the time.
That's an interesting observation, and that seems to imply finding a way to spawn many threads faster would be a possible micro-optimization here. If every thread would take 10ms to start, we need a second to start
I think that's worth an attempt - it's easy to inject a different cache implementation. It's worth noting that there is tension between having a smarter cache that adds latency to an algorithm that otherwise is instantly producing results.
That's an interesting one, and I dreamt about streaming deltified objects a long time ago π . There even is an attempt to do that written in C, which I believe even works at least superficially. The problem would still be latency and memory consumption. The way I see it, one would have to merge all deltas into one, a relatively costly operation for longer chains that costs at least one allocation. Then there is the base object, which is needed for most chunks (presumably) as deltas try their best not to copy new data. Keeping around the 'whole object', which I assume here is the base object isn't necessary, but most of the time it will have to be loaded to resolve portions of the delta. Seeing how fast the current implementation is I am absolutely sure that the 'streaming delta resolution' will be slower and cost more compute, and I feel that this would really only be interesting for servers who want to stream a lot of deltified objects at reduced memory footprints, trading CPU time for it and to some extend, IO if they chose to reload the base object on each request of 4kB. The more I think about it, the more I believe implementing this at a net-win for the user requires a lot of prior research to identify which tradeoffs should be made there.
Awesome, I am eagerly watching and promise that before doing anything related to creating packs, I will assure zlib-ng can be toggled on. The current zlib module (in
Maybe the way to get a comfortable high-level interface is to switch to Flate2. Maybe I was hesitant for the wrong reasons - now that I have a custom implementation, it should be easy to throw it out and replace it with Flate2 to see if this negatively impacts performance after all. Maybe, maybe, not using it was a bit of a premature optimization, so definitely something I will look at as well. I think as long as there is streaming inflate and deflate, fast resets, and ways to deflate or inflate everything at once into pre-allocated buffers,
Indeed, and I think it might be worth looking into what pack format V3 is going to be like. If that's fixed there, there might not be huge payoffs in making pack format V2 faster.
Yes, let's do that π! |
Keeping the current interation in tasks.md more focussed on what to accomplish. Related to #1
When I saw the last few threads finishing up, the
No, it's not that only some of the threads had started, it's that none of the threads had started, and it wasn't 1s, it was several full seconds. As far as I can tell, gitoxide spent several seconds not processing objects at all; it was doing other processing that wasn't running a thread per CPU.
I don't think you'd have to keep the base object around; it should be possible to just keep the deflate streams active for an object currently being streamed. Tiny deltas (those smaller than a zlib state) could be kept in memory, while larger deltas or base objects could be kept as zlib streams plus the most recent chunk of data. Each time the caller asks for more data, you could process the deltas and walk the zlib streams forward. (That's assuming that deltas reference the base object more-or-less sequentially, otherwise this might well be a loss.) The goal would be to maintain a kind of "cursor" that can walk forward through an object that might be several MB but isn't being read all at once. (The whole pack would already be mmapped, and the corresponding index loaded.) I don't know if this would work well or not, but it seems worth an experiment. |
I think what you probably saw is a thread ending up with 8000 objects to handle, based on the 50 base objects it got. During traversal, threads use unbounded progress reporting, as the actual number isn't known ahead of time except that it's at least 50. It might be that for some reason the chunksize computation is off, that would be a bug. Tests seem to indicate that it shouldn't happen though for kernel-pack sized requests.
Aaah, good to know, this restores my faith in the performance of spawning threads :D. Indeed, it takes a moment until the index is built, and all that should be visible. It should never do nothing visibly, that would be a bug as well.
I see, I didn't think even keeping the zlib streams around under the given conditions, it makes perfect sense and seems to be the extra mile. I think once the zlib handling is refactored to be nicer and support zlib-ng, such an experiment could be run. My suggestion is to do thorough analysis in an issue beforehand to think about the involved tradeoffs and the desired outcome, maybe along with a test that triggers current, unwanted behaviour that one tries to improve. |
The latest version of the
|
I can't wait to try it out! By the looks of it, after this iteration (clone) will be an iteration to get basic push working. |
The current verifying and indexing code would also benefit from faster decompression. |
Absolutely, they will benefit 'for free' when the changes have landed. Maybe for you that would be trivial to try out even, given your knowledge in the field, and maybe you would even conclude that |
Results on a MacBook Air M1, using INTEL binaries:
And for comparison, the same with the
|
And here is the result with ARM binaries on a MacBook Air with M1 chip.
I had to run the
What's incredible is that the effective user time of git indicates it's using no more than 4 threads, bringing its user time (and possibly efficiency) down to ~93s, whereas gitoxide uses all cores and more user time only for a small gain.
When running again and limiting the process to 4 threads, the user time is still significantly higher, making git a clear winning when it comes to efficiency no matter what.
|
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
When trying to beat
git verify-pack
Attempt 1
I remembered timings on a cold cache that indicated something around 5:50min for git to run a verify pack on the linux kernel pack. However, turns out that if the environment is a little more controlled, git is still considerably faster than us despite using an LRU cache and despite using multiple cores quite efficiently.
Observation
Git uses a streaming pack approach which is optimized to apply objects inversely. It works by
We work using a memory mapped file which is optimized for random access, but won't be very fast for this kind of workload.
How to fix
Wait until we have implemented a streaming pack as well and try again, having the same algorithmical benefits possibly faired with more efficient memory handling.
Git for some reason limits the application to 3 threads, even though we do benefit from having more threads so could be faster just because of this.
The streaming (indexing) phase of reading a pack can be parallelised in case we have a pack on disk, and it should be easy to implement if the index datastructure itself is threadsafe (but might not be worth the complexity or memory overhead, let's see).
The text was updated successfully, but these errors were encountered: