Research/quantify performance envelopes of multiple CDC algorighms #227

ribasushi · 2019-12-04T19:38:01Z

The text was updated successfully, but these errors were encountered:

DonaldTsang · 2019-12-06T16:16:21Z

My ideas for IPFS challenge:

The whole of Wikipedia + other MediaWiki sites
All the books in Project Gutenberg + LibriVox + Europeana + Digital Public Library of America
All Prime numbers in http://www.primos.mat.br/ (uncompressed)
All digits of Pi
- Decimal https://drive.google.com/drive/folders/1L_HnNULhHSuDabD036H94pGdD-XbKhLy?usp=sharing
- Hex https://drive.google.com/drive/folders/1PzUPd4gxQjJ5HbPRxZFyFvCwmMq1Dy3Z?usp=sharing
The Whole Linux + All the repos in Ubuntu/Debian PPA, Arch AUR and Fedora RPM

aidanhs · 2019-12-23T21:45:36Z

Hi! I've been reading up on IPFS recently and stumbled across a mention of chunking, which got me interested because they're sort of a side hobby (manifesting in a repo I occasionally maintain to compare rolling checksum implementations - https://github.com/aidanhs/rollsum-tests). Searches led me to ipfs-inactive/archives#137, then ipfs/go-ipfs-chunker#18 and now here!

I don't know how interesting/useful my input may be, so I'll just make a series of comments - happy to elaborate on any, or feel free to ignore 😄

1 - speed of Rabin

There's a lot of discussion of the current rabin implementation being slow. I've not seen quantification of this, other than ipfs-inactive/archives#137. Unfortunately the discussion is a bit muddled but there's mention of a 16 hour run for 61GB (modarchive). This number isn't credible to me as being caused by Rabin - the benchmarking I've done (results available on my repo) indicates that the IPFS Rabin with a 256KB average chunk (IPFSRA256) takes ~8s for a ~1.5GB file.

Don't get me wrong, there's improvement to be had! With my SSD and a clean cache I can tar a 9GB directory of ~90k files and send to /dev/null in 10s (or /dev/zero in 1min10s) - it'd be nice to have something on a par. But I'd be interested to see measurements indicating it's a limiting factor.

2 - purpose of CDC

Being a newcomer to IPFS I'm probably missing context (having just read a bunch of issues), but there seems to be quite a high expectation of chunking on deduplication and the like.

I personally would expect it to deduplicate A) files that are identical and B) 'items' within other files (.tar, .a, disk images, maybe object sections within executables)...and fixed size chunking already does A! So I would expect essentially no space saving for everything in the comment above, apart from the Linux repos.

A different way to look at it is that CDC "avoids the pathologically poor cases of fixed size chunking and handles archives well", rather than making it order of magnitude better in the general case. I've seen a number of issues about format-specific chunking which sounds awesome, but is orthogonal to (and possibly partially in conflict with) CDC. To be clear, I'm a huge fan of CDC (you can probably tell), but I do think it's important to set expectations.

3 - chunking configuration and algorithm choice

Based on the above, we want chunks on average a fair amount smaller than the typical file size...otherwise, in case B the chunks will span multiple embedded items and be dependent on their ordering in the container!

IPFS currently uses 512KB - this is pretty large for CDC (attic uses 64KB, anything using bupsplit (perkeep, bup) uses 8KB). Based on the above, it would then be best suited for files approx >2MB (small files would still work, just wouldn't benefit). I realise there may be other influencing factors, e.g. IPFS performance with small files.

For the algorithm itself, you'd like something fast! But there's more than that - it'd also be nice to have something that has a small standard deviation of chunk sizes. If there's a large standard deviation you're more regularly hitting the min or max limits and losing the benefits of CDC. The current IPFS buzhash relies heavily on the chunk limits - it seems to actually be configured for 128KB splits but the aggressive limits raise the average to 256KB. My current results don't show this, but it's basically on a par with Gear from rsroll (RSROGE256) for standard deviation. bupsplit modified to target 256KB chunks is a fair amount better, but I've not yet done the analysis to figure if what that 'better' means.

4 - final misc notes

bupsplit with 8KB would be my default 'easy' recommendation - there are multiple language implementations to verify against, it's being used by multiple people and it has a decent heritage (from rsync)
- the best bupsplit implementations take ~2.2s per GB, rather than ~1.2s for buzhash - this feels pretty marginal to me, and given how well you've optimised buzhash I feel you can equal the best bupsplit implementations
FastCDC is mentioned in (WIP) records + merkledag specs #7. I've not yet had a chance to investigate it but I've seen it described as an improvement on Gear and would probably be what I'd go for if I really needed to get that edge over bupsplit
I strongly recommend going with something that has an existing competing implementation so you can feed PB of data in and see if they diverge. When I was implementing Rabin a long time ago, every single implementation out there bar one was incorrect (matching the experience of Rabin Fingerprinting notes#1 (comment))

(cc ipfs/go-ipfs-chunker#13 which mentions bup)

DonaldTsang · 2019-12-24T11:52:03Z

@aidanhs if it is possible read the lists in gilbertchen/benchmarking#13 and gilbertchen/benchmarking#14 for a consensus in chunking please, because I would like to know what chunk size is the most common.

ribasushi · 2020-03-23T06:12:37Z

For those following along: this weekend I finally managed to arrange incoming streams in a way that allows for building trees in parallel, while still being able to spit out a DAG 100% converging with go-ipfs. This, combined with ipfs/kubo#7011 will finally allow for the next stage of this endeavor: namely to be able to quantify how much pressure varying chunks sizes can exert on the bitswap subsystem, and how does this translate into end-user experience.

The tool itself should show up in its final version very early in April 🤞

Some very rough / unscientific numbers taken from a 4-core i7-4770HQ CPU @ 2.20GHz MacBook Pro (Retina, 15-inch, Mid 2015) ingesting the 5GiB test datastream

admin:~/devel/ipfs-dagger$ zstd -qdck test/data/repeat_5GiB.zst | time ../go-ipfs/cmd/ipfs/ipfs add -nq --upgrade-cidv0-in-output 
bafybeicby6ax3dbkxolfypowaizkhmbkrd4xzpkmmoez2wsf6jidso7fhi
       21.68 real        19.94 user         1.36 sys

admin:~/devel/ipfs-dagger$ zstd -qdck test/data/repeat_5GiB.zst | time bin/ipfs-dagger --ipfs-add-command="--upgrade-cidv0-in-output" --no-stats --single-threaded
{ "substream":      0, "size":5368709120, "CID":"bafybeicby6ax3dbkxolfypowaizkhmbkrd4xzpkmmoez2wsf6jidso7fhi" },
       19.51 real        16.54 user         1.17 sys

admin:~/devel/ipfs-dagger$ zstd -qdck test/data/repeat_5GiB.zst | time bin/ipfs-dagger --ipfs-add-command="--upgrade-cidv0-in-output" --no-stats
{ "substream":      0, "size":5368709120, "CID":"bafybeicby6ax3dbkxolfypowaizkhmbkrd4xzpkmmoez2wsf6jidso7fhi" },
        6.21 real        29.79 user         1.73 sys

admin:~/devel/ipfs-dagger$ zstd -qdck test/data/repeat_5GiB.zst | time ../go-ipfs/cmd/ipfs/ipfs add -nq --cid-version=1 --trickle
bafybeibkk3ztvggbxaev5wcfggiphqaknwj7h656ojmhaxry4oacx6pqfe
       19.50 real        17.37 user         1.23 sys

admin:~/devel/ipfs-dagger$ zstd -qdck test/data/repeat_5GiB.zst | time bin/ipfs-dagger --ipfs-add-command="--cid-version=1 --trickle" --no-stats --single-threaded
{ "substream":      0, "size":5368709120, "CID":"bafybeibkk3ztvggbxaev5wcfggiphqaknwj7h656ojmhaxry4oacx6pqfe" },
       19.47 real        16.47 user         1.16 sys

admin:~/devel/ipfs-dagger$ zstd -qdck test/data/repeat_5GiB.zst | time bin/ipfs-dagger --ipfs-add-command="--cid-version=1 --trickle" --no-stats 
{ "substream":      0, "size":5368709120, "CID":"bafybeibkk3ztvggbxaev5wcfggiphqaknwj7h656ojmhaxry4oacx6pqfe" },
        6.14 real        29.20 user         1.68 sys

ribasushi · 2020-03-25T09:36:57Z

If one moves to proper OS/hardware ( linux / AMD Ryzen 7 3700X 8-Core )

root@rescue ~/DAGger # zstd -qdck test/data/repeat_5GiB.zst | time -p bin/ipfs-dagger --ipfs-add-command="--cid-version=1 --trickle" --no-stats
{ "substream":      0, "size":5368709120, "CID":"bafybeibkk3ztvggbxaev5wcfggiphqaknwj7h656ojmhaxry4oacx6pqfe" },
real 2.24
user 6.58
sys 1.40

with zstandard out of the way ( it tops out at about 3GiB/s )

root@rescue ~/DAGger # cat repeat_5GiB.decompressed | time -p bin/ipfs-dagger --ipfs-add-command="--cid-version=1 --trickle" --no-stats
{ "substream":      0, "size":5368709120, "CID":"bafybeibkk3ztvggbxaev5wcfggiphqaknwj7h656ojmhaxry4oacx6pqfe" },
real 2.03
user 6.40
sys 1.33

And with the pipe out of the way ( buffer max out at 16MiB on linux )

root@rescue ~/DAGger # time -p bin/ipfs-dagger --ipfs-add-command="--cid-version=1 --trickle" --no-stats < repeat_5GiB.decompressed 
{ "substream":      0, "size":5368709120, "CID":"bafybeibkk3ztvggbxaev5wcfggiphqaknwj7h656ojmhaxry4oacx6pqfe" },
real 1.36
user 6.06
sys 0.65

Not too bad 😈

dbaarda · 2020-05-29T01:41:42Z

bupsplit with 8KB would be my default 'easy' recommendation - there are multiple language implementations to verify against, it's being used by multiple people and it has a decent heritage (from rsync)

As one of the maintainers of librsync who wrote one of the early "bupsplit" implementations many people have copied, I've recently introduced a RabinKarp polyhash as the new default instead of to the "bupsplit" rollsum because my analysis showed "bupsplit" had terrible hash distribution and collisions, particularly for small block/window sizes (<16K) and ASCII data, and particularly in the low "s1" end of the digest. Using rabinkarp made calculating signatures about 10% more expensive because they are more expensive to calculate, but makes calculating deltas about 10% faster because of significantly less rollsum hash collisions, and calculating deltas is the expensive operation (for 1GB files, adds about 1sec to calculating sigs, saves more than 10secs calculating deltas).

See https://github.com/dbaarda/librsync-tests/blob/master/RESULTS.rst for details.

The poor distribution probably doesn't matter quite as much for chunking, but with the tiny windows used for chunking (32 bytes?) and checking the LSB bits of the digest, you may be hitting the degenerate worst-case behavior for bupsplit. This would probably manifest as a chunk-size distribution that varies considerably with different file types (ASCII vs random) and doesn't reflect the target average block size.

I also found cyclic-poly AKA buzhash had terrible collision rates for ASCII data... again, probably not something that matters for chunking.

See the following for details. https://github.com/dbaarda/rollsum-tests/blob/master/RESULTS.rst

dbaarda · 2021-02-17T01:34:42Z

FTR I've done some;more testing/analysis of chunker algorithms after reading the FastCDC paper and got some surprising results here;

https://github.com/dbaarda/rollsum-chunking/blob/master/RESULTS.rst

Also, for a definitive summary of everything related to IPFS deduplication, read this;

https://discuss.ipfs.io/t/draft-common-bytes-standard-for-data-deduplication/6813/10?u=dbaarda

dbaarda · 2021-05-29T01:38:27Z

Another update; I've added "Regression Chunking" from a Microsoft paper to my chunking analysis that has an interesting technique for reducing the affects of truncation at the max_len.

I also added tests/analysis of Gear and small sliding windows used for chunking to my rollsum analysis.

safinaskar · 2023-07-27T19:49:15Z

I wrote deduplicator, which is x4 faster when storing than other solutions and x2 faster when extracting than other solutions. But my deduplicator doesn't do any CDC, it simply splits data into fixed sized chunks. Such great speed was achieved thanks to blake3 and rayon ( https://crates.io/crates/rayon ). Also I did benchmark for various deduplicators. See this thread: borgbackup/borg#7674 . Especially last comment with newest benchmark and my newest deduplicator: borgbackup/borg#7674 (comment) . Yes, I know that you are interested in CDC, but I still think you can take some ideas

dbaarda · 2023-07-28T21:54:55Z

Of course a fixed size fixed chunkier is going to be fast, and on VM images will probably even give you OK de-duplication. This is because VM images are filesystems that are arranged into fixed size blocks containing files that are mostly binaries, libraries, and compressed data. These kinds of files are either identical, or completely different, so they either have all or none of the same blocks on disk, and a fixed size chunkier can easily find them.

However, a fixed size chunker fails very badly on many other simple cases. Get a big file, add one byte at the start, and a fixed chunkier will get 0% de-duplication. Or tar all the files on those VM images so they are no longer nicely aligned into fixed-size disk blocks, get nearly 0% de-duplication, etc.

safinaskar · 2023-07-29T17:26:07Z

I added comparison between CDC-based tools: borgbackup/borg#7674 (comment) .

Note that casync, desync and borg all use buzhash. So they use exactly same algorithm (buzhash + same chunk size + zstd level 3). And yet they give very different performance. These means that some of these tools greatly under-perform. Make sure not to repeat these problems in IPFS.

Also note that borg is single-threaded and yet it beats parallel desync. This mean that desync does something horribly wrong. Make sure not to repeat this in IPFS!

Possibly whole thread may contain some important info

safinaskar · 2023-07-30T10:52:58Z

Okay, so here is list of Github issues I ~~spammed~~ wrote in last few days on this topic (i. e. fast fixed-sized and CDC-based deduplication). I hope they provide great insight to everyone interested in fast deduplicated storage.
borgbackup/borg#7674
systemd/casync#259
folbricht/desync#243
#227
dpc/rdedup#222
opencontainers/umoci#256

This was referenced Dec 4, 2019

Incorporate results of CDC parameterization research into go-ipfs ipfs/kubo#6781

Open

Collection of findings ipfs/go-ipfs-chunker#18

Closed

ribasushi mentioned this issue Dec 17, 2019

chunker: infinite loop in the Rabin chunker ipfs/boxo#353

Open

ribasushi mentioned this issue Jan 21, 2020

rabin chunker - object size limit exceeded ipfs/kubo#6841

Closed

This was referenced Jul 30, 2023

casync decompresses x1.5 faster than borg on same config (and other benchmarks) borgbackup/borg#7674

Open

Project is dead, please put that info on main page systemd/casync#259

Closed

I benchmarked desync and alternatives folbricht/desync#243

Closed

safinaskar mentioned this issue Jul 30, 2023

[rfc] OCIv2 implementation opencontainers/umoci#256

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Research/quantify performance envelopes of multiple CDC algorighms #227

Research/quantify performance envelopes of multiple CDC algorighms #227

ribasushi commented Dec 4, 2019 •

edited

Loading

DonaldTsang commented Dec 6, 2019

aidanhs commented Dec 23, 2019

DonaldTsang commented Dec 24, 2019

ribasushi commented Mar 23, 2020 •

edited

Loading

ribasushi commented Mar 25, 2020

dbaarda commented May 29, 2020 •

edited

Loading

dbaarda commented Feb 17, 2021

dbaarda commented May 29, 2021 •

edited

Loading

safinaskar commented Jul 27, 2023

dbaarda commented Jul 28, 2023

safinaskar commented Jul 29, 2023

safinaskar commented Jul 30, 2023

Research/quantify performance envelopes of multiple CDC algorighms #227

Research/quantify performance envelopes of multiple CDC algorighms #227

Comments

ribasushi commented Dec 4, 2019 • edited Loading

DonaldTsang commented Dec 6, 2019

aidanhs commented Dec 23, 2019

DonaldTsang commented Dec 24, 2019

ribasushi commented Mar 23, 2020 • edited Loading

ribasushi commented Mar 25, 2020

dbaarda commented May 29, 2020 • edited Loading

dbaarda commented Feb 17, 2021

dbaarda commented May 29, 2021 • edited Loading

safinaskar commented Jul 27, 2023

dbaarda commented Jul 28, 2023

safinaskar commented Jul 29, 2023

safinaskar commented Jul 30, 2023

ribasushi commented Dec 4, 2019 •

edited

Loading

ribasushi commented Mar 23, 2020 •

edited

Loading

dbaarda commented May 29, 2020 •

edited

Loading

dbaarda commented May 29, 2021 •

edited

Loading