-
-
Notifications
You must be signed in to change notification settings - Fork 742
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Measure how much the chunk cache changes in big real life repositories. #4827
Comments
Some hints:
|
Ok, so first this backup (chunks added to an archive before and after):
default chunker params:
attic chunker params:
And another..
default:
attic default:
I suspect this is due to what @elho suggested on IRC
|
Yeah, if one backs up mostly the same data, all refcounts going up, spoiling all dedup... Additional chunks also spread randomly over all hashtable space, also spoiling dedup (to a lesser extent). |
To have better dedup, the data in the hashtable would need to be de-mangled: Currently: id rc s cs, id rc s cs, ... Better: id, id, ... + rc, rc, ... + s, s, ... + cs, cs, ... All refcounts (rc) would still change, but in quite some case be like 1,1, ... -> 2, 2, ... and thus easily compressible. The id, s, cs arrays would somewhat stay stable. IIRC, the idea of separate arrays for the hashtable was already there, but is not implemented yet... |
While discussing ways to avoid cache sync on irc @ThomasWaldmann came up with a simple idea how to use current borg to measure how much transfers a simple chunk based "store cache into the repository" approach would take.
The basic idea is to see if using mostly existing code to store the chunk cache into the repository can produce good enough results to further investigate.
The basic idea is to take a repository with a fairly large chunk cache that is only accessed from a single location and to use borg to backup just the chunk cache of that repository to a local dummy repository. Using --stats gives an indication of the amount of traffic that saving the cache would add. And also how much data would be needed to transfer for a different location to catch up to the current cache. (catching up with multiple backups in one go might or might not need less data transfer)
It would be interesting to see this with some different chunker parameters. I would expect a fixed size chunking setting to be the prime target, but of course all chunker setting can be tested.
The text was updated successfully, but these errors were encountered: