Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Measure how much the chunk cache changes in big real life repositories. #4827

Closed
textshell opened this issue Nov 4, 2019 · 4 comments · Fixed by #8332
Closed

Measure how much the chunk cache changes in big real life repositories. #4827

textshell opened this issue Nov 4, 2019 · 4 comments · Fixed by #8332

Comments

@textshell
Copy link
Member

While discussing ways to avoid cache sync on irc @ThomasWaldmann came up with a simple idea how to use current borg to measure how much transfers a simple chunk based "store cache into the repository" approach would take.

The basic idea is to see if using mostly existing code to store the chunk cache into the repository can produce good enough results to further investigate.

The basic idea is to take a repository with a fairly large chunk cache that is only accessed from a single location and to use borg to backup just the chunk cache of that repository to a local dummy repository. Using --stats gives an indication of the amount of traffic that saving the cache would add. And also how much data would be needed to transfer for a different location to catch up to the current cache. (catching up with multiple backups in one go might or might not need less data transfer)

It would be interesting to see this with some different chunker parameters. I would expect a fixed size chunking setting to be the prime target, but of course all chunker setting can be tested.

@ThomasWaldmann
Copy link
Member

Some hints:

  • one chunks cache hash table entry is:
    • chunkid (256 bit == 32 Bytes)
    • refcount, size, csize (3 * 32bit == 12 Bytes)
    • total: 44 Bytes
  • .cache/borg/REPOID/chunks

@d7415
Copy link
Contributor

d7415 commented Nov 5, 2019

Ok, so first this backup (chunks added to an archive before and after):

                       Original size      Compressed size    Deduplicated size
This archive:              183.34 GB            174.00 GB            295.08 MB
All archives:               18.22 TB             16.08 TB            285.53 GB

                       Unique chunks         Total chunks
Chunk index:                 5941664            315009217

default chunker params:

------------------------------------------------------------------------------
                       Original size      Compressed size    Deduplicated size
This archive:              370.09 MB            272.59 MB            272.59 MB
All archives:              370.09 MB            272.59 MB            272.59 MB

                       Unique chunks         Total chunks
Chunk index:                     146                  146
------------------------------------------------------------------------------

------------------------------------------------------------------------------
                       Original size      Compressed size    Deduplicated size
This archive:              370.09 MB            272.78 MB            272.78 MB
All archives:              740.18 MB            545.37 MB            545.37 MB

                       Unique chunks         Total chunks
Chunk index:                     292                  292
------------------------------------------------------------------------------

attic chunker params:

------------------------------------------------------------------------------
                       Original size      Compressed size    Deduplicated size
This archive:              370.32 MB            272.98 MB            272.98 MB
All archives:              370.32 MB            272.98 MB            272.98 MB

                       Unique chunks         Total chunks
Chunk index:                    5456                 5456
------------------------------------------------------------------------------

------------------------------------------------------------------------------
                       Original size      Compressed size    Deduplicated size
This archive:              370.32 MB            273.17 MB            273.17 MB
All archives:              740.64 MB            546.15 MB            546.15 MB

                       Unique chunks         Total chunks
Chunk index:                   10909                10909
------------------------------------------------------------------------------

And another..

------------------------------------------------------------------------------
                       Original size      Compressed size    Deduplicated size
This archive:                1.38 TB              1.29 TB             10.21 GB
All archives:              460.93 TB            434.37 TB              1.04 TB

                       Unique chunks         Total chunks
Chunk index:                 2515058            760671321
------------------------------------------------------------------------------

default:

------------------------------------------------------------------------------
                       Original size      Compressed size    Deduplicated size
This archive:              188.82 MB            117.46 MB            117.46 MB
All archives:              188.82 MB            117.46 MB            117.46 MB

                       Unique chunks         Total chunks
Chunk index:                      68                   68
------------------------------------------------------------------------------

------------------------------------------------------------------------------
                       Original size      Compressed size    Deduplicated size
This archive:              188.82 MB            118.06 MB            118.06 MB
All archives:              377.64 MB            235.52 MB            235.52 MB

                       Unique chunks         Total chunks
Chunk index:                     135                  135
------------------------------------------------------------------------------

attic default:

------------------------------------------------------------------------------
                       Original size      Compressed size    Deduplicated size
This archive:              188.94 MB            117.68 MB            117.68 MB
All archives:              188.94 MB            117.68 MB            117.68 MB

                       Unique chunks         Total chunks
Chunk index:                    2852                 2852
------------------------------------------------------------------------------

------------------------------------------------------------------------------
                       Original size      Compressed size    Deduplicated size
This archive:              188.94 MB            118.27 MB            118.27 MB
All archives:              377.88 MB            235.95 MB            235.95 MB

                       Unique chunks         Total chunks
Chunk index:                    5633                 5633
------------------------------------------------------------------------------

I suspect this is due to what @elho suggested on IRC

with the refcount in there, all chunks except those whose data was overwritten/deleted on the system (as opposed to e.g. only modified, which might be what one may first think of) will be incremented by at least 1 with every archive creation and so even with a 44 byte chunker

@ThomasWaldmann
Copy link
Member

Yeah, if one backs up mostly the same data, all refcounts going up, spoiling all dedup...

Additional chunks also spread randomly over all hashtable space, also spoiling dedup (to a lesser extent).

@ThomasWaldmann
Copy link
Member

To have better dedup, the data in the hashtable would need to be de-mangled:

Currently: id rc s cs, id rc s cs, ...

Better: id, id, ... + rc, rc, ... + s, s, ... + cs, cs, ...

All refcounts (rc) would still change, but in quite some case be like 1,1, ... -> 2, 2, ... and thus easily compressible.

The id, s, cs arrays would somewhat stay stable.

IIRC, the idea of separate arrays for the hashtable was already there, but is not implemented yet...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants