Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

borgception #474

Closed
ThomasWaldmann opened this issue Dec 7, 2015 · 31 comments · Fixed by #8332
Closed

borgception #474

ThomasWaldmann opened this issue Dec 7, 2015 · 31 comments · Fixed by #8332
Labels

Comments

@ThomasWaldmann
Copy link
Member

ThomasWaldmann commented Dec 7, 2015

i thought a bit about how to optimize the chunks cache and just wanted to document one weird idea.

the issue with the chunks cache is that it needs to match the overall repository state (== have up-to-date information about all chunks in all archives, including refcount, size, csize). when backing up multiple machines into same repo, creating an archive of one machine invalidates all chunk caches on the other machines and they need to resync their chunks cache with the repo, which is expensive.

so, there is the idea to store the chunk index into the repo also, so all out-of-sync clients can just fetch the index from the repo.

But:

  • index can be large (way larger than segment size)
  • when using raw hashtable it could have up to 75% unused bucket space
  • index has additional information about chunks, so we should not store it unencrypted if the repo is encrypted
  • the index should match the chunks in the repo, so it should not create own chunks in the repo.

So we need:

  • chunking of the index into smaller pieces
  • compression (unused bucket space is mostly binary zeros AFAIK)
  • encryption

This pretty much sounds like we should just backup the index of repo A into a related, but separate borg repository A'. :-)


💰 there is a bounty for this

@ThomasWaldmann ThomasWaldmann changed the title borg inception borgception Dec 7, 2015
@level323
Copy link

level323 commented Dec 7, 2015

I don't think it's a weird idea at all.

I was thinking along similar lines, actually.

I was initially "what iffing" the idea of storing the chunks cache inside
the main repo, but then saw benefits in storing it in a special dedicated
repo - because this opens the possibility of exploiting the known structure
of the chunks cache to minimise the size of its repo (and minimise transfer
time over slow links).

I think similar fresh approaches can be used to optimise the size of other
objects such as the archive metadata and the files cache to optimise
synchronisation of this data between machines in a
multi-machine-backing-up-to-central-repo use case.

On 8 December 2015 at 05:58, TW [email protected] wrote:

i thought a bit about how to optimize the chunks cache and just wanted to
document one weird idea.

the issue with the chunks cache is that it needs to match the overall
repository state (== have up-to-date information about all chunks in all
archives, including refcount, size, csize). when backing up multiple
machines into same repo, creating an archive of one machine invalidates all
chunk caches on the other machines and they need to resync their chunks
cache with the repo, which is expensive.

so, there is the idea to store the chunk index into the repo also, so all
out-of-sync clients can just fetch the index from the repo.

But:

  • index can be large (way larger than segment size)
  • when using raw hashtable it could have up to 75% unused bucket space
  • index has additional information about chunks, so we should not
    store it unencrypted if the repo is encrypted

So we need:

  • chunking of the index into smaller pieces
  • compression (unused bucket space is mostly binary zeros AFAIK)
  • encryption

This pretty much sounds like we should just backup the index into a
related borg repository. :-)


Reply to this email directly or view it on GitHub
#474.

@RonnyPfannschmidt
Copy link
Contributor

how about having per segment index chunks - then each segment would have encrypted index update instructions, and those could be removed/regenerated as well

@RonnyPfannschmidt
Copy link
Contributor

a yet different method could be to just have 2 files per segment, one encrypted metadata, one for the blob data it adds in order

the whole chunk index can then be reconstructed from the metadata

@ThomasWaldmann
Copy link
Member Author

@level323 the files index does not need to be shared. it just remembers the mtime/size/inode/chunks info for all files of last backup, so it can quickly skip the files next time.

@RonnyPfannschmidt what's the point of a per-segment index if you have to merge 100.000 of them to get the full index?

@RonnyPfannschmidt
Copy link
Contributor

@ThomasWaldmann you only need to deal with tens of thousands on a full index regen

the multi machine use case would be much less intensive on computation, since one only has to obtain increments

@ThomasWaldmann
Copy link
Member Author

@RonnyPfannschmidt well, a segment has 5MB, so a 500GB repo has 100.000 segments. That's just an example, but a quite realistic one. Of course, it can be more or less depending on how much data you have.

But I still don't see how your suggestion should be efficient. Just for comparison: the normal, uncached and quite slow repo rebuild goes through ALL archives, ALL files and uses the file item chunk list stored in metadata as increment. The items metadata are stored clustered together in a few segments [not together with file content data]).

BTW, an incremental "just add on top what we already have" approach for the chunks cache only works as long as nothing is removed. Because if something is removed, also the information about it is gone, so we can't subtract. (see the PRs)

@RonnyPfannschmidt
Copy link
Contributor

if each segment also tracks the removals, then the chunks index willl match the current state after applying a segment, the main problem would be to correct the reference segment of the current state on a vacuum - which is quite hard, since segments will be combined on vacuum

@pszxzsd
Copy link

pszxzsd commented Dec 9, 2015

No idea if it's applicable to the problem, but ZeroDB (docs), an end-to-end encrypted database, recently went open source as a Python implementation.

@RonnyPfannschmidt
Copy link
Contributor

its highly unrelated and a completely different kind of data-store not suitable for borg's needs

@callegar
Copy link

callegar commented Jan 31, 2022

Out of curiosity and with no intention whatsoever to be pressing, would be great to hear if there is progress on this idea.

If I understand correctly, the advantage of the cache is when multiple hosts are backed up to the same repo (which would help when a lot of overlap is expected). I get the impression that when a single host is backed up to a repo one can work without a cache at all without performance penalty using the rm -rf chunks.archive.d ; touch chunks.archive.d trick. Is this correct?

At the same time I now see in the FAQs that backing up multiple hosts to the same server is discouraged (at https://borgbackup.readthedocs.io/en/stable/faq.html#can-i-backup-from-multiple-servers-into-a-single-repository).

From this premises, I would like to ask if the borgception is still being pursued as a way to make the multiple hosts to one repo model workable, or is being dropped as unneeded as the single host to single repo model would not get any real benefit from it.

@FabioPedretti
Copy link
Contributor

I think this was somewhat addressed with BORG_FILES_CACHE_SUFFIX, see #5433 and #5602.

@ThomasWaldmann
Copy link
Member Author

ThomasWaldmann commented Jan 31, 2022

@callegar I currently have no near-future plans to implement this.

Priority now is on getting 1.2 out (b4 released, rc1 next), then fixing any potential issues with that.

After that: working on helium milestone (which is crypto improvements). Maybe the crypto improvements can solve the security issue why multi-client repos are discouraged.

The caching is another reason why multi-client is sub-optimal currently, especially if you have many archives in the repo (== many files in chunks.archive.d or many archives to fetch).

@FabioPedretti the files cache is a purely local per-machine thing. The multi-client cache coherency issue needing cache resync is the chunks index (which contains information about the chunks in the repository, like presence, size, refcount).

@callegar
Copy link

callegar commented Jan 31, 2022

@FabioPedretti I need to learn something, then! Can you please clarify the usage of this suffix option? From the "FILES" word I was expecting it to be related to a file cache, not to the chunks cache.

My current issue is that I have some laptops that contain a subset of stuff that is stored on a desktop and a lot of common stuff. Hence, it initially appeared convenient to back them and the desktop machine all up to the same repo. However, doing so without the chunk cache has quickly grown way too time consuming, particularly for the laptop where the speed of getting the stuff needed to sync the cache is quite sensitive on the wifi congestion. On the other hand, using the cache means that each client pays a storage price and that this price grows linearly not only with the number of backups being made on the client, but the backups made by the other clients. For about 60GB of stuff being backed up from each laptop this already means ~ 10GB of cache on each laptop, which is clearly unsustainable.

Before taking decision such as splitting the repos, or drastically reducing the number of backups being kept, or moving to a different arrangement, I am trying to understand if I have missed some borg feature that could help or if waiting for the borgception could make sense.

Never mind, while I was writing this I got the answer from @ThomasWaldmann with the update I was looking for. Thanks!

@tomchiverton
Copy link

My cache (on the machine I am backing up) is currently 10% of the size of the repository data folder (on the server).

Is any work ever going to happen on being able to do away with this, or at least improve the trade off ?

@ThomasWaldmann
Copy link
Member Author

If you'ld rather not want to have that cache, see the FAQ about how to get rid of it (search chunks.archives.d).

It is not required, but can speed up re-syncing of the chunks index if it gets out of sync (especially if the access to the repo is rather slow).

@tomchiverton
Copy link

Oh, I saw that, and I've just done it, after Borg filled my disk with it's cache :) As it seems a common issue, an env. variable or command line to simply disable it might be an idea, as we're getting on for a decade since noticing it was an issue ?

But I refuse to believe even a 50 gig repo can have >8 gig of cachable meta data...

I'm doing a prune & compact on it now, but that's taking forever.... when it does, is it worth turning the cache back on, and report back if it helps keeping the cache under control ?

@ThomasWaldmann
Copy link
Member Author

The chunks.archive.d/ Cache is only relevant to the the chunks index resyncing - borg tells you when it does that (and it only does that if the local chunks index is out of sync with the repo, e.g. because another client has modified the repo).

Good idea with the env var, maybe that could be done in borg 1.4-maint, please open a separate ticket for that.

@tomchiverton
Copy link

Issue for a simple flag to disable cache totally is #8280

As far as I know, I'm not intending to do a chunk resync, and borg isn't telling me I am. I'm doing something like this on the client

export BORG_REPO=ssh://[email protected]/~/borg
export PATH=$PATH:/usr/local/bin

borg create --compression auto,zstd \
        $BORG_REPO::{hostname}-{user}-{utcnow} \
        /etc/ /usr/share/wordpress/wp-content/ \
        /var/lib/mailman/archives/ \
        /home/ec2-user /root \
        /mnt/data/home/ /mnt/data/var/spool/mail 

I have borgbackup-1.2.8-1.fc39.x86_64 on the (Fedora 39) server, and 1.1.18 (from pip ?) on the (Amazon Linux) client.

Are these both as massively outdated as "1.4-maint" makes me thing they are ?!?

@ThomasWaldmann
Copy link
Member Author

About borg releases:

1.2.8 is the latest release from the 1.2-maint branch and quite recent.

1.4.0 is first release of 1.4-maint branch and intended as a refresh/modernisation of 1.2.

There never was a 1.3 release, but we used that version for some tags or alphas.

@tomchiverton
Copy link

My client won't update :

[root@cloud ~]# ls -laht /usr/local/bin/borg 
lrwxrwxrwx 1 root root 23 Aug  9  2023 /usr/local/bin/borg -> /root/borg-env/bin/borg
[root@cloud ~]# source borg-env/bin/activate
(borg-env) [root@cloud ~]# pip install -U borgbackup 
Requirement already satisfied: borgbackup in ./borg-env/lib64/python3.7/site-packages (1.1.18)
Requirement already satisfied: packaging in ./borg-env/lib/python3.7/site-packages (from borgbackup) (23.1)

[notice] A new release of pip is available: 23.2.1 -> 24.0
[notice] To update, run: pip install --upgrade pip

@FelixSchwarz
Copy link
Contributor

Are these both as massively outdated as "1.4-maint" makes me thing they are ?!?

Fedora maintainer here: Fedora rawhide provides 1.4.x. Regarding 1.4 in F40 please see my comment in borg 1.4 release discussion.

@FelixSchwarz
Copy link
Contributor

My client won't update :

borg 1.2 requires Python 3.8+

@ThomasWaldmann
Copy link
Member Author

@tomchiverton please don't use this ticket for all your issues, stay on topic.

@callegar
Copy link

The chunks.archive.d/ Cache is only relevant to the the chunks index resyncing - borg tells you when it does that (and it only does that if the local chunks index is out of sync with the repo, e.g. because another client has modified the repo).

Another client modifying the repo is still strongly discouraged in 1.4 and the forthcoming version 2, right?

If so wouldn't it be better to have this cache disabled by default and to be enabled on an opt-in basis for those who have workflows needing frequent resyncs?

Is working with multiple clients on the same repo on the drawing board for future releases (possibly with no need for that big cache)?

@ThomasWaldmann
Copy link
Member Author

borg 1.4 is the same as 1.2 concerning the few pros and some cons of using a repo by multiple clients.

if one does not access a repo by multiple borg clients, that cache will never be built - but if you do, it will be built (except when using that hack from the faq).

borg2 (master branch) already has improved crypto, so at least there are no crypto issues speaking against multiple clients per repo. cache coherency requiring a cache resync is currently still the same in master.

@jdchristensen
Copy link
Contributor

My cache (on the machine I am backing up) is currently 10% of the size of the repository data folder (on the server).

That seems high to me. I checked one of mine, and it is 3%. You could just delete it all and let it get recreated, and see if it is smaller.

Another client modifying the repo is still strongly discouraged in 1.4 and the forthcoming version 2, right?

Multiple clients writing to the same repo works fine. I use it all of the time. The cache needs to get synced, but that takes just a minute or two.

If so wouldn't it be better to have this cache disabled by default and to be enabled on an opt-in basis for those who have workflows needing frequent resyncs?

As @ThomasWaldmann just mentioned, borg already does this, only creating the cache when a cache resync is needed, which shouldn't happen if only one client is accessing the repo. If you have a cache, then a resync must have been needed. If you don't want the cache, you can just delete it.

@tomchiverton
Copy link

I removed the cache folder, it came back as soon as I did the next backup.
There is only a single client backing up to this $BORG_REPO.

How does the client detect it needs one ? Could I unset this somehow ?

I can't upgrade either end easily to see if there is a bug 1.1.x/1.2.x, sadly

@jdchristensen
Copy link
Contributor

jdchristensen commented Jul 11, 2024

I removed the cache folder, it came back as soon as I did the next backup. There is only a single client backing up to this $BORG_REPO.

Borg needs the cached chunks and files in .cache/borg/ID/chunks and .../files. But it should only need the chunks.archive.d folder when doing a resync. So if you delete the contents of that folder, and don't need a resync, I don't think borg should put anything there. (And if you want to force it not to put anything there, you can use the trick in the FAQ. But I'm just making the point that borg's default behaviour shouldn't put files there.)

If you deleted the chunks file itself, then it would have had to do a resync, which would probably then fill the chunks.archive.d folder.

Just out of curiousity, after the rebuild, how big is the cache compared to the repo?

@tomchiverton
Copy link

Just out of curiousity, after the rebuild, how big is the cache compared to the repo?

The prune completed after several hours.

I was also able to complete a compact using a newer borg on the server side.

I've removed the chunks.archive.d file, and my backup command above is now both faster and not generating a ballooning cache size:

19M     .cache/borg/5cdf44e2224351ae834a0e908a683ba859781b393e3dcf58d43f465653b9296c/chunks
4.0K    .cache/borg/5cdf44e2224351ae834a0e908a683ba859781b393e3dcf58d43f465653b9296c/config
9.6M    .cache/borg/5cdf44e2224351ae834a0e908a683ba859781b393e3dcf58d43f465653b9296c/files
4.0K    .cache/borg/5cdf44e2224351ae834a0e908a683ba859781b393e3dcf58d43f465653b9296c/README

and the server claims

4.5K    /exports/cloud-backup/borg/config
18G     /exports/cloud-backup/borg/data
29K     /exports/cloud-backup/borg/hints.10250
12M     /exports/cloud-backup/borg/index.10250
4.5K    /exports/cloud-backup/borg/integrity.10250
512     /exports/cloud-backup/borg/README

Note the massive drop in data size from ~55 gig to 18.

@ThomasWaldmann
Copy link
Member Author

chunks.archive.d/ size is roughly proportional to the archive count in the repo.
so if you pruned a lot of archives, that cache will need less space.

@tomchiverton
Copy link

chunks.archive.d/ size is roughly proportional to the archive count

Oh, that's super useful, and nothing I read about the Borg cache mentioned this enough for it to stick in my head. Worth adding to the FAQ ?

Pruning archive: cloud-root-2023-08-09T17:42:01       Wed, 2023-08-09 17:42:04 [5fcef2834984869f0f46759345be18829c5300afdee027d9f192f1af05da980d] (1912/1912)
Remote: compaction freed about 39.08 GB repository space.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants