-
-
Notifications
You must be signed in to change notification settings - Fork 742
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
borgception #474
Comments
I don't think it's a weird idea at all. I was thinking along similar lines, actually. I was initially "what iffing" the idea of storing the chunks cache inside I think similar fresh approaches can be used to optimise the size of other On 8 December 2015 at 05:58, TW [email protected] wrote:
|
how about having per segment index chunks - then each segment would have encrypted index update instructions, and those could be removed/regenerated as well |
a yet different method could be to just have 2 files per segment, one encrypted metadata, one for the blob data it adds in order the whole chunk index can then be reconstructed from the metadata |
@level323 the files index does not need to be shared. it just remembers the mtime/size/inode/chunks info for all files of last backup, so it can quickly skip the files next time. @RonnyPfannschmidt what's the point of a per-segment index if you have to merge 100.000 of them to get the full index? |
@ThomasWaldmann you only need to deal with tens of thousands on a full index regen the multi machine use case would be much less intensive on computation, since one only has to obtain increments |
@RonnyPfannschmidt well, a segment has 5MB, so a 500GB repo has 100.000 segments. That's just an example, but a quite realistic one. Of course, it can be more or less depending on how much data you have. But I still don't see how your suggestion should be efficient. Just for comparison: the normal, uncached and quite slow repo rebuild goes through ALL archives, ALL files and uses the file item chunk list stored in metadata as increment. The items metadata are stored clustered together in a few segments [not together with file content data]). BTW, an incremental "just add on top what we already have" approach for the chunks cache only works as long as nothing is removed. Because if something is removed, also the information about it is gone, so we can't subtract. (see the PRs) |
if each segment also tracks the removals, then the chunks index willl match the current state after applying a segment, the main problem would be to correct the reference segment of the current state on a vacuum - which is quite hard, since segments will be combined on vacuum |
its highly unrelated and a completely different kind of data-store not suitable for borg's needs |
Out of curiosity and with no intention whatsoever to be pressing, would be great to hear if there is progress on this idea. If I understand correctly, the advantage of the cache is when multiple hosts are backed up to the same repo (which would help when a lot of overlap is expected). I get the impression that when a single host is backed up to a repo one can work without a cache at all without performance penalty using the At the same time I now see in the FAQs that backing up multiple hosts to the same server is discouraged (at https://borgbackup.readthedocs.io/en/stable/faq.html#can-i-backup-from-multiple-servers-into-a-single-repository). From this premises, I would like to ask if the borgception is still being pursued as a way to make the multiple hosts to one repo model workable, or is being dropped as unneeded as the single host to single repo model would not get any real benefit from it. |
@callegar I currently have no near-future plans to implement this. Priority now is on getting 1.2 out (b4 released, rc1 next), then fixing any potential issues with that. After that: working on helium milestone (which is crypto improvements). Maybe the crypto improvements can solve the security issue why multi-client repos are discouraged. The caching is another reason why multi-client is sub-optimal currently, especially if you have many archives in the repo (== many files in chunks.archive.d or many archives to fetch). @FabioPedretti the files cache is a purely local per-machine thing. The multi-client cache coherency issue needing cache resync is the chunks index (which contains information about the chunks in the repository, like presence, size, refcount). |
@FabioPedretti I need to learn something, then! Can you please clarify the usage of this suffix option? From the "FILES" word I was expecting it to be related to a file cache, not to the chunks cache. My current issue is that I have some laptops that contain a subset of stuff that is stored on a desktop and a lot of common stuff. Hence, it initially appeared convenient to back them and the desktop machine all up to the same repo. However, doing so without the chunk cache has quickly grown way too time consuming, particularly for the laptop where the speed of getting the stuff needed to sync the cache is quite sensitive on the wifi congestion. On the other hand, using the cache means that each client pays a storage price and that this price grows linearly not only with the number of backups being made on the client, but the backups made by the other clients. For about 60GB of stuff being backed up from each laptop this already means ~ 10GB of cache on each laptop, which is clearly unsustainable. Before taking decision such as splitting the repos, or drastically reducing the number of backups being kept, or moving to a different arrangement, I am trying to understand if I have missed some borg feature that could help or if waiting for the borgception could make sense. Never mind, while I was writing this I got the answer from @ThomasWaldmann with the update I was looking for. Thanks! |
My cache (on the machine I am backing up) is currently 10% of the size of the repository data folder (on the server). Is any work ever going to happen on being able to do away with this, or at least improve the trade off ? |
If you'ld rather not want to have that cache, see the FAQ about how to get rid of it (search chunks.archives.d). It is not required, but can speed up re-syncing of the chunks index if it gets out of sync (especially if the access to the repo is rather slow). |
Oh, I saw that, and I've just done it, after Borg filled my disk with it's cache :) As it seems a common issue, an env. variable or command line to simply disable it might be an idea, as we're getting on for a decade since noticing it was an issue ? But I refuse to believe even a 50 gig repo can have >8 gig of cachable meta data... I'm doing a prune & compact on it now, but that's taking forever.... when it does, is it worth turning the cache back on, and report back if it helps keeping the cache under control ? |
The chunks.archive.d/ Cache is only relevant to the the chunks index resyncing - borg tells you when it does that (and it only does that if the local chunks index is out of sync with the repo, e.g. because another client has modified the repo). Good idea with the env var, maybe that could be done in borg 1.4-maint, please open a separate ticket for that. |
Issue for a simple flag to disable cache totally is #8280 As far as I know, I'm not intending to do a chunk resync, and borg isn't telling me I am. I'm doing something like this on the client
I have borgbackup-1.2.8-1.fc39.x86_64 on the (Fedora 39) server, and 1.1.18 (from pip ?) on the (Amazon Linux) client. Are these both as massively outdated as "1.4-maint" makes me thing they are ?!? |
About borg releases: 1.2.8 is the latest release from the 1.2-maint branch and quite recent. 1.4.0 is first release of 1.4-maint branch and intended as a refresh/modernisation of 1.2. There never was a 1.3 release, but we used that version for some tags or alphas. |
My client won't update :
|
Fedora maintainer here: Fedora rawhide provides 1.4.x. Regarding 1.4 in F40 please see my comment in borg 1.4 release discussion. |
borg 1.2 requires Python 3.8+ |
@tomchiverton please don't use this ticket for all your issues, stay on topic. |
Another client modifying the repo is still strongly discouraged in 1.4 and the forthcoming version 2, right? If so wouldn't it be better to have this cache disabled by default and to be enabled on an opt-in basis for those who have workflows needing frequent resyncs? Is working with multiple clients on the same repo on the drawing board for future releases (possibly with no need for that big cache)? |
borg 1.4 is the same as 1.2 concerning the few pros and some cons of using a repo by multiple clients. if one does not access a repo by multiple borg clients, that cache will never be built - but if you do, it will be built (except when using that hack from the faq). borg2 (master branch) already has improved crypto, so at least there are no crypto issues speaking against multiple clients per repo. cache coherency requiring a cache resync is currently still the same in master. |
That seems high to me. I checked one of mine, and it is 3%. You could just delete it all and let it get recreated, and see if it is smaller.
Multiple clients writing to the same repo works fine. I use it all of the time. The cache needs to get synced, but that takes just a minute or two.
As @ThomasWaldmann just mentioned, borg already does this, only creating the cache when a cache resync is needed, which shouldn't happen if only one client is accessing the repo. If you have a cache, then a resync must have been needed. If you don't want the cache, you can just delete it. |
I removed the cache folder, it came back as soon as I did the next backup. How does the client detect it needs one ? Could I unset this somehow ? I can't upgrade either end easily to see if there is a bug 1.1.x/1.2.x, sadly |
Borg needs the cached chunks and files in .cache/borg/ID/chunks and .../files. But it should only need the chunks.archive.d folder when doing a resync. So if you delete the contents of that folder, and don't need a resync, I don't think borg should put anything there. (And if you want to force it not to put anything there, you can use the trick in the FAQ. But I'm just making the point that borg's default behaviour shouldn't put files there.) If you deleted the chunks file itself, then it would have had to do a resync, which would probably then fill the chunks.archive.d folder. Just out of curiousity, after the rebuild, how big is the cache compared to the repo? |
The prune completed after several hours. I was also able to complete a compact using a newer borg on the server side. I've removed the
and the server claims
Note the massive drop in |
chunks.archive.d/ size is roughly proportional to the archive count in the repo. |
Oh, that's super useful, and nothing I read about the Borg cache mentioned this enough for it to stick in my head. Worth adding to the FAQ ?
|
i thought a bit about how to optimize the chunks cache and just wanted to document one weird idea.
the issue with the chunks cache is that it needs to match the overall repository state (== have up-to-date information about all chunks in all archives, including refcount, size, csize). when backing up multiple machines into same repo, creating an archive of one machine invalidates all chunk caches on the other machines and they need to resync their chunks cache with the repo, which is expensive.
so, there is the idea to store the chunk index into the repo also, so all out-of-sync clients can just fetch the index from the repo.
But:
So we need:
This pretty much sounds like we should just backup the index of repo A into a related, but separate borg repository A'. :-)
💰 there is a bounty for this
The text was updated successfully, but these errors were encountered: