Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unnecessary chunks cache sync? #7278

Closed
jdchristensen opened this issue Jan 17, 2023 · 7 comments · Fixed by #8332
Closed

Unnecessary chunks cache sync? #7278

jdchristensen opened this issue Jan 17, 2023 · 7 comments · Fixed by #8332
Milestone

Comments

@jdchristensen
Copy link
Contributor

Have you checked borgbackup docs, FAQ, and open GitHub issues?

Yes. There's a chance this is related to #7274 , but I think it is probably different.

Is this a BUG / ISSUE report or a QUESTION?

BUG

System information. For client/server mode post info for both machines.

Ubuntu 22.04

Your borg version (borg -V).

borg 1.2.0. Also happens with master branch borg2.

Operating system (distribution) and version.

Ubuntu 22.04

Hardware / network configuration, and filesystems used.

N/A

How much data is handled by borg?

N/A

Full borg commandline that lead to the problem (leave away excludes and passwords)

See below.

Describe the problem you're observing.

When a host accesses a borg repo after another host has accessed it, sometimes some archives have their chunks cache synced even when they should already be in the cache. I first noticed it in #7277 (with borg2), where the line

Fetching and building archive index for host1test19

shouldn't be there, since that archive was made with the same host. Then I reproduced it with borg 1.2.0, using the steps shown below.

Can you reproduce the problem? If so, describe how. If not, describe troubleshooting steps you took before opening the issue.

borg-test-sync$ mkdir host1 host2 test-data
borg-test-sync$ echo foo > test-data/foo
borg-test-sync$ export BORG_BASE_DIR=host1
borg-test-sync$ export BORG_REPO=test.repo
borg-test-sync$ borg init -e none
borg-test-sync$ borg create -v ::${BORG_BASE_DIR}-test1 ./test-data
Creating archive at "test.repo::host1-test1"
borg-test-sync$ borg create -v ::${BORG_BASE_DIR}-test2 ./test-data
Creating archive at "test.repo::host1-test2"
borg-test-sync$ export BORG_BASE_DIR=host2
borg-test-sync$ borg create -v ::${BORG_BASE_DIR}-test3 ./test-data
Warning: Attempting to access a previously unknown unencrypted repository!
Do you want to continue? [yN] y
Creating archive at "test.repo::host2-test3"
Synchronizing chunks cache...
Archives: 2, w/ cached Idx: 0, w/ outdated Idx: 0, w/o cached Idx: 2.
Fetching and building archive index for host1-test1 ...
Merging into master chunks index ...
Fetching and building archive index for host1-test2 ...
Merging into master chunks index ...
Done.
borg-test-sync$ borg create -v ::${BORG_BASE_DIR}-test4 ./test-data
Creating archive at "test.repo::host2-test4"
borg-test-sync$ export BORG_BASE_DIR=host1
borg-test-sync$ borg create -v ::${BORG_BASE_DIR}-test5 ./test-data
Creating archive at "test.repo::host1-test5"
Synchronizing chunks cache...
Archives: 4, w/ cached Idx: 0, w/ outdated Idx: 0, w/o cached Idx: 4.  <<<<<<
Fetching and building archive index for host1-test1 ...  *************
Merging into master chunks index ...
Fetching and building archive index for host1-test2 ...  *************
Merging into master chunks index ...
Fetching and building archive index for host2-test3 ...
Merging into master chunks index ...
Fetching and building archive index for host2-test4 ...
Merging into master chunks index ...
Done.

In this case, both archives made on "host1" have their data fetched, even though we are currently on host1. See the two lines marked with "******". Also, the line marked with "<<<<<<" shows that borg thinks there is nothing locally cached. In #7277, only one gets synced that shouldn't, so it is less extreme there. Not sure what is going on.

@ThomasWaldmann ThomasWaldmann added this to the 1.2.4 milestone Jan 18, 2023
@jdchristensen
Copy link
Contributor Author

What's going on is that the borg client doesn't cache archive chunks when it creates new archives. It only caches the data when it fetches it remotely after another host has added to the repo.

I think this is correct behaviour with the commands I showed above, because host1 initially has no way to predict that the repo will be accessed by multiple hosts. So I think it's good that it didn't cache things as the cache can get large.

But if you iterate the above commands, exchanging creates from host1 and host2, then it seems odd that the clients don't add to the cached chunks when they create new archives. It means that the client later fetches data that it at one point had locally, which is wasteful.

So this is more of a feature request than a bug: if chunks.archive.d is in use and an archive is created, add the relevant info to the cache. I'm not sure whether this is trivial, or if the data would need to be massaged to be in the right format.

(Also, I verified this behaviour with 1.2.3. Not sure why I was still running 1.2.0.)

@ThomasWaldmann
Copy link
Member

Correct, if one client is alone, it would never need that chunks.archives.d/* that has the single-archive chunks indexes that are used to generate the overall master chunks index.

Sorry, I initially didn't precisely read the issue report, only had a quick glance (overlooking that you simulated 2 hosts).

What you've seen is currently expected behaviour.

@ThomasWaldmann
Copy link
Member

ThomasWaldmann commented Mar 5, 2023

Hmm, I am thinking about closing this as "works as expected":

  • local archive creation is not updating the chunks.archives.d per-archive chunks index cache.
  • if a repo is only used by 1 client, updating that would do something that is never needed.
  • chunks.archives.d is mainly a means to accelerate resyncing of an out-of-sync chunks cache (assuming that a local cached per-archive index can be processed faster than fetching this data from a remote repo and updating an in-memory archive), which is a multi-client per repo thing.
  • having the master chunks index and the per-archive chunks index in memory at the same time could need 2x the memory, maybe we want to avoid this for normal archive creation.

@ThomasWaldmann ThomasWaldmann self-assigned this Mar 5, 2023
@jdchristensen
Copy link
Contributor Author

We could regard the existence of a non-empty chunks.archives.d as a hint that multiple clients are accessing the repo, so we may as well save the data we already have locally. About the memory use, I guess it depends on whether we need to create an entirely new hashtable, or can just compress and write-out one that we already have. I don't know exactly what is stored or how it is created, or how big it is.

@ThomasWaldmann
Copy link
Member

ThomasWaldmann commented Mar 5, 2023

@jdchristensen a yes, not-empty dir is a good indication.

but we do not have that separate per-archive chunks index data when borg is creating an archive, it just does update the master chunks index (which contains the chunks index for all archives / for the repo).

creating a one-archive chunks index would need maintaining a 2nd HT or creating a "diff" between updated master chunks index and previous master chunks index.

@jdchristensen
Copy link
Contributor Author

Presumably a one-archive chunks index would be a lot smaller than the full chunks index, so we aren't really talking about 2x the memory, right? And if we're only creating it to write it to disk, we could create it in compact form, paying no attention to the hash function. Is there already code that does this when a client notices that it is missing chunks index data from one or more archives?

@ThomasWaldmann
Copy link
Member

2x is the worst case for a 100% deduplication rate (so the one-archive chunks index is the same size as the all-archives chunks index).

"A lot smaller" might only happen if you have a lot of archives and a lot of changes between your archives.

That chunks index archive is only used for the chunks index resync, nowhere else (IIRC).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants