Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FileExistsError/FileNotFoundError when locking #5828

Closed
pgerber opened this issue Jun 9, 2021 · 9 comments
Closed

FileExistsError/FileNotFoundError when locking #5828

pgerber opened this issue Jun 9, 2021 · 9 comments
Labels
Milestone

Comments

@pgerber
Copy link
Contributor

pgerber commented Jun 9, 2021

Have you checked borgbackup docs, FAQ, and open Github issues?

Yes

Is this a BUG / ISSUE report or a QUESTION?

bug

System information. For client/server mode post info for both machines.

Your borg version (borg -V).

$ borg -V
borg 1.1.16

Operating system (distribution) and version.

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 20.04.2 LTS
Release:        20.04
Codename:       focal

Hardware / network configuration, and filesystems used.

BTRFS on target host

Describe the problem you're observing.

I've been seeing locking-related errors for some time. See backtraces below. So far, I've only seen this issue on one out of two servers. The only difference I'm aware of is that the machine is using BTRFS as filesystem.

Can you reproduce the problem? If so, describe how. If not, describe troubleshooting steps you took before opening the issue.

No, but the issue reappears sporadically.

Include any warning/errors/backtraces from the system logs

Traceback (most recent call last):
  File "borg/locking.py", line 130, in acquire
FileExistsError: [Errno 17] File exists: '/backup/borg_repos/backup01/lock.exclusive'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "borg/remote.py", line 247, in serve
  File "borg/remote.py", line 375, in open
  File "borg/repository.py", line 190, in __enter__
  File "borg/repository.py", line 421, in open
  File "borg/locking.py", line 350, in acquire
  File "borg/locking.py", line 363, in _wait_for_readers_finishing
  File "borg/locking.py", line 134, in acquire
  File "borg/locking.py", line 159, in kill_stale_lock
FileNotFoundError: [Errno 2] No such file or directory: '/backup/borg_repos/backup01/lock.exclusive'

Borg server: Platform: Linux backup02 5.4.0-73-generic #82-Ubuntu SMP Wed Apr 14 17:39:42 UTC 2021 x86_64
Borg server: Linux: debian bullseye/sid 
Borg server: Borg: 1.1.16  Python: CPython 3.7.9 msgpack: 0.5.6.+borg1
Borg server: PID: 1808518  CWD: /backup/borg_repos/backup01
Borg server: sys.argv: ['borg', 'serve', '--restrict-to-path', '/backup/borg_repos/backup01']
Borg server: SSH_ORIGINAL_COMMAND: 'borg serve --umask=077 --error'
Platform: Linux backup01 5.4.0-73-generic #82-Ubuntu SMP Wed Apr 14 17:39:42 UTC 2021 x86_64
Linux: debian bullseye/sid 
Borg: 1.1.16  Python: CPython 3.7.9 msgpack: 0.5.6.+borg1
PID: 67423  CWD: /root
sys.argv: ['borg', 'create', '--error', '--compression', 'zstd', '--lock-wait', '43200', '--noatime', '--patterns-from', '/etc/borg/patterns-main', '--', '[email protected]:~::main-{now:%Y-%m-%dT%H:%M:%S}']
SSH_ORIGINAL_COMMAND: None
Traceback (most recent call last):
  File "borg/locking.py", line 130, in acquire
FileExistsError: [Errno 17] File exists: '/backup/borg_repos/backup01/lock.exclusive'


During handling of the above exception, another exception occurred:


Traceback (most recent call last):
  File "borg/remote.py", line 247, in serve
  File "borg/remote.py", line 375, in open
  File "borg/repository.py", line 190, in __enter__
  File "borg/repository.py", line 421, in open
  File "borg/locking.py", line 350, in acquire
  File "borg/locking.py", line 363, in _wait_for_readers_finishing
  File "borg/locking.py", line 134, in acquire
  File "borg/locking.py", line 159, in kill_stale_lock
FileNotFoundError: [Errno 2] No such file or directory: '/backup/borg_repos/backup01/lock.exclusive'

Borg server: Platform: Linux backup02 5.4.0-73-generic #82-Ubuntu SMP Wed Apr 14 17:39:42 UTC 2021 x86_64
Borg server: Linux: debian bullseye/sid 
Borg server: Borg: 1.1.16  Python: CPython 3.7.9 msgpack: 0.5.6.+borg1
Borg server: PID: 1808518  CWD: /backup/borg_repos/backup01
Borg server: sys.argv: ['borg', 'serve', '--restrict-to-path', '/backup/borg_repos/backup01']
Borg server: SSH_ORIGINAL_COMMAND: 'borg serve --umask=077 --error'
Platform: Linux backup01 5.4.0-73-generic #82-Ubuntu SMP Wed Apr 14 17:39:42 UTC 2021 x86_64
Linux: debian bullseye/sid 
Borg: 1.1.16  Python: CPython 3.7.9 msgpack: 0.5.6.+borg1
PID: 67423  CWD: /root
sys.argv: ['borg', 'create', '--error', '--compression', 'zstd', '--lock-wait', '43200', '--noatime', '--patterns-from', '/etc/borg/patterns-main', '--', '[email protected]:~::main-{now:%Y-%m-%dT%H:%M:%S}']
SSH_ORIGINAL_COMMAND: None
@ThomasWaldmann
Copy link
Member

Do you have scripts running that use borg break-lock?

Do you have multiple borg client processes running against that one repo? At the same time?

@ThomasWaldmann
Copy link
Member

Analyzing this:

  • locking:130 exclusive lock directory (ELD) exists already when we try creating it
  • locking:134 we only get here if it is NOT our own lock and try to kill stale locks now
  • locking:159 ELD is gone when we try looking into it

So, as ELD is gone in 159 then something must have removed it (which is why I was asking for break-lock and other borg processes which maybe raced for deleting a stale lock).

@ThomasWaldmann
Copy link
Member

Hmm, no matter how this happened, I think you have found a bug here.

borg should not crash when trying to look into a not-any-more-existing exclusive lock directory.

@ThomasWaldmann ThomasWaldmann added this to the 1.1.17 milestone Jun 9, 2021
ThomasWaldmann added a commit to ThomasWaldmann/borg that referenced this issue Jun 9, 2021
@ThomasWaldmann
Copy link
Member

Note: master branch locking code did get recently get improvements/fixes by @alfredo08154711 and it already handles this case.

ThomasWaldmann added a commit to ThomasWaldmann/borg that referenced this issue Jun 9, 2021
@alfredo08154711
Copy link
Contributor

alfredo08154711 commented Jun 9, 2021 via email

@ThomasWaldmann
Copy link
Member

ThomasWaldmann commented Jun 9, 2021

hi @alfredo08154711 - no action strictly required, maybe you want to have a look in the PR.

yes, your fixes are only in master branch (and thus in borg 1.2).

i first wanted to have that tested there before doing big changes in 1.1 locking. not sure how much testing the 1.2 betas got until now - a lot people might hold back and use a stable release only rather than additionally testing a beta.

about the with-lock --shared PR: will try to review it, but I'ld also like other users/developers to review that and give feedback.

do you mean #4120 ?

@alfredo08154711
Copy link
Contributor

alfredo08154711 commented Jun 9, 2021 via email

ThomasWaldmann added a commit that referenced this issue Jun 9, 2021
handle crash due to kill stale lock race, fixes #5828
@pgerber
Copy link
Contributor Author

pgerber commented Jun 10, 2021

Do you have scripts running that use borg break-lock?

No, break-lock is not used.

Do you have multiple borg client processes running against that one repo? At the same time?

Yes.

According to the crontab file two jobs are actually started simultaniously:

# create backup "main" at [email protected]:~
30 21 * * * root  borg create --error --compression zstd --lock-wait 43200 --noatime --patterns-from /etc/borg/patterns-main -- '[email protected]:~'::main-{now:\%Y-\%m-\%dT\%H:\%M:\%S}

# prune backup "main" at [email protected]:~
30 20 * * * root  borg prune '[email protected]:~' --lock-wait 43200 --keep-within 7d --keep-daily 7 --keep-weekly 4 --prefix main-

# create backup "postgres" at [email protected]:~
02 22 * * * root  borg create --error --compression zstd --lock-wait 43200 --noatime --patterns-from /etc/borg/patterns-postgres -- '[email protected]:~'::postgres-{now:\%Y-\%m-\%dT\%H:\%M:\%S}

# prune backup "postgres" at [email protected]:~
30 20 * * 1-5 root  borg prune '[email protected]:~' --lock-wait 43200 --keep-daily 7 --keep-weekly 4 --prefix postgres-

# check repo at [email protected]:~
00 23 * * 0 root  borg check --lock-wait 43200 '[email protected]:~' >/dev/null 2>&1

# intended for use by monitoring and health checks
0 12 * * * root  borg info --glob '*' --lock-wait 43200 --last 300 --json '[email protected]:~' >/var/lib/tocco-borg/info-1.tmp && mv /var/lib/tocco-borg/info-1.tmp /var/lib/tocco-borg/info-1

I guess that could explain why it only happens with one server. I can't see any other simultaneous executions of borg.

Note: master branch locking code did get recently get improvements/fixes by @alfredo08154711 and it already handles this case.

I noticed the errors a while back and noticed that refactoring of the locking mechanism being ongoing. So, I refrained from reporting the issue under the assumption that I'd likely be fixed by the change. I assumed the change went into 1.1.16 but I guess was wrong about that.

I'm thinking I should just update the one server to borg 1.2 and see if that helps. Considering that only the secondary backup server is affected, the risk should be manageable.

@ThomasWaldmann
Copy link
Member

OK, 1.1 fixed recently, master fixed since longer, all done here.

@pgerber 1.2 still in beta, be careful. testing is very welcome of course.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants