Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snapshotting should wait for user-callback to prepare/mount target filesystem before throwing errors #1142

Open
kortschak opened this issue Mar 23, 2021 · 19 comments
Labels
Bug Notifications desktop, email, GUI/tray status messages User-Callback

Comments

@kortschak
Copy link
Contributor

kortschak commented Mar 23, 2021

I have a Back in Time profile that stores to a NAS that is powered down for most of the time. The NAS is set to wake with a WOL packet that is sent by a user-callback which waits after the WOL packet has been sent then returns a success status if the NAS comes up within the timeout. This all works very nicely since #654 was fixed.

However, the backup to the NAS is scheduled for the middle of the night and often in the morning I see a notification like so

notify

despite the backup having succeeded. This is due to this

if self.config.PLUGIN_MANAGER.hasGuiPlugins and self.config.notify():
self.setTakeSnapshotMessage(1,
_('Can\'t find snapshots folder.\nIf it is on a removable drive please plug it.') +
'\n' +
gettext.ngettext('Waiting %s second.', 'Waiting %s seconds.', 30) % 30,
30)
which makes sense in the context of attended backups, but not in the context of unattended scheduled backups where there is no user to follow the instructions. In the later context it leaves doubt in the mind of the user as to whether the backup succeeded. The issue appears to arise due to a race between backintime and the remote being mounted by the system since the user-callback waits until the NAS is up before it returns. backintime iterates a test for the presence of the remote mount each second, but first notifying the user that the remote mount is not present.

In looking into this issue, I found that backintime also logs as an ERROR to syslog for checks where the remote mount is not present (actually it checks whether it is not a directory):

backintime/common/config.py

Lines 1356 to 1360 in 9310acc

if not os.path.isdir(self.snapshotsFullPath(profile_id)):
logger.error("%s does not exist"
%self.snapshotsFullPath(profile_id),
self)
return False
This results in misleading ERROR log lines in syslog.

Mar 22 01:00:02 client backintime (user/2): INFO: Lock
Mar 22 01:01:52 client backintime (user/2): INFO: user-callback returned 'user-callback: 2021/03/22 01:00:20 sending wake packet#012user-callback: 2021/03/22 01:01:52 server ready'
Mar 22 01:01:53 client backintime (user/2): INFO: mount ssh: [email protected]:/data/backups on /home/user/.local/share/backintime/mnt/XXXXXXXX/mountpoint
Mar 22 01:01:53 client backintime (user/2): INFO: Take a new snapshot. Profile: 2 remote
Mar 22 01:01:53 client backintime (user/2): ERROR: /home/user/.local/share/backintime/mnt/2_28727/backintime/client/user/remote does not exist
Mar 22 01:01:54 client backintime (user/2): ERROR: /home/user/.local/share/backintime/mnt/2_28727/backintime/client/user/remote does not exist
Mar 22 01:01:57 client backintime (user/2): INFO: Call rsync to take the snapshot
Mar 22 02:23:40 client backintime (user/2): WARNING: Command "rsync --recursive ...
Mar 22 02:23:40 client backintime (user/2): WARNING: CONTINUE: --exclude...
Mar 22 02:23:40 client backintime (user/2): INFO: Save config file
Mar 22 02:23:40 client backintime (user/2): INFO: Save permissions
Mar 22 02:26:27 client backintime (user/2): INFO: Create info file
Mar 22 02:26:31 client backintime (user/2): INFO: [smart remove] remove snapshots: [20210315-010002-225]
Mar 22 02:46:50 client backintime (user/2): INFO: unmount ssh: [email protected]:/data/backups from /home/user/.local/share/backintime/mnt/XXXXXXXX/mountpoint
Mar 22 02:46:50 client backintime (user/2): INFO: Unlock

ISTM that the check here should be a silent check for whether the file exists (if it doesn't after the 30 seconds of checking has elapsed it does finally error out) and only if it does exist then check that it's a directory, erroring out if not. The fix for this is simple and I will send a PR.

The fix for the underlying issue is less clear to me; ISTM that perhaps the code should be aware of whether the backup has been invoked by a user or is the result of an unattended schedule, only asking for user input if there is a user likely to be present (i.e. not in the case of scheduled backup).

@kortschak kortschak changed the title Scheduled backup ask for user input and non-error states are logged to syslog as ERROR Scheduled backup asks for user input and non-error states are logged to syslog as ERROR Mar 23, 2021
@colintedford
Copy link

colintedford commented Mar 23, 2021 via email

@kortschak
Copy link
Contributor Author

kortschak commented Mar 23, 2021

Yes, this is why I felt this needed more discussion. Perhaps an *.schedule.unattended boolean configuration would make sense?

@buhtz
Copy link
Member

buhtz commented Sep 26, 2022

@kortschak @colintedford There is a new forming maintaining team and we do review all issues. Is this problem still relevant for you, can you reproduce it with a newer or the latest release? Did you find a solution?

Tag: Bug

This depends on a refactoring of the whole logging mechanic. If I understood the implications correct this is not a blocking or heavy bug. That bug can't be fixed isolated. Because of that I vote for close. No matter the logging mechanic will be recreated using Pythons own logging package. This will take time and won't happen in the next release because it implicates some breaking changes.

@kortschak
Copy link
Contributor Author

kortschak commented Sep 26, 2022

This is still relevant. I run the version at #1143 and have not moved from there because of the impact that that bug has on my use. I don't see any changes between that branch and master that would affect the behaviour here (the relevant code in config.py has been touched, but not in a way that would fix this).

All the requirements to fix the issue are described in the issue, but depend on design decisions, which is why I did not send a fix. I don't think that the bug should be closed until the issue no longer exists, and I would say that there is a fundamental issue of logging non-error states as errors due to invalid logic, not a logging issue.

@buhtz
Copy link
Member

buhtz commented Sep 26, 2022

Please let me paraphrase this to see if I understand it.

  • Your backup destination is a NAS that is not online 24/7
  • Your backup/snapshot job has a user-callback script that use wake-on-lan to wakeup the NAS before the backup starts.

Does your script wait until the NAS has booted (e.g. via checking via ping?). Looking at the time code of your log output it doesn't wait.

  • No matter if that error appears or not your backups are done?

I run the version at #1143 and have not moved from there because of the impact that that bug has on my use.

What is the "impact" in your use-case? It sounds like you can't do backups without that PR? Am I right?

@colintedford
Copy link

Not relevant to me now because I'm not currently using backintime. I'm glad to hear about the team, tho.

@kortschak
Copy link
Contributor Author

kortschak commented Sep 26, 2022

Your backup destination is a NAS that is not online 24/7

Correct.

Your backup/snapshot job has a user-callback script that use wake-on-lan to wakeup the NAS before the backup starts.

This is the case. There is a callback to this program https://github.com/kortschak/bit-user-callback/blob/master/bit-user-callback.go using the following configuration

{
  "iwconfig-path": "/sbin/iwconfig",
  "verbose": false,
  "profile": "remote",
  "essid": "SSID",
  "server": "http://nas.host.address/",
  "wake-mac": "xx:xx:xx:xx:xx:xx",
  "wake-delay": "20s",
  "wake-timeout": "10m0s",
  "wake-remote": "255.255.255.255:9"
}

You can see here that the wake code waits 10 minutes for the NAS to come up before failing out (this is this loop). The NAS takes less than 2 minutes to boot.

No matter if that error appears or not your backups are done?

I think that is correct; this issue is over a year old, so the details are fuzzy.

What is the "impact" in your use-case? It sounds like you can't do backups without that PR? Am I right?

Error logs result in system notifications. BIT without this change can cause significant spamming of notifications that make the Gnome notification bar useless (in pathological cases causing Gnome to become unresponsive — this doesn't happen in this case, but can when other errors occur, this is a different (but related) and bigger issue that may be fixed by your plans to refactor the logging system if you keep track of the number of ERROR level log messages that you send).

@buhtz
Copy link
Member

buhtz commented Sep 27, 2022

Maybe my next question is because I'm not deep enough into the BIT code.

When your user-callback does wait 10 minutes the NAS is up by guarantee.

Then why does the error message happens? Doesn't BIT wait until the end of the user-callback script?

@kortschak
Copy link
Contributor Author

kortschak commented Sep 27, 2022

There are two issues here, the first is shown in the screenshot and is a result of the code in the first block, this ~always happens and is not an error, but spams the notifications. I don't know why it fails to find what it is looking for, but my first guess would be that while the NAS is up and responding to HTTP request, perhaps the FS is slow to respond early. The second is the error logging and I haven't seen that for over a year because I run a version that fixes it; I don't recall the details of when and why it happens.

@buhtz
Copy link
Member

buhtz commented Sep 27, 2022

Now we come closer.

There are two issues here, the first is shown in the screenshot and is a result of the code in the first block, this ~always happens and is not an error, but spams the notifications.

I understand and agree.

I don't know why it fails to find what it is looking for

That is the important question for me. In my understanding (without reviewing the BIT code) this error shouldn't happen because your user-callback bring up the NAS and does wait long enough.

, but my first guess would be that while the NAS is up and responding to HTTP request, perhaps the FS is slow to respond early. The second is the error logging and I haven't seen that for over a year because I run a version that fixes it; I don't recall the details of when and why it happens.

@emtiu
Copy link
Member

emtiu commented Sep 29, 2022

Dan, thanks for your patience and clarity in dealing with this issue, and taking the time to explain your troubles and your proposed solutions. First off, let's agree that (as you explained above), you're dealing with two separate annoyances:

1. Spurious ERRORs in syslog

There are errors thrown by the canBackup function:

backintime/common/config.py

Lines 1356 to 1357 in 88d19d4

if not os.path.isdir(self.snapshotsFullPath(profile_id)):
logger.error("%s does not exist"

which cause an annoyance in your logs, because they indicate a problem that resolves itself after a few seconds. That is discussed at length in #1143 (comment). It's not what this Issue is about.

Moreover, it becomes irrelevant, once we fix …:

2. Spurious desktop notification

You're receiving a desktop notification that says Can't find snapshots folder. If it is on a removable drive […] every 30 seconds, when in fact your user-callback script is still in the middle of waiting for the NAS to come up (which takes about 90 seconds, judging from your logs). This is a race condition, and should be fixed. That's what this Issue is about.

If you agree, I think the title of this Issue should be:
Snapshotting should wait for user-callback to prepare/mount target filesystem before throwing errors

… and that would be our new starting point for eliminating the annoyance for you and, hopefully, others :)

@emtiu emtiu added Feedback needs user response, may be closed after timeout without a response Discussion decision or consensus needed labels Sep 29, 2022
@emtiu emtiu added User-Callback and removed Discussion decision or consensus needed labels Oct 7, 2022
@emtiu emtiu changed the title Scheduled backup asks for user input and non-error states are logged to syslog as ERROR Snapshotting should wait for user-callback to prepare/mount target filesystem before throwing errors Oct 7, 2022
@emtiu emtiu added Bug and removed Feedback needs user response, may be closed after timeout without a response labels Oct 7, 2022
emtiu added a commit that referenced this issue Oct 7, 2022
Do not automatically log an error if the target drive is not accessible, as we might simply be waiting for it to come online. In turn, add a debug message to the 30s wait loop.
@emtiu
Copy link
Member

emtiu commented Oct 8, 2022

//cc @aryoda, who is looking into improving user-callback functionality.

There's a long-ish discussion here, but you really only need this summary from my above comment:

Spurious desktop notification: You're receiving a desktop notification that says Can't find snapshots folder. If it is on a removable drive […] every 30 seconds, when in fact your user-callback script is still in the middle of waiting for the NAS to come up (which takes about 90 seconds, judging from your logs). This is a race condition, and should be fixed.

@aryoda
Copy link
Contributor

aryoda commented Oct 8, 2022

@emtiu I saw some (just merged) PRs for this but I think they "only" do fix the notfication spamming, right?

So if there is still work to do I need to understand what exactly.

TLDR;

What exactly should be changed now from your user point-of-view (after the above PRs are already merged in)?

I see only one thing (but I jumped onto this issue very late):

  • Do not send the "imperative" user notification Can't find snapshots folder. If it is on a removable drive please plug it. from cron jobs (since it is send before the timeout loop which normally heals the missing "mount" and since a user is not expected to read and understand this).

if not self.config.canBackup(profile_id):
if self.config.PLUGIN_MANAGER.hasGuiPlugins and self.config.notify():
self.setTakeSnapshotMessage(1,
_('Can\'t find snapshots folder.\nIf it is on a removable drive please plug it.') +
'\n' +
gettext.ngettext('Waiting %s second.', 'Waiting %s seconds.', 30) % 30,
30)
counter = 0
for counter in range(0, 30):
logger.debug("Cannot start snapshot yet: target directory not accessible. Waiting 1s.")
time.sleep(1)
if self.config.canBackup():
break
if counter != 0:
logger.info("Waited %d seconds for target directory to be available", counter)
if not self.config.canBackup(profile_id):
logger.warning('Can\'t find snapshots folder!', self)
self.config.PLUGIN_MANAGER.error(3) #Can't find snapshots directory (is it on a removable drive ?)

Anything missing?

Details

The race condition can only happen if two++ processes are working in parallel, eg.

  1. WOL in the user-callback (incl. mounting the drive if this is done in there)
  2. BiT snapshots.backup() polling for a correctly mounted target drive.

@kortschak

The NAS is set to wake with a WOL packet that is sent by a user-callback which waits
after the WOL packet has been sent then returns a success status if the NAS comes up within the timeout.

I assume this is sync call (waits for the NAS beeing up) - I have not checked your GO code (cool script BTW) in details
but think it "only" reacts on the "mount" signal 7:

# Taken from user-callback script in GO: https://github.com/kortschak/bit-user-callback/blob/master/bit-user-callback.go
# user-callback ignores profile id and only acts for reason 7 [= mount]

The issue appears to arise due to a race between backintime and the remote being mounted by the system
since the user-callback waits until the NAS is up before it returns.

I think if the user-callback for WOL is a sync call and there could be a race between sshd starting on the NAS and BiT trying to mount over SSH, at least I can see in the logs that you are using SSH in BiT:

Mar 22 01:00:02 client backintime (user/2): INFO: Lock
Mar 22 01:01:52 client backintime (user/2): INFO: user-callback returned 'user-callback: 2021/03/22 01:00:20 sending wake packet#012user-callback: 2021/03/22 01:01:52 server ready'
Mar 22 01:01:53 client backintime (user/2): INFO: mount ssh: [email protected]:/data/backups on /home/user/.local/share/backintime/mnt/XXXXXXXX/mountpoint
Mar 22 01:01:53 client backintime (user/2): INFO: Take a new snapshot. Profile: 2 remote
Mar 22 01:01:53 client backintime (user/2): ERROR: /home/user/.local/share/backintime/mnt/2_28727/backintime/client/user/remote does not exist
Mar 22 01:01:54 client backintime (user/2): ERROR: /home/user/.local/share/backintime/mnt/2_28727/backintime/client/user/remote does not exist
Mar 22 01:01:57 client backintime (user/2): INFO: Call rsync to take the snapshot

I think this could be fixed in the user-mount by also waiting for the sshd (ssh server) to be up
and then returning from the user-callback.

If BiT calls the user-callback to ask for a mount (reason 7 signal ) it should rely on an available mount to do further processing
when the user-callback returns (unless the user-callback returns with an error).

The additional 30 secs timeout for BiT-side mounting of ssh (or whatever is configured in the profile) comes on top
and informing the user with an error only after timing out (instead of spamming notifications for the timeout count down)
is implemented now via above PRs I think.

ISTM that perhaps the code should be aware of whether the backup has been invoked by a user
or is the result of an unattended schedule, only asking for user input if there is a user likely to be present
(i.e. not in the case of scheduled backup).

I assume with "user input" you mean the "user notification" in your Gnome notification bar when logged in
which shows:

Can't find snapshots folder. If it is on a removable drive please plug it.

If really something different pops up requiring user interaction please provide more details or a screen shot here.

I fully agree that a (head-less) scheduled backup should only write to logs and send user notifications only at the end (and only if a user is logged in).

@kortschak
Copy link
Contributor Author

I assume this is sync call (waits for the NAS beeing up)

Yes. You can see the core logic for that here https://github.com/kortschak/bit-user-callback/blob/5fc406cfd52bb6279884127e2c10f4726c63c7cf/bit-user-callback.go#L335-L361

The basis for assessing whether the NAS is up is that it responds with a 200 on a get to the web portal for the NAS. A reasonable human would expect that a success here would mean the fs was up, ... but software.

The remainder is there to ensure that the device is at on the correct wireless network and config handling.

think it "only" reacts on the "mount" signal 7:

This is correct.

https://github.com/kortschak/bit-user-callback/blob/5fc406cfd52bb6279884127e2c10f4726c63c7cf/bit-user-callback.go#L323-L325

I assume with "user input" you mean the "user notification" in your Gnome notification bar when logged in

Yes.

I fully agree that a (head-less) scheduled backup should only write to logs and send user notifications only at the end (and only if a user is logged in).

Partially. Logs should be written, but user notification absolutely should not be used in a headless situation. The logs that should be written should be at the appropriate level (I think this is fixed now).

@aryoda
Copy link
Contributor

aryoda commented Oct 8, 2022

Logs should be written, but user notification absolutely should not be used in a headless situation.

Yes, in 100 % headless situations a user notification is simply not possible and BiT offers the notification plugin for this headless use case (eg. to send emails).

If a cron job is started under a certain user account and the user is logged-in in a desktop environment when the job finishes I would not consider this as a headless situation (eg. Ubuntu informs me via notifications - when I am logged-in - that some scheduled updates were not possible - eg. snap security updates).

IMHO it would be very helpful for me as BiT user if scheduled backup jobs (in user's cron) gave me a notification if a backup failed (or generally that it was executed and what the result was) so that I am reminded that it worked (or not). Otherwise I had to actively check the logs (if I think of it) or could think backups work but they don't.

Anyhow these are the options I see:

A) Implementing a "cron-zero-notification" change:

  1. A backup started via cron could be recognized in this this code (eg. by setting a transient flag as you proposed earlier):

    def backupJob(args):
    """
    Command for taking a new snapshot in background. Mainly used for cronjobs.
    This will run the snapshot inside a daemon and detach from it. It will
    return immediately back to commandline.

  2. The only and central code location where user notifications are sent is in the notifyplugin.py.
    Here all messages could be suppressed if the flag from 1. is set (and maybe redirected to a the log not already implemented).

Note: I was surprised that the GUI-plugin notifyplugin.py is always loaded (even if only backintime CLI was started). This seems inconsistent...

Loads all plugins from python source code files that are stored
in one of these plugin sub folders in the installation
root folder:
'plugins', 'common/plugins', 'qt/plugins'

B) Alternative: Implementing a "if cron then notify only about final backup status" change:

  1. Same as 1. above

  2. All existing pluginmanager.message() calls must be changed to consider the flag from 1.
    (and future implementations must stick to this but can be forgotten easily due to the de-central nature of this logic).

    There are currently only one code location containing this call: snapshots.setTakeSnapshotsMessage():

    self.config.PLUGIN_MANAGER.message(profile_id, profile_name, type_id, message, timeout)

    Unfortunately changing this call by considering the flag value from 1. is not the right way
    since setTakeSnapshotsMessage() is called from about 20 different other locations to send
    notifications that have either nothing to do with snapshots (eg. "Saving config file", "removing old snapshots")
    or are informing about a non-final state of taking a snapshot (eg. "Taking snapshot").

    I would modify setTakeSnapshotsMessage() by appending a new argument IsFinalResultStatus = False
    and modify only the calls that represent the final backup result (OK or error results) with IsFinalResultStatus = True.

@kortschak @emtiu @buhtzz Any opinions?

@kortschak
Copy link
Contributor Author

Yeah, I should clarify, notifications are reasonable, but expecting user interaction is not.

The final outcome notification looks like the best approach to me. I agree that it is easily possible for the user to not notice that a backup has failed otherwise.

@emtiu
Copy link
Member

emtiu commented Oct 8, 2022

@aryoda wrote:

If BiT calls the user-callback to ask for a mount (reason 7 signal ) it should rely on an available mount to do further processing
when the user-callback returns (unless the user-callback returns with an error).

Yes, this is what I consider the core of this issue. If this was done correctly, @kortschak would not be seeing the unwelcome notifications.

But other considerations regarding notifications also come into play:

@aryoda wrote:

IMHO it would be very helpful for me as BiT user if scheduled backup jobs (in user's cron) gave me a notification if a backup failed (or generally that it was executed and what the result was) so that I am reminded that it worked (or not). Otherwise I had to actively check the logs (if I think of it) or could think backups work but they don't.

Yes, this problem (silent failure) has been mentioned in #450, and I've also experienced it myself. Different Issue ;)

@aryoda wrote:

A) Implementing a "cron-zero-notification" change:

I think this is what #850 also wishes for.

On the other hand, users might wish for that notification: It would remind them to connect a drive (for which they have 30 seconds), then run a snapshot, and then the drive might be removed again.

B) Alternative: Implementing a "if cron then notify only about final backup status" change:

Yes, users might very well want to be notified of the final result of a snapshot by desktop notification, but currently, this does not happen.

The "Saving config file" etc. notifications are "BiT-internal" (shown in the GUI and the tray icon), but not as desktop notifications.

In summary, a lot could be done to improve notifications. It would probably be best to have per-profile configuration options like:

  • "Notify when drive is not connected"
  • "Notify when snapshot fails"
  • "Notify when snapshot succeeds".

But this would require a major reworking, including the integration of the If it is on a removable drive please plug it. desktop notification with the "BiT-internal" notifications in the GUI and the tray icon.

For the moment, I think several other issues and bugs are more important. But this Issue gives a good summary.

@emtiu emtiu added the Notifications desktop, email, GUI/tray status messages label Oct 8, 2022
@aryoda
Copy link
Contributor

aryoda commented Oct 9, 2022

For the moment, I think several other issues and bugs are more important. But this Issue gives a good summary.

I agree putting this issue into the backlog (esp. since the plugin and notification systems need a complete overhaul...)

@aryoda
Copy link
Contributor

aryoda commented Oct 9, 2022

@emtiu

Yes, users might very well want to be notified of the final result of a snapshot by desktop notification,
but currently, this does not happen.

THX, I didn't realize that only selected messages are send as desktop notifications and just checked why...

The notifyplugin does a notify-send only for errors (argument level = 1 - nice example for a "magic number" BTW ;-)

def message(self, profile_id, profile_name, level, message, timeout):
if 1 == level:
cmd = ['notify-send']

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Notifications desktop, email, GUI/tray status messages User-Callback
Projects
None yet
Development

No branches or pull requests

5 participants