-
-
Notifications
You must be signed in to change notification settings - Fork 30.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create an issue when database backups fail because the system runs out of resources #109020
Conversation
…urces If the backup takes too long, the system can run out of resources to store the event queue. When this happens we logged a warning in the log which was easy to miss, and users sometimes found out their backup had failed at the worst possible time (when they needed to restore it.)
Hey there @home-assistant/core, mind taking a look at this pull request as it has been labeled with an integration ( Code owner commandsCode owners of
|
It would be good to create an issue when the system storage space falls under a certain amount, say, 10 or 15% (if that is not already the case), so that users can treat storage issues well in advance (certainly before backups start failing). |
That is not in scope for this PR |
I understand that but we are just treating the symptom here (backups failing), rather than the cause (low / no space). While I like the idea, I feel that it doesn't seem like a solution that scales well. Are we going to add alerts for every other area of HA that fails due to low / no space too? Another example could be a HAOS or HA Core docker image that does not have enough space to be able to update. We could instead check available disk space before creating a backup or performing an upgrade instead and if there is not enough space available, fail to start the backup / upgrade process with an exception. |
This PR has nothing to do with low space, its about the lock being held too long because the backup itself it taking too long. Resources in this case mean: cpu time, memory, I/O, or anything else that would make the backup too slow. It's intentionally generic because we honestly don't know why, and it's something the user is going to have to investigate. After the supervisor PR is merged hopefully this will only happen in very extreme cases. If the backup fails because it runs out of space, they're already going to get a notification in the UI. The problem here is the backup looks like it's successful but it really wasn't since we gave up the lock and now the database back up is no longer good but the rest of the backup is ok. The alternative would to make it a hard failure but that might put the user in an even a worse state because then they can't even back up core. I think it's better to be able to make a back up even without the database than get nothing at all. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Lol, isoformat seemed super friendly to me, but I'm obviously not the user in this case. I'll pick a more human readable format |
I think |
Repair issues can't send datetime or timestamps to the frontend, only strings AFAIK |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @bdraco 👍
../Frenck
🤞 that this is rare as I think it's going to be, but if it's not, I guess it'll motivate to improve the Backup performance more. Certainly better to know than vs finding out the hard way We still have a lot of opportunities to reduce I/O and disk writes which also save storage lifetime that haven't been explored yet. |
Thanks |
This should wait to merge until after home-assistant/supervisor#4843 so we don't have a have a lot of issues in the queue that we can do little about.Proposed change
If the backup takes too long, the system can run out of resources to store the event queue because we have to queue them in memory until the database backup finishes and the write lock is released. When this happens we logged a warning in the log which was easy to miss, and users sometimes found out their database backup had failed at the worst possible time (when they needed to restore it.)
The repair issue increases the chance the user will see the problem without forcing the rest of their backup to hard fail (since it only affects the database). This is especially relevant if they're doing backups via API to do an update, or using an add-on like the Google Drive Backup or other Backup system. For testing I did backups via api and verified I got the error in the UI.
This issue is expected to be extremely rare after home-assistant/supervisor#4843
Type of change
Additional information
Checklist
ruff format homeassistant tests
)If user exposed functionality or configuration variables are added/changed:
If the code communicates with devices, web services, or third-party tools:
Updated and included derived files by running:
python3 -m script.hassfest
.requirements_all.txt
.Updated by running
python3 -m script.gen_requirements_all
..coveragerc
.To help with the load of incoming pull requests: