Create an issue when database backups fail because the system runs out of resources #109020

bdraco · 2024-01-28T19:46:49Z

~~This should wait to merge until after home-assistant/supervisor#4843 so we don't have a have a lot of issues in the queue that we can do little about.~~

Proposed change

If the backup takes too long, the system can run out of resources to store the event queue because we have to queue them in memory until the database backup finishes and the write lock is released. When this happens we logged a warning in the log which was easy to miss, and users sometimes found out their database backup had failed at the worst possible time (when they needed to restore it.)

The repair issue increases the chance the user will see the problem without forcing the rest of their backup to hard fail (since it only affects the database). This is especially relevant if they're doing backups via API to do an update, or using an add-on like the Google Drive Backup or other Backup system. For testing I did backups via api and verified I got the error in the UI.

This issue is expected to be extremely rare after home-assistant/supervisor#4843

Type of change

Dependency upgrade
Bugfix (non-breaking change which fixes an issue)
New integration (thank you!)
New feature (which adds functionality to an existing integration)
Deprecation (breaking change to happen in the future)
Breaking change (fix/feature causing existing functionality to break)
Code quality improvements to existing code or addition of tests

Additional information

This PR fixes or closes issue: fixes #
This PR is related to issue: ERROR: The recorder backlog queue reached the maximum size of 70918 events #105987 Database queue backlog reached #106300
Link to documentation pull request:

Checklist

The code change is tested and works locally.
Local tests pass. Your PR cannot be merged unless tests pass
There is no commented out code in this PR.
I have followed the development checklist
I have followed the perfect PR recommendations
The code has been formatted using Ruff (ruff format homeassistant tests)
Tests have been added to verify that the new code works.

If user exposed functionality or configuration variables are added/changed:

Documentation added/updated for www.home-assistant.io

If the code communicates with devices, web services, or third-party tools:

The manifest file has all fields filled out correctly.
Updated and included derived files by running: python3 -m script.hassfest.
New or updated dependencies have been added to requirements_all.txt.
Updated by running python3 -m script.gen_requirements_all.
For the updated dependencies - a link to the changelog, or at minimum a diff between library versions is added to the PR description.
Untested files have been added to .coveragerc.

To help with the load of incoming pull requests:

I have reviewed two other open pull requests in this repository.

…urces If the backup takes too long, the system can run out of resources to store the event queue. When this happens we logged a warning in the log which was easy to miss, and users sometimes found out their backup had failed at the worst possible time (when they needed to restore it.)

home-assistant · 2024-01-28T19:46:54Z

Hey there @home-assistant/core, mind taking a look at this pull request as it has been labeled with an integration (recorder) you are listed as a code owner for? Thanks!

Code owner commands

Code owners of recorder can trigger bot actions by commenting:

@home-assistant close Closes the pull request.
@home-assistant rename Awesome new title Renames the pull request.
@home-assistant reopen Reopen the pull request.
@home-assistant unassign recorder Removes the current integration label and assignees on the pull request, add the integration domain after the command.
@home-assistant add-label needs-more-information Add a label (needs-more-information, problem in dependency, problem in custom component) to the pull request.
@home-assistant remove-label needs-more-information Remove a label (needs-more-information, problem in dependency, problem in custom component) on the pull request.

homeassistant/components/recorder/strings.json

codyc1515 · 2024-01-28T20:31:22Z

It would be good to create an issue when the system storage space falls under a certain amount, say, 10 or 15% (if that is not already the case), so that users can treat storage issues well in advance (certainly before backups start failing).

bdraco · 2024-01-28T20:33:58Z

It would be good to create an issue when the system storage space falls under a certain amount, say, 10 or 15% (if that is not already the case), so that users can treat storage issues well in advance (certainly before backups start failing).

That is not in scope for this PR

codyc1515 · 2024-01-28T20:35:56Z

I understand that but we are just treating the symptom here (backups failing), rather than the cause (low / no space). While I like the idea, I feel that it doesn't seem like a solution that scales well.

Are we going to add alerts for every other area of HA that fails due to low / no space too? Another example could be a HAOS or HA Core docker image that does not have enough space to be able to update.

We could instead check available disk space before creating a backup or performing an upgrade instead and if there is not enough space available, fail to start the backup / upgrade process with an exception.

bdraco · 2024-01-28T20:39:17Z

This PR has nothing to do with low space, its about the lock being held too long because the backup itself it taking too long.

Resources in this case mean: cpu time, memory, I/O, or anything else that would make the backup too slow. It's intentionally generic because we honestly don't know why, and it's something the user is going to have to investigate. After the supervisor PR is merged hopefully this will only happen in very extreme cases.

If the backup fails because it runs out of space, they're already going to get a notification in the UI.

The problem here is the backup looks like it's successful but it really wasn't since we gave up the lock and now the database back up is no longer good but the rest of the backup is ok. The alternative would to make it a hard failure but that might put the user in an even a worse state because then they can't even back up core. I think it's better to be able to make a back up even without the database than get nothing at all.

codyc1515

LGTM

frenck · 2024-01-30T17:45:13Z

I think the PR is good, but can we make the datetime a little bit more end-user friendly?

:)

bdraco · 2024-01-30T17:47:44Z

Lol, isoformat seemed super friendly to me, but I'm obviously not the user in this case. I'll pick a more human readable format

bdraco · 2024-01-30T17:50:18Z

I think %H:%M:%S is sufficient here

codyc1515 · 2024-01-30T17:51:37Z

Is it not straightforward to use these?

bdraco · 2024-01-30T17:53:25Z

Is it not straightforward to use these?

Repair issues can't send datetime or timestamps to the frontend, only strings AFAIK

frenck

Thanks, @bdraco 👍

../Frenck

bdraco · 2024-01-30T18:14:37Z

🤞 that this is rare as I think it's going to be, but if it's not, I guess it'll motivate to improve the Backup performance more. Certainly better to know than vs finding out the hard way

We still have a lot of opportunities to reduce I/O and disk writes which also save storage lifetime that haven't been explored yet.

bdraco · 2024-01-30T18:16:53Z

Thanks

home-assistant bot added cla-signed core integration: recorder small-pr PRs with less than 30 lines. labels Jan 28, 2024

home-assistant bot added the Quality Scale: internal label Jan 28, 2024

bdraco mentioned this pull request Jan 28, 2024

Significantly speed up creating backups with isal via zlib-fast home-assistant/supervisor#4843

Merged

16 tasks

codyc1515 reviewed Jan 28, 2024

View reviewed changes

homeassistant/components/recorder/strings.json Outdated Show resolved Hide resolved

bdraco added 2 commits January 28, 2024 10:31

fixup tests

75db1bd

fixup tests

e5178c9

bdraco changed the title ~~Create an issue when backups fail because the system runs out of resources~~ Create an issue when database backups fail because the system runs out of resources Jan 28, 2024

codyc1515 approved these changes Jan 28, 2024

View reviewed changes

bdraco marked this pull request as ready for review January 28, 2024 22:56

bdraco requested a review from a team as a code owner January 28, 2024 22:56

format %H:%M:%S

8c346b7

format %H:%M:%S

d89e4f3

frenck approved these changes Jan 30, 2024

View reviewed changes

bdraco merged commit a222447 into dev Jan 30, 2024
53 checks passed

bdraco deleted the backup_failed_repair_issue branch January 30, 2024 20:23

github-actions bot locked and limited conversation to collaborators Jan 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create an issue when database backups fail because the system runs out of resources #109020

Create an issue when database backups fail because the system runs out of resources #109020

bdraco commented Jan 28, 2024 •

edited

Loading

home-assistant bot commented Jan 28, 2024

codyc1515 commented Jan 28, 2024

bdraco commented Jan 28, 2024

codyc1515 commented Jan 28, 2024 •

edited

Loading

bdraco commented Jan 28, 2024 •

edited

Loading

codyc1515 left a comment

frenck commented Jan 30, 2024

bdraco commented Jan 30, 2024

bdraco commented Jan 30, 2024

codyc1515 commented Jan 30, 2024

bdraco commented Jan 30, 2024

frenck left a comment

bdraco commented Jan 30, 2024 •

edited

Loading

bdraco commented Jan 30, 2024

Create an issue when database backups fail because the system runs out of resources #109020

Create an issue when database backups fail because the system runs out of resources #109020

Conversation

bdraco commented Jan 28, 2024 • edited Loading

Proposed change

Type of change

Additional information

Checklist

home-assistant bot commented Jan 28, 2024

codyc1515 commented Jan 28, 2024

bdraco commented Jan 28, 2024

codyc1515 commented Jan 28, 2024 • edited Loading

bdraco commented Jan 28, 2024 • edited Loading

codyc1515 left a comment

Choose a reason for hiding this comment

frenck commented Jan 30, 2024

bdraco commented Jan 30, 2024

bdraco commented Jan 30, 2024

codyc1515 commented Jan 30, 2024

bdraco commented Jan 30, 2024

frenck left a comment

Choose a reason for hiding this comment

bdraco commented Jan 30, 2024 • edited Loading

bdraco commented Jan 30, 2024

bdraco commented Jan 28, 2024 •

edited

Loading

codyc1515 commented Jan 28, 2024 •

edited

Loading

bdraco commented Jan 28, 2024 •

edited

Loading

bdraco commented Jan 30, 2024 •

edited

Loading