Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try to recover the system if available size in critical #2325

Merged
merged 2 commits into from
Jan 22, 2024

Conversation

patrickelectric
Copy link
Member

@patrickelectric patrickelectric commented Jan 20, 2024

Once the system gets in a critical state where the disk is full, it gets in a condition that recover is almost impossible. The system is unstable, the backend services and frontend may not work or be available for the user to recover at all. As a last resource, we delete BlueOS logs if the filesystem.

Fix #2323

Copy link
Member

@joaoantoniocardoso joaoantoniocardoso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code seems correct, and 100MB of threshold feels okay since this is before the system starts up.

How I imagine this patch acting:

  1. some subsystem spams a log, depleting the disk space
  2. the user feels the system is unusable
  3. the user reboots the system
  4. assuming there is some space left for the kernel and services, the system boots
  5. critical space is detected
  6. blueos logs are removed

I can see this patch not working if the kernel or docker fails to start because of insufficient disk space.

It would be nice to:

  • inform the user of what happened
  • inform us via telemetry system

Now, to really protect the system, we should put the logs into a separate partition from the root (/), and the log manager service be responsible for cleaning it when necessary.

@patrickelectric
Copy link
Member Author

patrickelectric commented Jan 20, 2024

The code seems correct, and 100MB of threshold feels okay since this is before the system starts up.

How I imagine this patch acting:

  1. some subsystem spams a log, depleting the disk space
  2. the user feels the system is unusable
  3. the user reboots the system
  4. assuming there is some space left for the kernel and services, the system boots
  5. critical space is detected
  6. blueos logs are removed

I can see this patch not working if the kernel or docker fails to start because of insufficient disk space.

It would be nice to:

  • inform the user of what happened
  • inform us via telemetry system

Now, to really protect the system, we should put the logs into a separate partition from the root (/), and the log manager service be responsible for cleaning it when necessary.

Check: #2327, #2323, #2326, #1015

The docker is able to start, but everything after that just results in unstable behavior.
Some points that you suggested are already available as issues, others are relevant to recover the system.

  • We should delete old logs when doing the rotation and noticing the the disk space is almost full.
  • We should stop logging if the disk space is almost full.
  • We should clean up old dockers that are not being used.
  • We should clean up old docker artifacts that are not being used.
  • We should allow user to delete all unused docker images.
  • We should warn the user though cockpit that the companion computer is almost full in disk.
  • We should erase older tlog or bin files if the disk is almost full.
  • We should warn the user though BlueOS header that the disk is almost full and in critical state.
  • We may not allow the user to arm the vehicle if the disk is almost full.
  • We may do some of this steps automatically to try to recover the system once it starts.
  • We may need a page like filelight on BlueOS to help identify the root of such problems.

@patrickelectric patrickelectric merged commit ccfec05 into bluerobotics:master Jan 22, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

We should delete old logs if the space available on the system is close the be full
2 participants