Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mark workers associated with failed systemd units as stopped #23182

Merged
merged 2 commits into from
Sep 27, 2024

Conversation

agrare
Copy link
Member

@agrare agrare commented Sep 11, 2024

If we start a systemd unit and it fails this can leave the miq_worker record associated with it in "creating" without ever being cleaned up.

When we stop and cleanup any failed systemd units we should also mark any associated miq-worker records as stopped so that they can be cleaned up by the clean_worker_records method.

INFO -- evm: MIQ(MiqServer::WorkerManagement::Systemd#cleanup_failed_systemd_services) Disabling failed unit files: [opentofu-runner.service]
INFO -- evm: MIQ(MiqServer::WorkerManagement::Systemd#cleanup_failed_systemd_services) Stopping worker records for failed units: [opentofu-runner.service]
INFO -- evm: MIQ(MiqServer::WorkerManagement::Systemd#clean_worker_records) SQL Record for Worker [OpentofuWorker] with ID: [71], PID: [], GUID: [46e4cdf4-22b8-426>

TODO

  • Live test on an appliance

Fixes ManageIQ/manageiq-providers-embedded_terraform#59

@agrare agrare added the bug label Sep 11, 2024
@miq-bot miq-bot added the wip label Sep 11, 2024
@agrare agrare changed the title [WIP] Mark workers associated with failed systemd units as stopped Mark workers associated with failed systemd units as stopped Sep 11, 2024
@agrare agrare removed the wip label Sep 11, 2024
@agrare agrare changed the title Mark workers associated with failed systemd units as stopped [WIP] Mark workers associated with failed systemd units as stopped Sep 18, 2024
@agrare agrare added the wip label Sep 18, 2024
If a systemd unit is failed but there is still a miq_worker record
associated with it we should mark that worker record as stopped.  This
will then be cleaned up by the subsequent `clean_worker_records` method.
@agrare agrare force-pushed the mark_workers_for_failed_units_stopped branch from b1e30ad to 728e223 Compare September 27, 2024 14:52
@miq-bot
Copy link
Member

miq-bot commented Sep 27, 2024

Checked commits agrare/manageiq@2906f85~...728e223 with ruby 3.1.5, rubocop 1.56.3, haml-lint 0.51.0, and yamllint
2 files checked, 0 offenses detected
Everything looks fine. 🏆

@agrare agrare changed the title [WIP] Mark workers associated with failed systemd units as stopped Mark workers associated with failed systemd units as stopped Sep 27, 2024
@agrare agrare added core/workers and removed wip labels Sep 27, 2024
@agrare
Copy link
Member Author

agrare commented Sep 27, 2024

Okay I ran a live test on a master appliance build with this applied and I enable the embedded_terraform role first then set the container_image later and confirmed the failed workers are marked stopped and later deleted and then after the container_image setting is set properly the next time the worker starts up it pulls the correct image. Taking out of WIP

@Fryguy Fryguy merged commit de72e9e into ManageIQ:master Sep 27, 2024
8 checks passed
@Fryguy Fryguy self-assigned this Sep 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

OpentofuWorker record stuck in "creating" even though service is failed
3 participants