-
Notifications
You must be signed in to change notification settings - Fork 410
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1916169: storeCurrentConfigOnDisk after os changes #2922
Bug 1916169: storeCurrentConfigOnDisk after os changes #2922
Conversation
Skipping CI for Draft Pull Request. |
Background: I think I'd summarize this "write current config" thing is a workaround for not having #1190 Basically it was trying to track the intention of switching to a new configuration by writing to the current |
+1 for layering |
Currently the MachineConfig being applied is saved to disk before OS changes are applied. If the node loses power while OS changes are being applied, the MCO incorrectly concludes from the config stored on disk that the update has been applied. Instead, wait until the OS changes have been made to write the config to disk. I manually verified that this shouldn't break anything: getCurrentConfigOnDisk is the only function that accesses the config on disk, and it is only called in the following code paths: syncNode->runPreflightConfigDriftCheck syncNode->startConfigDriftMonitor performPostConfigChangeAction->startConfigDriftMonitor checkStateOnFirstRun syncNode->prepUpdateFromCluster runOnceFromMachineConfig->prepUpdateFromCluster None of those functions are called between where the config is currently stored to disk and where I'm moving it to, so this change should be safe. Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1916169
d45a297
to
d324dac
Compare
@mkenigs: This pull request references Bugzilla bug 1916169, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Bugzilla ([email protected]), skipping review request. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the code and Matthew's assessment, I think this code lgtm
Just to make sure, since the original BZ is quite long, do we foresee this being able to close all non-MCO reboot races? I presume no, but this should leave us in a proper "error" state instead of having it not report errors but also not be updated?
Do you have any specific races in mind that this wouldn't solve? I don't feel like I know whether or not this fixes all races and would have to look into https://issues.redhat.com/browse/MCO-156 in more depth That BZ is kinda vague but this would fix the only failure it explicitly describes |
Not off the top of my head. I think we can use 156 to pursue follow ups if they arise. /lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mkenigs, yuqi-zhang The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
8 similar comments
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
7 similar comments
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
@mkenigs: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
@mkenigs: All pull requests linked via external trackers have merged: Bugzilla bug 1916169 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Currently the MachineConfig being applied is saved to disk before OS
changes are applied. If the node loses power while OS changes are being
applied, the MCO incorrectly concludes from the config stored on disk
that the update has been applied. Instead, wait until the OS changes
have been made to write the config to disk.
I manually verified that this shouldn't break anything:
getCurrentConfigOnDisk is the only function that accesses the config on
disk, and it is only called in the following code paths:
syncNode->runPreflightConfigDriftCheck
syncNode->startConfigDriftMonitor
performPostConfigChangeAction->startConfigDriftMonitor
checkStateOnFirstRun
syncNode->prepUpdateFromCluster
runOnceFromMachineConfig->prepUpdateFromCluster
None of those functions are called between where the config is currently
stored to disk and where I'm moving it to, so this change should be
safe.
Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1916169
- How to verify it
Run the reproducer script from https://bugzilla.redhat.com/show_bug.cgi?id=1916169 before and after the change. Without the change, the node does not end up with a realtime kernel, but with the change, the switch to realtime kernel is correctly performed