Bug 1916169: storeCurrentConfigOnDisk after os changes #2922

mkenigs · 2022-01-20T23:40:44Z

Currently the MachineConfig being applied is saved to disk before OS
changes are applied. If the node loses power while OS changes are being
applied, the MCO incorrectly concludes from the config stored on disk
that the update has been applied. Instead, wait until the OS changes
have been made to write the config to disk.

I manually verified that this shouldn't break anything:
getCurrentConfigOnDisk is the only function that accesses the config on
disk, and it is only called in the following code paths:
syncNode->runPreflightConfigDriftCheck
syncNode->startConfigDriftMonitor
performPostConfigChangeAction->startConfigDriftMonitor
checkStateOnFirstRun
syncNode->prepUpdateFromCluster
runOnceFromMachineConfig->prepUpdateFromCluster

None of those functions are called between where the config is currently
stored to disk and where I'm moving it to, so this change should be
safe.

Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1916169

- How to verify it
Run the reproducer script from https://bugzilla.redhat.com/show_bug.cgi?id=1916169 before and after the change. Without the change, the node does not end up with a realtime kernel, but with the change, the switch to realtime kernel is correctly performed

openshift-ci · 2022-01-20T23:40:46Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

cgwalters · 2022-01-21T15:54:36Z

Background: I think I'd summarize this "write current config" thing is a workaround for not having #1190

Basically it was trying to track the intention of switching to a new configuration by writing to the current /etc. But if we delegate that whole thing to ostree, we are always either in the new config or the old.

mkenigs · 2022-01-21T16:16:47Z

But if we delegate that whole thing to ostree, we are always either in the new config or the old.

+1 for layering

Currently the MachineConfig being applied is saved to disk before OS changes are applied. If the node loses power while OS changes are being applied, the MCO incorrectly concludes from the config stored on disk that the update has been applied. Instead, wait until the OS changes have been made to write the config to disk. I manually verified that this shouldn't break anything: getCurrentConfigOnDisk is the only function that accesses the config on disk, and it is only called in the following code paths: syncNode->runPreflightConfigDriftCheck syncNode->startConfigDriftMonitor performPostConfigChangeAction->startConfigDriftMonitor checkStateOnFirstRun syncNode->prepUpdateFromCluster runOnceFromMachineConfig->prepUpdateFromCluster None of those functions are called between where the config is currently stored to disk and where I'm moving it to, so this change should be safe. Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1916169

openshift-ci · 2022-01-24T17:46:54Z

@mkenigs: This pull request references Bugzilla bug 1916169, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.10.0) matches configured target release for branch (4.10.0)
bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

No GitHub users were found matching the public email listed for the QA contact in Bugzilla ([email protected]), skipping review request.

In response to this:

Bug 1916169: storeCurrentConfigOnDisk after os changes

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

yuqi-zhang

Looking at the code and Matthew's assessment, I think this code lgtm

Just to make sure, since the original BZ is quite long, do we foresee this being able to close all non-MCO reboot races? I presume no, but this should leave us in a proper "error" state instead of having it not report errors but also not be updated?

mkenigs · 2022-01-24T20:44:37Z

Do you have any specific races in mind that this wouldn't solve? I don't feel like I know whether or not this fixes all races and would have to look into https://issues.redhat.com/browse/MCO-156 in more depth

That BZ is kinda vague but this would fix the only failure it explicitly describes

yuqi-zhang · 2022-01-24T21:39:46Z

Not off the top of my head. I think we can use 156 to pursue follow ups if they arise.

/lgtm

openshift-ci · 2022-01-24T21:40:25Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mkenigs, yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [yuqi-zhang]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-bot · 2022-01-24T23:06:13Z

/retest-required