-
Notifications
You must be signed in to change notification settings - Fork 170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARO-9990: Update etchosts controller to use ForceReconcilation flag #3837
Conversation
826a99d
to
b94da17
Compare
Functional testing passed. |
/azp run ci,e2e |
Azure Pipelines successfully started running 2 pipeline(s). |
@@ -141,7 +156,7 @@ func (r *EtcHostsMachineConfigReconciler) SetupWithManager(mgr ctrl.Manager) err | |||
Complete(r) | |||
} | |||
|
|||
func reconcileMachineConfigs(ctx context.Context, instance *arov1alpha1.Cluster, role string, dh dynamichelper.Interface, mcps ...mcv1.MachineConfigPool) error { | |||
func reconcileMachineConfigs(ctx context.Context, instance *arov1alpha1.Cluster, role string, dh dynamichelper.Interface, c client.Client, allowReconcile bool, mcps ...mcv1.MachineConfigPool) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
c client.Client
is not used. (I don't know why CI doesn't complain about it)
Also I don't think we need to pass allowReconcile
here.
We need it for the dnsmasq one because it still wants to reconcile some things even if it's false.
ARO-RP/pkg/operator/controllers/dnsmasq/cluster_controller.go
Lines 134 to 147 in bd9af03
// If we are allowed to reconcile the resources, then we run Ensure to | |
// create or update. If we are not allowed to reconcile, we do not want to | |
// perform any updates, but we do want to perform initial configuration. | |
if allowReconcile { | |
return ch.Ensure(ctx, resources...) | |
} else { | |
for _, i := range resources { | |
err := c.Create(ctx, i.(client.Object)) | |
// Since we are only creating, ignore AlreadyExists | |
if err != nil && !kerrors.IsAlreadyExists(err) { | |
return fmt.Errorf("error creating client object: %w", err) | |
} | |
} | |
} |
According to reconcileMachineConfigs
, it does nothing when allowReconcile
is false.
Why don't we just stop reconciling at the beginning of Reconcile()
if allowReconcile
is false ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the Client stuff, if it can be removed let's do that.
My best guess is that Lisa was following convention in other controllers: allowReconcile only matters on calls to Ensure
, a "problem" function in this space in regards to upgrades, so I can see why the code was implemented this way to get it across the line. I think if we wanted to make this more clear, we could have some follow-up work to move Ensure
into Reconcile
(probably for every controller - not just this one), with all other functions building up a resources
object to pass into Ensure
- how does that sound? That would also prevent us from having to pass dh
all over the place. Alternatively, we could have the dh
object itself have an allowReconcile
property, so that Ensure
will know for itself when it should/shouldn't apply resources into a cluster... many choices, but they seem out of scope for this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I did functional testing I found that calling client.Create
caused reboots so I took it out. My guess is that for the dnsmasq controller, we actually dont have clusters missing the aro-dns
MachineConfig so it didnt trip during that testing. I can remove the client entirely now yeah.
When we call dh.Ensure
we do cause reboots, so we want to put this behind the allowReconcile
flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we wanted to make this more clear, we could have some follow-up work to move Ensure into Reconcile (probably for every controller - not just this one), with all other functions building up a resources object to pass into Ensure - how does that sound? That would also prevent us from having to pass dh all over the place.
We can card this up but to my knowledge this allowReconcile
approach is going to be EOL with SyncSets. Eventually this will move into syncsets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
allowReconcile only matters on calls to Ensure
It's true,
I think if we wanted to make this more clear, we could have some follow-up work to move Ensure into Reconcile (probably for every controller - not just this one), with all other functions building up a resources object to pass into Ensure - how does that sound?
but I don't understand why we need to move Ensure
to Reconcile
.
I just think that the logic from r.AllowRebootCausingReconciliation(ctx, instance)
to Ensure
is not used if allowReconcile
is false (sorry if this understanding is wrong).
If it stops reconciling just after r.AllowRebootCausingReconciliation(ctx, instance)
, it will be more readable because we don't have to follow the lines until Ensure
, and also there will be (very) slight improvement in performance.
Also AllowRebootCausingReconciliation
is used only in dnsmasq and etchosts.
And the dnsmasq has the reason not to stop earlier (because it has to create some things), so I don't think we have the convention we should follow here.
We can card this up but to my knowledge this allowReconcile approach is going to be EOL with SyncSets. Eventually this will move into syncsets.
If so, we might not necessarily have to change the code, but still I'd say just changing the code here is doable and worth doing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree it is not worth refactoring code that is deemed temporary and already went through a code review under a different author.
You seem to had a fundamental mistunderstanding of what allowReconcile
is doing. This must be used to stop from Ensure or Create being called to have the desired effect.
The code in dnsmasq doesnt actually have the desired effect. It will cause reboots. That was a miss on the PR reviews of the previous author, Amber.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You seem to had a fundamental mistunderstanding of what allowReconcile is doing. This must be used to stop from Ensure or Create being called to have the desired effect.
That's what I understand.
That's why I'm asking you why these codes are needed to run even though it doesn't Ensure or Create.
ARO-RP/pkg/operator/controllers/etchosts/machineconfig_controller.go
Lines 160 to 178 in 3762a79
var resources []kruntime.Object | |
for _, mcp := range mcps { | |
resource, err := EtcHostsMachineConfig(instance.Spec.Domain, instance.Spec.APIIntIP, instance.Spec.GatewayDomains, instance.Spec.GatewayPrivateEndpointIP, role) | |
if err != nil { | |
return err | |
} | |
err = dynamichelper.SetControllerReferences([]kruntime.Object{resource}, &mcp) | |
if err != nil { | |
return err | |
} | |
resources = append(resources, resource) | |
} | |
err := dynamichelper.Prepare(resources) | |
if err != nil { | |
return err | |
} |
@@ -891,7 +892,7 @@ var _ = Describe("ARO Operator - etchosts", func() { | |||
}).WithContext(ctx).WithTimeout(DefaultEventuallyTimeout).Should(Succeed()) | |||
} | |||
|
|||
if !instance.Spec.OperatorFlags.GetSimpleBoolean(etchostsManaged) { | |||
if !instance.Spec.OperatorFlags.GetSimpleBoolean(etchostsManaged) && instance.Spec.OperatorFlags.GetSimpleBoolean(forceReconciliation) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we reach this condition?
The default value of etchostsManaged
is true. If we don't change the value elsewhere, we won't reach here.
Also it runs just once in each e2e test run, so only one of those conditions can be run, and others can't.
I think we should split this test cases into multiple test cases according to operator flags and test each behaviour in each test case not using conditional branches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I prefer the test as-is, because it is future-proof to our E2E test expanding out from the rigid structure it is in today (e.g. us performing permutation tests on various clusters with different features set in each one).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is more of an issue of the E2E suite not being extensive and less about the test itself. The condition is correct, because when; 1) etchostsManaged is set to false the MachineConfigs are deleted and 2) etchostsManaged is set to true the MachineConfigs are created.
If we aren't reaching this in E2E test suite its because we need to rethink how we are running these tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we change operator flags before running each test case?
I think we can because we can change cluster operator flags by running oc edit command.
If we can change the flags then, we can set them before each test, and we can test all scenarios.
Each test case should focus on one scenario, and this test case doesn't follow the principle.
Otherwise the purpose of the test cases will be unclear.
If we had to wait for the change in how to run E2E, we can rewrite this test case like below:
It("must not reconcile machine configs when enabled == false", func(ctx context.Context) {
if instance.Spec.OperatorFlags.GetSimpleBoolean(etchostsEnabled) {
Skip("skipping because enabled == true")
}
// ...
}
It("must reconcile machine configs when managed == true and ...", func(ctx context.Context) {
if !instance.Spec.OperatorFlags.GetSimpleBoolean(etchostsManaged) ||
!instance.Spec.OperatorFlags.GetSimpleBoolean(forceReconciliation) {
Skip("skipping because managed == false || forceReconciliation == false")
}
// ...
}
It("must remove machine configs when managed == false and ...", func(ctx context.Context) {
if instance.Spec.OperatorFlags.GetSimpleBoolean(etchostsManaged) ||
!instance.Spec.OperatorFlags.GetSimpleBoolean(forceReconciliation) {
Skip("skipping because managed == true || forceReconciliation == false")
}
// ...
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. This doesn't show the actual functionality of the controller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which question is No? You mean we can't change operator flags before running each test case?
Also I want you to explain which part is wrong to get better understanding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - if you can remove the unnecessary Client from that function header, all the better. Thanks Lisa, great work!
@@ -141,7 +156,7 @@ func (r *EtcHostsMachineConfigReconciler) SetupWithManager(mgr ctrl.Manager) err | |||
Complete(r) | |||
} | |||
|
|||
func reconcileMachineConfigs(ctx context.Context, instance *arov1alpha1.Cluster, role string, dh dynamichelper.Interface, mcps ...mcv1.MachineConfigPool) error { | |||
func reconcileMachineConfigs(ctx context.Context, instance *arov1alpha1.Cluster, role string, dh dynamichelper.Interface, c client.Client, allowReconcile bool, mcps ...mcv1.MachineConfigPool) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the Client stuff, if it can be removed let's do that.
My best guess is that Lisa was following convention in other controllers: allowReconcile only matters on calls to Ensure
, a "problem" function in this space in regards to upgrades, so I can see why the code was implemented this way to get it across the line. I think if we wanted to make this more clear, we could have some follow-up work to move Ensure
into Reconcile
(probably for every controller - not just this one), with all other functions building up a resources
object to pass into Ensure
- how does that sound? That would also prevent us from having to pass dh
all over the place. Alternatively, we could have the dh
object itself have an allowReconcile
property, so that Ensure
will know for itself when it should/shouldn't apply resources into a cluster... many choices, but they seem out of scope for this PR.
@@ -891,7 +892,7 @@ var _ = Describe("ARO Operator - etchosts", func() { | |||
}).WithContext(ctx).WithTimeout(DefaultEventuallyTimeout).Should(Succeed()) | |||
} | |||
|
|||
if !instance.Spec.OperatorFlags.GetSimpleBoolean(etchostsManaged) { | |||
if !instance.Spec.OperatorFlags.GetSimpleBoolean(etchostsManaged) && instance.Spec.OperatorFlags.GetSimpleBoolean(forceReconciliation) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I prefer the test as-is, because it is future-proof to our E2E test expanding out from the rigid structure it is in today (e.g. us performing permutation tests on various clusters with different features set in each one).
b94da17
to
3762a79
Compare
I've removed the unused client. Now that we aren't calling create we don't need to pass it along. |
/azp run ci,e2e |
Azure Pipelines successfully started running 2 pipeline(s). |
/azp run e2e |
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good, thanks for this!
/azp run e2e |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run e2e |
Azure Pipelines successfully started running 1 pipeline(s). |
e2e is panicking in the pre-test suite. I've kicked it off again. |
e2e is passed. Merging. |
Which issue this PR addresses:
Fixes ARO-9990
What this PR does / why we need it:
This updates the new etchosts controller to wait until the cluster is updating to reconcile the machine configs. This is needed to roll out the controller to the fleet without causing a fleet wide rebooting action.
Test plan for issue:
Is there any documentation that needs to be updated for this PR?
N/A
How do you know this will function as expected in production?