Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARO-9990: Update etchosts controller to use ForceReconcilation flag #3837

Merged
merged 1 commit into from
Sep 17, 2024

Conversation

lranjbar
Copy link
Collaborator

Which issue this PR addresses:

Fixes ARO-9990

What this PR does / why we need it:

This updates the new etchosts controller to wait until the cluster is updating to reconcile the machine configs. This is needed to roll out the controller to the fleet without causing a fleet wide rebooting action.

Test plan for issue:

  • New tests added to the controllers
  • Functional operator testing

Is there any documentation that needs to be updated for this PR?

N/A

How do you know this will function as expected in production?

  • New tests added to the controllers
  • Functional operator testing

@lranjbar
Copy link
Collaborator Author

Functional testing passed.

@lranjbar lranjbar added ready-for-review next-release To be included in the next RP release rollout loki labels Sep 11, 2024
@lranjbar
Copy link
Collaborator Author

/azp run ci,e2e

Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@@ -141,7 +156,7 @@ func (r *EtcHostsMachineConfigReconciler) SetupWithManager(mgr ctrl.Manager) err
Complete(r)
}

func reconcileMachineConfigs(ctx context.Context, instance *arov1alpha1.Cluster, role string, dh dynamichelper.Interface, mcps ...mcv1.MachineConfigPool) error {
func reconcileMachineConfigs(ctx context.Context, instance *arov1alpha1.Cluster, role string, dh dynamichelper.Interface, c client.Client, allowReconcile bool, mcps ...mcv1.MachineConfigPool) error {
Copy link
Collaborator

@bitoku bitoku Sep 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

c client.Client is not used. (I don't know why CI doesn't complain about it)

Also I don't think we need to pass allowReconcile here.
We need it for the dnsmasq one because it still wants to reconcile some things even if it's false.

// If we are allowed to reconcile the resources, then we run Ensure to
// create or update. If we are not allowed to reconcile, we do not want to
// perform any updates, but we do want to perform initial configuration.
if allowReconcile {
return ch.Ensure(ctx, resources...)
} else {
for _, i := range resources {
err := c.Create(ctx, i.(client.Object))
// Since we are only creating, ignore AlreadyExists
if err != nil && !kerrors.IsAlreadyExists(err) {
return fmt.Errorf("error creating client object: %w", err)
}
}
}

According to reconcileMachineConfigs, it does nothing when allowReconcile is false.
Why don't we just stop reconciling at the beginning of Reconcile() if allowReconcile is false ?

Copy link
Collaborator

@SudoBrendan SudoBrendan Sep 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the Client stuff, if it can be removed let's do that.

My best guess is that Lisa was following convention in other controllers: allowReconcile only matters on calls to Ensure, a "problem" function in this space in regards to upgrades, so I can see why the code was implemented this way to get it across the line. I think if we wanted to make this more clear, we could have some follow-up work to move Ensure into Reconcile (probably for every controller - not just this one), with all other functions building up a resources object to pass into Ensure - how does that sound? That would also prevent us from having to pass dh all over the place. Alternatively, we could have the dh object itself have an allowReconcile property, so that Ensure will know for itself when it should/shouldn't apply resources into a cluster... many choices, but they seem out of scope for this PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I did functional testing I found that calling client.Create caused reboots so I took it out. My guess is that for the dnsmasq controller, we actually dont have clusters missing the aro-dns MachineConfig so it didnt trip during that testing. I can remove the client entirely now yeah.

When we call dh.Ensure we do cause reboots, so we want to put this behind the allowReconcile flag.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we wanted to make this more clear, we could have some follow-up work to move Ensure into Reconcile (probably for every controller - not just this one), with all other functions building up a resources object to pass into Ensure - how does that sound? That would also prevent us from having to pass dh all over the place.

We can card this up but to my knowledge this allowReconcile approach is going to be EOL with SyncSets. Eventually this will move into syncsets.

Copy link
Collaborator

@bitoku bitoku Sep 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allowReconcile only matters on calls to Ensure

It's true,

I think if we wanted to make this more clear, we could have some follow-up work to move Ensure into Reconcile (probably for every controller - not just this one), with all other functions building up a resources object to pass into Ensure - how does that sound?

but I don't understand why we need to move Ensure to Reconcile.
I just think that the logic from r.AllowRebootCausingReconciliation(ctx, instance) to Ensure is not used if allowReconcile is false (sorry if this understanding is wrong).
If it stops reconciling just after r.AllowRebootCausingReconciliation(ctx, instance), it will be more readable because we don't have to follow the lines until Ensure, and also there will be (very) slight improvement in performance.

Also AllowRebootCausingReconciliation is used only in dnsmasq and etchosts.
And the dnsmasq has the reason not to stop earlier (because it has to create some things), so I don't think we have the convention we should follow here.

We can card this up but to my knowledge this allowReconcile approach is going to be EOL with SyncSets. Eventually this will move into syncsets.

If so, we might not necessarily have to change the code, but still I'd say just changing the code here is doable and worth doing.

Copy link
Collaborator Author

@lranjbar lranjbar Sep 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree it is not worth refactoring code that is deemed temporary and already went through a code review under a different author.

You seem to had a fundamental mistunderstanding of what allowReconcile is doing. This must be used to stop from Ensure or Create being called to have the desired effect.

The code in dnsmasq doesnt actually have the desired effect. It will cause reboots. That was a miss on the PR reviews of the previous author, Amber.

Copy link
Collaborator

@bitoku bitoku Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You seem to had a fundamental mistunderstanding of what allowReconcile is doing. This must be used to stop from Ensure or Create being called to have the desired effect.

That's what I understand.
That's why I'm asking you why these codes are needed to run even though it doesn't Ensure or Create.

var resources []kruntime.Object
for _, mcp := range mcps {
resource, err := EtcHostsMachineConfig(instance.Spec.Domain, instance.Spec.APIIntIP, instance.Spec.GatewayDomains, instance.Spec.GatewayPrivateEndpointIP, role)
if err != nil {
return err
}
err = dynamichelper.SetControllerReferences([]kruntime.Object{resource}, &mcp)
if err != nil {
return err
}
resources = append(resources, resource)
}
err := dynamichelper.Prepare(resources)
if err != nil {
return err
}

@@ -891,7 +892,7 @@ var _ = Describe("ARO Operator - etchosts", func() {
}).WithContext(ctx).WithTimeout(DefaultEventuallyTimeout).Should(Succeed())
}

if !instance.Spec.OperatorFlags.GetSimpleBoolean(etchostsManaged) {
if !instance.Spec.OperatorFlags.GetSimpleBoolean(etchostsManaged) && instance.Spec.OperatorFlags.GetSimpleBoolean(forceReconciliation) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we reach this condition?
The default value of etchostsManaged is true. If we don't change the value elsewhere, we won't reach here.
Also it runs just once in each e2e test run, so only one of those conditions can be run, and others can't.

I think we should split this test cases into multiple test cases according to operator flags and test each behaviour in each test case not using conditional branches.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I prefer the test as-is, because it is future-proof to our E2E test expanding out from the rigid structure it is in today (e.g. us performing permutation tests on various clusters with different features set in each one).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more of an issue of the E2E suite not being extensive and less about the test itself. The condition is correct, because when; 1) etchostsManaged is set to false the MachineConfigs are deleted and 2) etchostsManaged is set to true the MachineConfigs are created.

If we aren't reaching this in E2E test suite its because we need to rethink how we are running these tests.

Copy link
Collaborator

@bitoku bitoku Sep 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we change operator flags before running each test case?
I think we can because we can change cluster operator flags by running oc edit command.
If we can change the flags then, we can set them before each test, and we can test all scenarios.

Each test case should focus on one scenario, and this test case doesn't follow the principle.
Otherwise the purpose of the test cases will be unclear.
If we had to wait for the change in how to run E2E, we can rewrite this test case like below:

It("must not reconcile machine configs when enabled == false", func(ctx context.Context) {
  if instance.Spec.OperatorFlags.GetSimpleBoolean(etchostsEnabled) {
    Skip("skipping because enabled == true")
  }
  // ...
}
It("must reconcile machine configs when managed == true and ...", func(ctx context.Context) {
  if !instance.Spec.OperatorFlags.GetSimpleBoolean(etchostsManaged) || 
     !instance.Spec.OperatorFlags.GetSimpleBoolean(forceReconciliation) {
    Skip("skipping because managed == false || forceReconciliation == false")
  }
  // ...
}
It("must remove machine configs when managed == false and ...", func(ctx context.Context) {
  if instance.Spec.OperatorFlags.GetSimpleBoolean(etchostsManaged) || 
     !instance.Spec.OperatorFlags.GetSimpleBoolean(forceReconciliation) {
      Skip("skipping because managed == true || forceReconciliation == false")
  }
  // ...
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. This doesn't show the actual functionality of the controller.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which question is No? You mean we can't change operator flags before running each test case?

Also I want you to explain which part is wrong to get better understanding.

SudoBrendan
SudoBrendan previously approved these changes Sep 12, 2024
Copy link
Collaborator

@SudoBrendan SudoBrendan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - if you can remove the unnecessary Client from that function header, all the better. Thanks Lisa, great work!

@@ -141,7 +156,7 @@ func (r *EtcHostsMachineConfigReconciler) SetupWithManager(mgr ctrl.Manager) err
Complete(r)
}

func reconcileMachineConfigs(ctx context.Context, instance *arov1alpha1.Cluster, role string, dh dynamichelper.Interface, mcps ...mcv1.MachineConfigPool) error {
func reconcileMachineConfigs(ctx context.Context, instance *arov1alpha1.Cluster, role string, dh dynamichelper.Interface, c client.Client, allowReconcile bool, mcps ...mcv1.MachineConfigPool) error {
Copy link
Collaborator

@SudoBrendan SudoBrendan Sep 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the Client stuff, if it can be removed let's do that.

My best guess is that Lisa was following convention in other controllers: allowReconcile only matters on calls to Ensure, a "problem" function in this space in regards to upgrades, so I can see why the code was implemented this way to get it across the line. I think if we wanted to make this more clear, we could have some follow-up work to move Ensure into Reconcile (probably for every controller - not just this one), with all other functions building up a resources object to pass into Ensure - how does that sound? That would also prevent us from having to pass dh all over the place. Alternatively, we could have the dh object itself have an allowReconcile property, so that Ensure will know for itself when it should/shouldn't apply resources into a cluster... many choices, but they seem out of scope for this PR.

@@ -891,7 +892,7 @@ var _ = Describe("ARO Operator - etchosts", func() {
}).WithContext(ctx).WithTimeout(DefaultEventuallyTimeout).Should(Succeed())
}

if !instance.Spec.OperatorFlags.GetSimpleBoolean(etchostsManaged) {
if !instance.Spec.OperatorFlags.GetSimpleBoolean(etchostsManaged) && instance.Spec.OperatorFlags.GetSimpleBoolean(forceReconciliation) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I prefer the test as-is, because it is future-proof to our E2E test expanding out from the rigid structure it is in today (e.g. us performing permutation tests on various clusters with different features set in each one).

@lranjbar
Copy link
Collaborator Author

lranjbar commented Sep 12, 2024

When I did functional testing I found that calling client.Create caused reboots so I took it out. My guess is that for the dnsmasq controller, we actually dont have clusters missing the aro-dns MachineConfig so it didnt trip during that testing. I can remove the client entirely now yeah.

I've removed the unused client. Now that we aren't calling create we don't need to pass it along.

@lranjbar
Copy link
Collaborator Author

/azp run ci,e2e

Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@SudoBrendan
Copy link
Collaborator

/azp run e2e

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Contributor

@jhoreman jhoreman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, thanks for this!

@SudoBrendan
Copy link
Collaborator

/azp run e2e

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@hlipsig
Copy link
Contributor

hlipsig commented Sep 16, 2024

/azp run e2e

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@hlipsig
Copy link
Contributor

hlipsig commented Sep 16, 2024

e2e is panicking in the pre-test suite. I've kicked it off again.

@hlipsig
Copy link
Contributor

hlipsig commented Sep 17, 2024

e2e is passed. Merging.

@hlipsig hlipsig merged commit 48ff446 into master Sep 17, 2024
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
loki next-release To be included in the next RP release rollout ready-for-review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants