Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Startup Taint Removal Feature #1588

Merged
merged 1 commit into from
May 15, 2023

Conversation

ConnorJC3
Copy link
Contributor

@ConnorJC3 ConnorJC3 commented May 3, 2023

Is this a bug fix or adding new feature?

New feature

What is this PR about? / Why do we need it?

Implements a feature to remove a taint on driver startup to alleviate potential race conditions. Supercedes #1581, all credit for the design and initial implementation to @gtxu.

This PR differs from the original in a few meaningful ways:

  • The CLI option is removed, users do not need to enable this feature (it is always enabled)
  • Because of this, failing to connect to the k8s API is a soft failure that only logs on log level 4+ (because otherwise it would inaccurately log an error in scenarios such as using alternative COs like Nomad)

What testing is done?

CI/Manual/New unit tests added

@k8s-ci-robot k8s-ci-robot added the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label May 3, 2023
@k8s-ci-robot k8s-ci-robot requested review from gtxu and torredil May 3, 2023 14:20
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels May 3, 2023
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label May 3, 2023
@ConnorJC3 ConnorJC3 force-pushed the startup-taint branch 2 times, most recently from cb25723 to c710a9a Compare May 3, 2023 14:31
@ConnorJC3 ConnorJC3 requested review from rdpsin and hanyuel and removed request for gtxu May 3, 2023 14:33
return nil
}

patchRemoveTaints := []JSONPatch{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@torredil mentioned this, but can we use https://pkg.go.dev/k8s.io/kubernetes/pkg/util/taints#RemoveTaint instead of JSON patch?

Copy link
Contributor Author

@ConnorJC3 ConnorJC3 May 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally thought that was possible, but sadly, not really:

Firstly, RemoveTaint requires a specific effect, so we'd have to repeat it for all three Effects (and the code wouldn't be very future proof, we'd have to update it whenever a new taint effect is added).

Secondly, and more importantly, RemoveTaint only updates the local representation of the node, it doesn't actually make any calls to the k8s API. Thus, you then need to push the entire node object, which (1) is way less efficient than just patching the taints and (2) possibly even introduces a race condition (if the node is modified by something else between us downloading it and attempting to update it).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there concerns with this implementation?

node.Spec.Taints = taintsToKeep
err = clientset.CoreV1().Nodes().Update(context.Background(), node, metav1.UpdateOptions{})
if err != nil {
	return err
}
return nil

Copy link
Contributor Author

@ConnorJC3 ConnorJC3 May 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, see my comment above:

Thus, you then need to push the entire node object, which (1) is way less efficient than just patching the taints and (2) possibly even introduces a race condition (if the node is modified by something else between us downloading it and attempting to update it).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(2) possibly even introduces a race condition (if the node is modified by something else between us downloading it and attempting to update it).

K8s requires that objects submitted in update requests contain a resourceVersion, the K8s API server verifies this field and will reject the request if there is a conflict which should prevent race conditions.

(1) is way less efficient than just patching the taints.

👍

/lgtm

Copy link
Contributor

@rdpsin rdpsin May 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of creating the patch manually, can we DeepCopy the Node object, delete the taints, and use strategicpatch.CreateTwoWayMergePatch to create the patch data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could although I'd rather not unless we have a strong reason to because it eats an unnecessary amount of RAM and CPU creating a second copy of the node, and transforming both to JSON for the patch.

That said, I'm open to either option: I'll leave this open for other feedback.

@rdpsin
Copy link
Contributor

rdpsin commented May 3, 2023

Why did we remove this CLI option?

@ConnorJC3
Copy link
Contributor Author

Why did we remove this CLI option?

Adding a knob to enable/disable this feature is unnecessary, because it has no impact (besides a single call to the k8s API on startup, which we already make for node metadata anyways) on users that don't use it.

@rdpsin
Copy link
Contributor

rdpsin commented May 3, 2023

I'd still prefer an option for the feature, even if the default is behavior is true. There may be cases where customers may want to explicitly opt out of the feature.

@ConnorJC3
Copy link
Contributor Author

/retest

1 similar comment
@ConnorJC3
Copy link
Contributor Author

/retest

pkg/driver/node.go Outdated Show resolved Hide resolved
pkg/driver/node.go Outdated Show resolved Hide resolved
return nil
}

patchRemoveTaints := []JSONPatch{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there concerns with this implementation?

node.Spec.Taints = taintsToKeep
err = clientset.CoreV1().Nodes().Update(context.Background(), node, metav1.UpdateOptions{})
if err != nil {
	return err
}
return nil

@torredil
Copy link
Member

torredil commented May 3, 2023

Regarding the CLI option, this is a good mental model to follow:

@wmesard: Every configuration option represents a failure of the software to do the right thing automatically. Every configuration option needs to be documented and protected by unit tests, thereby increasing the cognitive load of user and developer alike. Sometimes they are necessary, but only as a last resort.

Co-authored-by: Gengtao Xu <[email protected]>
Signed-off-by: Connor Catlett <[email protected]>
@ConnorJC3
Copy link
Contributor Author

/retest

@hanyuel
Copy link
Contributor

hanyuel commented May 5, 2023

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 5, 2023
@torredil
Copy link
Member

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: torredil

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 15, 2023
@ConnorJC3
Copy link
Contributor Author

/retest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants