Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log error in remediation #128

Merged
merged 2 commits into from
Feb 18, 2024

Conversation

mshitrit
Copy link
Member

@mshitrit mshitrit commented Feb 13, 2024

ECOPROJECT-1505

  • retry in case of update conflict and log as info instead of an error
  • Adding a "generate all" make job

Copy link
Contributor

openshift-ci bot commented Feb 13, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mshitrit

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@mshitrit
Copy link
Member Author

/test ?

Copy link
Contributor

openshift-ci bot commented Feb 13, 2024

@mshitrit: The following commands are available to trigger required jobs:

  • /test 4.12-ci-bundle-my-bundle
  • /test 4.12-images
  • /test 4.12-openshift-e2e
  • /test 4.12-test
  • /test 4.13-ci-bundle-my-bundle
  • /test 4.13-images
  • /test 4.13-openshift-e2e
  • /test 4.13-test
  • /test 4.14-ci-bundle-my-bundle
  • /test 4.14-images
  • /test 4.14-openshift-e2e
  • /test 4.14-test
  • /test 4.15-ci-bundle-my-bundle
  • /test 4.15-images
  • /test 4.15-openshift-e2e
  • /test 4.15-test

Use /test all to run all jobs.

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mshitrit
Copy link
Member Author

/test 4.15-openshift-e2e

@mshitrit mshitrit changed the title [WIP] Log error in remediation Log error in remediation Feb 13, 2024
if err := utils.RemoveTaint(r.Client, far.Name, taint); err != nil {
if apiErrors.IsConflict(err) {
r.Log.Info("Failed to remove taint from node due to node update, retrying... ,", "node name", node.Name, "taint key", taint.Key, "taint effect", taint.Effect)
return ctrl.Result{RequeueAfter: time.Second}, nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why requeue after one second and not immediately afterwards since this conflict should not happen multiple times, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Giving the update a bit of time to finish, no point in stressing the system and spamming the log

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But how frequently does it happen and will it happen one after another multiple times if we don't wait? In case you happen to reproduce this race error.
My only concern is just with the interfering of exponential back-off that we "lose" when we set the exact time to requeue. Not sure if it is relevant though, as this back-off would happen in case of an earlier error that is probably unrelated to removing taints.
Up to you, waiting for one second is still a reasonable time for fast FAR execution

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, in case we want to utilize the exponential back-off mechanism we must return an error which will also be logged as such.
IIUC it is something we want to avoid (logging an error) since this behavior is not uncommon and expected .

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requeue: True also results in rate limited requeue

I agree this is what I originally meant. But IIRC Michael suspects that this error is recurrent (I didn't see a proof for that), it seems more like a temporary error to me.

sed -r -i "s|createdAt: .*|createdAt: \"\"|;" ${BUNDLE_CSV}

.PHONY: full-gen
full-gen: go-verify manifests generate manifests fmt bundle fix-imports bundle-reset ## generates all automatically generated content
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: redundant space in manifests generate

Makefile Show resolved Hide resolved
Makefile Show resolved Hide resolved
@razo7
Copy link
Member

razo7 commented Feb 15, 2024

/lgtm

@openshift-ci openshift-ci bot added the lgtm label Feb 15, 2024
@mshitrit mshitrit marked this pull request as ready for review February 18, 2024 07:51
@openshift-merge-bot openshift-merge-bot bot merged commit 53e506d into medik8s:main Feb 18, 2024
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants