Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] 409 conflict error may occur when updating cr in the test #902

Closed
2 tasks done
Yicheng-Lu-llll opened this issue Feb 9, 2023 · 0 comments · Fixed by #904
Closed
2 tasks done

[Bug] 409 conflict error may occur when updating cr in the test #902

Yicheng-Lu-llll opened this issue Feb 9, 2023 · 0 comments · Fixed by #904
Labels
bug Something isn't working

Comments

@Yicheng-Lu-llll
Copy link
Contributor

Yicheng-Lu-llll commented Feb 9, 2023

Search before asking

  • I searched the issues and found no similar issues.

KubeRay Component

ci

What happened + What you expected to happen

If running make test several times, the following error may occur(409 error appears 14 times in 100 runs):




                Status: "Failure",
                Message: "Operation cannot be fulfilled on rayservices.ray.io \"rayservice-sample\": the object has been modified; please apply your changes to the latest version and try again",
                Reason: "Conflict",
                Details: {
                    Name: "rayservice-sample",
                    Group: "ray.io",
                    Kind: "rayservices",
                    UID: "",
                    Causes: nil,
                    RetryAfterSeconds: 0,
                },
                Code: 409,

From the k8s api-conventions docuemnt:

Kubernetes leverages the concept of resource versions to achieve optimistic concurrency. All Kubernetes resources have a "resourceVersion" field as part of their metadata. This resourceVersion is a string that identifies the internal version of an object that can be used by clients to determine when objects have changed. When a record is about to be updated, its version is checked against a pre-saved value, and if it doesn't match, the update fails with a StatusConflict (HTTP status code 409).

It is believed that this error is due to changes made by others(like another client) between the last Get and Update.

From k8s api-conventions docuemnt:

In the case of a conflict, the correct client action at this point is to GET the resource again, apply the changes afresh, and try submitting again.

It suggests using the retry strategy(though it emphasizes more on first reading and then writing.
While it is doubtful to apply this strategy in the operator(may be better to just fail the update operation and let the decision be made in the next reconciliation), In my own view, it is good to use it in the test.

So, a possible solution(need to discuss) is to use RetryOnConflict for every update operation in the test.

How do others deal with Update?

  1. client-go already has a helper function: RetryOnConflict and an example that use the above strategy. RetryOnConflict will retry until the timeout or the error code is not 409.

  2. azure-databricks-operator(also link) does in a normal way(no retry just call update and see if success).

Reproduction script

running make test several times.

Anything else

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant