Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TestClusterResourceSetReconciler test is flaky #10854

Closed
Sunnatillo opened this issue Jul 10, 2024 · 15 comments
Closed

TestClusterResourceSetReconciler test is flaky #10854

Sunnatillo opened this issue Jul 10, 2024 · 15 comments
Assignees
Labels
area/ci Issues or PRs related to ci kind/flake Categorizes issue or PR as related to a flaky test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@Sunnatillo
Copy link
Contributor

Sunnatillo commented Jul 10, 2024

Which jobs are flaking?

capi-test-main

Which tests are flaking?

TestClusterResourceSetReconciler

Since when has it been flaking?

Most likely after merging this PR: #10656

Testgrid link

edited: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-cluster-api-test-mink8s-main/1810533764078505984

Reason for failure (if possible)

No response

Anything else we need to know?

No response

Label(s) to be applied

/kind flake
/area ci

@k8s-ci-robot k8s-ci-robot added kind/flake Categorizes issue or PR as related to a flaky test. area/ci Issues or PRs related to ci needs-priority Indicates an issue lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 10, 2024
@sbueringer
Copy link
Member

cc @fabriziopandini Should be the same as the one I mentioned last time

@sbueringer
Copy link
Member

@Sunnatillo I think the link is pointing to the wrong job

@Sunnatillo
Copy link
Contributor Author

@Sunnatillo I think the link is pointing to the wrong job

Fixed now.
Also another occurence:
https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-cluster-api-test-main/1807033132835147776

@fabriziopandini
Copy link
Member

@jimmidyson PTAL
I will also take a look because I recently made changes on ownerReference management on the same area of the code base

It is important to nail this down before release

@jimmidyson
Copy link
Member

I can't see results before 30th June to check for flakiness before #10756 was merged on 28th June. Is that data available? The first flake appeared in f57b8c8 which was merged after that PR too, but like I said I can't tell if the build was stable before that PR or not 😞

@chrischdi
Copy link
Member

This page here should allow you to go further back:

https://prow.k8s.io/job-history/gs/kubernetes-jenkins/logs/periodic-cluster-api-test-mink8s-main?buildId=

As alternative, you could try to filter at k8s-triage

@jimmidyson
Copy link
Member

Thanks @chrischdi!

I've gone back to the time of merge of #10656 and I can only see failures for this test after #10756 was merged so I'm pretty sure that introduced the flakiness. I'll take a closer look if I can find time to try to help figure out what's going on.

@Sunnatillo
Copy link
Contributor Author

Sunnatillo commented Jul 15, 2024

Increasing timeout helps to solve the issues.
Spend some time debugging this, find out eventually resources do get created but sometimes timeout is not enough.

@fabriziopandini
Copy link
Member

fabriziopandini commented Jul 15, 2024

The fact that a ClusterResourceSet binding takes so long to reach a stable state isn't ideal.

The issue is on the fact that we are re-queuing in case of API conflicts, and then next reconciliations are influenced by exponential backoff delay quicky growing (+ the other side of the coin, that many reconcilation are happening in a very short sequence at the beginning of the backoff sequence).

TL;DR exponential backoff should be used to handle errors, not to handle how controllers are reaching a stable state

I have submitted #10869 to get rid of the exponential backoff and all of the noise that API conflict were adding to the logs + documented the problem

But also in this case (like for the timeout increase on the test) this is a mitigation.

@fabriziopandini fabriziopandini added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Jul 17, 2024
@k8s-ci-robot k8s-ci-robot removed the needs-priority Indicates an issue lacks a `priority/foo` label and requires one. label Jul 17, 2024
@fabriziopandini fabriziopandini added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jul 17, 2024
@k8s-ci-robot k8s-ci-robot removed the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jul 17, 2024
@sbueringer
Copy link
Member

Let's see if the test is stable now and then close the issue in 1-2 days if the test is stable

(xref: #10868 (comment))

@Sunnatillo
Copy link
Contributor Author

Link to check for new occurences https://storage.googleapis.com/k8s-triage/index.html?text=Should_handle_applying_multiple_ClusterResourceSets_concurrently_to_the_same_cluster&job=.*cluster-api.*main&xjob=.*-e2e-.*%7C.*-provider-.*

Not occurred today. It was occurred often before the fix. We can close this issue and the PR.

@Sunnatillo
Copy link
Contributor Author

Sunnatillo commented Jul 18, 2024

I did the test with 200 count, all passed. I would say we are safe to close this issue.
Before it was failing under 50 test counts.

@sbueringer
Copy link
Member

Great! Thx for testing

/close

@k8s-ci-robot
Copy link
Contributor

@sbueringer: Closing this issue.

In response to this:

Great! Thx for testing

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ci Issues or PRs related to ci kind/flake Categorizes issue or PR as related to a flaky test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants