TestClusterResourceSetReconciler test is flaky #10854

Sunnatillo · 2024-07-10T16:26:24Z

Which jobs are flaking?

capi-test-main

Which tests are flaking?

TestClusterResourceSetReconciler

Since when has it been flaking?

Most likely after merging this PR: #10656

Testgrid link

edited: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-cluster-api-test-mink8s-main/1810533764078505984

Reason for failure (if possible)

No response

Anything else we need to know?

No response

Label(s) to be applied

/kind flake
/area ci

sbueringer · 2024-07-10T16:56:18Z

cc @fabriziopandini Should be the same as the one I mentioned last time

sbueringer · 2024-07-10T16:56:44Z

@Sunnatillo I think the link is pointing to the wrong job

Sunnatillo · 2024-07-10T19:23:21Z

@Sunnatillo I think the link is pointing to the wrong job

Fixed now.
Also another occurence:
https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-cluster-api-test-main/1807033132835147776

fabriziopandini · 2024-07-15T13:43:54Z

@jimmidyson PTAL
I will also take a look because I recently made changes on ownerReference management on the same area of the code base

It is important to nail this down before release

jimmidyson · 2024-07-15T15:02:44Z

I can't see results before 30th June to check for flakiness before #10756 was merged on 28th June. Is that data available? The first flake appeared in f57b8c8 which was merged after that PR too, but like I said I can't tell if the build was stable before that PR or not 😞

chrischdi · 2024-07-15T15:45:51Z

This page here should allow you to go further back:

https://prow.k8s.io/job-history/gs/kubernetes-jenkins/logs/periodic-cluster-api-test-mink8s-main?buildId=

As alternative, you could try to filter at k8s-triage

jimmidyson · 2024-07-15T16:12:54Z

Thanks @chrischdi!

I've gone back to the time of merge of #10656 and I can only see failures for this test after #10756 was merged so I'm pretty sure that introduced the flakiness. I'll take a closer look if I can find time to try to help figure out what's going on.

Sunnatillo · 2024-07-15T16:29:50Z

Increasing timeout helps to solve the issues.
Spend some time debugging this, find out eventually resources do get created but sometimes timeout is not enough.

fabriziopandini · 2024-07-15T18:46:06Z

The fact that a ClusterResourceSet binding takes so long to reach a stable state isn't ideal.

The issue is on the fact that we are re-queuing in case of API conflicts, and then next reconciliations are influenced by exponential backoff delay quicky growing (+ the other side of the coin, that many reconcilation are happening in a very short sequence at the beginning of the backoff sequence).

TL;DR exponential backoff should be used to handle errors, not to handle how controllers are reaching a stable state

I have submitted #10869 to get rid of the exponential backoff and all of the noise that API conflict were adding to the logs + documented the problem

But also in this case (like for the timeout increase on the test) this is a mitigation.

sbueringer · 2024-07-17T14:39:24Z

Let's see if the test is stable now and then close the issue in 1-2 days if the test is stable

(xref: #10868 (comment))

sbueringer · 2024-07-18T12:49:18Z

Link to check for new occurences https://storage.googleapis.com/k8s-triage/index.html?text=Should_handle_applying_multiple_ClusterResourceSets_concurrently_to_the_same_cluster&job=.*cluster-api.*main&xjob=.*-e2e-.*%7C.*-provider-.*

Sunnatillo · 2024-07-18T12:52:35Z

Link to check for new occurences https://storage.googleapis.com/k8s-triage/index.html?text=Should_handle_applying_multiple_ClusterResourceSets_concurrently_to_the_same_cluster&job=.*cluster-api.*main&xjob=.*-e2e-.*%7C.*-provider-.*

Not occurred today. It was occurred often before the fix. We can close this issue and the PR.

Sunnatillo · 2024-07-18T13:37:14Z

I did the test with 200 count, all passed. I would say we are safe to close this issue.
Before it was failing under 50 test counts.

sbueringer · 2024-07-18T15:12:33Z

Great! Thx for testing

/close

k8s-ci-robot · 2024-07-18T15:12:38Z

@sbueringer: Closing this issue.

In response to this:

Great! Thx for testing

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sunnatillo mentioned this issue Jul 15, 2024

🐛 Increase timeout for TestClusterResourceSetReconciler test #10868

Closed

fabriziopandini assigned Sunnatillo Jul 15, 2024

sbueringer mentioned this issue Jul 16, 2024

🌱 Make ClusterResourceSet controller more predictable #10869

Merged

fabriziopandini added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Jul 17, 2024

k8s-ci-robot removed the needs-priority Indicates an issue lacks a `priority/foo` label and requires one. label Jul 17, 2024

fabriziopandini added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jul 17, 2024

k8s-ci-robot removed the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jul 17, 2024

k8s-ci-robot closed this as completed Jul 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TestClusterResourceSetReconciler test is flaky #10854

TestClusterResourceSetReconciler test is flaky #10854

Sunnatillo commented Jul 10, 2024 •

edited

Loading

sbueringer commented Jul 10, 2024

sbueringer commented Jul 10, 2024

Sunnatillo commented Jul 10, 2024

fabriziopandini commented Jul 15, 2024

jimmidyson commented Jul 15, 2024

chrischdi commented Jul 15, 2024

jimmidyson commented Jul 15, 2024

Sunnatillo commented Jul 15, 2024 •

edited

Loading

fabriziopandini commented Jul 15, 2024 •

edited

Loading

sbueringer commented Jul 17, 2024

sbueringer commented Jul 18, 2024

Sunnatillo commented Jul 18, 2024

Sunnatillo commented Jul 18, 2024 •

edited

Loading

sbueringer commented Jul 18, 2024

k8s-ci-robot commented Jul 18, 2024

TestClusterResourceSetReconciler test is flaky #10854

TestClusterResourceSetReconciler test is flaky #10854

Comments

Sunnatillo commented Jul 10, 2024 • edited Loading

Which jobs are flaking?

Which tests are flaking?

Since when has it been flaking?

Testgrid link

Reason for failure (if possible)

Anything else we need to know?

Label(s) to be applied

sbueringer commented Jul 10, 2024

sbueringer commented Jul 10, 2024

Sunnatillo commented Jul 10, 2024

fabriziopandini commented Jul 15, 2024

jimmidyson commented Jul 15, 2024

chrischdi commented Jul 15, 2024

jimmidyson commented Jul 15, 2024

Sunnatillo commented Jul 15, 2024 • edited Loading

fabriziopandini commented Jul 15, 2024 • edited Loading

sbueringer commented Jul 17, 2024

sbueringer commented Jul 18, 2024

Sunnatillo commented Jul 18, 2024

Sunnatillo commented Jul 18, 2024 • edited Loading

sbueringer commented Jul 18, 2024

k8s-ci-robot commented Jul 18, 2024

Sunnatillo commented Jul 10, 2024 •

edited

Loading

Sunnatillo commented Jul 15, 2024 •

edited

Loading

fabriziopandini commented Jul 15, 2024 •

edited

Loading

Sunnatillo commented Jul 18, 2024 •

edited

Loading