Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Flaking test] Kueue when Creating a Job With Queueing Should run with prebuilt workload #3051

Open
mimowo opened this issue Sep 16, 2024 · 7 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/flake Categorizes issue or PR as related to a flaky test.

Comments

@mimowo
Copy link
Contributor

mimowo commented Sep 16, 2024

What happened:

The test flaked: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-kueue-test-e2e-main-1-28/1834993639869124608

image

What you expected to happen:

No flakes.

How to reproduce it (as minimally and precisely as possible):

Repeat the CI build.

Anything else we need to know?:

End To End Suite: kindest/node:v1.28.9: [It] Kueue when Creating a Job With Queueing Should run with prebuilt workload expand_less	6s
{Timed out after 5.000s.
The function passed to Eventually failed at /home/prow/go/src/kubernetes-sigs/kueue/test/e2e/singlecluster/e2e_test.go:185 with:
Expected
    <[]v1.OwnerReference | len:0, cap:0>: nil
to contain element matching
    <*matchers.BeComparableToMatcher | 0xc0005792c0>: {
        Expected: <v1.OwnerReference>{
            APIVersion: "",
            Kind: "",
            Name: "test-job",
            UID: "d2e2bd31-0087-4759-a306-253b308d837f",
            Controller: nil,
            BlockOwnerDeletion: nil,
        },
        Options: [
            <*cmp.pathFilter | 0xc0006aa870>{
                core: {},
                fnc: 0x633f20,
                opt: <cmp.ignore>{core: {}},
            },
        ],
    } failed [FAILED] Timed out after 5.000s.
The function passed to Eventually failed at /home/prow/go/src/kubernetes-sigs/kueue/test/e2e/singlecluster/e2e_test.go:185 with:
Expected
    <[]v1.OwnerReference | len:0, cap:0>: nil
to contain element matching
    <*matchers.BeComparableToMatcher | 0xc0005792c0>: {
        Expected: <v1.OwnerReference>{
            APIVersion: "",
            Kind: "",
            Name: "test-job",
            UID: "d2e2bd31-0087-4759-a306-253b308d837f",
            Controller: nil,
            BlockOwnerDeletion: nil,
        },
        Options: [
            <*cmp.pathFilter | 0xc0006aa870>{
                core: {},
                fnc: 0x633f20,
                opt: <cmp.ignore>{core: {}},
            },
        ],
    }
In [It] at: /home/prow/go/src/kubernetes-sigs/kueue/test/e2e/singlecluster/e2e_test.go:190 @ 09/14/24 16:39:35.489
}
@mimowo mimowo added the kind/bug Categorizes issue or PR as related to a bug. label Sep 16, 2024
@mimowo
Copy link
Contributor Author

mimowo commented Sep 16, 2024

/kind flake
/cc @mbobrovskyi @trasc

@k8s-ci-robot k8s-ci-robot added the kind/flake Categorizes issue or PR as related to a flaky test. label Sep 16, 2024
@IrvingMg
Copy link
Contributor

/assign

@IrvingMg
Copy link
Contributor

Digging into this issue, the cause seems to be that objects came out of order. This occurs when a Workload is created after the Job that will adopt it, as shown in the logs:

2024-09-14T16:39:30.49936039Z stderr F 2024-09-14T16:39:30.499137821Z	LEVEL(-2)	jobframework/reconciler.go:313	Reconciling Job	{"controller": "job", "controllerGroup": "batch", "controllerKind": "Job", "Job": {"name":"test-job","namespace":"e2e-n4shr"}, "namespace": "e2e-n4shr", "name": "test-job", "reconcileID": "80a0f899-0e23-438a-9956-b95691055017", "job": "e2e-n4shr/test-job", "gvk": "batch/v1, Kind=Job"}
2024-09-14T16:39:30.49939532Z stderr F 2024-09-14T16:39:30.499216661Z	LEVEL(-3)	jobframework/reconciler.go:381	The workload is nil, handle job with no workload	{"controller": "job", "controllerGroup": "batch", "controllerKind": "Job", "Job": {"name":"test-job","namespace":"e2e-n4shr"}, "namespace": "e2e-n4shr", "name": "test-job", "reconcileID": "80a0f899-0e23-438a-9956-b95691055017", "job": "e2e-n4shr/test-job", "gvk": "batch/v1, Kind=Job"}
2024-09-14T16:39:30.49962255Z stderr F 2024-09-14T16:39:30.49943252Z	LEVEL(-2)	workload-reconciler	core/workload_controller.go:563	Workload create event	{"workload": {"name":"prebuilt-wl","namespace":"e2e-n4shr"}, "queue": "main", "status": "pending"}

A fix could be to set up a retry for the retrieval of the workload. However, during my test executions I found that this test seems to be flaky only on Kubernetes 1.28 which End-of-Life is on end of October.

@mimowo @alculquicondor WDYT?

@mimowo
Copy link
Contributor Author

mimowo commented Sep 18, 2024

Can you investigate if this is just a test issue or it can also affect the e22 runtime?

@alculquicondor
Copy link
Contributor

Interesting, I would think that this could happen in any k8s version. Maybe it's just very hard to reproduce in general?

In any case, it sounds like this could happen in production. The solution should be to trigger another job sync when the corresponding Workload object appears. Don't we have event handlers for that?

@IrvingMg
Copy link
Contributor

Interesting, I would think that this could happen in any k8s version. Maybe it's just very hard to reproduce in general?

In any case, it sounds like this could happen in production. The solution should be to trigger another job sync when the corresponding Workload object appears. Don't we have event handlers for that?

Yes, we have a watcher for workloads and batch jobs, and we could do the same for jobs waiting for a prebuilt workload. Yet to do that we would need to add a watcher for every job supporting prebuilt.

Another option would be implementing the retry because we would need to modify only the job framework reconciler.

@alculquicondor
Copy link
Contributor

I think we can add an indexer into the job for the prebuilt-workload name, so that it can be used to find the job for a workload when it appears.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/flake Categorizes issue or PR as related to a flaky test.
Projects
None yet
Development

No branches or pull requests

4 participants