[Bug] "should be able to update all Pods to Running" is still flaky. #894

kevin85421 · 2023-02-04T07:05:25Z

Search before asking

I searched the issues and found no similar issues.

KubeRay Component

ci

What happened + What you expected to happen

After #893 is merged, this test is still flaky. (See link for more details.) Only "should be able to update all Pods to Running" fails, so all Pods become running before the start of the test "cluster's .status.state should be updated to 'ready' shortly after all Pods are Running". We may need to increase the timeout of the test.

@Yicheng-Lu-llll will update the test and run more runs to test its stability.

Reproduction script

Run make test several times.

Anything else

No response

Are you willing to submit a PR?

Yes I am willing to submit a PR!

The text was updated successfully, but these errors were encountered:

davidxia · 2023-02-04T20:34:46Z

Dang, this is unfortunate. I wonder if there's a way to more reliably get the true status of the Pods either by forcing local cache refresh or bypassing it. Or maybe checking the Pod statuses are running isn't needed and cluster's .status.state should be updated to 'ready' shortly after is the only test we need to keep.

Yicheng-Lu-llll · 2023-02-05T01:09:48Z

Dang, this is unfortunate. I wonder if there's a way to more reliably get the true status of the Pods either by forcing local cache refresh or bypassing it. Or maybe checking the Pod statuses are running isn't needed and cluster's .status.state should be updated to 'ready' shortly after is the only test we need to keep.

It seems that there is a way to do that.

In client.New doc, it says:

New returns a new Client using the provided config and Options. The returned client reads and writes directly from the server (it doesn't use object caches). It understands how to work with normal types (both custom resources and aggregated/built-in resources), as well as unstructured types.

In kubebuilder's writing-tests doc, it says:

Note that we set up both a “live” k8s client and a separate client from the manager. This is because when making assertions in tests, you generally want to assert against the live state of the API server. If you use the client from the manager (k8sManager.GetClient), you’d end up asserting against the contents of the cache instead, which is slower and can introduce flakiness into your tests. We could use the manager’s APIReader to accomplish the same thing, but that would leave us with two clients in our test assertions and setup (one for reading, one for writing), and it’d be easy to make mistakes.

So, the idea is we should use k8sClient that are created from client.New instead of k8sManager.GetClient.

In current implementation, we use k8sClient from k8sManager.GetClient:

	k8sClient = mgr.GetClient()
	Expect(k8sClient).ToNot(BeNil())

Thus, I remove the above k8sClient from k8sManager.GetClient and use k8sClient from client.New:

	k8sClient, err = client.New(cfg, client.Options{Scheme: scheme.Scheme})
	Expect(err).ToNot(HaveOccurred())
	Expect(k8sClient).ToNot(BeNil())

In a 2 core CPU, 7 GB RAM VM (to simulate the github's standard Linux runner), I ran the test 100 times:
[Bug] "should be able to update all Pods to Running" never happened again.

If this makes sense(also see this example, they do the similar thing), I can make a pr.

In 100 runs, 86 runs succeed, 14 runs fail with the following errors (not sure if we should ignore these small proportions or need to dig out why, but I think this is a separate problem):
UPDATE: It is also because we need to use eventually instead of expect given that creation may not immediately happen.(see similar examples) I can also make a pr for this.

[Fail] Inside the default namespace When creating a raycluster [It] should update a raycluster object 
/home/lyc/Desktop/kuberay/ray-operator/controllers/ray/raycluster_controller_test.go:279

[Fail] Inside the default namespace When creating a raycluster [It] should have only 1 running worker 
/home/lyc/Desktop/kuberay/ray-operator/controllers/ray/raycluster_controller_test.go:286

[Fail] Inside the default namespace When creating a rayservice [It] should perform a zero-downtime update after a code change. 
/home/lyc/Desktop/kuberay/ray-operator/controllers/ray/rayservice_controller_test.go:354

kevin85421 · 2023-02-05T05:03:12Z

Thanks, @davidxia and @Yicheng-Lu-llll!

@Yicheng-Lu-llll

Using client.New() seems promising. One question: why does we still need to use Eventually after we use client.New() to read/write to Kubernetes API Server directly. Does it mean that writing to Kubernetes API Server is asynchronous?

Yicheng-Lu-llll · 2023-02-07T04:59:00Z

Thanks, @davidxia and @Yicheng-Lu-llll!

@Yicheng-Lu-llll

Using client.New() seems promising. One question: why does we still need to use Eventually after we use client.New() to read/write to Kubernetes API Server directly. Does it mean that writing to Kubernetes API Server is asynchronous?

According to this comment:

After creating this CronJob, let's check that the CronJob's Spec fields match what we passed in. Note that, because the k8s apiserver may not have finished creating a CronJob after our Create() call from earlier, we will use Gomega’s Eventually() testing function instead of Expect() to give the apiserver an opportunity to finish creating our CronJob.

writing to Kubernetes API Server is asynchronous.

kevin85421 added the bug Something isn't working label Feb 4, 2023

Yicheng-Lu-llll mentioned this issue Feb 8, 2023

Use k8sClient from client.New in controller test #898

Merged

kevin85421 closed this as completed in #898 Feb 8, 2023

Yicheng-Lu-llll mentioned this issue Feb 11, 2023

refactor: Fix flaky tests by using RetryOnConflict #904

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] "should be able to update all Pods to Running" is still flaky. #894

[Bug] "should be able to update all Pods to Running" is still flaky. #894

kevin85421 commented Feb 4, 2023

davidxia commented Feb 4, 2023

Yicheng-Lu-llll commented Feb 5, 2023 •

edited

Loading

kevin85421 commented Feb 5, 2023

Yicheng-Lu-llll commented Feb 7, 2023

[Bug] "should be able to update all Pods to Running" is still flaky. #894

[Bug] "should be able to update all Pods to Running" is still flaky. #894

Comments

kevin85421 commented Feb 4, 2023

Search before asking

KubeRay Component

What happened + What you expected to happen

Reproduction script

Anything else

Are you willing to submit a PR?

davidxia commented Feb 4, 2023

Yicheng-Lu-llll commented Feb 5, 2023 • edited Loading

kevin85421 commented Feb 5, 2023

Yicheng-Lu-llll commented Feb 7, 2023

Yicheng-Lu-llll commented Feb 5, 2023 •

edited

Loading