Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] "should be able to update all Pods to Running" is still flaky. #894

Closed
1 of 2 tasks
kevin85421 opened this issue Feb 4, 2023 · 4 comments · Fixed by #898
Closed
1 of 2 tasks

[Bug] "should be able to update all Pods to Running" is still flaky. #894

kevin85421 opened this issue Feb 4, 2023 · 4 comments · Fixed by #898
Labels
bug Something isn't working

Comments

@kevin85421
Copy link
Member

Search before asking

  • I searched the issues and found no similar issues.

KubeRay Component

ci

What happened + What you expected to happen

After #893 is merged, this test is still flaky. (See link for more details.) Only "should be able to update all Pods to Running" fails, so all Pods become running before the start of the test "cluster's .status.state should be updated to 'ready' shortly after all Pods are Running". We may need to increase the timeout of the test.

@Yicheng-Lu-llll will update the test and run more runs to test its stability.

Reproduction script

Run make test several times.

Anything else

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@kevin85421 kevin85421 added the bug Something isn't working label Feb 4, 2023
@davidxia
Copy link
Contributor

davidxia commented Feb 4, 2023

Dang, this is unfortunate. I wonder if there's a way to more reliably get the true status of the Pods either by forcing local cache refresh or bypassing it. Or maybe checking the Pod statuses are running isn't needed and cluster's .status.state should be updated to 'ready' shortly after is the only test we need to keep.

@Yicheng-Lu-llll
Copy link
Contributor

Yicheng-Lu-llll commented Feb 5, 2023

Dang, this is unfortunate. I wonder if there's a way to more reliably get the true status of the Pods either by forcing local cache refresh or bypassing it. Or maybe checking the Pod statuses are running isn't needed and cluster's .status.state should be updated to 'ready' shortly after is the only test we need to keep.

It seems that there is a way to do that.

In client.New doc, it says:

New returns a new Client using the provided config and Options. The returned client reads and writes directly from the server (it doesn't use object caches). It understands how to work with normal types (both custom resources and aggregated/built-in resources), as well as unstructured types.

In kubebuilder's writing-tests doc, it says:

Note that we set up both a “live” k8s client and a separate client from the manager. This is because when making assertions in tests, you generally want to assert against the live state of the API server. If you use the client from the manager (k8sManager.GetClient), you’d end up asserting against the contents of the cache instead, which is slower and can introduce flakiness into your tests. We could use the manager’s APIReader to accomplish the same thing, but that would leave us with two clients in our test assertions and setup (one for reading, one for writing), and it’d be easy to make mistakes.

So, the idea is we should use k8sClient that are created from client.New instead of k8sManager.GetClient.

In current implementation, we use k8sClient from k8sManager.GetClient:

	k8sClient = mgr.GetClient()
	Expect(k8sClient).ToNot(BeNil())

Thus, I remove the above k8sClient from k8sManager.GetClient and use k8sClient from client.New:

	k8sClient, err = client.New(cfg, client.Options{Scheme: scheme.Scheme})
	Expect(err).ToNot(HaveOccurred())
	Expect(k8sClient).ToNot(BeNil())

In a 2 core CPU, 7 GB RAM VM (to simulate the github's standard Linux runner), I ran the test 100 times:
[Bug] "should be able to update all Pods to Running" never happened again.

If this makes sense(also see this example, they do the similar thing), I can make a pr.

In 100 runs, 86 runs succeed, 14 runs fail with the following errors (not sure if we should ignore these small proportions or need to dig out why, but I think this is a separate problem):
UPDATE: It is also because we need to use eventually instead of expect given that creation may not immediately happen.(see similar examples) I can also make a pr for this.

[Fail] Inside the default namespace When creating a raycluster [It] should update a raycluster object 
/home/lyc/Desktop/kuberay/ray-operator/controllers/ray/raycluster_controller_test.go:279

[Fail] Inside the default namespace When creating a raycluster [It] should have only 1 running worker 
/home/lyc/Desktop/kuberay/ray-operator/controllers/ray/raycluster_controller_test.go:286

[Fail] Inside the default namespace When creating a rayservice [It] should perform a zero-downtime update after a code change. 
/home/lyc/Desktop/kuberay/ray-operator/controllers/ray/rayservice_controller_test.go:354

@kevin85421
Copy link
Member Author

Thanks, @davidxia and @Yicheng-Lu-llll!

@Yicheng-Lu-llll

Using client.New() seems promising. One question: why does we still need to use Eventually after we use client.New() to read/write to Kubernetes API Server directly. Does it mean that writing to Kubernetes API Server is asynchronous?

@Yicheng-Lu-llll
Copy link
Contributor

Thanks, @davidxia and @Yicheng-Lu-llll!

@Yicheng-Lu-llll

Using client.New() seems promising. One question: why does we still need to use Eventually after we use client.New() to read/write to Kubernetes API Server directly. Does it mean that writing to Kubernetes API Server is asynchronous?

According to this comment:

After creating this CronJob, let's check that the CronJob's Spec fields match what we passed in. Note that, because the k8s apiserver may not have finished creating a CronJob after our Create() call from earlier, we will use Gomega’s Eventually() testing function instead of Expect() to give the apiserver an opportunity to finish creating our CronJob.

writing to Kubernetes API Server is asynchronous.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants