Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow E2E tests to run with arbitrary k8s cluster #1306

Merged
merged 1 commit into from
Sep 13, 2023

Conversation

jiripetrlik
Copy link
Contributor

Why are these changes needed?

Currently E2E tests run only with Kind cluster. The goal of this PR is to make these tests more configurable and allow it to run with any Kubernetes cluster. It is now possible to login to arbitrary cluster using kubectl/oc command and run tests with arbitrary cluster. For example:

oc login ...
EXTERNAL_CLUSTER=true RAY_IMAGE=rayproject/ray:2.5.0 OPERATOR_IMAGE=kuberay/operator:nightly python3 tests/compatibility-test.py

Related issue number

Closes #1284

Checks

  • [] I've made sure the tests are passing.
  • Testing Strategy
    • [] Unit tests
    • [] Manual tests
    • [*] This PR is not tested :(

@jiripetrlik
Copy link
Contributor Author

@kevin85421
Hello, can you please provide initial feedback?

@kevin85421 kevin85421 self-requested a review August 9, 2023 16:33
@kevin85421 kevin85421 self-assigned this Aug 9, 2023

def create_kind_cluster(self, kind_config=None) -> None:
# def create_kind_cluster(self, kind_config=None) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

forget to remove?

tests/framework/utils.py Show resolved Hide resolved
return self.client_dict

def cleanup(self, namespace = "default") -> None:
config.load_kube_config()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to load_kube_config again if we have already loaded the config in initialize_cluster?

def __init__(self) -> None:
self.client_dict = {}

def client_dict(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function seems to be unused.


def cleanup(self, namespace = "default") -> None:
config.load_kube_config()
api_extensions = client.ApiextensionsV1Api()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using multiple Kubernetes client instances will cause flakiness. Please reuse the client from self.client_dict and ensure there is only one Kubernetes client instance in this process.

def client_dict(self):
return self.client_dict

def cleanup(self, namespace = "default") -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not delete the cluster directly?

def client_dict(self):
return self.client_dict

def cleanup(self, namespace = "default") -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also close Kubernetes clients and cleanup self.client_dict.


time.sleep(15)

if "kuberay-operator" in [deployment.metadata.name for deployment in apps_v1.list_namespaced_deployment(namespace).items]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this if we will run helm uninstall below?

apps_v1 = client.AppsV1Api()
custom_objects_api = client.CustomObjectsApi()
crds = api_extensions.list_custom_resource_definition()
for crd in crds.items:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Custom resources should be cleaned up by the CREvent instead of the cluster manager.

namespace = cr["metadata"]["namespace"]
custom_objects_api.delete_namespaced_custom_object(group, version, namespace, plural, name)

time.sleep(15)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not hardcode the sleep time?

@jiripetrlik
Copy link
Contributor Author

Thank you @kevin85421 for your review! I tried to address all your comments. Except that I do not use CREvent for deleting Ray custom resources. I rather use clean up method which can be general for all tests. I'm sorry for late reply. It was due to my vacation.

@kevin85421
Copy link
Member

The sample YAML tests seem to fail consistently. Would you mind running the sample YAML tests locally? See https://github.com/ray-project/kuberay/blob/master/ray-operator/DEVELOPMENT.md#running-configuration-tests-locally for more details.

@jiripetrlik jiripetrlik force-pushed the 1284-arbitrary-cluster branch 3 times, most recently from 1cec918 to 60bbf1e Compare September 7, 2023 21:50
tests/framework/utils.py Show resolved Hide resolved
"""Check whether cluster exists or not"""
return (
shell_subprocess_run(
"kubectl cluster-info --context kind-kind", check=False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kind-kind is the default context for Kind.

)

def __delete_all_crs(self, group, version, namespace, plural):
custom_objects_api = self.k8s_client_dict[CONST.K8S_CR_CLIENT_KEY]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The custom resource lifecycles are better controlled by CREvent (link). In my opinion, the cleanup at the cluster level is not necessary. We can remove this function.

self.cleanup_timeout = 120

def cleanup(self, namespace = "default") -> None:
self.__delete_all_crs("ray.io", "v1alpha1", namespace, "rayservices")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The custom resource lifecycles are better controlled by CREvent (link). In my opinion, the cleanup at the cluster level is not necessary. We can remove this function.

)
self.cleanup_timeout = 120

def cleanup(self, namespace = "default") -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In ClusterManager's definition, cleanup needs to delete the Kubernetes cluster, but this function does not do that.

@jiripetrlik
Copy link
Contributor Author

Hello @kevin85421 thank you for the review. The point why I do not want to delete cluster itself in ExternalClusterManager -> cleanup method is that I want to run tests even with clusters which I can not easily provision or tear down. For example to provision and tear down some Openshift cluster may take half an hour or something, which is probably not doable with so many tests. Instead I've decided to implement full cluster wide clean up. This behavior is enabled by EXTERNAL_CLUSTER=true. If you do not enable it then the behavior of tests is the same as before. It creates Kind cluster and delete Kind cluster the same way as before. See cleanup method for KindClusterManager.

@kevin85421
Copy link
Member

cleanup method is that I want to run tests even with clusters which I can not easily provision or tear down. For example to provision and tear down some Openshift cluster may take half an hour or something, which is probably not doable with so many tests.

Thank you for the explanation! It makes sense to me. Some CR YAMLs not only define CR but also other resources, such as ConfigMap. Some tests may have side effects after the cluster-level cleanup, but it is fine to open follow-up PRs to fix that if any.

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kevin85421
Copy link
Member

The RayService e2e tests are known to be flaky. It is not related to this PR.

@kevin85421 kevin85421 merged commit 59d703f into ray-project:master Sep 13, 2023
15 checks passed
lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023
Allow E2E tests to run with arbitrary k8s cluster
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow E2E tests to run with arbitrary Kubernetes cluster
2 participants