-
Notifications
You must be signed in to change notification settings - Fork 402
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow E2E tests to run with arbitrary k8s cluster #1306
Allow E2E tests to run with arbitrary k8s cluster #1306
Conversation
@kevin85421 |
tests/framework/utils.py
Outdated
|
||
def create_kind_cluster(self, kind_config=None) -> None: | ||
# def create_kind_cluster(self, kind_config=None) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
forget to remove?
return self.client_dict | ||
|
||
def cleanup(self, namespace = "default") -> None: | ||
config.load_kube_config() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to load_kube_config
again if we have already loaded the config in initialize_cluster
?
tests/framework/utils.py
Outdated
def __init__(self) -> None: | ||
self.client_dict = {} | ||
|
||
def client_dict(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function seems to be unused.
tests/framework/utils.py
Outdated
|
||
def cleanup(self, namespace = "default") -> None: | ||
config.load_kube_config() | ||
api_extensions = client.ApiextensionsV1Api() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using multiple Kubernetes client instances will cause flakiness. Please reuse the client from self.client_dict and ensure there is only one Kubernetes client instance in this process.
tests/framework/utils.py
Outdated
def client_dict(self): | ||
return self.client_dict | ||
|
||
def cleanup(self, namespace = "default") -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not delete the cluster directly?
tests/framework/utils.py
Outdated
def client_dict(self): | ||
return self.client_dict | ||
|
||
def cleanup(self, namespace = "default") -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also close Kubernetes clients and cleanup self.client_dict
.
tests/framework/utils.py
Outdated
|
||
time.sleep(15) | ||
|
||
if "kuberay-operator" in [deployment.metadata.name for deployment in apps_v1.list_namespaced_deployment(namespace).items]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this if we will run helm uninstall
below?
tests/framework/utils.py
Outdated
apps_v1 = client.AppsV1Api() | ||
custom_objects_api = client.CustomObjectsApi() | ||
crds = api_extensions.list_custom_resource_definition() | ||
for crd in crds.items: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Custom resources should be cleaned up by the CREvent
instead of the cluster manager.
tests/framework/utils.py
Outdated
namespace = cr["metadata"]["namespace"] | ||
custom_objects_api.delete_namespaced_custom_object(group, version, namespace, plural, name) | ||
|
||
time.sleep(15) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we not hardcode the sleep time?
a164211
to
4a8e49d
Compare
Thank you @kevin85421 for your review! I tried to address all your comments. Except that I do not use CREvent for deleting Ray custom resources. I rather use clean up method which can be general for all tests. I'm sorry for late reply. It was due to my vacation. |
The sample YAML tests seem to fail consistently. Would you mind running the sample YAML tests locally? See https://github.com/ray-project/kuberay/blob/master/ray-operator/DEVELOPMENT.md#running-configuration-tests-locally for more details. |
1cec918
to
60bbf1e
Compare
tests/framework/utils.py
Outdated
"""Check whether cluster exists or not""" | ||
return ( | ||
shell_subprocess_run( | ||
"kubectl cluster-info --context kind-kind", check=False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kind-kind
is the default context for Kind.
) | ||
|
||
def __delete_all_crs(self, group, version, namespace, plural): | ||
custom_objects_api = self.k8s_client_dict[CONST.K8S_CR_CLIENT_KEY] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The custom resource lifecycles are better controlled by CREvent
(link). In my opinion, the cleanup at the cluster level is not necessary. We can remove this function.
self.cleanup_timeout = 120 | ||
|
||
def cleanup(self, namespace = "default") -> None: | ||
self.__delete_all_crs("ray.io", "v1alpha1", namespace, "rayservices") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The custom resource lifecycles are better controlled by CREvent (link). In my opinion, the cleanup at the cluster level is not necessary. We can remove this function.
) | ||
self.cleanup_timeout = 120 | ||
|
||
def cleanup(self, namespace = "default") -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In ClusterManager's definition, cleanup
needs to delete the Kubernetes cluster, but this function does not do that.
Hello @kevin85421 thank you for the review. The point why I do not want to delete cluster itself in ExternalClusterManager -> cleanup method is that I want to run tests even with clusters which I can not easily provision or tear down. For example to provision and tear down some Openshift cluster may take half an hour or something, which is probably not doable with so many tests. Instead I've decided to implement full cluster wide clean up. This behavior is enabled by EXTERNAL_CLUSTER=true. If you do not enable it then the behavior of tests is the same as before. It creates Kind cluster and delete Kind cluster the same way as before. See cleanup method for KindClusterManager. |
60bbf1e
to
a37388f
Compare
Thank you for the explanation! It makes sense to me. Some CR YAMLs not only define CR but also other resources, such as ConfigMap. Some tests may have side effects after the cluster-level |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
The RayService e2e tests are known to be flaky. It is not related to this PR. |
Allow E2E tests to run with arbitrary k8s cluster
Why are these changes needed?
Currently E2E tests run only with Kind cluster. The goal of this PR is to make these tests more configurable and allow it to run with any Kubernetes cluster. It is now possible to login to arbitrary cluster using kubectl/oc command and run tests with arbitrary cluster. For example:
Related issue number
Closes #1284
Checks