Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Operator oc support #1185

Closed
wants to merge 216 commits into from
Closed

Conversation

blublinsky
Copy link
Contributor

Why are these changes needed?

The current implementation of the operator allows creating ingress to expose the dashboard and remote job submit APIs from outside the cluster. In the case of OpenShift in the place of Ingress are typically used Routes.

PR implements two main things:

  1. During operator startup, implementation checks whether the platform is OpenShift of plain K8
  2. If Ingress is requested depending on which type of K8 we are running in either an Ingress is created in the case of k8 or Route in the case of OpenShift

Related issue number

Checks

  • [x ] I've made sure the tests are passing.
  • Testing Strategy
    • [ x] Unit tests
    • [ x] Manual tests
    • This PR is not tested :(

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a technical debt. KubeRay should not create an ingress for users. I should deprecate this at certain moments.

The preferred method is to provide YAML files, such as the one found at https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray-cluster.separate-ingress.yaml, and document it in https://github.com/ray-project/kuberay/blob/master/docs/guidance/ingress.md.

Sorry about the inconvenience.

@blublinsky
Copy link
Contributor Author

Sorry, @kevin85421. It can be a very useful feature for us moving forward. The issue is that we plan to create Ray clusters per job execution. We also want to do it using KFP - no human interaction. Finally, we want to be able to do remote cluster creation and job submission. For this, we do need to be able to create ingress/route. Can we discuss some compromise solution on Monday?

@kevin85421
Copy link
Member

It can be a very useful feature for us moving forward. The issue is that we plan to create Ray clusters per job execution. We also want to do it using KFP - no human interaction. Finally, we want to be able to do remote cluster creation and job submission.

Q1: Does KFP stand for Kubeflow pipeline?
Q2: Why not add the ingress spec and RayCluster spec in a single YAML file?

@blublinsky
Copy link
Contributor Author

blublinsky commented Jun 24, 2023

Q1: Does KFP stand for Kubeflow pipeline?
Yes
Q2: Why not add the ingress spec and RayCluster spec in a single YAML file?
This will make things far more complex:

  1. Clean-up is not coordinated by the operator
  2. It can lead to errors if the name of the service is not defined correctly - the operator ensures that it is correct
  3. Ensuring that if someone accidentally deletes an ingress, an operator will restore it.
    All the good things that the operator provides.

I think we do need to revisit ingress depreciation. Let's talk more about it. I have my own view on the final architecture. Lets compare notes.

@anishasthana
Copy link
Contributor

I think I agree that the KubeRay operator creating an ingress/route for the cluster makes sense. The only thing I think we need to be careful about is auth. Creating routes that by default are exposed to external traffic could be dangerous -- we should figure out how to protect these clusters too.

@blublinsky
Copy link
Contributor Author

@kevin85421 any progress on this

@kevin85421
Copy link
Member

@blublinsky Would you mind fixing the CI failures? I will try to review it today. You can refer to https://github.com/ray-project/kuberay/blob/master/ray-operator/DEVELOPMENT.md#consistency-check for more details about the failure. Thanks!

@blublinsky
Copy link
Contributor Author

@kevin85421 I think I fixed my errors

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, would you mind rebasing this PR with the master branch? Thanks!

ray-operator/Makefile Outdated Show resolved Hide resolved
ray-operator/controllers/ray/common/route.go Outdated Show resolved Hide resolved
ray-operator/controllers/ray/common/route.go Show resolved Hide resolved
ray-operator/controllers/ray/common/route.go Show resolved Hide resolved
ray-operator/controllers/ray/common/route.go Outdated Show resolved Hide resolved
Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Would you mind rebasing with the master branch?

  2. Let's try to get this PR merged before the v0.6.0 feature freeze, which is expected to be on July 10 or 11.

@anishasthana Would you mind reviewing this PR, especially the OpenShift-related part?


// Check where we are running. We are trying to distinguish here whether
// this is vanilla kubernetes cluster or OPenshift
func getClusterType(logger logr.Logger) bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using an environment variable for the KubeRay operator to determine which Kubernetes distribution is being used, instead of relying on automatic detection? Based on my experience, making assumptions about users' Kubernetes control planes can often lead to issues. In addition, environment variable will also make us easier to maintain the backward compatibility.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g. K8S_DISTRIBUTIONS:

Case 1: it is not defined or VANILLA => vanilla Kubernetes
Case 2: OPENSHIFT => OpenShift

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can refer to ENABLE_INIT_CONTAINER_INJECTION in #1069 for more details.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, in my experience usage of env variable is by far more error-prone compared to automatic detection. Sorry

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current implementation getClusterType checks whether there is anything from the OpenShift ns installed. It always defaults to plain vanilla k8. Also just added an env variable to overwrite the default OpenShift behavior

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I agree with automatic detection

#1185 (review)

environment variables offer more control but tend to result in user-error more often.

I believe only IBM/RedHat will use this feature, so we shouldn't subject other users to the risks of this code path.
In addition, there is no feature gate for this change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you still want to use automatic detection, please add a feature flag. In addition, would you mind replying to my question (3) about the implementation in #1185 (review)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function needs to check if a string ends with .openshift.io, like strings.HasSuffix(apiGroupList.Groups[i].Name, ".openshift.io"). I'm not sure if
this is the standard way to do it, but it seems a bit hacky to me at first glance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe only IBM/RedHat will use this feature, so we shouldn't subject other users to the risks of this code path.
In addition, there is no feature gate for this change.

There are more users of OpenShift than IBM and RedHat, so it is relevant to all OpenShift users. We already have a flag bypassing this function. What exactly else do you want me to do?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would you mind replying to my question (3) about the implementation in https://github.com/ray-project/kuberay/pull/1185#pullrequestreview-1525067293?

Yes, this is a standard way of using this check. See https://developers.redhat.com/blog/2020/09/11/5-tips-for-developing-kubernetes-operators-with-the-new-operator-sdk for an example

apiserver/go.mod Outdated Show resolved Hide resolved
apiserver/go.sum Outdated Show resolved Hide resolved
ray-operator/controllers/ray/common/route.go Outdated Show resolved Hide resolved
}

func TestBuildRouteForHeadService(t *testing.T) {
route, err := BuildRouteForHeadService(*instanceWithRouteEnabled)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although currently BuildRouteForHeadService will only read instanceWithRouteEnabled, it is better to perform a deep copy of instanceWithRouteEnabled to avoid side effects in the future. I expect that instanceWithRouteEnabled will be used in multiple tests. If the value will be different after each test, it will prevent the tests from covering the correct code paths.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh? BuildRouteForHeadService should never modify source

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BuildRouteForHeadService should never modify source

You can refer to my comment https://github.com/ray-project/kuberay/pull/1185/files#r1255244213.

I would say that it is a good practice for every gopher to avoid side effects caused by future changes. For more details on this topic, you can refer to this article. In addition, you can find related comments frequently in Kubernetes source code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I would agree that shallow copy is bad. In this particular case BuildRouteForHeadService is part of CR processing. If it modifies CR we have a much bigger problem. Also compare it with other tests, for example ingress test

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@blublinsky, perhaps inlining the variable in the call to BuildRouteForHead might be a better solution. I think it should assuade @kevin85421 concerns and also slighly simplify the reading of the code (a subjective opinion)

ray-operator/go.mod Outdated Show resolved Hide resolved
ray-operator/go.sum Outdated Show resolved Hide resolved
@blublinsky
Copy link
Contributor Author

@kevin85421 It has been a while. Is it time to merge this one?

@kevin85421
Copy link
Member

@blublinsky I reviewed this PR last week and noticed that the commits were disorganized. This hasn't been addressed yet. It's challenging for me to review a PR with 102 file changes, so I'll review it once the commits have been cleaned up.

@blublinsky
Copy link
Contributor Author

@kevin85421 What do you mean disorganized? I keep rebasing it to catch up with the changes. What exactly do you want me to do? Just tell me

@kevin85421
Copy link
Member

@blublinsky It shows that this PR updates 102 files, but this PR actually updates less than 10 files.

Screen Shot 2023-08-29 at 9 11 38 AM

@blublinsky blublinsky mentioned this pull request Aug 29, 2023
1 task
@blublinsky
Copy link
Contributor Author

@kevin85421 replaced this PR with #1371
Feel free to close this one and approve the new one

@blublinsky blublinsky closed this Aug 30, 2023
@blublinsky blublinsky deleted the operator_oc_support branch August 30, 2023 09:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.