Operator oc support #1185

blublinsky · 2023-06-23T21:43:14Z

Why are these changes needed?

The current implementation of the operator allows creating ingress to expose the dashboard and remote job submit APIs from outside the cluster. In the case of OpenShift in the place of Ingress are typically used Routes.

PR implements two main things:

During operator startup, implementation checks whether the platform is OpenShift of plain K8
If Ingress is requested depending on which type of K8 we are running in either an Ingress is created in the case of k8 or Route in the case of OpenShift

Related issue number

Checks

[x ] I've made sure the tests are passing.
Testing Strategy
- [ x] Unit tests
- [ x] Manual tests
- This PR is not tested :(

kevin85421

This is a technical debt. KubeRay should not create an ingress for users. I should deprecate this at certain moments.

Release v0.5.0 doc validation part 2 #999 => Remove built-in ingress support from doc
Deprecate built in ingress #911

The preferred method is to provide YAML files, such as the one found at https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray-cluster.separate-ingress.yaml, and document it in https://github.com/ray-project/kuberay/blob/master/docs/guidance/ingress.md.

Sorry about the inconvenience.

blublinsky · 2023-06-24T07:31:07Z

Sorry, @kevin85421. It can be a very useful feature for us moving forward. The issue is that we plan to create Ray clusters per job execution. We also want to do it using KFP - no human interaction. Finally, we want to be able to do remote cluster creation and job submission. For this, we do need to be able to create ingress/route. Can we discuss some compromise solution on Monday?

kevin85421 · 2023-06-24T14:59:01Z

It can be a very useful feature for us moving forward. The issue is that we plan to create Ray clusters per job execution. We also want to do it using KFP - no human interaction. Finally, we want to be able to do remote cluster creation and job submission.

Q1: Does KFP stand for Kubeflow pipeline?
Q2: Why not add the ingress spec and RayCluster spec in a single YAML file?

blublinsky · 2023-06-24T19:17:30Z

Q1: Does KFP stand for Kubeflow pipeline?
Yes
Q2: Why not add the ingress spec and RayCluster spec in a single YAML file?
This will make things far more complex:

Clean-up is not coordinated by the operator
It can lead to errors if the name of the service is not defined correctly - the operator ensures that it is correct
Ensuring that if someone accidentally deletes an ingress, an operator will restore it.
All the good things that the operator provides.

I think we do need to revisit ingress depreciation. Let's talk more about it. I have my own view on the final architecture. Lets compare notes.

anishasthana · 2023-06-26T17:05:46Z

I think I agree that the KubeRay operator creating an ingress/route for the cluster makes sense. The only thing I think we need to be careful about is auth. Creating routes that by default are exposed to external traffic could be dangerous -- we should figure out how to protect these clusters too.

blublinsky · 2023-06-28T12:31:47Z

@kevin85421 any progress on this

kevin85421 · 2023-06-28T20:17:50Z

@blublinsky Would you mind fixing the CI failures? I will try to review it today. You can refer to https://github.com/ray-project/kuberay/blob/master/ray-operator/DEVELOPMENT.md#consistency-check for more details about the failure. Thanks!

blublinsky · 2023-06-29T10:20:56Z

@kevin85421 I think I fixed my errors

kevin85421

By the way, would you mind rebasing this PR with the master branch? Thanks!

ray-operator/Makefile

ray-operator/controllers/ray/common/ingress.go

ray-operator/controllers/ray/common/route.go

ray-operator/controllers/ray/raycluster_controller.go

kevin85421

Would you mind rebasing with the master branch?
Let's try to get this PR merged before the v0.6.0 feature freeze, which is expected to be on July 10 or 11.

@anishasthana Would you mind reviewing this PR, especially the OpenShift-related part?

kevin85421 · 2023-07-07T04:21:34Z

ray-operator/controllers/ray/raycluster_controller.go

+
+// Check where we are running. We are trying to distinguish here whether
+// this is vanilla kubernetes cluster or OPenshift
+func getClusterType(logger logr.Logger) bool {


How about using an environment variable for the KubeRay operator to determine which Kubernetes distribution is being used, instead of relying on automatic detection? Based on my experience, making assumptions about users' Kubernetes control planes can often lead to issues. In addition, environment variable will also make us easier to maintain the backward compatibility.

e.g. K8S_DISTRIBUTIONS:

Case 1: it is not defined or VANILLA => vanilla Kubernetes
Case 2: OPENSHIFT => OpenShift

You can refer to ENABLE_INIT_CONTAINER_INJECTION in #1069 for more details.

Sorry, in my experience usage of env variable is by far more error-prone compared to automatic detection. Sorry

In the current implementation getClusterType checks whether there is anything from the OpenShift ns installed. It always defaults to plain vanilla k8. Also just added an env variable to overwrite the default OpenShift behavior

I think I agree with automatic detection

#1185 (review)

environment variables offer more control but tend to result in user-error more often.

I believe only IBM/RedHat will use this feature, so we shouldn't subject other users to the risks of this code path.
In addition, there is no feature gate for this change.

If you still want to use automatic detection, please add a feature flag. In addition, would you mind replying to my question (3) about the implementation in #1185 (review)?

The function needs to check if a string ends with .openshift.io, like strings.HasSuffix(apiGroupList.Groups[i].Name, ".openshift.io"). I'm not sure if
this is the standard way to do it, but it seems a bit hacky to me at first glance.

I believe only IBM/RedHat will use this feature, so we shouldn't subject other users to the risks of this code path. In addition, there is no feature gate for this change.

There are more users of OpenShift than IBM and RedHat, so it is relevant to all OpenShift users. We already have a flag bypassing this function. What exactly else do you want me to do?

would you mind replying to my question (3) about the implementation in https://github.com/ray-project/kuberay/pull/1185#pullrequestreview-1525067293?

Yes, this is a standard way of using this check. See https://developers.redhat.com/blog/2020/09/11/5-tips-for-developing-kubernetes-operators-with-the-new-operator-sdk for an example

apiserver/go.mod

apiserver/go.sum

ray-operator/controllers/ray/common/route.go

ray-operator/controllers/ray/raycluster_controller.go

kevin85421 · 2023-07-07T04:53:59Z

ray-operator/controllers/ray/common/route_test.go

+}
+
+func TestBuildRouteForHeadService(t *testing.T) {
+	route, err := BuildRouteForHeadService(*instanceWithRouteEnabled)


Although currently BuildRouteForHeadService will only read instanceWithRouteEnabled, it is better to perform a deep copy of instanceWithRouteEnabled to avoid side effects in the future. I expect that instanceWithRouteEnabled will be used in multiple tests. If the value will be different after each test, it will prevent the tests from covering the correct code paths.

Huh? BuildRouteForHeadService should never modify source

BuildRouteForHeadService should never modify source

You can refer to my comment https://github.com/ray-project/kuberay/pull/1185/files#r1255244213.

I would say that it is a good practice for every gopher to avoid side effects caused by future changes. For more details on this topic, you can refer to this article. In addition, you can find related comments frequently in Kubernetes source code.

In general, I would agree that shallow copy is bad. In this particular case BuildRouteForHeadService is part of CR processing. If it modifies CR we have a much bigger problem. Also compare it with other tests, for example ingress test

@blublinsky, perhaps inlining the variable in the call to BuildRouteForHead might be a better solution. I think it should assuade @kevin85421 concerns and also slighly simplify the reading of the code (a subjective opinion)

ray-operator/go.mod

ray-operator/go.sum

Signed-off-by: Boris Lublinsky <[email protected]>

# Conflicts: # helm-chart/kuberay-apiserver/values.yaml

RayJob with custom head service name

…istening to Kubernetes events (ray-project#1341) Redefine the behavior for deleting Pods and stop listening to Kubernetes events

Upgrade to Go 1.19

…or_oc_support

blublinsky · 2023-08-29T11:24:31Z

@kevin85421 It has been a while. Is it time to merge this one?

kevin85421 · 2023-08-29T16:02:40Z

@blublinsky I reviewed this PR last week and noticed that the commits were disorganized. This hasn't been addressed yet. It's challenging for me to review a PR with 102 file changes, so I'll review it once the commits have been cleaned up.

blublinsky · 2023-08-29T16:06:38Z

@kevin85421 What do you mean disorganized? I keep rebasing it to catch up with the changes. What exactly do you want me to do? Just tell me

kevin85421 · 2023-08-29T16:13:12Z

@blublinsky It shows that this PR updates 102 files, but this PR actually updates less than 10 files.

blublinsky · 2023-08-29T17:46:32Z

@kevin85421 replaced this PR with #1371
Feel free to close this one and approve the new one

blublinsky added 4 commits June 22, 2023 17:06

Add support for openshift routes

f7abe67

Add support for openshift routes

1ed62c4

Add support for openshift routes

b23e5cd

Add support for openshift routes in the operator

4914cbc

kevin85421 reviewed Jun 23, 2023

View reviewed changes

blublinsky added 3 commits June 29, 2023 10:24

Run consistency checks

fd2087d

Run consistency checks

8b80a18

Run consistency checks

3d75af6

kevin85421 reviewed Jun 29, 2023

View reviewed changes

blublinsky added 4 commits June 30, 2023 14:39

Run consistency checks

4f02879

more fixes

6de05d8

more fixes

a68b13c

more fixes

fb27632

kevin85421 reviewed Jul 7, 2023

View reviewed changes

blublinsky added 9 commits July 7, 2023 09:14

Rebased to master

5bb81a5

Rebased to master

d52187a

Merge branch 'master' into operator_oc_support

3fb4abf

Signed-off-by: Boris Lublinsky <[email protected]>

Merge remote-tracking branch 'origin/master'

46bc25e

# Conflicts: # helm-chart/kuberay-apiserver/values.yaml

more fixes

643667c

more fixes

3592c1c

more fixes

7164b4a

more fixes

fb49a05

more fixes

1fd703e

blublinsky and others added 22 commits August 29, 2023 11:32

More clean up and tests

63a668f

updating openshift library

70d5fae

updating openshift library

fbb8de6

[Bug][RayJob] RayJob with custom head service name (ray-project#1332)

b31dd46

RayJob with custom head service name

[GCS FT][Refactor] Redefine the behavior for deleting Pods and stop l…

111be20

…istening to Kubernetes events (ray-project#1341) Redefine the behavior for deleting Pods and stop listening to Kubernetes events

Add support for openshift routes in the operator

84eef67

Run consistency checks

7d27132

more fixes

d2c11b8

more fixes

bf45652

Upgrade to Go 1.19 (ray-project#1325)

1180d93

Upgrade to Go 1.19

Run consistency checks

3f778bc

more fixes

b77cb30

Run consistency checks

f023a2b

more fixes

e0b6187

more fixes

43b0fd4

more fixes

b187eac

more fixes

1514bbd

rebased

fc3f788

rebased

ab2f262

rebased to latest

3225c63

Merge remote-tracking branch 'origin/operator_oc_support' into operat…

19ffd4c

…or_oc_support

rebased to latest

fa9645a

blublinsky mentioned this pull request Aug 29, 2023

Operator support for openShift #1371

Merged

1 task

blublinsky closed this Aug 30, 2023

blublinsky deleted the operator_oc_support branch August 30, 2023 09:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Operator oc support #1185

Operator oc support #1185

blublinsky commented Jun 23, 2023

kevin85421 left a comment

blublinsky commented Jun 24, 2023

kevin85421 commented Jun 24, 2023

blublinsky commented Jun 24, 2023 •

edited

Loading

anishasthana commented Jun 26, 2023

blublinsky commented Jun 28, 2023

kevin85421 commented Jun 28, 2023

blublinsky commented Jun 29, 2023

kevin85421 left a comment

kevin85421 left a comment

kevin85421 Jul 7, 2023

kevin85421 Jul 7, 2023

kevin85421 Jul 7, 2023

blublinsky Jul 7, 2023

blublinsky Jul 12, 2023

kevin85421 Aug 21, 2023

kevin85421 Aug 21, 2023

kevin85421 Aug 21, 2023

blublinsky Aug 22, 2023

blublinsky Aug 22, 2023

kevin85421 Jul 7, 2023

blublinsky Jul 9, 2023

kevin85421 Jul 11, 2023

blublinsky Jul 12, 2023

z103cb Jul 13, 2023

blublinsky commented Aug 29, 2023

kevin85421 commented Aug 29, 2023

blublinsky commented Aug 29, 2023

kevin85421 commented Aug 29, 2023

blublinsky commented Aug 29, 2023

Operator oc support #1185

Operator oc support #1185

Conversation

blublinsky commented Jun 23, 2023

Why are these changes needed?

Related issue number

Checks

kevin85421 left a comment

Choose a reason for hiding this comment

blublinsky commented Jun 24, 2023

kevin85421 commented Jun 24, 2023

blublinsky commented Jun 24, 2023 • edited Loading

anishasthana commented Jun 26, 2023

blublinsky commented Jun 28, 2023

kevin85421 commented Jun 28, 2023

blublinsky commented Jun 29, 2023

kevin85421 left a comment

Choose a reason for hiding this comment

kevin85421 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

blublinsky commented Aug 29, 2023

kevin85421 commented Aug 29, 2023

blublinsky commented Aug 29, 2023

kevin85421 commented Aug 29, 2023

blublinsky commented Aug 29, 2023

blublinsky commented Jun 24, 2023 •

edited

Loading