-
Notifications
You must be signed in to change notification settings - Fork 402
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Operator oc support #1185
Operator oc support #1185
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a technical debt. KubeRay should not create an ingress for users. I should deprecate this at certain moments.
- Release v0.5.0 doc validation part 2 #999 => Remove built-in ingress support from doc
- Deprecate built in ingress #911
The preferred method is to provide YAML files, such as the one found at https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray-cluster.separate-ingress.yaml, and document it in https://github.com/ray-project/kuberay/blob/master/docs/guidance/ingress.md.
Sorry about the inconvenience.
Sorry, @kevin85421. It can be a very useful feature for us moving forward. The issue is that we plan to create Ray clusters per job execution. We also want to do it using KFP - no human interaction. Finally, we want to be able to do remote cluster creation and job submission. For this, we do need to be able to create ingress/route. Can we discuss some compromise solution on Monday? |
Q1: Does KFP stand for Kubeflow pipeline? |
I think we do need to revisit ingress depreciation. Let's talk more about it. I have my own view on the final architecture. Lets compare notes. |
I think I agree that the KubeRay operator creating an ingress/route for the cluster makes sense. The only thing I think we need to be careful about is auth. Creating routes that by default are exposed to external traffic could be dangerous -- we should figure out how to protect these clusters too. |
@kevin85421 any progress on this |
@blublinsky Would you mind fixing the CI failures? I will try to review it today. You can refer to https://github.com/ray-project/kuberay/blob/master/ray-operator/DEVELOPMENT.md#consistency-check for more details about the failure. Thanks! |
@kevin85421 I think I fixed my errors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, would you mind rebasing this PR with the master branch? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
Would you mind rebasing with the master branch?
-
Let's try to get this PR merged before the v0.6.0 feature freeze, which is expected to be on July 10 or 11.
@anishasthana Would you mind reviewing this PR, especially the OpenShift-related part?
|
||
// Check where we are running. We are trying to distinguish here whether | ||
// this is vanilla kubernetes cluster or OPenshift | ||
func getClusterType(logger logr.Logger) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about using an environment variable for the KubeRay operator to determine which Kubernetes distribution is being used, instead of relying on automatic detection? Based on my experience, making assumptions about users' Kubernetes control planes can often lead to issues. In addition, environment variable will also make us easier to maintain the backward compatibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e.g. K8S_DISTRIBUTIONS
:
Case 1: it is not defined or VANILLA
=> vanilla Kubernetes
Case 2: OPENSHIFT
=> OpenShift
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can refer to ENABLE_INIT_CONTAINER_INJECTION
in #1069 for more details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, in my experience usage of env variable is by far more error-prone compared to automatic detection. Sorry
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the current implementation getClusterType
checks whether there is anything from the OpenShift ns installed. It always defaults to plain vanilla k8. Also just added an env variable to overwrite the default OpenShift behavior
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I agree with automatic detection
environment variables offer more control but tend to result in user-error more often.
I believe only IBM/RedHat will use this feature, so we shouldn't subject other users to the risks of this code path.
In addition, there is no feature gate for this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you still want to use automatic detection, please add a feature flag. In addition, would you mind replying to my question (3) about the implementation in #1185 (review)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function needs to check if a string ends with .openshift.io, like strings.HasSuffix(apiGroupList.Groups[i].Name, ".openshift.io"). I'm not sure if
this is the standard way to do it, but it seems a bit hacky to me at first glance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe only IBM/RedHat will use this feature, so we shouldn't subject other users to the risks of this code path.
In addition, there is no feature gate for this change.
There are more users of OpenShift than IBM and RedHat, so it is relevant to all OpenShift users. We already have a flag bypassing this function. What exactly else do you want me to do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would you mind replying to my question (3) about the implementation in https://github.com/ray-project/kuberay/pull/1185#pullrequestreview-1525067293?
Yes, this is a standard way of using this check. See https://developers.redhat.com/blog/2020/09/11/5-tips-for-developing-kubernetes-operators-with-the-new-operator-sdk for an example
} | ||
|
||
func TestBuildRouteForHeadService(t *testing.T) { | ||
route, err := BuildRouteForHeadService(*instanceWithRouteEnabled) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although currently BuildRouteForHeadService
will only read instanceWithRouteEnabled
, it is better to perform a deep copy of instanceWithRouteEnabled
to avoid side effects in the future. I expect that instanceWithRouteEnabled
will be used in multiple tests. If the value will be different after each test, it will prevent the tests from covering the correct code paths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh? BuildRouteForHeadService
should never modify source
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BuildRouteForHeadService
should never modify source
You can refer to my comment https://github.com/ray-project/kuberay/pull/1185/files#r1255244213.
I would say that it is a good practice for every gopher to avoid side effects caused by future changes. For more details on this topic, you can refer to this article. In addition, you can find related comments frequently in Kubernetes source code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, I would agree that shallow copy is bad. In this particular case BuildRouteForHeadService
is part of CR processing. If it modifies CR we have a much bigger problem. Also compare it with other tests, for example ingress test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@blublinsky, perhaps inlining the variable in the call to BuildRouteForHead
might be a better solution. I think it should assuade @kevin85421 concerns and also slighly simplify the reading of the code (a subjective opinion)
Signed-off-by: Boris Lublinsky <[email protected]>
# Conflicts: # helm-chart/kuberay-apiserver/values.yaml
RayJob with custom head service name
…istening to Kubernetes events (ray-project#1341) Redefine the behavior for deleting Pods and stop listening to Kubernetes events
Upgrade to Go 1.19
@kevin85421 It has been a while. Is it time to merge this one? |
@blublinsky I reviewed this PR last week and noticed that the commits were disorganized. This hasn't been addressed yet. It's challenging for me to review a PR with 102 file changes, so I'll review it once the commits have been cleaned up. |
@kevin85421 What do you mean disorganized? I keep rebasing it to catch up with the changes. What exactly do you want me to do? Just tell me |
@blublinsky It shows that this PR updates 102 files, but this PR actually updates less than 10 files. |
@kevin85421 replaced this PR with #1371 |
Why are these changes needed?
The current implementation of the operator allows creating ingress to expose the dashboard and remote job submit APIs from outside the cluster. In the case of OpenShift in the place of Ingress are typically used Routes.
PR implements two main things:
Related issue number
Checks