Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kuberay integration #2383

Merged
merged 18 commits into from
May 12, 2023
Merged

Kuberay integration #2383

merged 18 commits into from
May 12, 2023

Conversation

juliusvonkohout
Copy link
Member

Which issue is resolved by this Pull Request:
Resolves kubeflow/kubeflow#6680

@kevin85421

Description of your changes:

Checklist:

  • Unit tests pass:
    Make sure you have installed kustomize == 3.2.1
    1. make generate-changed-only
    2. make test

@kevin85421
Copy link
Contributor

kevin85421 commented Feb 16, 2023

Progress

I will keep updating the progress in this comment.

  • Makefile: Transform the latest release KubeRay Helm chart (i.e. 0.4.0 for now) to Kustomize.

    • Create a manifest for KubeRay operator and CRD.
    • Run test.sh
  • README.md: [Status: TODO]

    • Install the manifest in one command.
    • Introduction of Ray
    • Architecture
    • Example
    • UPGRADE
  • Test:

    • test.sh
    • GitHub Actions (needs to be tested)
  • RayCluster example YAML

Screenshot

Screen Shot 2023-02-27 at 5 50 23 PM

@juliusvonkohout
Copy link
Member Author

Does 0.4.0 have the securitycontext set everywhere?

@kevin85421
Copy link
Contributor

Does 0.4.0 have the securitycontext set everywhere?

Yes, the PR ray-project/kuberay#752 also includes in the release 0.4.0. See here for more details.

@juliusvonkohout
Copy link
Member Author

Does 0.4.0 have the securitycontext set everywhere?

Yes, the PR ray-project/kuberay#752 also includes in the release 0.4.0. See here for more details.

Are they also enforced everywhere?

@juliusvonkohout
Copy link
Member Author

i do not see the securitycontext in raycluster_example.yaml or resources.yaml. Shouldnt there be

          securityContext:
            allowPrivilegeEscalation: false
            capabilities:
              drop: ["ALL"]
            runAsNonRoot: true
            seccompProfile:
              type: RuntimeDefault

everywhere as shown in https://github.com/ray-project/kuberay/pull/750/files ?

@kevin85421
Copy link
Contributor

i do not see the securitycontext in raycluster_example.yaml or resources.yaml. Shouldnt there be

          securityContext:
            allowPrivilegeEscalation: false
            capabilities:
              drop: ["ALL"]
            runAsNonRoot: true
            seccompProfile:
              type: RuntimeDefault

everywhere as shown in https://github.com/ray-project/kuberay/pull/750/files ?

Good catch. Will update it today.

@kevin85421
Copy link
Contributor

This PR is ready for review cc @juliusvonkohout @richardsliu. Could you approve the CI workflow? Thanks!

cc some KubeRay folks @architkulkarni @DmitriGekhtman @gvspraveen @Yicheng-Lu-llll @mjconnor @Jeffwan

@juliusvonkohout
Copy link
Member Author

juliusvonkohout commented Feb 28, 2023

i do not see the securitycontext in raycluster_example.yaml or resources.yaml. Shouldnt there be

          securityContext:
            allowPrivilegeEscalation: false
            capabilities:
              drop: ["ALL"]
            runAsNonRoot: true
            seccompProfile:
              type: RuntimeDefault

everywhere as shown in https://github.com/ray-project/kuberay/pull/750/files ?

Good catch. Will update it today.

  1. What about the initcontainer securitycontext?
  2. Is there a reason for the logfile in the repository?
  3. We have to check https://github.com/kubeflow/manifests/blob/master/proposals/20220926-contrib-component-guidelines.md#component-requirements for conformance
  4. the test failed
Run cd contrib/ray/
./test.sh
customresourcedefinition.apiextensions.k8s.io/rayclusters.ray.io serverside-applied
customresourcedefinition.apiextensions.k8s.io/rayjobs.ray.io serverside-applied
customresourcedefinition.apiextensions.k8s.io/rayservices.ray.io serverside-applied
serviceaccount/kuberay-operator serverside-applied
role.rbac.authorization.k8s.io/kuberay-operator serverside-applied
clusterrole.rbac.authorization.k8s.io/kuberay-operator serverside-applied
clusterrole.rbac.authorization.k8s.io/rayjob-editor-role serverside-applied
clusterrole.rbac.authorization.k8s.io/rayjob-viewer-role serverside-applied
clusterrole.rbac.authorization.k8s.io/rayservice-editor-role serverside-applied
clusterrole.rbac.authorization.k8s.io/rayservice-viewer-role serverside-applied
rolebinding.rbac.authorization.k8s.io/kuberay-operator serverside-applied
clusterrolebinding.rbac.authorization.k8s.io/kuberay-operator serverside-applied
service/kuberay-operator serverside-applied
deployment.apps/kuberay-operator serverside-applied
deployment.apps/kuberay-operator condition met
NAME                                READY   STATUS    RESTARTS   AGE
kuberay-operator-5b8cd69758-nsvlv   1/1     Running   0          18s
raycluster.ray.io/kubeflow-raycluster created
pod/kubeflow-raycluster-head-z44k6 condition met
error: timed out waiting for the condition on pods/kubeflow-raycluster-worker-small-group-sfxch
kill: usage: kill [-s sigspec | -n signum | -sigspec] pid | jobspec ... or kill -l [sigspec]
make: *** [Makefile:11: test] Error 2
Error: Process completed with exit code 2.

@kevin85421
Copy link
Contributor

kevin85421 commented Feb 28, 2023

What about the initcontainer securitycontext?

Added. Btw, what's the reason to add securityContext here? The namespaces created by Kubeflow (example/) do not apply PSS. Why do we set the securityContext by default rather than making users configure that?

Is there a reason for the logfile in the repository?

Removed.

We have to check https://github.com/kubeflow/manifests/blob/master/proposals/20220926-contrib-component-guidelines.md#component-requirements for conformance

I have already checked the following:

  • README.md
  • There must be an OWNERS file with at least 2 users
    • We have the file OWNERS, but the file seems to not include "users" (@juliusvonkohout)
  • The component must work with the latest version of Kubeflow, and its dependencies
    • Maybe you can do that part. You are more familiar with Kubeflow. (@juliusvonkohout)
  • There must be an UPGRADE.md file that documents any instructions users need to follow when applying manifests of a newer version
  • There needs to be sufficient work on testing
    • Test exists, but we need to make sure it passes on CI.
    • I opened a PR to my fork, and the CI workflow can be triggered without approval. It passes (link).

the test failed

Does KubeRay use free plan GitHub Actions? If so, the CPU may be insufficient. I tried to reduce the CPU request.

[UPDATE] It passes after I reduce the CPU request (link).

Others

  • I fetched your master branch and rebase it. Everything goes wrong, so I need to force push to make the branch back. Would you mind adding your comments again?

  • I prefer to use Fig. 1 compared with Fig. 2. The example in README primarily focuses on Jupyter Notebook / Kubeflow Central Dashboard / KubeRay operator. Hence, Fig. 2 includes a lot of information that is unrelated to the example. This will make users confused.

    Fig. 1 architecture
    Fig. 2 architecture

Any thoughts? Thanks!

@juliusvonkohout
Copy link
Member Author

"Added. Btw, what's the reason to add securityContext here? The namespaces created by Kubeflow (example/) do not apply PSS. Why do we set the securityContext by default rather than making users configure that?" Setting and enforcing are two different things. We want to at least set the securitycontext

@juliusvonkohout
Copy link
Member Author

Figure 1 will be wrong quite soon. We rename everything to workbenches. Jupyter notebooks is also wrong it is a Notebook inside a Jupyterlab like a car and gasoline, file and server. Furthermore we also want to call it from KFP directly. We can include both images, but then please use the SVG version and rename to workbench/Jupyterlab, get rid of the notebook term and add KFP.

@juliusvonkohout
Copy link
Member Author

And thanks for your amazing effort so far

@kevin85421
Copy link
Contributor

kevin85421 commented Mar 1, 2023

Setting and enforcing are two different things. We want to at least set the securitycontext

Got it. Although the configuration has no relationship with this integration, it is nice to have.

Figure 1 will be wrong quite soon. We rename everything to workbenches.

I will add some sentences to explain it, but I prefer not to change the name of "Kubeflow Central Dashboard" in Figure 1. The example is based on release 1.6, and if we introduce new terms from future releases, users may feel confused. You can open a new PR to update it if you want the example to support the future release.

  • Add some sentences to explain it.

Jupyter notebooks is also wrong it is a Notebook inside a Jupyterlab like a car and gasoline, file and server.

Fixed.

Furthermore we also want to call it from KFP directly. We can include both images, but then please use the SVG version and rename to workbench/Jupyterlab, get rid of the notebook term and add KFP.

KFP (Kubeflow pipeline) seems not to be a necessary component for users to understand this example. It may make users feel confused. Again, you can open a new PR later to add a section to describe how to launch a JupyterLab instance via KFP.

please use the SVG version

Done

@kevin85421
Copy link
Contributor

Add KFP to the diagram

Done

Do you see Istio initcontainers and sidecars (proxy) for ray head, worker etc?

No, maybe I can try to put the Istio integration as an item for next quarter.

@juliusvonkohout
Copy link
Member Author

No, maybe I can try to put the Istio integration as an item for next quarter.

This might be problematic, because in a full Kubeflow they are injected by default.
So why are they not injected in your installation?

https://istio.io/latest/docs/setup/additional-setup/sidecar-injection/ is the documentation

since profile namespaces have the istio-injection=enabled label they will be added by default.

Or did you annotate the ray pods with sidecar.istio.io/inject="false" ?

Or are you not in a Kubeflow profile namespace?

@juliusvonkohout
Copy link
Member Author

juliusvonkohout commented Mar 27, 2023

NAMESPACE=default definitely looks wrong. We need to test in a namespace where also your Jupyterlabs from Kubeflow are running

@juliusvonkohout
Copy link
Member Author

juliusvonkohout commented Apr 4, 2023

@kimwnasptd please approve. The istio enablement is for a second iteration.

@juliusvonkohout
Copy link
Member Author

Istio support for ray is tracked here ray-project/kuberay#1005

@juliusvonkohout
Copy link
Member Author

juliusvonkohout commented Apr 4, 2023

@kevin85421 after #2432 i can approve

@juliusvonkohout juliusvonkohout changed the title WIP: Kuberay integration Kuberay integration Apr 28, 2023
@kimwnasptd
Copy link
Member

Thanks for the great work @juliusvonkohout! Awesome to see Ray start being part of Kubeflow

/lgtm
/approve

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: juliusvonkohout, kimwnasptd

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ray on Kubeflow - Asking for user and contributor interest / comments
3 participants