Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ray restricted podsecuritystandards for enterprise security and Kubeflow integration #29665

Closed
juliusvonkohout opened this issue Oct 25, 2022 · 11 comments
Assignees
Labels
enhancement Request for new feature and/or capability infra autoscaler, ray client, kuberay, related issues

Comments

@juliusvonkohout
Copy link

juliusvonkohout commented Oct 25, 2022

Description

@Jeffwan @DmitriGekhtman related to kubeflow/kubeflow#6680 and ray-project/kuberay#502

You can build the OCI images however you want, you just need to adhere to the official safe Kubernetes "Restricted PSP" standard https://kubernetes.io/docs/concepts/security/pod-security-standards/#restricted .

If you have a good architecture, just enforcing the restricted PSP on your namespace should be enough. You can also set the pod securitycontext manually to the values from the restricted set. The most important ones are to block anything that starts with host or privilege escalation and runasroot. If your images crash then please check whether you forgot to set proper file permissions on the working directories. If you need help on that level feel free to reach out on the Kubeflow slack or LinkedIn. I can also provide you Podsecuritypolicies if your clusters are below 1.23.

To properly build your images there are thousand of guides out there, but the most common stuff is :
For example use USER 1000:0 at the end of the Dockerfiles to build the OCI image. Make sure working directories are created with 777 file permissions and of course do not use and drop all capabilities such as SYS_ADMIN, SYS_CHROOT (done by the restricted PSP) etc. Use proper networking ports above 1024, do not use insecure setuid or setuid binaries and so on, just the same stuff that you would do for any proper Linux userspace application.

As long as all pods (Worker/head/raylet) run with the restricted PSS that is fine. This is needed to prevent that the cluster can be more easily hacked as described in the kubernetes PSS documentation linked above. Adding isolation within the Worker/Raylet pod is not essential (but still desired in general) for kubeflow integration, since the users will only have access to their own on-demand created kuberay clusters in their own namespaces. They can damage their clusters in their own namespace anyway if they wish to do so.

This can of course happen in parallel with the other integration tasks.

So far i only checked the kuberay implementation here #14077 (comment) Please point me to the the other implementation that you want to use instead and the Dockerfiles for the corresponding OCI images.

Use case

Official Kubernetes Enterprise security standards and integration with Kubeflow.

@juliusvonkohout juliusvonkohout added the enhancement Request for new feature and/or capability label Oct 25, 2022
@DmitriGekhtman DmitriGekhtman added the infra autoscaler, ray client, kuberay, related issues label Oct 25, 2022
@DmitriGekhtman
Copy link
Contributor

Adding isolation within the Worker/Raylet

This part is beyond near-term scope; for now, "Ray node == K8s pod" is the only architecture we can feasibly maintain.

Enterprise security standards

This the Ray team must make sure to nail down.

Adding the experts from the Ray team to comment on secure image building.
Any thoughts, @ijrsvt @aslonnie @thomasdesr @simon-mo ?

@simon-mo
Copy link
Contributor

Is there a scanner/tool we can use to check (and therefore plan out to fix) the gap in our current docker images?

@juliusvonkohout
Copy link
Author

juliusvonkohout commented Oct 25, 2022

Is there a scanner/tool we can use to check (and therefore plan out to fix) the gap in our current docker images?

Yes let's start with https://kubernetes.io/docs/concepts/security/pod-security-admission/ use a modern cluster 1.24+ with that enabled and you can audit any violations. 1.24-1.25 is what kubeflow 1.7 will require and it supports the podadmission controller.

Here is an example, but with the wrong profile (baseline instead of restricted) https://kubernetes.io/docs/tasks/configure-pod-container/enforce-standards-namespace-labels/

@juliusvonkohout
Copy link
Author

You need to add USER 1000:0 at the end of your Dockerfiles. Then run your cicd test and I bet you will already get some error messages regarding file permissions. I think that is the first step and afterwards I would focus on PSS

@simon-mo
Copy link
Contributor

I believe we already met the requirements regarding uid, gid, ports, etc...

ARG RAY_UID=1000
ARG RAY_GID=100

We can go test against a modern cluster and audit violations. I think there is a misunderstanding here that Ray, on its critical path, and the default configurations, do not run as root at all and do not use podman or any other container tools.

The snippet you linked in thread is referring to an experimental feature that has not been recommended to any user to try out yet.

container_driver = "podman"
container_command = [
container_driver,
"run",
"-v",
self._ray_tmp_dir + ":" + self._ray_tmp_dir,
"--cgroup-manager=cgroupfs",
"--network=host",
"--pid=host",
"--ipc=host",
"--env-host",
]

This means Ray deployments on K8s should not require any privileges.

@juliusvonkohout
Copy link
Author

@simon-mo that sounds amazing. Yes, please test it in enforcing mode with a some workloads and then you can also advertise in the main GitHub readme, that it runs according to kubernetes restricted PSS. This goes for the operator as well as the cluster.

If that is solved the integration into Kubeflow is rather straightforward, just some boring controller and UI writing, RBAC Rules, Istio polices etc. for automation. This pod security part was worrying me the most ;-) regarding the Notebook (will be renamed to workbenches) integration we will also find something simple.

@DmitriGekhtman
Copy link
Contributor

We can use this workload https://docs.ray.io/en/master/cluster/kubernetes/examples/ml-example.html as the example.
@simon-mo would you mind assigning someone to carry out the check?

@DmitriGekhtman
Copy link
Contributor

cc @sihanwang41 @kevin85421

@kevin85421
Copy link
Member

kevin85421 commented Nov 18, 2022

I will take a look at this issue.

Action items:

(1) Run KubeRay on Kubernetes v1.25 (KubeRay supports v1.19 - v1.24). By the way, Pod Security Admission becomes the stable state in Kubernetes v1.25.

(2) Create a namespace with the label pod-security.kubernetes.io/enforce: restricted (Example)

(3) Check with a pod that uses sys_admin xor root that the policies are really enforced.

(4) Deploy a RayCluster in that namespace.

(5) Check whether some restricted policies are violated or not.

(6) Run an example E2E workload

(7) Check whether some restricted policies are violated or not.

@juliusvonkohout
Copy link
Author

Alright now that #31563 is merged i think only ray-project/kuberay#866 is missing. We can then start the integration into Kubeflow on 30th of January if you have time then @kevin85421

@kevin85421
Copy link
Member

We decided to integrate with Kubeflow without Docker image update from Ray. See kubeflow/manifests#2383 for more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Request for new feature and/or capability infra autoscaler, ray client, kuberay, related issues
Projects
None yet
Development

No branches or pull requests

4 participants