Ray restricted podsecuritystandards for enterprise security and Kubeflow integration #29665

juliusvonkohout · 2022-10-25T16:22:04Z

Description

@Jeffwan @DmitriGekhtman related to kubeflow/kubeflow#6680 and ray-project/kuberay#502

You can build the OCI images however you want, you just need to adhere to the official safe Kubernetes "Restricted PSP" standard https://kubernetes.io/docs/concepts/security/pod-security-standards/#restricted .

If you have a good architecture, just enforcing the restricted PSP on your namespace should be enough. You can also set the pod securitycontext manually to the values from the restricted set. The most important ones are to block anything that starts with host or privilege escalation and runasroot. If your images crash then please check whether you forgot to set proper file permissions on the working directories. If you need help on that level feel free to reach out on the Kubeflow slack or LinkedIn. I can also provide you Podsecuritypolicies if your clusters are below 1.23.

To properly build your images there are thousand of guides out there, but the most common stuff is :
For example use USER 1000:0 at the end of the Dockerfiles to build the OCI image. Make sure working directories are created with 777 file permissions and of course do not use and drop all capabilities such as SYS_ADMIN, SYS_CHROOT (done by the restricted PSP) etc. Use proper networking ports above 1024, do not use insecure setuid or setuid binaries and so on, just the same stuff that you would do for any proper Linux userspace application.

As long as all pods (Worker/head/raylet) run with the restricted PSS that is fine. This is needed to prevent that the cluster can be more easily hacked as described in the kubernetes PSS documentation linked above. Adding isolation within the Worker/Raylet pod is not essential (but still desired in general) for kubeflow integration, since the users will only have access to their own on-demand created kuberay clusters in their own namespaces. They can damage their clusters in their own namespace anyway if they wish to do so.

This can of course happen in parallel with the other integration tasks.

So far i only checked the kuberay implementation here #14077 (comment) Please point me to the the other implementation that you want to use instead and the Dockerfiles for the corresponding OCI images.

Use case

Official Kubernetes Enterprise security standards and integration with Kubeflow.

DmitriGekhtman · 2022-10-25T17:21:05Z

Adding isolation within the Worker/Raylet

This part is beyond near-term scope; for now, "Ray node == K8s pod" is the only architecture we can feasibly maintain.

Enterprise security standards

This the Ray team must make sure to nail down.

Adding the experts from the Ray team to comment on secure image building.
Any thoughts, @ijrsvt @aslonnie @thomasdesr @simon-mo ?

simon-mo · 2022-10-25T18:26:59Z

Is there a scanner/tool we can use to check (and therefore plan out to fix) the gap in our current docker images?

juliusvonkohout · 2022-10-25T18:36:35Z

Is there a scanner/tool we can use to check (and therefore plan out to fix) the gap in our current docker images?

Yes let's start with https://kubernetes.io/docs/concepts/security/pod-security-admission/ use a modern cluster 1.24+ with that enabled and you can audit any violations. 1.24-1.25 is what kubeflow 1.7 will require and it supports the podadmission controller.

Here is an example, but with the wrong profile (baseline instead of restricted) https://kubernetes.io/docs/tasks/configure-pod-container/enforce-standards-namespace-labels/

juliusvonkohout · 2022-10-25T18:40:21Z

You need to add USER 1000:0 at the end of your Dockerfiles. Then run your cicd test and I bet you will already get some error messages regarding file permissions. I think that is the first step and afterwards I would focus on PSS

simon-mo · 2022-10-25T20:37:59Z

I believe we already met the requirements regarding uid, gid, ports, etc...

ray/docker/base-deps/Dockerfile

Lines 16 to 17 in 69b3e3c

    
           ARG RAY_UID=1000 
        
           ARG RAY_GID=100

We can go test against a modern cluster and audit violations. I think there is a misunderstanding here that Ray, on its critical path, and the default configurations, do not run as root at all and do not use podman or any other container tools.

The snippet you linked in thread is referring to an experimental feature that has not been recommended to any user to try out yet.

ray/python/ray/_private/runtime_env/container.py

Lines 26 to 37 in 3e357c8

    
           container_driver = "podman" 
        
           container_command = [ 
        
               container_driver, 
        
               "run", 
        
               "-v", 
        
               self._ray_tmp_dir + ":" + self._ray_tmp_dir, 
        
               "--cgroup-manager=cgroupfs", 
        
               "--network=host", 
        
               "--pid=host", 
        
               "--ipc=host", 
        
               "--env-host", 
        
           ]

This means Ray deployments on K8s should not require any privileges.

juliusvonkohout · 2022-10-26T06:55:05Z

@simon-mo that sounds amazing. Yes, please test it in enforcing mode with a some workloads and then you can also advertise in the main GitHub readme, that it runs according to kubernetes restricted PSS. This goes for the operator as well as the cluster.

If that is solved the integration into Kubeflow is rather straightforward, just some boring controller and UI writing, RBAC Rules, Istio polices etc. for automation. This pod security part was worrying me the most ;-) regarding the Notebook (will be renamed to workbenches) integration we will also find something simple.

DmitriGekhtman · 2022-11-07T20:16:39Z

We can use this workload https://docs.ray.io/en/master/cluster/kubernetes/examples/ml-example.html as the example.
@simon-mo would you mind assigning someone to carry out the check?

DmitriGekhtman · 2022-11-07T20:28:03Z

cc @sihanwang41 @kevin85421

kevin85421 · 2022-11-18T07:08:00Z

I will take a look at this issue.

Action items:

(1) Run KubeRay on Kubernetes v1.25 (KubeRay supports v1.19 - v1.24). By the way, Pod Security Admission becomes the stable state in Kubernetes v1.25.

(2) Create a namespace with the label pod-security.kubernetes.io/enforce: restricted (Example)

(3) Check with a pod that uses sys_admin xor root that the policies are really enforced.

(4) Deploy a RayCluster in that namespace.

(5) Check whether some restricted policies are violated or not.

(6) Run an example E2E workload

(7) Check whether some restricted policies are violated or not.

juliusvonkohout · 2023-01-18T09:26:55Z

Alright now that #31563 is merged i think only ray-project/kuberay#866 is missing. We can then start the integration into Kubeflow on 30th of January if you have time then @kevin85421

kevin85421 · 2023-02-28T19:18:21Z

We decided to integrate with Kubeflow without Docker image update from Ray. See kubeflow/manifests#2383 for more details.

juliusvonkohout added the enhancement Request for new feature and/or capability label Oct 25, 2022

juliusvonkohout mentioned this issue Oct 25, 2022

Ray on Kubeflow - Asking for user and contributor interest / comments kubeflow/kubeflow#6680

Closed

DmitriGekhtman added the infra autoscaler, ray client, kuberay, related issues label Oct 25, 2022

kevin85421 self-assigned this Nov 18, 2022

kevin85421 mentioned this issue Nov 22, 2022

[Feature] Ray restricted podsecuritystandards for enterprise security and Kubeflow integration ray-project/kuberay#750

Merged

4 tasks

peterghaddad mentioned this issue Dec 13, 2022

[Ray Client] - Client server failed with runtime_env container #29852

Open

kevin85421 closed this as completed Feb 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ray restricted podsecuritystandards for enterprise security and Kubeflow integration #29665

Ray restricted podsecuritystandards for enterprise security and Kubeflow integration #29665

juliusvonkohout commented Oct 25, 2022 •

edited

Loading

DmitriGekhtman commented Oct 25, 2022

simon-mo commented Oct 25, 2022

juliusvonkohout commented Oct 25, 2022 •

edited

Loading

juliusvonkohout commented Oct 25, 2022

simon-mo commented Oct 25, 2022

juliusvonkohout commented Oct 26, 2022

DmitriGekhtman commented Nov 7, 2022

DmitriGekhtman commented Nov 7, 2022

kevin85421 commented Nov 18, 2022 •

edited

Loading

juliusvonkohout commented Jan 18, 2023

kevin85421 commented Feb 28, 2023

Ray restricted podsecuritystandards for enterprise security and Kubeflow integration #29665

Ray restricted podsecuritystandards for enterprise security and Kubeflow integration #29665

Comments

juliusvonkohout commented Oct 25, 2022 • edited Loading

Description

Use case

DmitriGekhtman commented Oct 25, 2022

simon-mo commented Oct 25, 2022

juliusvonkohout commented Oct 25, 2022 • edited Loading

juliusvonkohout commented Oct 25, 2022

simon-mo commented Oct 25, 2022

juliusvonkohout commented Oct 26, 2022

DmitriGekhtman commented Nov 7, 2022

DmitriGekhtman commented Nov 7, 2022

kevin85421 commented Nov 18, 2022 • edited Loading

juliusvonkohout commented Jan 18, 2023

kevin85421 commented Feb 28, 2023

juliusvonkohout commented Oct 25, 2022 •

edited

Loading

juliusvonkohout commented Oct 25, 2022 •

edited

Loading

kevin85421 commented Nov 18, 2022 •

edited

Loading