Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Controller should work in namespaced or role-only mode #1295

Closed
philwinder opened this issue Aug 6, 2020 · 14 comments
Closed

Controller should work in namespaced or role-only mode #1295

philwinder opened this issue Aug 6, 2020 · 14 comments

Comments

@philwinder
Copy link

/kind feature

In multi-tenant clusters, I don't want users to have cluster-level access to objects. The crude approach is to replace all ClusterRoles (and bindings) with Roles. This works fine for the 0.9.0 UI, because it falls back to requesting experiments/etc. from the current namespace.

However, many parts of the controller codebase assume that there is only a single instance of the controller installed and also need cluster-scope, because that was how it was designed. For example, Katib installs a MutatingWebhookConfiguration that points to the namespace where Katib is installed. But the name of the MWC is hardcoded, so a subsequent install of Katib overwrites the MWC.

I'm sure there's more too. I suspect this requires fundamental changes in the design/operation of the controller.

This represents a security risk for my clients, because all users can query the controller API and observe the status/results/etc. of the jobs from other namespaces, even ones they don't have access to.

For future people that find this issue, if this isn't implemented, then the only workaround is to have a single centralised controller, and many UI's in each namespace. And accept the security risk.

@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the labels:

Label Probability
area/front-end 0.59
area/katib 0.80

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

@andreyvelich
Copy link
Member

Thank you for pointing this @philwinder.

I believe it was designed by that because Katib is installed as part of Kubeflow. In Kubeflow you should submit Experiments only in your user profile namespace. Also see this: #162.

On the other hand, Katib can be also stand-alone deployment where not giving cluster-role access is possible.
That requires some component changes.

Regarding to the webhook, how we should handle multi-user usage with more than one controller in namespace-role access?

/cc @gaocegege @johnugeorge

@philwinder
Copy link
Author

Hi @andreyvelich. Just stumbled across my own issue again. :-)

Yes, you're right, and that is a problem with Kubeflow too. It expects a single kubeflow instance and the ability to create/control namespaces. For many of my clients that is not acceptable. Namespaces are locked down.

But yes, a simple solution would be to provide the ability to create a webhook with the namespace name in the name of the webhook, or the ability to customize the name of the webhook. That way, katib can install multiple webhooks without clashing/overwriting.

Another alternative is to add an option to disable the mutating webhook entirely! After a lot of hacking around, I've found that if you disable some of the cluster scoped roles, it accidentally disables the webhook controller. So it doesn't even install mutating webhook. Depending on what you disable, parts of katib work. Other's don't.

Any thoughts?

Phil

@philwinder
Copy link
Author

Just a quick update on this to help future me/people. In the end I have written a script/pod that proactively scrapes for the hardcoded mutating and validating configs, deletes them, then creates new ones with non-clashing equivalents.

@andreyvelich
Copy link
Member

@philwinder If we disable mutating webhook, metrics collector doesn't work.
We can add webhook YAMLs to the manifests once user deploys Katib.
In that case, controller doesn't need to deploy webhooks.

Do you want to attend one of the upcoming AutoML meetings to discuss questions of using Katib without cluster-role access ?

/cc @gaocegege @johnugeorge

@nlarusstone
Copy link

I have a similar, but slightly different ask -- I want Katib to have true multi-user support. For me, that means that I can have different configurations for Katib in each namespace (I'm ok with having a centralized controller though).

This would allow me to customize, for example, the Katib configmap in each namespace to use a different service account for each namespace. This would also entail running experiments in separate namespaces (which is useful for having resource quotas and proper separation of concerns).

I was told to post here following this discussion on slack: https://kubeflow.slack.com/archives/C018PMV53NW/p1604351060029800?thread_ts=1604341870.028800&cid=C018PMV53NW

@andreyvelich
Copy link
Member

/priority p2

@philwinder
Copy link
Author

philwinder commented Nov 3, 2020

Hi @andreyvelich ,
Thanks for your comments.

If we disable mutating webhook, metrics collector doesn't work.

Yes, I understand. I was proposing that we stop hardcoding the name of the webooks. In my case, I am deleting the created webhooks and recreating them with team-friendly names. That works.

We can add webhook YAMLs to the manifests once user deploys Katib. In that case, controller doesn't need to deploy webhooks.

Yes, that would work too. Although that would stop the UI-generated job from working, since that expects the webhook to mutate the request. But yes, adding a flag to disable them, then adding a flag to the UI to "add metrics to this job..." would work. I prefer this because then katib is not having to make global changes.

attend one of the upcoming AutoML meetings

I'm happy to attend if you want me to, but I have no more questions, thank you.

Dear @nlarusstone,
Yes, I agree. There's a bigger story here. For example, removing the mandated need for cluster-scope access for the controller. That and the UI shouldn't NEED cluster-scope access to do it's job. But this is a wider problem with Kubeflow. Nearly all KF components have this problem. Kubeflow was never intended to run multiple instances on the same cluster. A major oversight IMO.

Thanks all,
Phil

@nlarusstone
Copy link

@philwinder I agree that Kubeflow isn't intended to run multiple instances on the same cluster, but there is definitely an intent to allow multi-user isolation (see docs here: https://www.kubeflow.org/docs/components/multi-tenancy/). While Katib supports that on the UI, I would like to see more in depth support of it on the cluster level by allowing us to schedule resources in separate namespaces. I'm fine with having a single Kubeflow instance that coordinates the running of resources in separate namespaces (and therefore wouldn't need cluster-scoped access).

@andreyvelich
Copy link
Member

andreyvelich commented Nov 3, 2020

Yes, that would work too. Although that would stop the UI-generated job from working, since that expects the webhook to mutate the request. But yes, adding a flag to disable them, then adding a flag to the UI to "add metrics to this job..." would work. I prefer this because then katib is not having to make global changes.

@philwinder What do you mean by: "stop the UI-generated job from working"?
I believe, we create webhook when Katib controller is created.

Yes, I agree. There's a bigger story here. For example, removing the mandated need for cluster-scope access for the controller. That and the UI shouldn't NEED cluster-scope access to do it's job. But this is a wider problem with Kubeflow. Nearly all KF components have this problem. Kubeflow was never intended to run multiple instances on the same cluster. A major oversight IMO.

If our community wants to see role-level access in Katib, we can make it for the standalone installation.
When you use Katib outside of Kubeflow.

@philwinder
Copy link
Author

Hi @andreyvelich,
Sorry, I mean that when you specify an experiment, it has to be mutated in order to work. I.e. you need to inject the metrics collector. If it doesn't go through that mutation then you won't see results.

we create webhook when Katib controller is created

Not quite, the webhook is created by the controller itself, in the go code. It's not created by the user during the install of the manifests, for instance. If it was created during installation via a manifest, then we could modify the name of the webhooks. As it stands, we can't alter any of the behaviour of the webhook creation.

Thanks!

@andreyvelich
Copy link
Member

Not quite, the webhook is created by the controller itself, in the go code. It's not created by the user during the install of the manifests, for instance. If it was created during installation via a manifest, then we could modify the name of the webhooks. As it stands, we can't alter any of the behaviour of the webhook creation.

Yes, that is correct. Probably we should add the manifests for the webhooks, like mentioned here: kubeflow/manifests#1379.

The question is, how we make sure that user's webhooks have the unique names.

@stale
Copy link

stale bot commented Jun 11, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale
Copy link

stale bot commented Jul 8, 2021

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants