Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure scheduler, worker pods on correct node pool #536

Closed

Conversation

TomAugspurger
Copy link
Member

Two changes here:

  1. Ensure that worker pods are on preemptible nodes.
    This requires that the k8s cluster have nodes with the
    k8s.dask.org_dedicated:worker label. Otherwise workers won't start.
    Do we want this in the base config used for all our hubs?
  2. Ensure that scheduler pods are not on preemptible nodes.

Right now, we can have scheduler pods end up in the preemptible nodes in
the dask pool

$ kubectl describe pod -n dev-staging dask-gateway-tomaugspurger-scheduler-df5cb0c595ca4c80adc12a82c84e7150 | grep pool
Node:         gke-dev-pangeo-io-cluster-dask-pool-f89fa71c-rh7b/10.128.0.82
  Normal  Scheduled  3m4s  default-scheduler                                           Successfully assigned dev-staging/dask-gateway-tomaugspurger-scheduler-df5cb0c595ca4c80adc12a82c84e7150 to gke-dev-pangeo-io-cluster-dask-pool-f89fa71c-rh7b
  ...

By removing the toleration, they won't end up there. I think this means
they'll always end up in the core-pool, which is currently not set up
to autoscale. We'll need to adjust that before merging this. @jhamman
does putting schedulers in the core-pool sound OK? If so, is it OK to autoscale it, or should we set up a dedicated dask-scheduler that's not preemptible?

Two changes here:

1. Ensure that worker pods are on preemptible nodes.
   This requires that the k8s cluster have nodes with the
   `k8s.dask.org_dedicated:worker` label. Otherwise workers won't start.
   Do we want this in the base config used for all our hubs?
2. Ensure that scheduler pods are *not* on preemptible nodes.

Right now, we can have scheduler pods end up in the preemptible nodes in
the dask pool

```
$ kubectl describe pod -n dev-staging dask-gateway-tomaugspurger-scheduler-df5cb0c595ca4c80adc12a82c84e7150 | grep pool
Node:         gke-dev-pangeo-io-cluster-dask-pool-f89fa71c-rh7b/10.128.0.82
  Normal  Scheduled  3m4s  default-scheduler                                           Successfully assigned dev-staging/dask-gateway-tomaugspurger-scheduler-df5cb0c595ca4c80adc12a82c84e7150 to gke-dev-pangeo-io-cluster-dask-pool-f89fa71c-rh7b
  ...
```

By removing the toleration, they won't end up there. I think this means
they'll always end up in the `core-pool`, which is currently not set up
to autoscale. We'll need to adjust that before merging this.
@tjcrone
Copy link
Contributor

tjcrone commented Feb 10, 2020

@TomAugspurger, I think because the branch you are trying to merge is so far behind staging, and there was one change to fix the hubploy issues (see #533), your PR will not pass the initial checks. I would consider rebasing your branch and trying again. You could also merge upstream/staging if you do not want to rebase.

@TomAugspurger
Copy link
Member Author

Thanks, hopefully fixed now.

@TomAugspurger TomAugspurger changed the title Ensure scheduler, worker pods on correct node pool [WIP]: Ensure scheduler, worker pods on correct node pool Feb 10, 2020
@jhamman jhamman mentioned this pull request Mar 2, 2020
8 tasks
@TomAugspurger
Copy link
Member Author

I made the core-pool autoscalable using the GCP web UI. I think this should be good to go.

It'll need a bit of testing to make sure I got the taints / tolerations correct to ensure that the scheduler doesn't end up in the worker or jupyter pools.

@TomAugspurger TomAugspurger changed the title [WIP]: Ensure scheduler, worker pods on correct node pool Ensure scheduler, worker pods on correct node pool Mar 12, 2020
@jhamman
Copy link
Member

jhamman commented Mar 13, 2020

Thanks @TomAugspurger. This is currently blocked by #560 but this should be good to go.

@TomAugspurger
Copy link
Member Author

Huh, so perhaps this PR is unnecessary now. in pangeo-deploy/values.yaml we already have

pangeo:
  dask-gateway:
    gateway:
      clusterManager:
        scheduler:
          extraPodConfig:
            tolerations:
              - key: "k8s.dask.org/dedicated"
                operator: "Equal"
                value: "scheduler"
                effect: "NoSchedule"

which IIUC matches the taint we added to the new scheduler node pool in #569.

@TomAugspurger
Copy link
Member Author

Closing, since nothing more should be required.

@TomAugspurger TomAugspurger deleted the scheduler-core-pool branch June 23, 2020 18:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants