Ensure scheduler, worker pods on correct node pool #536

TomAugspurger · 2020-02-10T20:25:20Z

Two changes here:

Ensure that worker pods are on preemptible nodes.
This requires that the k8s cluster have nodes with the
k8s.dask.org_dedicated:worker label. Otherwise workers won't start.
Do we want this in the base config used for all our hubs?
Ensure that scheduler pods are not on preemptible nodes.

Right now, we can have scheduler pods end up in the preemptible nodes in
the dask pool

$ kubectl describe pod -n dev-staging dask-gateway-tomaugspurger-scheduler-df5cb0c595ca4c80adc12a82c84e7150 | grep pool
Node:         gke-dev-pangeo-io-cluster-dask-pool-f89fa71c-rh7b/10.128.0.82
  Normal  Scheduled  3m4s  default-scheduler                                           Successfully assigned dev-staging/dask-gateway-tomaugspurger-scheduler-df5cb0c595ca4c80adc12a82c84e7150 to gke-dev-pangeo-io-cluster-dask-pool-f89fa71c-rh7b
  ...

By removing the toleration, they won't end up there. I think this means
they'll always end up in the core-pool, which is currently not set up
to autoscale. We'll need to adjust that before merging this. @jhamman
does putting schedulers in the core-pool sound OK? If so, is it OK to autoscale it, or should we set up a dedicated dask-scheduler that's not preemptible?

Two changes here: 1. Ensure that worker pods are on preemptible nodes. This requires that the k8s cluster have nodes with the `k8s.dask.org_dedicated:worker` label. Otherwise workers won't start. Do we want this in the base config used for all our hubs? 2. Ensure that scheduler pods are *not* on preemptible nodes. Right now, we can have scheduler pods end up in the preemptible nodes in the dask pool ``` $ kubectl describe pod -n dev-staging dask-gateway-tomaugspurger-scheduler-df5cb0c595ca4c80adc12a82c84e7150 | grep pool Node: gke-dev-pangeo-io-cluster-dask-pool-f89fa71c-rh7b/10.128.0.82 Normal Scheduled 3m4s default-scheduler Successfully assigned dev-staging/dask-gateway-tomaugspurger-scheduler-df5cb0c595ca4c80adc12a82c84e7150 to gke-dev-pangeo-io-cluster-dask-pool-f89fa71c-rh7b ... ``` By removing the toleration, they won't end up there. I think this means they'll always end up in the `core-pool`, which is currently not set up to autoscale. We'll need to adjust that before merging this.

tjcrone · 2020-02-10T20:41:33Z

@TomAugspurger, I think because the branch you are trying to merge is so far behind staging, and there was one change to fix the hubploy issues (see #533), your PR will not pass the initial checks. I would consider rebasing your branch and trying again. You could also merge upstream/staging if you do not want to rebase.

TomAugspurger · 2020-02-10T21:40:59Z

Thanks, hopefully fixed now.

TomAugspurger · 2020-03-12T16:50:06Z

I made the core-pool autoscalable using the GCP web UI. I think this should be good to go.

It'll need a bit of testing to make sure I got the taints / tolerations correct to ensure that the scheduler doesn't end up in the worker or jupyter pools.

jhamman · 2020-03-13T07:04:01Z

Thanks @TomAugspurger. This is currently blocked by #560 but this should be good to go.

TomAugspurger · 2020-03-25T13:48:37Z

Huh, so perhaps this PR is unnecessary now. in pangeo-deploy/values.yaml we already have

pangeo:
  dask-gateway:
    gateway:
      clusterManager:
        scheduler:
          extraPodConfig:
            tolerations:
              - key: "k8s.dask.org/dedicated"
                operator: "Equal"
                value: "scheduler"
                effect: "NoSchedule"

which IIUC matches the taint we added to the new scheduler node pool in #569.

TomAugspurger · 2020-03-27T16:26:26Z

Closing, since nothing more should be required.

Merge remote-tracking branch 'upstream/staging' into scheduler-core-pool

0b6016b

TomAugspurger changed the title ~~Ensure scheduler, worker pods on correct node pool~~ [WIP]: Ensure scheduler, worker pods on correct node pool Feb 10, 2020

jhamman mentioned this pull request Mar 2, 2020

Dask-gateway checklist #496

Closed

8 tasks

TomAugspurger changed the title ~~[WIP]: Ensure scheduler, worker pods on correct node pool~~ Ensure scheduler, worker pods on correct node pool Mar 12, 2020

scottyhq mentioned this pull request Mar 17, 2020

Fix tolerations on gateway dask worker pods #567

Merged

TomAugspurger mentioned this pull request Mar 23, 2020

Scheduler pool docs #569

Merged

TomAugspurger closed this Mar 27, 2020

TomAugspurger deleted the scheduler-core-pool branch June 23, 2020 18:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure scheduler, worker pods on correct node pool #536

Ensure scheduler, worker pods on correct node pool #536

TomAugspurger commented Feb 10, 2020

tjcrone commented Feb 10, 2020 •

edited

Loading

TomAugspurger commented Feb 10, 2020

TomAugspurger commented Mar 12, 2020

jhamman commented Mar 13, 2020

TomAugspurger commented Mar 25, 2020

TomAugspurger commented Mar 27, 2020

Ensure scheduler, worker pods on correct node pool #536

Ensure scheduler, worker pods on correct node pool #536

Conversation

TomAugspurger commented Feb 10, 2020

tjcrone commented Feb 10, 2020 • edited Loading

TomAugspurger commented Feb 10, 2020

TomAugspurger commented Mar 12, 2020

jhamman commented Mar 13, 2020

TomAugspurger commented Mar 25, 2020

TomAugspurger commented Mar 27, 2020

tjcrone commented Feb 10, 2020 •

edited

Loading