Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot connect to cluster in 2023.6.0 #751

Closed
Artimi opened this issue Jun 28, 2023 · 6 comments
Closed

Cannot connect to cluster in 2023.6.0 #751

Artimi opened this issue Jun 28, 2023 · 6 comments

Comments

@Artimi
Copy link
Contributor

Artimi commented Jun 28, 2023

Describe the issue:

I cannot connect to a running cluster with 2023.6.0 version. It seems that the problem is with getting valid credentials somewhere.

Minimal Complete Verifiable Example:
Having operator installed this way

helm install -n dask-ns dask-kubernetes-operator dask-kubernetes-operator --set image.tag=2023.6.0 --set=rbac.cluster=false --set kopfArgs="{--namespace=dask-ns}"

And having the deployment permissions patched as described in #749

While running this code

from dask_kubernetes.operator import KubeCluster
import time
import random
import joblib


def square(x):
    time.sleep(random.expovariate(1.5))
    return x**2


def main():
    cluster = KubeCluster(
        name="my-dask-cluster",
        image="ghcr.io/dask/dask:2023.6.0-py3.11",
        namespace="dask-ns",
        env={"EXTRA_PIP_PACKAGES": "joblib"},
        shutdown_on_close=True,
    )
    print("Cluster created")
    cluster.scale(1)
    client = cluster.get_client()
    print("Client", client)
    joblib.parallel_backend(
        "dask", client=client, pure=False, wait_for_workers_timeout=60
    )

    results = joblib.Parallel(n_jobs=2)(
        joblib.delayed(square)(arg) for arg in range(10)
    )
    print(results)
    client.close()


if __name__ == "__main__":
    main()

It results in this exception on the command line

$ python k8s_cluster.py                                                                                                                                     1 ↵
╭─────────────────── Creating KubeCluster 'my-dask-cluster' ───────────────────╮
│                                                                              │
│   DaskCluster                                                      Pending   │
│   Scheduler Pod                                                          -   │
│   Scheduler Service                                                      -   │
│   Default Worker Group                                             Created   │
│                                                                              │
│ ⠴ Waiting for controller to action cluster                                   │
╰──────────────────────────────────────────────────────────────────────────────╯
Traceback (most recent call last):
  File "/Users/artimi/playground/dask/.venv/lib/python3.11/site-packages/dask_kubernetes/operator/kubecluster/kubecluster.py", line 334, in _create_cluster
    await self._wait_for_controller()
  File "/Users/artimi/playground/dask/.venv/lib/python3.11/site-packages/dask_kubernetes/operator/kubecluster/kubecluster.py", line 450, in _wait_for_controller
    raise TimeoutError(
TimeoutError: Dask Cluster resource not actioned after 60 seconds, is the Dask Operator running?

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/artimi/playground/dask/k8s_cluster.py", line 38, in <module>
    main()
  File "/Users/artimi/playground/dask/k8s_cluster.py", line 14, in main
    cluster = KubeCluster(
              ^^^^^^^^^^^^
  File "/Users/artimi/playground/dask/.venv/lib/python3.11/site-packages/dask_kubernetes/operator/kubecluster/kubecluster.py", line 244, in __init__
    self.sync(self._start)
  File "/Users/artimi/playground/dask/.venv/lib/python3.11/site-packages/distributed/utils.py", line 349, in sync
    return sync(
           ^^^^^
  File "/Users/artimi/playground/dask/.venv/lib/python3.11/site-packages/distributed/utils.py", line 416, in sync
    raise exc.with_traceback(tb)
  File "/Users/artimi/playground/dask/.venv/lib/python3.11/site-packages/distributed/utils.py", line 389, in f
    result = yield future
             ^^^^^^^^^^^^
  File "/Users/artimi/playground/dask/.venv/lib/python3.11/site-packages/tornado/gen.py", line 769, in run
    value = future.result()
            ^^^^^^^^^^^^^^^
  File "/Users/artimi/playground/dask/.venv/lib/python3.11/site-packages/dask_kubernetes/operator/kubecluster/kubecluster.py", line 280, in _start
    await self._create_cluster()
  File "/Users/artimi/playground/dask/.venv/lib/python3.11/site-packages/dask_kubernetes/operator/kubecluster/kubecluster.py", line 336, in _create_cluster
    await self._close()
  File "/Users/artimi/playground/dask/.venv/lib/python3.11/site-packages/dask_kubernetes/operator/kubecluster/kubecluster.py", line 726, in _close
    if time.time() > start + timeout:
                     ~~~~~~^~~~~~~~~

And this exception in operator

[2023-06-28 10:50:11,211] kopf.objects         [ERROR   ] [dask-ns/my-dask-cluster-scheduler] Handler 'handle_scheduler_service_status/status' failed with an exception. Will retry.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/kopf/_core/actions/execution.py", line 276, in execute_handler_once
    result = await invoke_handler(
  File "/usr/local/lib/python3.10/site-packages/kopf/_core/actions/execution.py", line 371, in invoke_handler
    result = await invocation.invoke(
  File "/usr/local/lib/python3.10/site-packages/kopf/_core/actions/invocation.py", line 116, in invoke
    result = await fn(**kwargs)  # type: ignore
  File "/usr/local/lib/python3.10/site-packages/dask_kubernetes/operator/controller/controller.py", line 371, in handle_scheduler_service_status
    cluster = await DaskCluster.get(
  File "/usr/local/lib/python3.10/site-packages/kr8s/_objects.py", line 137, in get
    api = await kr8s.asyncio.api()
  File "/usr/local/lib/python3.10/site-packages/kr8s/asyncio/_api.py", line 31, in api
    return await _f(
  File "/usr/local/lib/python3.10/site-packages/kr8s/asyncio/_api.py", line 29, in _f
    return await _cls(**kwargs, bypass_factory=True)
  File "/usr/local/lib/python3.10/site-packages/kr8s/_api.py", line 53, in f
    await self.auth
  File "/usr/local/lib/python3.10/site-packages/kr8s/_auth.py", line 44, in f
    await self.reauthenticate()
  File "/usr/local/lib/python3.10/site-packages/kr8s/_auth.py", line 56, in reauthenticate
    raise ValueError("Unable to find valid credentials")
ValueError: Unable to find valid credentials

The exception is also visible in k8s service for the scheduler in Events

$ k describe svc my-dask-cluster-scheduler
...
Events:
  Type   Reason   Age   From  Message
  ----   ------   ----  ----  -------
  Error  Logging  25s   kopf  Handler 'handle_scheduler_service_status/status' failed with an exception. Will retry.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/kopf/_core/actions/execution.py", line 276, in execute_handler_once
    result = await invoke_handler(
  File "/usr/local/lib/python3.10/site-packages/kopf/_core/actions/execution.py", line 371, in invoke_handler
    result = await invocation.invoke(
  File "/usr/local/lib/python3.10/site-packages/kopf/_core/actions/invocation.py", line 11...n3.10/site-packages/kr8s/asyncio/_api.py", line 29, in _f
    return await _cls(**kwargs, bypass_factory=True)
  File "/usr/local/lib/python3.10/site-packages/kr8s/_api.py", line 53, in f
    await self.auth
  File "/usr/local/lib/python3.10/site-packages/kr8s/_auth.py", line 44, in f
    await self.reauthenticate()
  File "/usr/local/lib/python3.10/site-packages/kr8s/_auth.py", line 56, in reauthenticate
    raise ValueError("Unable to find valid credentials")
ValueError: Unable to find valid credentials

Anything else we need to know?:

Environment:

  • Dask version: 2023.6.0
  • Python version: 3.11.3
  • Operating System: mac
  • Install method (conda, pip, source): pip
@jacobtomlinson
Copy link
Member

It looks like kr8s is unable to find the service account credentials that are being provided by the Dask Operator helm chart. I'm investigating now.

@bstadlbauer
Copy link
Collaborator

@jacobtomlinson Should we add a smoke test that installs the helm operator and runs a simple cluster? 🤔 Or is there a reason this would have been caught?

@jacobtomlinson
Copy link
Member

Yeah I think that's a great idea.

@jacobtomlinson
Copy link
Member

Ok I've fixed the bug in kr8s and released v0.6.0. Then in #752 I've bumped our dependency here and tested the helm chart locally using the pytest-kind cluster.

Build the dev image

$ docker build -t ghcr.io/dask/dask-kubernetes-operator:dev -f dask_kubernetes/operator/deployment/Dockerfile .

Load the image into Kubernetes

$ kind load docker-image --name pytest-kind ghcr.io/dask/dask-kubernetes-operator:dev

Install the helm chart

$ cd dask_kubernetes/operator/deployment/helm/

$ helm install --generate-name ./dask-kubernetes-operator --set image.tag=dev

Create a cluster

In [1]: from dask_kubernetes.operator import KubeCluster
   ...: cluster = KubeCluster(name="foo", n_workers=1)
╭───────────────────────── Creating KubeCluster 'foo' ─────────────────────────╮
│                                                                              │
│   DaskCluster                                                      Running   │
│   Scheduler Pod                                                    Running   │
│   Scheduler Service                                                Created   │
│   Default Worker Group                                             Created   │
│                                                                              │
│ ⠇ Getting dashboard URL                                                      │
╰──────────────────────────────────────────────────────────────────────────────╯

In [2]: 

I'm going to cut another release here and things should be good!

@bstadlbauer
Copy link
Collaborator

Thank you! I've been working on the automated test but my kind cluster wouldn't come up on the train 🤦

@jacobtomlinson
Copy link
Member

2023.6.1 is up on PyPI. Conda Forge is still stuck on 2023.3.2 due to some packaging issues but will skip right over this problem. So I'm going to close this out.

Looking forward to seeing the tests @bstadlbauer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants