Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubeflow-dashboard channels 1.8/edge and latest/edge is not functional when deployed with Juju 3.5 #188

Open
NohaIhab opened this issue May 9, 2024 · 11 comments
Labels
bug Something isn't working

Comments

@NohaIhab
Copy link
Contributor

NohaIhab commented May 9, 2024

Bug Description

Note: This behavior is seen with 1.8/edge and latest/edge, but not 1.8/stable.
This issue is blocking the CI of PR #canonical/dex-auth-operator#187.
This came up during the CI updates to Juju 3.5, specifically in the dex-auth integration CI where test_login is timing out to login to the dshboard.
When trying to access the dashboard from the browser, it's unsuccessful with this error shown:

upstream connect error or disconnect/reset before headers. retried and the latest reset reason: connection failure, transport failure reason: delayed connect error: 111

the response status is 503 Service Unavailable and the server is istio-envoy

Observing the istio-ingerssgateway workload logs:

[2024-05-09T10:48:49.956Z] "GET /?next=%2F HTTP/1.1" 503 UF,URX upstream_reset_before_response_started{connection_failure,delayed_connect_error:_111} - "delayed_connect_error:_111" 0 168 64 - "172.31.18.195" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/124.0.6367.155 Safari/537.36" "60b6e7d2-4ee7-461f-aef4-9d510109ccbf" "10.64.140.43.nip.io" "10.1.199.227:8082" outbound|8082||kubeflow-dashboard.kubeflow.svc.cluster.local - 10.1.199.236:8080 172.31.18.195:3212 - -

delayed connection error 111 typically means the application port is not open

and observing the kubeflow-dashboard workload logs:

2024-05-09T09:42:41.298Z [pebble] HTTP API server listening on ":38813".
2024-05-09T09:42:41.299Z [pebble] Started daemon.
2024-05-09T09:43:07.129Z [pebble] GET /v1/plan?format=yaml 230.782µs 200
2024-05-09T09:43:07.148Z [pebble] POST /v1/layers 291.684µs 200
2024-05-09T09:43:07.163Z [pebble] POST /v1/services 7.638296ms 202
2024-05-09T09:43:07.170Z [pebble] GET /v1/notices?timeout=30s 25.7898025s 200
2024-05-09T09:43:07.170Z [pebble] Service "serve" starting: /usr/bin/npm start
2024-05-09T09:43:07.646Z [serve] 
2024-05-09T09:43:07.646Z [serve] > [email protected] start
2024-05-09T09:43:07.646Z [serve] > npm run serve
2024-05-09T09:43:07.646Z [serve] 
2024-05-09T09:43:08.112Z [serve] glob error [Error: EACCES: permission denied, scandir '/root/.npm/_logs'] {
2024-05-09T09:43:08.112Z [serve]   errno: -13,
2024-05-09T09:43:08.112Z [serve]   code: 'EACCES',
2024-05-09T09:43:08.112Z [serve]   syscall: 'scandir',
2024-05-09T09:43:08.112Z [serve]   path: '/root/.npm/_logs'
2024-05-09T09:43:08.112Z [serve] }
2024-05-09T09:43:08.114Z [serve] npm WARN logfile Error: EACCES: permission denied, scandir '/root/.npm/_logs'
2024-05-09T09:43:08.114Z [serve] npm WARN logfile  error cleaning log files [Error: EACCES: permission denied, scandir '/root/.npm/_logs'] {
2024-05-09T09:43:08.114Z [serve] npm WARN logfile   errno: -13,
2024-05-09T09:43:08.115Z [serve] npm WARN logfile   code: 'EACCES',
2024-05-09T09:43:08.115Z [serve] npm WARN logfile   syscall: 'scandir',
2024-05-09T09:43:08.115Z [serve] npm WARN logfile   path: '/root/.npm/_logs'
2024-05-09T09:43:08.115Z [serve] npm WARN logfile }
2024-05-09T09:43:08.118Z [serve] 
2024-05-09T09:43:08.118Z [serve] > [email protected] serve
2024-05-09T09:43:08.118Z [serve] > node dist/server.js
2024-05-09T09:43:08.118Z [serve] 
2024-05-09T09:43:08.177Z [pebble] GET /v1/changes/1/wait?timeout=4.000s 1.012609377s 200
2024-05-09T09:43:08.177Z [pebble] GET /v1/notices?after=2024-05-09T09%3A43%3A07.164028187Z&timeout=30s 1.006409154s 200
2024-05-09T09:43:08.896Z [serve] Initializing Kubernetes configuration
2024-05-09T09:43:08.904Z [serve] [SEVERE] unhandledRejection Error: EACCES: permission denied, open '/var/run/secrets/kubernetes.io/serviceaccount/token'
2024-05-09T09:43:08.904Z [serve]     at Object.openSync (fs.js:462:3)
2024-05-09T09:43:08.904Z [serve]     at Object.readFileSync (fs.js:364:35)
2024-05-09T09:43:08.904Z [serve]     at KubeConfig.loadFromCluster (/app/node_modules/@kubernetes/client-node/dist/config.js:137:27)
2024-05-09T09:43:08.904Z [serve]     at KubeConfig.loadFromDefault (/app/node_modules/@kubernetes/client-node/dist/config.js:173:18)
2024-05-09T09:43:08.904Z [serve]     at new KubernetesService (/app/dist/k8s_service.js:45:25)
2024-05-09T09:43:08.904Z [serve]     at /app/dist/server.js:46:28
2024-05-09T09:43:08.904Z [serve]     at Generator.next (<anonymous>)
2024-05-09T09:43:08.904Z [serve]     at /app/dist/server.js:8:71
2024-05-09T09:43:08.904Z [serve]     at new Promise (<anonymous>)
2024-05-09T09:43:08.904Z [serve]     at __awaiter (/app/dist/server.js:4:12) {
2024-05-09T09:43:08.904Z [serve]   errno: -13,
2024-05-09T09:43:08.904Z [serve]   syscall: 'open',
2024-05-09T09:43:08.904Z [serve]   code: 'EACCES',
2024-05-09T09:43:08.904Z [serve]   path: '/var/run/secrets/kubernetes.io/serviceaccount/token'
2024-05-09T09:43:08.904Z [serve] }
2024-05-09T09:43:08.946Z [pebble] Service "serve" stopped unexpectedly with code 0
2024-05-09T09:43:08.946Z [pebble] Service "serve" on-success action is "restart", waiting ~500ms before restart (backoff 1)

we can see that the npm server is not able start, and the pebble service is constantly restarting with permission denied errors to access the files:

  • var/run/secrets/kubernetes.io/serviceaccount/token
  • /root/.npm/_logs

The kubeflow dashboard is down in this case, but the charm status still shows up as active. This is problematic because the charm's state is not representative.

To Reproduce

  1. Deploy the following charms from latest/edge or corresponding 1.8/edge channels:
dex-auth 
istio-ingressgateway
istio-pilot
kubeflow-dashboard
kubeflow-profiles
oidc-gatekeeper
  1. try to access the dashboard by visiting the public-url

Environment

juju 3.5/stable
microk8s 1.25-strict/stable

Relevant Log Output

upstream connect error or disconnect/reset before headers. retried and the latest reset reason: connection failure, transport failure reason: delayed connect error: 111

Additional Context

No response

@NohaIhab NohaIhab added the bug Something isn't working label May 9, 2024
Copy link

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5651.

This message was autogenerated

@NohaIhab
Copy link
Contributor Author

NohaIhab commented May 9, 2024

To gather more information, I've tested:

  • the 1.8/edge dashboard charm on juju 3.1, and it was functioning correctly.
  • the 1.8/stable dashboard charm on juju 3.5, this also functioned correctly.

So this concludes that the 1.8/edge charm has a diff from 1.8/stable that makes break with juju 3.5 specifically.

It's worth noting that we are using the rock in 1.8/edge, but not in 1.8/stable.

The Juju 3.5 release notes include the change juju/juju#17070, I'm curious if this has to do with the errors we're seeing, and relating to the rock that we are using.

@NohaIhab
Copy link
Contributor Author

NohaIhab commented May 9, 2024

Comparing the pod definitions when deploying with 3.4 and 3.5, this security context is additional in 3.5:

  securityContext:
    fsGroup: 170
    supplementalGroups:
    - 170

@ca-scribner
Copy link
Contributor

ca-scribner commented May 9, 2024

I can reproduce with:

  • juju 3.5/stable
  • microk8s 1.25-strict/stable
  • deployed with:
juju deploy kubeflow-dashboard --channel 1.8/edge --trust
juju deploy kubeflow-profiles --channel 1.8/stable --trust
juju relate kubeflow-dashboard kubeflow-profiles

But in my case, kubeflow-dashboard's charm goes to Error because ops's pebble raises an exception during charm execution:

2024-05-09T15:18:02.818Z [container-agent] 2024-05-09 15:18:02 INFO juju-log Pebble plan updated with new configuration, replaning
2024-05-09T15:18:03.848Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed Error in sys.excepthook:
2024-05-09T15:18:03.848Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed Traceback (most recent call last):
2024-05-09T15:18:03.848Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed   File "/usr/lib/python3.8/logging/__init__.py", line 954, in handle
2024-05-09T15:18:03.848Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed     self.emit(record)
2024-05-09T15:18:03.848Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed   File "/var/lib/juju/agents/unit-kubeflow-dashboard-0/charm/venv/ops/log.py", line 41, in emit
2024-05-09T15:18:03.848Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed     self.model_backend.juju_log(record.levelname, self.format(record))
2024-05-09T15:18:03.848Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed   File "/usr/lib/python3.8/logging/__init__.py", line 929, in format
2024-05-09T15:18:03.848Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed     return fmt.format(record)
2024-05-09T15:18:03.848Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed   File "/usr/lib/python3.8/logging/__init__.py", line 676, in format
2024-05-09T15:18:03.848Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed     record.exc_text = self.formatException(record.exc_info)
2024-05-09T15:18:03.848Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed   File "/usr/lib/python3.8/logging/__init__.py", line 626, in formatException
2024-05-09T15:18:03.848Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed     traceback.print_exception(ei[0], ei[1], tb, None, sio)
2024-05-09T15:18:03.848Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed   File "/usr/lib/python3.8/traceback.py", line 103, in print_exception
2024-05-09T15:18:03.848Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed     for line in TracebackException(
2024-05-09T15:18:03.848Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed   File "/usr/lib/python3.8/traceback.py", line 617, in format
2024-05-09T15:18:03.848Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed     yield from self.format_exception_only()
2024-05-09T15:18:03.848Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed   File "/usr/lib/python3.8/traceback.py", line 566, in format_exception_only
2024-05-09T15:18:03.848Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed     stype = smod + '.' + stype
2024-05-09T15:18:03.848Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'
2024-05-09T15:18:03.849Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed 
2024-05-09T15:18:03.849Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed Original exception was:
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed Traceback (most recent call last):
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed   File "./src/charm.py", line 212, in _update_layer
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed     self.container.replan()
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed   File "/var/lib/juju/agents/unit-kubeflow-dashboard-0/charm/venv/ops/model.py", line 1915, in replan
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed     self._pebble.replan_services()
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed   File "/var/lib/juju/agents/unit-kubeflow-dashboard-0/charm/venv/ops/pebble.py", line 1680, in replan_services
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed     return self._services_action('replan', [], timeout, delay)
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed   File "/var/lib/juju/agents/unit-kubeflow-dashboard-0/charm/venv/ops/pebble.py", line 1761, in _services_action
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed     raise ChangeError(change.err, change)
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed ops.pebble.ChangeError: cannot perform the following tasks:
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed - Start service "serve" (cannot start service: exited quickly with code 0)
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed ----- Logs from task 0 -----
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed 2024-05-09T15:18:03Z INFO Most recent service output:
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed     (...)
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed     > [email protected] serve
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed     > node dist/server.js
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed     
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed     Initializing Kubernetes configuration
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed     [SEVERE] unhandledRejection Error: EACCES: permission denied, open '/var/run/secrets/kubernetes.io/serviceaccount/token'
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed         at Object.openSync (fs.js:462:3)
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed         at Object.readFileSync (fs.js:364:35)
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed         at KubeConfig.loadFromCluster (/app/node_modules/@kubernetes/client-node/dist/config.js:137:27)
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed         at KubeConfig.loadFromDefault (/app/node_modules/@kubernetes/client-node/dist/config.js:173:18)
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed         at new KubernetesService (/app/dist/k8s_service.js:45:25)
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed         at /app/dist/server.js:46:28
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed         at Generator.next (<anonymous>)
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed         at /app/dist/server.js:8:71
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed         at new Promise (<anonymous>)
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed         at __awaiter (/app/dist/server.js:4:12) {
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed       errno: -13,
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed       syscall: 'open',
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed       code: 'EACCES',
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed       path: '/var/run/secrets/kubernetes.io/serviceaccount/token'
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed     }
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed 2024-05-09T15:18:03Z ERROR cannot start service: exited quickly with code 0
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed -----
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed 
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed The above exception was the direct cause of the following exception:
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed 
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed Traceback (most recent call last):
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed   File "./src/charm.py", line 304, in <module>
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed     main(KubeflowDashboardOperator)
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed   File "/var/lib/juju/agents/unit-kubeflow-dashboard-0/charm/venv/ops/main.py", line 441, in main
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed     _emit_charm_event(charm, dispatcher.event_name)
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed   File "/var/lib/juju/agents/unit-kubeflow-dashboard-0/charm/venv/ops/main.py", line 149, in _emit_charm_event
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed     event_to_emit.emit(*args, **kwargs)
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed   File "/var/lib/juju/agents/unit-kubeflow-dashboard-0/charm/venv/ops/framework.py", line 344, in emit
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed     framework._emit(event)
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed   File "/var/lib/juju/agents/unit-kubeflow-dashboard-0/charm/venv/ops/framework.py", line 841, in _emit
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed     self._reemit(event_path)
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed   File "/var/lib/juju/agents/unit-kubeflow-dashboard-0/charm/venv/ops/framework.py", line 930, in _reemit
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed     custom_handler(event)
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed   File "./src/charm.py", line 284, in main
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed     self._update_layer()
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed   File "./src/charm.py", line 214, in _update_layer
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed     raise GenericCharmRuntimeError("Failed to replan") from e
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed <unknown>GenericCharmRuntimeError: Failed to replan

@ca-scribner
Copy link
Contributor

I can also reproduce this behaviour directly using kubernetes and the rock charmedkubeflow/kubeflow-central-dashboard (source rockcraft.yaml) by:

rockcraft pack
skopeo --insecure-policy copy oci-archive:kubeflow-central-dashboard_1.8_amd64.rock docker-daemon:centraldashboard:rock
docker save centraldashboard:rock > tempimage
microk8s ctr image import tempimage
rm tempimage
kubectl -f broken-dashboard-rock.yaml

where broken-dashboard-rock.yaml is:

apiVersion: v1
kind: Pod
metadata:
  name: kubeflow-dashboard-with-security-context
spec:
  automountServiceAccountToken: true
  securityContext:
    fsGroup: 170
    supplementalGroups:
    - 170
  volumes:
  - name: sec-ctx-vol
    emptyDir: {}
  containers:
  - name: kubeflow-dashboard
    image: centraldashboard:rock
    volumeMounts:
    - name: sec-ctx-vol
      mountPath: /data/demo
    securityContext:
      runAsGroup: 0
      runAsUser: 0
    volumeMounts:
    - mountPath: /charm/bin/pebble
      name: charm-data
      readOnly: true
      subPath: charm/bin/pebble
    - mountPath: /charm/container
      name: charm-data
      subPath: charm/containers/kubeflow-dashboard
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-b4hms
      readOnly: true
  volumes:
  - emptyDir: {}
    name: charm-data
  - name: kube-api-access-b4hms
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace

(note that I'm reusing the ServiceAccount that juju created above for kubeflow-dashboard)

This raises the same errors as @NohaIhab mention above. If we remove the securityContexts, the app works as normal.

Looking at the serviceaccount/token that npm reports we cannot access, we see that it is in group 170

kubectl exec kubeflow-dashboard-with-security-context -- ls -l /var/run/secrets/kubernetes.io/serviceaccount/token
lrwxrwxrwx 1 root 170 12 May  9 17:51 /var/run/secrets/kubernetes.io/serviceaccount/token -> ..data/token

which makes sense because setting Pod.spec.securityContext.fsGroup: 170 sets that all volumeMounts will be part of group 170.

@ca-scribner
Copy link
Contributor

ca-scribner commented May 9, 2024

Interestingly, if we do the same example as above but use upstream's image (kubeflownotebookswg/centraldashboard:v1.8.0), the dashboard starts successfully. Summary of some attempts:

  • kubeflownotebookswg/centraldashboard:v1.8.0 with container securityContext: {runAsGroup: 0; runAsUser: 0}:
    • dashboard starts successfully (can access serviceaccount token)
    • kubectl exec kubeflow-dashboard-with-pod-security-context-upstream-image -- whoami shows root
    • kubectl exec kubeflow-dashboard-with-pod-security-context-upstream-image -- groups shows root 170 ... (some other groups)
  • kubeflownotebookswg/centraldashboard:v1.8.0 without container securityContext:
    • dashboard starts successfully (can access serviceaccount token)
    • kubectl exec kubeflow-dashboard-with-pod-security-context-upstream-image -- whoami shows node (which is expected because the upstream image sets USER node
    • kubectl exec kubeflow-dashboard-with-pod-security-context-upstream-image -- groups shows node 170

So for the above cases, whether the upstream image is executed as root or node, they both appear to have the group 170 added to their existing groups. Strangly, typing groups node or groups root in those sessions does not show 170 in the list, making me think Kubernetes adds that group in somehow on the fly to the existing user.

Interestingly if I kubectl exec (which gives me the root user) or runc exec -u 584792 (which gives me the rockcraft _daemon_ user), and then type groups, I see 170 is in my current groups and I can view the serviceaccount token. So somehow in these adhoc sessions, the users are being granted access to group 170 just like above for the upstream docker images.

A theory of what is going on ...

My guess at what happening is that we're hitting trouble because rocks run their services via pebble. My guess is that the pebble entrypoint is likely getting the added group 170 like we see with the docker images, but that pebble does not pass this group down to the processes that it executes. So when pebble runs our service using the _daemon_ user, it does not have group 170 and thus cannot access the service token.

Next steps

  • try doing security context with uid/gid 584792 (the same id that pebble executes our services with). Does that allow our dashboard service to access the serviceaccount token? update: Yes! See next comment
  • make a rock where pebble executes a script that reports some debug info, like groups and whoami just to see what it is running. Could also do some cat /var/run/secretss... etc in that script to see what it says

@ca-scribner
Copy link
Contributor

Tried deploying the rock pod with the following changes (full yaml given below):

  • set fsGroup=584792 (to match the _daemon_ user pebble would use to execute services in the rock
  • remove the container securityContext (it didn't matter whether it was there or not - I tried both and got the same result)
pod.yaml
apiVersion: v1
kind: Pod
metadata:
  name: kubeflow-dashboard-with-security-context-as-584792
spec:
  automountServiceAccountToken: true
  securityContext:
    fsGroup: 584792
    supplementalGroups:
    - 584792
  serviceAccount: kubeflow-dashboard
  serviceAccountName: kubeflow-dashboard
  volumes:
  - name: sec-ctx-vol
    emptyDir: {}
  containers:
  - name: kubeflow-dashboard
    image: kubeflow-central-dashboard:1.8
    volumeMounts:
    - name: sec-ctx-vol
      mountPath: /data/demo
    securityContext:
      runAsGroup: 584792
      runAsUser: 584792
    volumeMounts:
    - mountPath: /charm/bin/pebble
      name: charm-data
      readOnly: true
      subPath: charm/bin/pebble
    - mountPath: /charm/container
      name: charm-data
      subPath: charm/containers/kubeflow-dashboard
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-b4hms
      readOnly: true
  volumes:
  - emptyDir: {}
    name: charm-data
  - name: kube-api-access-b4hms
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace

So I think this reinforces the theory above that kubernetes natively grants the fsGroup to the entrypoint's process, but pebble does not do the same for its children

@DnPlas
Copy link
Contributor

DnPlas commented May 22, 2024

Based on @NohaIhab's suggestion to use the upstream image instead of the rock, I have pushed two temporal patches (#191 and #192) to help alleviate the issue and unblock other workflows.

Also, based on some comments from the Juju team, it seems like the issue is rather on that side, so not much more we can do on the charm and rock.

@DnPlas
Copy link
Contributor

DnPlas commented May 23, 2024

FYI, this issue has been reported in juju, let's keep an eye on https://bugs.launchpad.net/juju/+bug/2066517.

DnPlas added a commit that referenced this issue May 23, 2024
…image (#178)" (#191)

This reverts commit 2811dcb.

This is a temporal workaround to help avoid #188. This change must be reverted again once the fix for it is available. Please note the fix depends on https://bugs.launchpad.net/juju/+bug/2066517.
DnPlas added a commit that referenced this issue May 23, 2024
…image (#178)" (#192)

This reverts commit 2ad140e.

This is a temporal workaround to help avoid #188. This change must be reverted again once the fix for it is available. Please note the fix depends on https://bugs.launchpad.net/juju/+bug/2066517.
@NohaIhab
Copy link
Contributor Author

NohaIhab commented Jun 4, 2024

When deploying the dashboard charm latest/edge now (this deploys the rock due to #194 ) we are seeing:
In Juju 3.5.0, the charm is active but it's not functioning correctly, and we can see in the workload logs:

2024-06-04T15:09:52.430Z [pebble] Service "kubeflow-dashboard" starting: npm start
2024-06-04T15:09:52.912Z [kubeflow-dashboard] npm ERR! code ENOENT
2024-06-04T15:09:52.913Z [kubeflow-dashboard] npm ERR! syscall open
2024-06-04T15:09:52.913Z [kubeflow-dashboard] npm ERR! path //package.json
2024-06-04T15:09:52.913Z [kubeflow-dashboard] npm ERR! errno -2
2024-06-04T15:09:52.915Z [kubeflow-dashboard] npm ERR! enoent ENOENT: no such file or directory, open '//package.json'
2024-06-04T15:09:52.915Z [kubeflow-dashboard] npm ERR! enoent This is related to npm not being able to find a file.
2024-06-04T15:09:52.915Z [kubeflow-dashboard] npm ERR! enoent 
2024-06-04T15:09:52.917Z [kubeflow-dashboard] 
2024-06-04T15:09:52.918Z [kubeflow-dashboard] npm ERR! A complete log of this run can be found in:
2024-06-04T15:09:52.918Z [kubeflow-dashboard] npm ERR!     /root/.npm/_logs/2024-06-04T15_09_52_732Z-debug-0.log
2024-06-04T15:09:52.928Z [pebble] Service "kubeflow-dashboard" stopped unexpectedly with code 254
2024-06-04T15:09:52.928Z [pebble] Service "kubeflow-dashboard" on-failure action is "restart", waiting ~30s before restart (backoff 7)
2024-06-04T15:10:01.913Z [pebble] Service "serve" starting: /usr/bin/npm start
2024-06-04T15:10:02.645Z [serve] 
2024-06-04T15:10:02.646Z [serve] > [email protected] start
2024-06-04T15:10:02.646Z [serve] > npm run serve
2024-06-04T15:10:02.646Z [serve] 
2024-06-04T15:10:03.393Z [serve] glob error [Error: EACCES: permission denied, scandir '/root/.npm/_logs'] {
2024-06-04T15:10:03.393Z [serve]   errno: -13,
2024-06-04T15:10:03.393Z [serve]   code: 'EACCES',
2024-06-04T15:10:03.393Z [serve]   syscall: 'scandir',
2024-06-04T15:10:03.393Z [serve]   path: '/root/.npm/_logs'
2024-06-04T15:10:03.393Z [serve] }
2024-06-04T15:10:03.396Z [serve] npm WARN logfile Error: EACCES: permission denied, scandir '/root/.npm/_logs'
2024-06-04T15:10:03.397Z [serve] npm WARN logfile  error cleaning log files [Error: EACCES: permission denied, scandir '/root/.npm/_logs'] {
2024-06-04T15:10:03.397Z [serve] npm WARN logfile   errno: -13,
2024-06-04T15:10:03.397Z [serve] npm WARN logfile   code: 'EACCES',
2024-06-04T15:10:03.397Z [serve] npm WARN logfile   syscall: 'scandir',
2024-06-04T15:10:03.397Z [serve] npm WARN logfile   path: '/root/.npm/_logs'
2024-06-04T15:10:03.397Z [serve] npm WARN logfile }
2024-06-04T15:10:03.402Z [serve] 
2024-06-04T15:10:03.402Z [serve] > [email protected] serve
2024-06-04T15:10:03.402Z [serve] > node dist/server.js
2024-06-04T15:10:03.402Z [serve] 
2024-06-04T15:10:04.426Z [serve] Initializing Kubernetes configuration
2024-06-04T15:10:04.432Z [serve] [SEVERE] unhandledRejection Error: EACCES: permission denied, open '/var/run/secrets/kubernetes.io/serviceaccount/token'
2024-06-04T15:10:04.432Z [serve]     at Object.openSync (fs.js:462:3)
2024-06-04T15:10:04.432Z [serve]     at Object.readFileSync (fs.js:364:35)
2024-06-04T15:10:04.432Z [serve]     at KubeConfig.loadFromCluster (/app/node_modules/@kubernetes/client-node/dist/config.js:137:27)
2024-06-04T15:10:04.432Z [serve]     at KubeConfig.loadFromDefault (/app/node_modules/@kubernetes/client-node/dist/config.js:173:18)
2024-06-04T15:10:04.432Z [serve]     at new KubernetesService (/app/dist/k8s_service.js:45:25)
2024-06-04T15:10:04.432Z [serve]     at /app/dist/server.js:46:28
2024-06-04T15:10:04.432Z [serve]     at Generator.next (<anonymous>)
2024-06-04T15:10:04.432Z [serve]     at /app/dist/server.js:8:71
2024-06-04T15:10:04.432Z [serve]     at new Promise (<anonymous>)
2024-06-04T15:10:04.432Z [serve]     at __awaiter (/app/dist/server.js:4:12) {
2024-06-04T15:10:04.432Z [serve]   errno: -13,
2024-06-04T15:10:04.432Z [serve]   syscall: 'open',
2024-06-04T15:10:04.432Z [serve]   code: 'EACCES',
2024-06-04T15:10:04.432Z [serve]   path: '/var/run/secrets/kubernetes.io/serviceaccount/token'
2024-06-04T15:10:04.432Z [serve] }
2024-06-04T15:10:04.486Z [pebble] Service "serve" stopped unexpectedly with code 0
2024-06-04T15:10:04.486Z [pebble] Service "serve" on-success action is "restart", waiting ~30s before restart (backoff 7)
2024-06-04T15:10:16.859Z [pebble] GET /v1/notices?after=2024-06-04T15%3A09%3A16.846321248Z&timeout=30s 30.000718606s 200

In Juju 3.5.1, the charm went to error, and we can see in the workload logs:

kubectl logs -nkubeflow kubeflow-dashboard-0 -c kubeflow-dashboard
2024-06-04T15:18:20.917Z [pebble] HTTP API server listening on ":38813".
2024-06-04T15:18:20.917Z [pebble] Started daemon.
2024-06-04T15:18:51.009Z [pebble] GET /v1/notices?timeout=30s 30.000488277s 200
2024-06-04T15:19:12.203Z [pebble] GET /v1/plan?format=yaml 644.109µs 200
2024-06-04T15:19:12.228Z [pebble] POST /v1/layers 785.985µs 200
2024-06-04T15:19:12.257Z [pebble] POST /v1/services 11.122061ms 202
2024-06-04T15:19:12.263Z [pebble] GET /v1/notices?timeout=30s 21.254288538s 200
2024-06-04T15:19:12.263Z [pebble] Service "kubeflow-dashboard" starting: npm start
2024-06-04T15:19:12.806Z [kubeflow-dashboard] npm ERR! code ENOENT
2024-06-04T15:19:12.808Z [kubeflow-dashboard] npm ERR! syscall open
2024-06-04T15:19:12.808Z [kubeflow-dashboard] npm ERR! path //package.json
2024-06-04T15:19:12.809Z [kubeflow-dashboard] npm ERR! errno -2
2024-06-04T15:19:12.813Z [kubeflow-dashboard] npm ERR! enoent ENOENT: no such file or directory, open '//package.json'
2024-06-04T15:19:12.814Z [kubeflow-dashboard] npm ERR! enoent This is related to npm not being able to find a file.
2024-06-04T15:19:12.814Z [kubeflow-dashboard] npm ERR! enoent 
2024-06-04T15:19:12.820Z [kubeflow-dashboard] 
2024-06-04T15:19:12.821Z [kubeflow-dashboard] npm ERR! A complete log of this run can be found in:
2024-06-04T15:19:12.821Z [kubeflow-dashboard] npm ERR!     /var/lib/pebble/default/.npm/_logs/2024-06-04T15_19_12_546Z-debug-0.log
2024-06-04T15:19:12.837Z [pebble] Change 1 task (Start service "kubeflow-dashboard") failed: cannot start service: exited quickly with code 254
2024-06-04T15:19:12.850Z [pebble] GET /v1/notices?after=2024-06-04T15%3A19%3A12.257312723Z&timeout=30s 585.796055ms 200
2024-06-04T15:19:12.850Z [pebble] GET /v1/changes/1/wait?timeout=4.000s 592.081916ms 200
2024-06-04T15:19:20.371Z [pebble] GET /v1/plan?format=yaml 488.842µs 200
2024-06-04T15:19:20.402Z [pebble] POST /v1/layers 264.427µs 200
2024-06-04T15:19:20.419Z [pebble] POST /v1/services 7.468515ms 202
2024-06-04T15:19:20.426Z [pebble] GET /v1/notices?after=2024-06-04T15%3A19%3A12.837810917Z&timeout=30s 7.57522781s 200
2024-06-04T15:19:20.426Z [pebble] Service "kubeflow-dashboard" starting: npm start
2024-06-04T15:19:20.999Z [kubeflow-dashboard] npm ERR! code ENOENT
2024-06-04T15:19:21.000Z [kubeflow-dashboard] npm ERR! syscall open
2024-06-04T15:19:21.000Z [kubeflow-dashboard] npm ERR! path //package.json
2024-06-04T15:19:21.001Z [kubeflow-dashboard] npm ERR! errno -2
2024-06-04T15:19:21.003Z [kubeflow-dashboard] npm ERR! enoent ENOENT: no such file or directory, open '//package.json'
2024-06-04T15:19:21.003Z [kubeflow-dashboard] npm ERR! enoent This is related to npm not being able to find a file.
2024-06-04T15:19:21.003Z [kubeflow-dashboard] npm ERR! enoent 
2024-06-04T15:19:21.004Z [kubeflow-dashboard] 
2024-06-04T15:19:21.005Z [kubeflow-dashboard] npm ERR! A complete log of this run can be found in:
2024-06-04T15:19:21.005Z [kubeflow-dashboard] npm ERR!     /var/lib/pebble/default/.npm/_logs/2024-06-04T15_19_20_762Z-debug-0.log
2024-06-04T15:19:21.019Z [pebble] Change 2 task (Start service "kubeflow-dashboard") failed: cannot start service: exited quickly with code 254
2024-06-04T15:19:21.035Z [pebble] GET /v1/changes/2/wait?timeout=4.000s 614.686733ms 200
2024-06-04T15:19:21.035Z [pebble] GET /v1/notices?after=2024-06-04T15%3A19%3A20.419724889Z&timeout=30s 608.795658ms 200
2024-06-04T15:19:33.805Z [pebble] GET /v1/plan?format=yaml 1.599755ms 200
2024-06-04T15:19:33.831Z [pebble] POST /v1/layers 261.411µs 200
2024-06-04T15:19:33.849Z [pebble] POST /v1/services 6.548049ms 202
2024-06-04T15:19:33.859Z [pebble] GET /v1/notices?after=2024-06-04T15%3A19%3A21.019663273Z&timeout=30s 12.823734778s 200

It looks like in Juju 3.5.1 we are no longer seeing the service account token Permission denied error that we were seeing in Juju 3.5.0 -> See Juju bug that was fixed
BUT we are seeing a different error saying no such file or directory, open '//package.json', so we need to investigate whether this is a different issue, and if so file it seperately.

On Juju 3.5.1, when refreshing the charm to use the upstream image instead like:

juju refresh kubeflow-dashboard --resource oci-image=docker.io/kubeflownotebookswg/centraldashboard:v1.8.0

we no longer see any error in the workload logs.
So we can safely proceed with fixing #194 by making sure the charm uses the resource specified in the metadata.yaml i.e. the upstream image.

@DnPlas
Copy link
Contributor

DnPlas commented Jun 19, 2024

I have tested the charm (1.8/stable) with the rock (charmedkubeflow/kubeflow-central-dashboard:1.8-5a3d6ea) and I'm having the same issue as @NohaIhab reported in #188 (comment):

2024-06-19T13:46:11.056Z [pebble] GET /v1/changes/4/wait?timeout=4.000s 426.897172ms 200
2024-06-19T13:46:17.949Z [pebble] GET /v1/notices?timeout=30s 30.001642379s 200
2024-06-19T13:46:47.952Z [pebble] GET /v1/notices?timeout=30s 30.001880459s 200
2024-06-19T13:46:54.960Z [pebble] GET /v1/plan?format=yaml 180.599µs 200
2024-06-19T13:46:54.981Z [pebble] POST /v1/layers 191.613µs 200
2024-06-19T13:46:54.997Z [pebble] POST /v1/services 10.006459ms 202
2024-06-19T13:46:55.006Z [pebble] Service "kubeflow-dashboard" starting: npm start
2024-06-19T13:46:55.370Z [kubeflow-dashboard] npm ERR! code ENOENT
2024-06-19T13:46:55.372Z [kubeflow-dashboard] npm ERR! syscall open
2024-06-19T13:46:55.372Z [kubeflow-dashboard] npm ERR! path //package.json
2024-06-19T13:46:55.372Z [kubeflow-dashboard] npm ERR! errno -2
2024-06-19T13:46:55.373Z [kubeflow-dashboard] npm ERR! enoent ENOENT: no such file or directory, open '//package.json'
2024-06-19T13:46:55.373Z [kubeflow-dashboard] npm ERR! enoent This is related to npm not being able to find a file.
2024-06-19T13:46:55.373Z [kubeflow-dashboard] npm ERR! enoent
2024-06-19T13:46:55.374Z [kubeflow-dashboard]
2024-06-19T13:46:55.374Z [kubeflow-dashboard] npm ERR! A complete log of this run can be found in:
2024-06-19T13:46:55.374Z [kubeflow-dashboard] npm ERR!     /root/.npm/_logs/2024-06-19T13_46_55_207Z-debug-0.log
2024-06-19T13:46:55.408Z [pebble] GET /v1/changes/5/wait?timeout=4.000s 410.115382ms 200
2024-06-19T13:47:17.953Z [pebble] GET /v1/notices?timeout=30s 30.000991436s 200

This could potentially be an issue with the rock, as the package.json file is more related to npm.

EDIT: I have filed canonical/kubeflow-rocks#104, it seems like this is a rocks issue rather than a juju issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants