Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

csi-do-controller-0 CrashLoopBackOff: couldn't get metadata: Get "http://169.254.169.254/metadata/v1.json" #328

Open
max3903 opened this issue Jun 16, 2020 · 23 comments

Comments

@max3903
Copy link

max3903 commented Jun 16, 2020

What did you do? (required. The issue will be closed when not provided.)

I followed the documentation to add the do-block-storage plugin:

I added the secret successfully and run:

kubectl apply -f https://raw.githubusercontent.com/digitalocean/csi-digitalocean/master/deploy/kubernetes/releases/csi-digitalocean-v1.3.0.yaml

It fails on some snapshot specific stuff:

CustomResourceDefinition.apiextensions.k8s.io "volumesnapshots.snapshot.storage.k8s.io" is invalid: spec.version: Invalid value: "v1alpha1": must match the first version in spec.versions

I moved on (I believe it is fixed by #322) and tried to create a PVC.

What did you expect to happen?

I was expecting the PV to be created.

Configuration (MUST fill this out):

  • system logs:

https://gist.github.com/max3903/acb18527be1138a33d77f3eaaddb89a8

  • manifests, such as pvc, deployments, etc.. you used to reproduce:

secret.yaml:

apiVersion: v1
kind: Secret
metadata:
  name: digitalocean
  namespace: kube-system
stringData:
  access-token: "3e8[...]ec5"

pvc.yaml:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: jenkins-data
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi
  storageClassName: do-block-storage
  • CSI Version:

1.3.0

  • Kubernetes Version:

1.17

  • Cloud provider/framework version, if applicable (such as Rancher):

OKD 4.5

@max3903
Copy link
Author

max3903 commented Jun 16, 2020

Other information:

I am using OKD 4.5 on Fedora CoreOS 31.

The pod csi-do-controller-0 remains in status CrashLoopBackOff.

4 out of 5 containers are in state Running but have this error message in the log:

connection.go:170] Still connecting to unix:///var/lib/csi/sockets/pluginproxy/csi.sock

The last one csi-do-plugin (digitalocean/do-csi-plugin:v1.3.0) remains in state Waiting and the logs says:

couldn't get metadata: Get "http://169.254.169.254/metadata/v1.json": dial tcp 169.254.169.254:80: connect: connection refused (are you running on DigitalOcean droplets?)

On the worker, the csi.sock is not in:

/var/lib/csi/sockets/pluginproxy/csi.sock

but in

/var/lib/kubelet/plugins/dobs.csi.digitalocean.com/csi.sock

@timoreimann
Copy link
Contributor

Hi @max3903

the error

couldn't get metadata: Get "http://169.254.169.254/metadata/v1.json": dial tcp 169.254.169.254:80: connect: connection refused (are you running on DigitalOcean droplets?)

is odd because it usually means that you are not running on DigitalOcean infrastructure (as the error indicates). However, I do see a DO region label on one of your Nodes. Can you confirm that you are indeed running on droplets? Can you connect to the metadata endpoint from your nodes?

What might also be good to know: did you try to apply the manifests on a cluster that had a previous version of the CSI driver installed already, or was this a first-time CSI installation attempt?

@max3903
Copy link
Author

max3903 commented Jun 17, 2020

Hello @timoreimann

Yes I am running on droplets built from a custom image: Fedora CoreOS 31 for Digital Ocean from https://getfedora.org/en/coreos/download?tab=cloud_operators&stream=stable

Yes, I can connect to the metadata endpoint from the 3 masters and 2 workers. That is actually how each droplet get their hostname during the installation:
See coreos/fedora-coreos-tracker#538

Yes, I tried to apply the manifests multiple time using different versions/urls.
I tried 0.3.0 first, then latest and finally 1.3.0.

@max3903
Copy link
Author

max3903 commented Jun 17, 2020

So I ran:

oc delete -f https://raw.githubusercontent.com/digitalocean/csi-digitalocean/master/deploy/kubernetes/releases/csi-digitalocean-v0.3.0.yaml

oc apply -f https://raw.githubusercontent.com/digitalocean/csi-digitalocean/master/deploy/kubernetes/releases/csi-digitalocean-v1.3.0.yaml

I don't know if it helps but the container created from the DaemonSet is working fine on the same node.

Only the one created from the StatefulSet is crashing...

@timoreimann
Copy link
Contributor

timoreimann commented Jun 17, 2020

@max3903 CSI driver in version 0.3.0 definitely does not support Kubernetes 1.17. (See also our support matrix.) If you installed that first, the subsequent 1.3.0 installation most likely failed because of unsupported (and broken) left-overs from 0.3.0.

Can you try to install v1.3.0 from a clean slate, i.e., on a 1.17 cluster that does not come with any other (older) CSI driver versions installed beforehand?

@max3903
Copy link
Author

max3903 commented Jun 17, 2020

Even after running:

oc delete -f https://raw.githubusercontent.com/digitalocean/csi-digitalocean/master/deploy/kubernetes/releases/csi-digitalocean-v0.3.0.yaml

?

@max3903
Copy link
Author

max3903 commented Jun 17, 2020

@timoreimann Installing the cluster was a pretty painful process I would like to avoid.

I removed all the csi* images from all the masters and workers:

podman image rm docker.io/digitalocean/do-csi-plugin:v1.3.0
podman image rm docker.io/digitalocean/do-csi-plugin:dev
podman image rm quay.io/k8scsi/csi-node-driver-registrar:v1.1.0
podman image rm quay.io/k8scsi/csi-resizer:v0.3.0
podman image rm quay.io/k8scsi/csi-snapshotter:v1.2.2
podman image rm quay.io/k8scsi/csi-provisioner:v1.4.0
podman image rm quay.io/k8scsi/csi-attacher:v2.0.0

and installed the correct version (1.3.0). I still get the same error.

Which left-overs am I missing?

@timoreimann
Copy link
Contributor

Check for any snapshot-related CRDs that might be remaining (kubectl get crd) and delete them.

@max3903
Copy link
Author

max3903 commented Jun 18, 2020

@timoreimann

I deleted them.

No errors when running:

kubectl apply -f https://raw.githubusercontent.com/digitalocean/csi-digitalocean/master/deploy/kubernetes/releases/csi-digitalocean-v1.3.0.yaml

Still the same behavior on the controller, i.e the pod csi-do-controller-0 remains in status CrashLoopBackOff.

4 out of 5 containers are in state Running but have this error message in the log:

connection.go:170] Still connecting to unix:///var/lib/csi/sockets/pluginproxy/csi.sock

The last one csi-do-plugin (digitalocean/do-csi-plugin:v1.3.0) remains in state Waiting and the logs says:

couldn't get metadata: Get "http://169.254.169.254/metadata/v1.json": dial tcp 169.254.169.254:80: connect: connection refused (are you running on DigitalOcean droplets?)

If I replace the args at https://github.com/digitalocean/csi-digitalocean/blob/master/deploy/kubernetes/releases/csi-digitalocean-v1.3.0.yaml#L194 with:

          args :
            - "--version"

I get this message in the logs of the container:

latest - 59e354368961c4688243fc083c94b963c276e5b4 (clean)

I tried to run the container on the worker:

$ podman run digitalocean/do-csi-plugin:v1.3.0 \
    --endpoint=unix:///var/lib/csi/sockets/pluginproxy/csi.sock \
    --url=https://api.digitalocean.com/ 
    --token=3e8****ec5
time="2020-06-18T00:05:44Z" level=info msg="removing socket" host_id=196466821 region=sfo3 socket=/var/lib/csi/sockets/pluginproxy/csi.sock version=latest
2020/06/18 00:05:44 failed to listen: listen unix /var/lib/csi/sockets/pluginproxy/csi.sock: bind: no such file or directory

I also tried to use curl to create a volume through the API from the same node and it worked:

curl -X POST -H "Content-Type: application/json" \
    -H "Authorization: Bearer 3e8***ec5" \
    -d '{"size_gigabytes":10, "name": "example", "description": "Block store for examples", "region": "sfo3", "filesystem_type": "ext4", "filesystem_label": "example"}' \
    "https://api.digitalocean.com/v2/volumes"

The container from the same image on the same node from the DaemonSet is still working fine:

time="2020-06-17T23:00:21Z" level=info msg="removing socket" host_id=196466821 region=sfo3 socket=/csi/csi.sock version=latest
time="2020-06-17T23:00:21Z" level=info msg="starting server" grpc_addr=/csi/csi.sock host_id=196466821 http_addr= region=sfo3 version=latest
time="2020-06-17T23:00:22Z" level=info msg="get plugin info called" host_id=196466821 method=get_plugin_info region=sfo3 response="name:\"dobs.csi.digitalocean.com\" vendor_version:\"latest\" " version=latest
time="2020-06-17T23:00:23Z" level=info msg="node get info called" host_id=196466821 method=node_get_info region=sfo3 version=latest

FYI, all droplets are Fedora CoreOS 31 in SFO3 with this workaround to set the hostname:
coreos/fedora-coreos-tracker#538

@lucab
Copy link

lucab commented Jun 18, 2020

The couldn't get metadata is likely a red-herring due to the manual podman run which is unlike the k8s manifest regarding network namespace setup.

@max3903
Copy link
Author

max3903 commented Jun 18, 2020

@timoreimann With @lucab and @dustymabe help, I got it working by adding:

      hostNetwork: true
      securityContext:
        privileged: true

in https://github.com/digitalocean/csi-digitalocean/blob/release-1.3/deploy/kubernetes/releases/csi-digitalocean-v1.3.0.yaml#L142

@timoreimann
Copy link
Contributor

@max3903 glad you figured it out. 🎉
Do I understand correctly that you needed to add the hostNetwork / privileged fields to the Controller service? (We do have it set on the Node service in the manifest.)

FWIW, the manifest you referenced (and had to amend) is what we use for our end-to-end tests as-is: we deploy it into a DOKS cluster and run upstream e2e tests against. I'm confused why it didn't work for you -- wondering if there's perhaps something specific about OKD (or DOKS) that explains the difference in behavior?

@max3903
Copy link
Author

max3903 commented Jun 18, 2020

@timoreimann Yes on the controller.

@dustymabe mentioned that openshift has stricter security settings than base kubernetes.

@dustymabe
Copy link

@dustymabe mentioned that openshift has stricter security settings than base kubernetes.

Typically that is the case. Unfortunately I don't have enough expertise to know what those extra security defaults are or if that's the cause of the issues here. I just know enough to bring up that it could be the cause.

@dustymabe
Copy link

@timoreimann With @lucab and @dustymabe help, I got it working by adding:

      hostNetwork: true
      securityContext:
        privileged: true

in https://github.com/digitalocean/csi-digitalocean/blob/release-1.3/deploy/kubernetes/releases/csi-digitalocean-v1.3.0.yaml#L142

This seems to be working for me with just the hostNetwork: true change. I don't think privileged: true is needed.

@timoreimann
Copy link
Contributor

Right, privileged mode should be needed on the Node service only to allow mount propagation. I don't think we have it set on our Controller service manifest.

@timoreimann
Copy link
Contributor

timoreimann commented Jul 4, 2020

If you'd like to submit a quick PR to document the need to run on host network in OKD (and perhaps leave a commented out hostNetwork: true field in the manifest), I'd be happy to review that.

@dustymabe
Copy link

Thanks @timoreimann. Do you think it would make sense to do it by default instead of having it commented out?

@timoreimann
Copy link
Contributor

@dustymabe the only platform I'm aware of at this point that requires host networking to be enabled on the Controller service seems to be OKD. So I'm more inclined to keeping it commented out for now.
If someone could manage to find out more specific reasons why it's needed in OKD though, we could possibly better judge if it's something that other platforms / systems may be affected by as well.

@dustymabe
Copy link

I changed the csi-do-plugin container within the pod to just sleep so I could exec in there and poke around.

/ # ip -4 -o a
1: lo    inet 127.0.0.1/8 scope host lo\       valid_lft forever preferred_lft forever
3: eth0    inet 10.129.0.61/23 brd 10.129.1.255 scope global eth0\       valid_lft forever preferred_lft forever
/ # busybox wget http://169.254.169.254/metadata/v1.json
Connecting to 169.254.169.254 (169.254.169.254:80)
wget: can't connect to remote host (169.254.169.254): Connection refused

It might be worth noting that OKD uses OVN networking: https://docs.openshift.com/container-platform/4.5/networking/ovn_kubernetes_network_provider/about-ovn-kubernetes.html. Unfortunately I don't know much about the networking side so I'm a bit limited in understanding this.

In order to workaround temporarily this patch command should work for users:

PATCH='                                                                       
spec:                                                                         
  template:                                                                   
    spec:                                                                     
      hostNetwork: true'                                                      
oc patch statefulset/csi-do-controller -n kube-system --type merge -p "$PATCH"

Can we change the title of this to csi-do-controller-0 CrashLoopBackOff: couldn't get metadata: Get "http://169.254.169.254/metadata/v1.json" so others might be able to find it easier.

@max3903 max3903 changed the title csi-do-controller-0 CrashLoopBackOff csi-do-controller-0 CrashLoopBackOff: couldn't get metadata: Get "http://169.254.169.254/metadata/v1.json" Jul 30, 2020
@max3903
Copy link
Author

max3903 commented Jul 30, 2020

@dustymabe Done!

@grumps
Copy link

grumps commented Sep 5, 2020

👋 So I've run into this issue as well using K3s on DO. I was able to finally get things running with hostnetwork: true. I'm using the default network driver of flannel but it does use containerd as the runtime.

@dustymabe
Copy link

I can confirm that the workaround in #328 (comment) still works for me today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants