Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pods using EBS-backed PVC sometimes get stuck. #38301

Closed
exarkun opened this issue Dec 7, 2016 · 9 comments
Closed

Pods using EBS-backed PVC sometimes get stuck. #38301

exarkun opened this issue Dec 7, 2016 · 9 comments

Comments

@exarkun
Copy link

exarkun commented Dec 7, 2016

Is this a request for help? (If yes, you should use our troubleshooting guide and community support channels, see http://kubernetes.io/docs/troubleshooting/.):

No.

What keywords did you search in Kubernetes issues before filing this one? (If you have found any duplicates, you should instead reply there.):

unmount
umount
Error checking if mountpoint

Is this a BUG REPORT or FEATURE REQUEST? (choose one):

BUG REPORT.

Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"4", GitVersion:"v1.4.6", GitCommit:"e569a27d02001e343cb68086bc06d47804f62af6", GitTreeState:"clean", BuildDate:"2016-11-12T05:22:15Z", GoVersion:"go1.6.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"4", GitVersion:"v1.4.6", GitCommit:"e569a27d02001e343cb68086bc06d47804f62af6", GitTreeState:"clean", BuildDate:"2016-11-12T05:16:27Z", GoVersion:"go1.6.3", Compiler:"gc", Platform:"linux/amd64"}

Environment:

  • Cloud provider or hardware configuration: AWS
  • OS (e.g. from /etc/os-release): Debian GNU/Linux 8 (jessie)
  • Kernel (e.g. uname -a):Linux ip-172-20-84-61 4.4.26-k8s Unit test coverage in Kubelet is lousy. (~30%) #1 SMP Fri Oct 21 05:21:13 UTC 2016 x86_64 GNU/Linux
  • Install tools: kops Version git-e1a9aad
  • Others:

What happened:

I created a storageclass and a new PVC referencing it:

apiVersion: v1
items:
- apiVersion: storage.k8s.io/v1beta1
  kind: StorageClass
  metadata:
    annotations:
      kubectl.kubernetes.io/last-applied-configuration: '{"kind":"StorageClass","apiVersion":"storage.k8s.io/v1beta1","metadata":{"name":"normal","creationTimestamp":null},"provisioner":"kubernetes.io/aws-ebs","parameters":{"type":"gp2"}}'
    creationTimestamp: 2016-12-07T15:31:12Z
    name: normal
    resourceVersion: "2739806"
    selfLink: /apis/storage.k8s.io/v1beta1/storageclasses/normal
    uid: 2f28bfcc-bc92-11e6-b3c8-12e507f54388
  parameters:
    type: gp2
  provisioner: kubernetes.io/aws-ebs
kind: List
metadata: {}
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: '{"kind":"PersistentVolumeClaim","apiVersion":"v1","metadata":{"name":"infrastructure-foolscap-logs-pvc","creationTimestamp":null,"labels":{"app":"s4","component":"Infrastructure","provider":"LeastAuthority"},"annotations":{"volume.beta.kubernetes.io/storage-class":"normal"}},"spec":{"accessModes":["ReadWriteOnce"],"resources":{"requests":{"storage":"10G"}}},"status":{}}'
    pv.kubernetes.io/bind-completed: "yes"
    pv.kubernetes.io/bound-by-controller: "yes"
    volume.beta.kubernetes.io/storage-class: normal
  creationTimestamp: 2016-12-07T15:10:30Z
  labels:
    app: s4
    component: Infrastructure
    provider: LeastAuthority
  name: infrastructure-foolscap-logs-pvc
  namespace: staging
  resourceVersion: "2739819"
  selfLink: /api/v1/namespaces/staging/persistentvolumeclaims/infrastructure-foolscap-logs-pvc
  uid: 4b3c2fb4-bc8f-11e6-b3c8-12e507f54388
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 10G
  volumeName: pvc-4b3c2fb4-bc8f-11e6-b3c8-12e507f54388
status:
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 10Gi
  phase: Bound

And I updated my deployment to include a volume using this PVC and updated the deployment's template spec so that one of the containers would mount this volume. Then I deployed this with kubectl apply -f .... I make some tweaks and repeated this operation a few times. Behavior was as expected (EBS-backed PV created, pod started, container had PV mounted in it, data persisted across deployment updates).

On the last deployment update (in which I changed the image used by some of the containers), the new pod failed to come up. The web ui reported

Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "s4-infrastructure-3171603516-8zj8k"/"staging". list of unattached/unmounted volumes=[log-gatherer-data]

What you expected to happen:

I expected a new pod to be created and its containers to start, and for the container using the log-gatherer-data volume to have the data it had before the deployment update.

How to reproduce it (as minimally and precisely as possible):

Anything else do we need to know:

There are many mount/unmount errors in the kubectl journalctl log, attached.

logs.txt

The EBS volume backing the PVC is indeed attached to the node.
The mount state is:

sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime)
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
udev on /dev type devtmpfs (rw,relatime,size=10240k,nr_inodes=479670,mode=755)
devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000)
tmpfs on /run type tmpfs (rw,nosuid,relatime,size=771468k,mode=755)
/dev/xvda1 on / type ext4 (rw,relatime,data=ordered)
securityfs on /sys/kernel/security type securityfs (rw,nosuid,nodev,noexec,relatime)
tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)
tmpfs on /run/lock type tmpfs (rw,nosuid,nodev,noexec,relatime,size=5120k)
tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,mode=755)
cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd)
pstore on /sys/fs/pstore type pstore (rw,nosuid,nodev,noexec,relatime)
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct)
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls,net_prio)
cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)
cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids)
systemd-1 on /proc/sys/fs/binfmt_misc type autofs (rw,relatime,fd=23,pgrp=1,timeout=300,minproto=5,maxproto=5,direct)
hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime)
debugfs on /sys/kernel/debug type debugfs (rw,relatime)
mqueue on /dev/mqueue type mqueue (rw,relatime)
rpc_pipefs on /run/rpc_pipefs type rpc_pipefs (rw,relatime)
/dev/xvdc on /mnt type ext3 (rw,relatime,data=ordered)
tmpfs on /var/lib/kubelet/pods/7cce5087-ab69-11e6-b3c8-12e507f54388/volumes/kubernetes.io~secret/default-token-3mbvh type tmpfs (rw,relatime)
tmpfs on /var/lib/kubelet/pods/7d411a93-ab69-11e6-b3c8-12e507f54388/volumes/kubernetes.io~secret/default-token-3mbvh type tmpfs (rw,relatime)
tmpfs on /var/lib/kubelet/pods/ae9bb673-ac0b-11e6-b3c8-12e507f54388/volumes/kubernetes.io~secret/default-token-fp0o5 type tmpfs (rw,relatime)
/dev/xvdba on /var/lib/kubelet/plugins/kubernetes.io/aws-ebs/mounts/vol-0f1ca7d3ab1426833 type ext4 (rw,relatime,data=ordered)
/dev/xvdba on /var/lib/kubelet/pods/ae9bb673-ac0b-11e6-b3c8-12e507f54388/volumes/kubernetes.io~aws-ebs/leastauthority-tweaks-kube-registry-pv type ext4 (rw,relatime,data=ordered)
/dev/xvdbc on /var/lib/kubelet/plugins/kubernetes.io/aws-ebs/mounts/vol-0e80ac26be3edd63f type ext4 (rw,relatime,data=ordered)
/dev/xvdbb on /var/lib/kubelet/plugins/kubernetes.io/aws-ebs/mounts/vol-01b01d11a6b17e2de type ext4 (rw,relatime,data=ordered)
tmpfs on /var/lib/kubelet/pods/755e6718-ac11-11e6-b3c8-12e507f54388/volumes/kubernetes.io~secret/default-token-zwvk5 type tmpfs (rw,relatime)
tmpfs on /var/lib/kubelet/pods/d81c5474-bbfb-11e6-b3c8-12e507f54388/volumes/kubernetes.io~secret/web-secrets type tmpfs (rw,relatime)
tmpfs on /var/lib/kubelet/pods/d81c5474-bbfb-11e6-b3c8-12e507f54388/volumes/kubernetes.io~secret/default-token-zwvk5 type tmpfs (rw,relatime)
tmpfs on /var/lib/kubelet/pods/d81c5474-bbfb-11e6-b3c8-12e507f54388/volumes/kubernetes.io~secret/flapp-secrets type tmpfs (rw,relatime)
/dev/xvdbc on /var/lib/kubelet/pods/d81c5474-bbfb-11e6-b3c8-12e507f54388/volumes/kubernetes.io~aws-ebs/infrastructure-web-pv type ext4 (rw,relatime,data=ordered)
/dev/xvdbb on /var/lib/kubelet/pods/d81c5474-bbfb-11e6-b3c8-12e507f54388/volumes/kubernetes.io~aws-ebs/infrastructure-flapp-pv type ext4 (rw,relatime,data=ordered)
/dev/xvdbd on /var/lib/kubelet/plugins/kubernetes.io/aws-ebs/mounts/aws/us-east-1b/vol-04e25da2c73877960 type ext4 (rw,relatime,data=ordered)
tmpfs on /var/lib/kubelet/pods/4f92171a-bc98-11e6-b3c8-12e507f54388/volumes/kubernetes.io~secret/default-token-36roi type tmpfs (rw,relatime)
tmpfs on /var/lib/kubelet/pods/f654cc46-bc9a-11e6-b3c8-12e507f54388/volumes/kubernetes.io~secret/flapp-secrets type tmpfs (rw,relatime)
tmpfs on /var/lib/kubelet/pods/f654cc46-bc9a-11e6-b3c8-12e507f54388/volumes/kubernetes.io~secret/web-secrets type tmpfs (rw,relatime)
tmpfs on /var/lib/kubelet/pods/f654cc46-bc9a-11e6-b3c8-12e507f54388/volumes/kubernetes.io~secret/default-token-36roi type tmpfs (rw,relatime)

The directory referenced by the stat error in the logs is empty:

admin@ip-172-20-84-61:~$ sudo ls -al /var/lib/kubelet/pods/938a5bbf-bc95-11e6-b3c8-12e507f54388/volumes/kubernetes.io~aws-ebs/
total 8
drwxr-x--- 2 root root 4096 Dec  7 15:57 .
drwxr-x--- 5 root root 4096 Dec  7 15:55 ..
admin@ip-172-20-84-61:~$ 
@jingxu97
Copy link
Contributor

jingxu97 commented Dec 7, 2016

@exarkun the log you posted starting from 1207 16:30:01.276015, do you have older logs before that time? Thanks!

@exarkun
Copy link
Author

exarkun commented Dec 8, 2016

Probably. How much older would you like?

@jingxu97
Copy link
Contributor

jingxu97 commented Dec 8, 2016

Basically starting from when you run the test

@exarkun
Copy link
Author

exarkun commented Dec 8, 2016

Hm. I picked a starting point just before the kubectl apply -f ... which resulted in the hung pod. The original deployment of the pod was days ago. I think there are several hundred megs of logs between that point and now.

@exarkun
Copy link
Author

exarkun commented Dec 12, 2016

All three deployment updates since I filed this issue have gotten stuck.

@exarkun
Copy link
Author

exarkun commented Dec 12, 2016

Note that while nothing is mounted at the /var/lib/kubelet/pods/... location, the EBS volume is attached to the instance and mounted elsewhere on the filesystem:

/dev/xvdbd on /var/lib/kubelet/plugins/kubernetes.io/aws-ebs/mounts/aws/us-east-1b/vol-04e25da2c73877960 type ext4 (rw,relatime,data=ordered)

Also noteworthy: restarting kubelet on the affected node is sufficient to un-stuck the pod.

@jingxu97
Copy link
Contributor

@exarkun I am thinking you might hit issue #36269, the fix #36840 is in release 1.4.7. Could you please check with the new vision?

@exarkun
Copy link
Author

exarkun commented Dec 12, 2016

Ah, thanks. I'll try out 1.4.7 - probably tomorrow - and report back.

@exarkun
Copy link
Author

exarkun commented Dec 14, 2016

I've upgraded to 1.4.7. The error hasn't recurred since then. I'll close this for now. If it happens again I can re-open. Thanks.

@exarkun exarkun closed this as completed Dec 14, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants