Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Volume stuck in attaching state when using multiple PersistentVolumeClaim #36450

Closed
willis7 opened this issue Nov 8, 2016 · 21 comments
Closed
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/storage Categorizes an issue or PR as relevant to SIG Storage.

Comments

@willis7
Copy link

willis7 commented Nov 8, 2016

I'm using Kube 1.4.5, and AWS storage.

When I try to attach multiple volumes using PVC's, one of the volumes consistently gets stuck in the attaching state whilst the other is successful.

Below are the definitions that I used.

sonar-persistence.yml

---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: sonarqube-data
spec:
  capacity:
    storage: 5Gi
  accessModes:
    - ReadWriteOnce
  awsElasticBlockStore:
    volumeID: aws://eu-west-1a/vol-XXXXXXX
    fsType: ext4

---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: sonarqube-extensions
spec:
  capacity:
    storage: 5Gi
  accessModes:
    - ReadWriteOnce
  awsElasticBlockStore:
    volumeID: aws://eu-west-1a/vol-XXXXXX
    fsType: ext4

sonar-claim.yml

---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: sonarqube-data
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 5Gi

---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: sonarqube-extensions
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 5Gi

sonar-deployment.yml

---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: sonar
spec:
  replicas: 1
  template:
    metadata:
      name: sonar
      labels:
        name: sonar
    spec:
      containers:
        - image: sonarqube:lts
          args:
            - -Dsonar.web.context=/sonar
          name: sonar
          env:
            - name: SONARQUBE_JDBC_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: postgres-pwd
                  key: password
            - name: SONARQUBE_JDBC_URL
              value: jdbc:postgresql://sonar:5432/sonar
          ports:
            - containerPort: 9000
              name: sonar
          volumeMounts:
          - name: sonarqube-data
            mountPath: /opt/sonarqube/data
          - name: sonarqube-extensions
            mountPath: /opt/sonarqube/extensions
      volumes:
        - name: sonarqube-data
          persistentVolumeClaim:
            claimName: sonarqube-data
        - name: sonarqube-extensions
          persistentVolumeClaim:
            claimName: sonarqube-extensions

The data volume always appears to be successful, and maybe coincidentally is first in the list. I have tried this multiple times but the result is always the same.

The error message is as follows:

Unable to mount volumes for pod "sonar-3504269494-tnzwo_default(2cc5292c-a5d4-11e6-bd99-0a82a8a86ebf)": timeout expired waiting for volumes to attach/mount for pod "sonar-3504269494-tnzwo"/"default". list of unattached/unmounted volumes=[sonarqube-data sonarqube-extensions]
Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "sonar-3504269494-tnzwo"/"default". list of unattached/unmounted volumes=[sonarqube-data sonarqube-extensions]
@eggie5
Copy link

eggie5 commented Nov 9, 2016

I am experiencing this too on GKE 1.4.5

The pod mounts the PV initially then after some time, ostensibly after the pod is moved to a new node, the pod can't remount the PV as it's stuck on the last node.

@justinsb justinsb added sig/storage Categorizes an issue or PR as relevant to SIG Storage. and removed area/kubectl labels Nov 15, 2016
@gnufied
Copy link
Member

gnufied commented Nov 16, 2016

@willis7 what do you see when you do kubectl describe pod sonar-3504269494-tnzwo_default . Do you see any particular errors in the YAML?

@willis7
Copy link
Author

willis7 commented Nov 16, 2016

@gnufied I dont still have this available as I took another approach, but there was no errors beyond what I shared above.

@gnufied
Copy link
Member

gnufied commented Nov 18, 2016

@justinsb or @saad-ali I will take a stab at this. Do assign this to me, if it is not a problem.

@whereisaaron
Copy link

whereisaaron commented Nov 23, 2016

We get this frequently with EBS PVC/PV voumes and see plenty of similar reports. It usually starts when recreating a Pod ("Recreate" Strategy). The old Pod is torn down and the PV unmounted, then the PV is mounted on the new Pod (on same or different worker). It seems like quick unmount/mount can trigger the 'stuck attaching' issue, which AWS blames on reusing device names (or reusing them too quickly maybe):
https://aws.amazon.com/premiumsupport/knowledge-center/ebs-stuck-attaching/

A temporary fix is to tell AWS for force detach the EBS volume, then wait, the new Pod will attach and recover with a few minutes. However next time you recreate that particular Pod you also most certainly get the same stuck problem; Once an instance+PV combo start doing this it seems to happen almost every time. The only long term fix I have found/seen is to reboot the worker node or to delete and recreate the PVC/PV.

It is a major hassle and we're looking to switch away from EBS to something more reliable for mounting, like EFS, NFS or GlusterFS.

I wondered about scaling the deployment to 0 instances first, waiting a while before redeploying. Not an attractive option though.

@saad-ali
Copy link
Member

@justinsb or @saad-ali I will take a stab at this. Do assign this to me, if it is not a problem.

Thanks @gnufied

We get this frequently with EBS PVC/PV voumes and see plenty of similar reports.

Sorry for the crappy experience! What version of kubernetes are you running?

I know @justinsb @jingxu97 have worked on a number of fixes to improve the AWS EBS experience. A big fix, #34859, went in to 1.4.6 and there are already fixes pending for 1.4.7: #36840

CC @kubernetes/sig-storage

@rootfs
Copy link
Contributor

rootfs commented Nov 23, 2016

@willis7
Can you provide kubectl describe pvc output and is it possible to share your controller log and kubelet log?

@saad-ali
Copy link
Member

@willis7 and @eggie5 Could you also try 1.4.6+ if you get a chance and see if you get a repro there.

@eggie5
Copy link

eggie5 commented Nov 23, 2016

@saad-ali upgraded to 1.4.6 today, i'll keep an eye on it...

@saad-ali
Copy link
Member

@eggie5 Thanks!

@whereisaaron
Copy link

@saad-ali no need to apologize, even before k8s the 'stuck attaching' was a known EBS condition (hence the AWS FAQ). It was just that before k8s it came up less often because it was much less common to be unmounting and remounting EBS volumes between instances every few minutes when a k8s CD deployment happens :-)

Thanks for the tip about the upcoming patches by @justinsb and @jingxu97. We create clusters using coreos kube-aws and latest release is 1.4.3 and master is 1.4.6 I think. Might test a 1.4.6 cluster if I can.

I see AWS EFS or similar as a more natural fit for smallish disk volumes for k8s anyway,

  • no need to decide/estimate volume sizes ahead of time or resize later
  • you can multiply mount, allowing more options for rolling deployments.

Unfortunately EFS is taking its own sweet time to get to the southern hemisphere, like Java committee process slow :-P

@willis7
Copy link
Author

willis7 commented Nov 23, 2016

Hey gang, @whereisaaron has summed up my scenario perfectly in his first post. I shall fire up another cluster, and see if this is resolved with the latest patches. Many thanks!

@gnufied
Copy link
Member

gnufied commented Nov 24, 2016

I tried reproducing this with latest version and I think situation has defenitely improved. Here is my deployment file:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: nginx2
spec:
  replicas: 1
  strategy:
    type: "Recreate"
  template:
    metadata:
      labels:
        run: nginx2
    spec:
      containers:
      - name: nginx2
        image: nginx
        ports:
        - containerPort: 80
        volumeMounts:
        - mountPath: "/opt1"
          name: pvol1
        - mountPath: "/opt2"
          name: pvol2
        - mountPath: "/opt3"
          name: pvol3
        - mountPath: "/opt4"
          name: pvol4
      volumes:
      - name: pvol1
        persistentVolumeClaim:
          claimName: "gnufied-vol1"
      - name: pvol2
        persistentVolumeClaim:
          claimName: "gnufied-vol2"
      - name: pvol3
        persistentVolumeClaim:
          claimName: "gnufied-vol3"
      - name: pvol4
        persistentVolumeClaim:
          claimName: "gnufied-vol4"

and I bumped nginx image verison so as to trigger new deployment. I couldn't reproduce it. So I think situation here has definitely improved in latest version.

If those - who are still seeing this problem can attach kubelet.log and kube-controller-manager.log that would be pretty helpful making this area robust.

@craigwillis85
Copy link

So, can we use EFS as the awsElasticBlockStore?

I see in the docs that it mentions not supporting nodes in different availability zones. My nodes are in the same region, but different availability zones.

I'm guessing I can't use awsElasticBlockStore in this case? Or can I?

@pajel
Copy link

pajel commented Dec 14, 2016

Unfortunately, I don't think the situation has improved.
Our setup:
Ubuntu 16.04.1, kernel 4.4.0-47-generic
K8s: 1.4.6

After a pod gets rescheduled its volume detaches correctly but then gets stuck in attaching state. I have checked with AWS support and got a response:

Unfortunately, the issue is on the underlying host side and not an Ubuntu problem. Restarting your instance causes the relevant information on our side to reset, so that makes the device available for use again. You can also achieve the same result by stopping the instance and starting it again, which moves your instance to a new underlying host. Without a restart, you can work around the issue by choosing a different device, or avoid the problem by making sure the volume is fully unmounted and no longer in use before detaching, but that's about it I'm afraid. I am sorry for any inconvenience.

The issue seems to be that k8s is not waiting for the unmount to fully finish before issuing a detach command to AWS. So the device name is not released yet.

It also looks like duplicate of #31891

@jingxu97
Copy link
Contributor

@pajel could you please check whether you have the same issue as #37662. That problem is fix in release 1.4.7. If you think yours is different, please let me know more details about your issue and share the log with us. Thanks!

@pajel
Copy link

pajel commented Dec 15, 2016

@jingxu97 thanks for your reply. #37662 seems to be different as their EBS volume is attached, but not picked up by k8s. While our case is the EBS volume is stuck in attaching state.
However #31891 seems like the exact same issue, even the logs are the same. I'll follow up there, thank you.

@fejta-bot
Copy link

Issues go stale after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 19, 2017
@whereisaaron
Copy link

It looks like stuck volumes are just a intrinsic, unavoidable risk when using EBS to mount/unmount on a running instance, but in addition to rotating device names, there is further mitigation in place for k8s 1.9:

In v1.9 SIG AWS has improved stability of EBS support across the board. If a Volume is “stuck” in the attaching state to a node for too long a unschedulable taint will be applied to the node, so a Kubernetes admin can take manual steps to correct the error. Users are encouraged to ensure they are monitoring for the taint, and should consider automatically terminating instances in this state.

@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle rotten
/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jan 18, 2018
@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 18, 2018
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/storage Categorizes an issue or PR as relevant to SIG Storage.
Projects
None yet
Development

No branches or pull requests