Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there any upgrade guilde to 1.2 #391

Closed
jerry153fish opened this issue Mar 24, 2021 · 20 comments
Closed

Is there any upgrade guilde to 1.2 #391

jerry153fish opened this issue Mar 24, 2021 · 20 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@jerry153fish
Copy link

jerry153fish commented Mar 24, 2021

Hello all,

Are there any docs regarding upgrading to 1.2 ? Or it will just magically upgrade to 1.2 as long as we set up the service account.

@k8s-ci-robot
Copy link
Contributor

@jerry153fish: The label(s) triage/support cannot be applied, because the repository doesn't have them.

In response to this:

/triage support

Hello all,

Are there any docs regarding upgrading to 1.2 ? Or it will just magically upgrade to 1.2 as long as we set up the service account.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@johnjeffers
Copy link

johnjeffers commented Apr 28, 2021

I'd like to know the answer to this question as well. I've tried upgrading twice and have had to roll back to 1.1 both times because pods have trouble both mounting and releasing the existing EFS PVCs. Roll back fixes the problems immediately. I don't know what I'm missing.

I have added the serviceaccount, and have set up the correct IAM permissions as far as I can tell, assigned to the serviceaccount. I have many other deployments in my cluster using IRSA permissions, so I know that works for other deployments.

@wongma7
Copy link
Contributor

wongma7 commented Apr 28, 2021

It should just be a matter of running helm upgrade, we are indeed lacking doc on this though.

@johnjeffers Regarding the issue with mounts hanging:

If you are able to, could you test the the master tag of the driver available in docker hub? (by adding --set image.repository=amazon/aws-efs-csi-driver --set image.tag=master) to the helm install command). The fact that simply rolling back from 1.2 fixes your issue is suspect.


Details:

The only major change in 1.2 that would affect all mounts is the bump of the efs-utils dependency https://github.com/kubernetes-sigs/aws-efs-csi-driver/blob/master/CHANGELOG-1.x.md. I actually ended up rolling back the efs-utils dependency this change in master branch as I was in process of debugging some CI flakiness with symptoms similar to what you are reporting. If you can confirm for me that the issue is NOT present in the master branch, I will release a 1.2.1 version of the driver that has the changes from 1.2 MINUS the efs-utils dependency change.

BTW, it sounds similar to #401 and #325.

@johnjeffers
Copy link

@wongma7 Do you want me to try this with the 1.1.2 version of the Helm chart that I'm pinned to right now, or with the latest helm chart version? Because my current helm chart version won't create the controller deployment, only the daemonset.

@wongma7
Copy link
Contributor

wongma7 commented Apr 28, 2021

@johnjeffers the chart you are pinned to right now, yes. The issue must be in the daemonset and I'm trying to control for efs-utils version, keeping all else including the chart equal.

If even master doesn't work, one other tag to try is ba2b561. amazon/aws-efs-csi-driver:ba2b561

master has efs-utils v1.30.1
v1.2.0 has efs-utils v1.29.1
ba2b561 has efs-utils v1.28.2

I really appreciate you testing this out, have not had the chance to reproduce this issue.

@johnjeffers
Copy link

Here's what I'm seeing:

I deploy the master tag, and wait for efs-csi-node daemonset rollout to complete

Then, I attempt to delete a pod that mounts an EFS PVC. The pod gets stuck in terminating, but a replacement pod does come up successfully, assuming that the deployment uses a rolling upgrade strategy. For a deployment that uses a replace strategy, it gets stuck terminating, and no replacement pod starts, of course.

After 10 minutes or so, I force delete the pod that's stuck in terminating.

Subsequent deletes of the pods appear to behave normally. It's only the first delete, after the daemonset is updated, where I see the deleted pod get stuck in terminating.

@wongma7
Copy link
Contributor

wongma7 commented Apr 29, 2021

OK thank you, I'm preparing a v1.2.1 release with efs-utils downgraded to v1.28.1 #429 because if master doesn't work then it means efs-utils v1.30.1 doesn't fix the issue either .

Subsequent deletes of the pods appear to behave normally

I assume you mean the replacement pods that get spawned by the deployment rollout.

This aligns with my basic understanding of what is happening. efs-utils takes care of maintaining the state of mounts. So it seems like for whatever reason, sometimes volumes originally mounted by efs-utils v1.28.1/driver v1.1.1 are NOT able to be unmounted by efs-utils v1.29.1/driver v1.2.0. Whereas subsequent mounts by efs-utils v1.29.1/driver v1.2.0 ARE able to be unmounted by efs-utils v1.29.1/driver v1.2.0

@johnjeffers
Copy link

I assume you mean the replacement pods that get spawned by the deployment rollout.

Yes. For example, I have a Grafana deployment that uses an EFS PVC. After I deployed the CSI driver with master, I tried deleting the Grafana pod. It hung in terminating until I --force deleted it. Subsequent deletes worked normally.

@jerry153fish
Copy link
Author

Hi @wongma7 can we add the Kubernetes manifest docs as well ? thanks very much.

@johnjeffers
Copy link

Hi @wongma7 can we add the Kubernetes manifest docs as well ? thanks very much.

They are here: https://github.com/kubernetes-sigs/aws-efs-csi-driver/tree/master/deploy/kubernetes

@johnjeffers
Copy link

@wongma7 I have some more info about the problems I'm seeing with the new version (this is using master, not 1.2). When I checked in on things this morning, I see that my Grafana pod is dead.

» k get po
NAME                      READY   STATUS    RESTARTS   AGE
grafana-f9bfddbd6-4xpqd   0/1     Running   0          16h

It says it's Running but I can't exec into it.

» k exec -ti grafana-f9bfddbd6-4xpqd -- bash
error: unable to upgrade connection: container not found ("grafana")

Here are the latest events:

» k get events
LAST SEEN   TYPE      REASON                 OBJECT                        MESSAGE
22m         Warning   Unhealthy              pod/grafana-f9bfddbd6-4xpqd   Readiness probe failed: Get "http://10.192.34.92:3000/api/health": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
20m         Warning   Unhealthy              pod/grafana-f9bfddbd6-4xpqd   Liveness probe failed: Get "http://10.192.34.92:3000/api/health": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
9m58s       Warning   Unhealthy              pod/grafana-f9bfddbd6-4xpqd   Readiness probe failed: Get "http://10.192.34.92:3000/api/health": dial tcp 10.192.34.92:3000: connect: connection refused
2s          Warning   FailedSync             pod/grafana-f9bfddbd6-4xpqd   error determining status: rpc error: code = DeadlineExceeded desc = context deadline exceeded
78s         Normal    TaintManagerEviction   pod/grafana-f9bfddbd6-4xpqd   Cancelling deletion of Pod grafana/grafana-f9bfddbd6-4xpqd

Rolling back to v1.1.1 fixes the problem immediately. As soon as the daemonset pods are replaced, Grafana comes back up in seconds.

@johnjeffers
Copy link

@wongma7 what's the status on this? Did 1.2.1 get rolled out with the downgraded efs-utils?

@wongma7
Copy link
Contributor

wongma7 commented May 14, 2021

@johnjeffers yes, sorry I forgot to leave an update here!! helm chart 1.2.4 contains 1.2.1

We (@kbasv ) also managed to narrow down the issue and, it should be fixed in the latest version of efs-utils 1.31.1. But we won't be releasing that for a while, of course we'll regression test it.

@esalberg
Copy link

esalberg commented Jun 9, 2021

@wongma7 - following up on the efs-utils version, I noticed that the version used in the new release 1.30 is efs-utils 1.30.2-1. Does this mean that the efs-utils issue was fixed prior to 1.31.1?

@kbasv
Copy link

kbasv commented Jun 9, 2021

@esalberg the new release v1.3 should have efs-utils v1.31.1 and the fix for efs-utils was released as part of efs-utils v1.31.1

@johnjeffers
Copy link

it looks like 1.3.1 reintroduced the bad behavior. When I rolled out 1.3.1, pods that have EFS volumes attached started failing. Rolling back to 1.2.1 restored things.

@wongma7
Copy link
Contributor

wongma7 commented Jun 17, 2021

I couldn't reproduce, upgrading from driver 1.2.1 to 1.3.1 worked for me (i.e. my Pod could continue to read and write before/after upgrade). I was using the dynamic provisioning example https://github.com/kubernetes-sigs/aws-efs-csi-driver/tree/master/examples/kubernetes/dynamic_provisioning.

I did helm upgrade --install but you can also in-place upgrade of just one specific node plugin for the purposes of debugging (i.e. kubectl edit the node pod to change the image from 1.2.1 to 1.3.1)

Please capture logs from the 1.3.1 efs-plugin while volumes appear to be stuck. For reference, here is what my 1.3.1 efs plugin node pod logs after the upgrade, it successfully "resumes" the mount/tls tunnel.

 k exec efs-csi-node-bp2s9 -n kube-system efs-plugin -- cat /var/log/amazon/efs/mount-watchdog.log 
Defaulting container name to efs-plugin.
Use 'kubectl describe pod/efs-csi-node-bp2s9 -n kube-system' to see all of the containers in this pod.
2021-06-17 20:07:54,011 - INFO - amazon-efs-mount-watchdog, version 1.31.1, is enabled and started
2021-06-17 20:07:54,017 - WARNING - TLS tunnel for fs-8fb2ae88.var.lib.kubelet.pods.42a76c52-fe74-4a8d-bf49-d367c45247d9.volumes.kubernetes.io~csi.pvc-cbc88c24-3ea8-42e2-8fab-953bdf78c097.mount.20141 is not running
2021-06-17 20:07:54,018 - INFO - Starting TLS tunnel: "/usr/bin/stunnel /var/run/efs/stunnel-config.fs-8fb2ae88.var.lib.kubelet.pods.42a76c52-fe74-4a8d-bf49-d367c45247d9.volumes.kubernetes.io~csi.pvc-cbc88c24-3ea8-42e2-8fab-953bdf78c097.mount.20141"
2021-06-17 20:07:54,024 - INFO - Started TLS tunnel, pid: 19

You can also try this script on the efs plugin node pod. https://github.com/kubernetes-sigs/aws-efs-csi-driver/tree/master/troubleshooting

@johnjeffers
Copy link

@wongma7 That was a false alarm. I had some unrelated symptoms that looked very similar to the previous problem, and I jumped to the wrong conclusion. Thank you for the quick reply, and my apologies!

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 16, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

7 participants