Is there any upgrade guilde to 1.2 #391

jerry153fish · 2021-03-24T11:42:01Z

Hello all,

Are there any docs regarding upgrading to 1.2 ? Or it will just magically upgrade to 1.2 as long as we set up the service account.

k8s-ci-robot · 2021-03-24T11:42:03Z

@jerry153fish: The label(s) triage/support cannot be applied, because the repository doesn't have them.

In response to this:

/triage support

Hello all,

Are there any docs regarding upgrading to 1.2 ? Or it will just magically upgrade to 1.2 as long as we set up the service account.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

johnjeffers · 2021-04-28T21:32:37Z

I'd like to know the answer to this question as well. I've tried upgrading twice and have had to roll back to 1.1 both times because pods have trouble both mounting and releasing the existing EFS PVCs. Roll back fixes the problems immediately. I don't know what I'm missing.

I have added the serviceaccount, and have set up the correct IAM permissions as far as I can tell, assigned to the serviceaccount. I have many other deployments in my cluster using IRSA permissions, so I know that works for other deployments.

wongma7 · 2021-04-28T21:41:51Z

It should just be a matter of running helm upgrade, we are indeed lacking doc on this though.

@johnjeffers Regarding the issue with mounts hanging:

If you are able to, could you test the the master tag of the driver available in docker hub? (by adding --set image.repository=amazon/aws-efs-csi-driver --set image.tag=master) to the helm install command). The fact that simply rolling back from 1.2 fixes your issue is suspect.

Details:

The only major change in 1.2 that would affect all mounts is the bump of the efs-utils dependency https://github.com/kubernetes-sigs/aws-efs-csi-driver/blob/master/CHANGELOG-1.x.md. I actually ended up rolling back the efs-utils dependency this change in master branch as I was in process of debugging some CI flakiness with symptoms similar to what you are reporting. If you can confirm for me that the issue is NOT present in the master branch, I will release a 1.2.1 version of the driver that has the changes from 1.2 MINUS the efs-utils dependency change.

BTW, it sounds similar to #401 and #325.

johnjeffers · 2021-04-28T21:55:41Z

@wongma7 Do you want me to try this with the 1.1.2 version of the Helm chart that I'm pinned to right now, or with the latest helm chart version? Because my current helm chart version won't create the controller deployment, only the daemonset.

wongma7 · 2021-04-28T23:04:24Z

@johnjeffers the chart you are pinned to right now, yes. The issue must be in the daemonset and I'm trying to control for efs-utils version, keeping all else including the chart equal.

If even master doesn't work, one other tag to try is ba2b561. amazon/aws-efs-csi-driver:ba2b561

master has efs-utils v1.30.1
v1.2.0 has efs-utils v1.29.1
ba2b561 has efs-utils v1.28.2

I really appreciate you testing this out, have not had the chance to reproduce this issue.

johnjeffers · 2021-04-28T23:45:25Z

Here's what I'm seeing:

I deploy the master tag, and wait for efs-csi-node daemonset rollout to complete

Then, I attempt to delete a pod that mounts an EFS PVC. The pod gets stuck in terminating, but a replacement pod does come up successfully, assuming that the deployment uses a rolling upgrade strategy. For a deployment that uses a replace strategy, it gets stuck terminating, and no replacement pod starts, of course.

After 10 minutes or so, I force delete the pod that's stuck in terminating.

Subsequent deletes of the pods appear to behave normally. It's only the first delete, after the daemonset is updated, where I see the deleted pod get stuck in terminating.

wongma7 · 2021-04-29T00:06:59Z

OK thank you, I'm preparing a v1.2.1 release with efs-utils downgraded to v1.28.1 #429 because if master doesn't work then it means efs-utils v1.30.1 doesn't fix the issue either .

Subsequent deletes of the pods appear to behave normally

I assume you mean the replacement pods that get spawned by the deployment rollout.

This aligns with my basic understanding of what is happening. efs-utils takes care of maintaining the state of mounts. So it seems like for whatever reason, sometimes volumes originally mounted by efs-utils v1.28.1/driver v1.1.1 are NOT able to be unmounted by efs-utils v1.29.1/driver v1.2.0. Whereas subsequent mounts by efs-utils v1.29.1/driver v1.2.0 ARE able to be unmounted by efs-utils v1.29.1/driver v1.2.0

johnjeffers · 2021-04-29T01:48:10Z

I assume you mean the replacement pods that get spawned by the deployment rollout.

Yes. For example, I have a Grafana deployment that uses an EFS PVC. After I deployed the CSI driver with master, I tried deleting the Grafana pod. It hung in terminating until I --force deleted it. Subsequent deletes worked normally.

jerry153fish · 2021-04-29T04:28:39Z

Hi @wongma7 can we add the Kubernetes manifest docs as well ? thanks very much.

johnjeffers · 2021-04-29T15:49:29Z

Hi @wongma7 can we add the Kubernetes manifest docs as well ? thanks very much.

They are here: https://github.com/kubernetes-sigs/aws-efs-csi-driver/tree/master/deploy/kubernetes

johnjeffers · 2021-04-29T16:09:47Z

@wongma7 I have some more info about the problems I'm seeing with the new version (this is using master, not 1.2). When I checked in on things this morning, I see that my Grafana pod is dead.

» k get po
NAME                      READY   STATUS    RESTARTS   AGE
grafana-f9bfddbd6-4xpqd   0/1     Running   0          16h

It says it's Running but I can't exec into it.

» k exec -ti grafana-f9bfddbd6-4xpqd -- bash
error: unable to upgrade connection: container not found ("grafana")

Here are the latest events:

» k get events
LAST SEEN   TYPE      REASON                 OBJECT                        MESSAGE
22m         Warning   Unhealthy              pod/grafana-f9bfddbd6-4xpqd   Readiness probe failed: Get "http://10.192.34.92:3000/api/health": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
20m         Warning   Unhealthy              pod/grafana-f9bfddbd6-4xpqd   Liveness probe failed: Get "http://10.192.34.92:3000/api/health": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
9m58s       Warning   Unhealthy              pod/grafana-f9bfddbd6-4xpqd   Readiness probe failed: Get "http://10.192.34.92:3000/api/health": dial tcp 10.192.34.92:3000: connect: connection refused
2s          Warning   FailedSync             pod/grafana-f9bfddbd6-4xpqd   error determining status: rpc error: code = DeadlineExceeded desc = context deadline exceeded
78s         Normal    TaintManagerEviction   pod/grafana-f9bfddbd6-4xpqd   Cancelling deletion of Pod grafana/grafana-f9bfddbd6-4xpqd

Rolling back to v1.1.1 fixes the problem immediately. As soon as the daemonset pods are replaced, Grafana comes back up in seconds.

johnjeffers · 2021-05-14T18:49:27Z

@wongma7 what's the status on this? Did 1.2.1 get rolled out with the downgraded efs-utils?

wongma7 · 2021-05-14T22:00:26Z

@johnjeffers yes, sorry I forgot to leave an update here!! helm chart 1.2.4 contains 1.2.1

We (@kbasv ) also managed to narrow down the issue and, it should be fixed in the latest version of efs-utils 1.31.1. But we won't be releasing that for a while, of course we'll regression test it.

esalberg · 2021-06-09T15:40:44Z

@wongma7 - following up on the efs-utils version, I noticed that the version used in the new release 1.30 is efs-utils 1.30.2-1. Does this mean that the efs-utils issue was fixed prior to 1.31.1?

kbasv · 2021-06-09T15:50:39Z

@esalberg the new release v1.3 should have efs-utils v1.31.1 and the fix for efs-utils was released as part of efs-utils v1.31.1

johnjeffers · 2021-06-17T19:51:35Z

it looks like 1.3.1 reintroduced the bad behavior. When I rolled out 1.3.1, pods that have EFS volumes attached started failing. Rolling back to 1.2.1 restored things.

wongma7 · 2021-06-17T20:29:23Z

I couldn't reproduce, upgrading from driver 1.2.1 to 1.3.1 worked for me (i.e. my Pod could continue to read and write before/after upgrade). I was using the dynamic provisioning example https://github.com/kubernetes-sigs/aws-efs-csi-driver/tree/master/examples/kubernetes/dynamic_provisioning.

I did helm upgrade --install but you can also in-place upgrade of just one specific node plugin for the purposes of debugging (i.e. kubectl edit the node pod to change the image from 1.2.1 to 1.3.1)

Please capture logs from the 1.3.1 efs-plugin while volumes appear to be stuck. For reference, here is what my 1.3.1 efs plugin node pod logs after the upgrade, it successfully "resumes" the mount/tls tunnel.

 k exec efs-csi-node-bp2s9 -n kube-system efs-plugin -- cat /var/log/amazon/efs/mount-watchdog.log 
Defaulting container name to efs-plugin.
Use 'kubectl describe pod/efs-csi-node-bp2s9 -n kube-system' to see all of the containers in this pod.
2021-06-17 20:07:54,011 - INFO - amazon-efs-mount-watchdog, version 1.31.1, is enabled and started
2021-06-17 20:07:54,017 - WARNING - TLS tunnel for fs-8fb2ae88.var.lib.kubelet.pods.42a76c52-fe74-4a8d-bf49-d367c45247d9.volumes.kubernetes.io~csi.pvc-cbc88c24-3ea8-42e2-8fab-953bdf78c097.mount.20141 is not running
2021-06-17 20:07:54,018 - INFO - Starting TLS tunnel: "/usr/bin/stunnel /var/run/efs/stunnel-config.fs-8fb2ae88.var.lib.kubelet.pods.42a76c52-fe74-4a8d-bf49-d367c45247d9.volumes.kubernetes.io~csi.pvc-cbc88c24-3ea8-42e2-8fab-953bdf78c097.mount.20141"
2021-06-17 20:07:54,024 - INFO - Started TLS tunnel, pid: 19

You can also try this script on the efs plugin node pod. https://github.com/kubernetes-sigs/aws-efs-csi-driver/tree/master/troubleshooting

johnjeffers · 2021-06-18T16:12:20Z

@wongma7 That was a false alarm. I had some unrelated symptoms that looked very similar to the previous problem, and I jumped to the wrong conclusion. Thank you for the quick reply, and my apologies!

k8s-triage-robot · 2021-09-16T16:54:06Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2021-10-16T17:39:42Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 16, 2021

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 16, 2021

jerry153fish closed this as completed Oct 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there any upgrade guilde to 1.2 #391

Is there any upgrade guilde to 1.2 #391

jerry153fish commented Mar 24, 2021 •

edited

Loading

k8s-ci-robot commented Mar 24, 2021

johnjeffers commented Apr 28, 2021 •

edited

Loading

wongma7 commented Apr 28, 2021

johnjeffers commented Apr 28, 2021

wongma7 commented Apr 28, 2021

johnjeffers commented Apr 28, 2021

wongma7 commented Apr 29, 2021

johnjeffers commented Apr 29, 2021

jerry153fish commented Apr 29, 2021

johnjeffers commented Apr 29, 2021

johnjeffers commented Apr 29, 2021

johnjeffers commented May 14, 2021

wongma7 commented May 14, 2021

esalberg commented Jun 9, 2021

kbasv commented Jun 9, 2021 •

edited

Loading

johnjeffers commented Jun 17, 2021

wongma7 commented Jun 17, 2021

johnjeffers commented Jun 18, 2021

k8s-triage-robot commented Sep 16, 2021

k8s-triage-robot commented Oct 16, 2021

Is there any upgrade guilde to 1.2 #391

Is there any upgrade guilde to 1.2 #391

Comments

jerry153fish commented Mar 24, 2021 • edited Loading

k8s-ci-robot commented Mar 24, 2021

johnjeffers commented Apr 28, 2021 • edited Loading

wongma7 commented Apr 28, 2021

johnjeffers commented Apr 28, 2021

wongma7 commented Apr 28, 2021

johnjeffers commented Apr 28, 2021

wongma7 commented Apr 29, 2021

johnjeffers commented Apr 29, 2021

jerry153fish commented Apr 29, 2021

johnjeffers commented Apr 29, 2021

johnjeffers commented Apr 29, 2021

johnjeffers commented May 14, 2021

wongma7 commented May 14, 2021

esalberg commented Jun 9, 2021

kbasv commented Jun 9, 2021 • edited Loading

johnjeffers commented Jun 17, 2021

wongma7 commented Jun 17, 2021

johnjeffers commented Jun 18, 2021

k8s-triage-robot commented Sep 16, 2021

k8s-triage-robot commented Oct 16, 2021

jerry153fish commented Mar 24, 2021 •

edited

Loading

johnjeffers commented Apr 28, 2021 •

edited

Loading

kbasv commented Jun 9, 2021 •

edited

Loading