Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ebs CSI volume detach failure #431

Closed
zhoudayongdennis opened this issue Dec 24, 2019 · 11 comments
Closed

ebs CSI volume detach failure #431

zhoudayongdennis opened this issue Dec 24, 2019 · 11 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@zhoudayongdennis
Copy link

/kind bug

What happened?

  1. define stateful pod in kubernates cluster successfully, after reboot node one by one, there was pods in ContainerCreating status. e.g.
    default app-ebs-1 0/1 ContainerCreating 0 21h
  2. checking pods app-ebs-1 as below event:
    Events:
    Type Reason Age From Message

Warning FailedMount 35m (x122 over 21h) kubelet, ip-10-0-3-28.us-east-2.compute.internal Unable to attach or mount volumes: unmounted volumes=[ebspvc], unattached volumes=[default-token-4fdt8 ebspvc]: timed out waiting for the condition
Warning FailedMount 3m31s (x448 over 21h) kubelet, ip-10-0-3-28.us-east-2.compute.internal Unable to attach or mount volumes: unmounted volumes=[ebspvc], unattached volumes=[ebspvc default-token-4fdt8]: timed out waiting for the condition
Warning FailedAttachVolume 75s (x646 over 21h) attachdetach-controller AttachVolume.Attach failed for volume "pvc-fad1c767-22cf-11ea-9a1d-0661b881b6f6" : volume attachment is being deleted
3. checked attacher log as below message:
{"log":"I1224 05:49:56.456376 1 connection.go:184] GRPC error: rpc error: code = Internal desc = Could not detach volume "vol-0678c4ebfb20d577b" from node "i-00c71bf216a632245": could not detach volume "vol-0678c4ebfb20d577b" from node "i-00c71bf216a632245": IncorrectState: Volume 'vol-0678c4ebfb20d577b'is in the 'available' state.\n","stream":"stderr","time":"2019-12-24T05:49:56.456465393Z"}

What you expected to happen?
if the volume is available, why need perform detaching operation? even if detaching, it should return success for available volume instead of current failure, right?

How to reproduce it (as minimally and precisely as possible)?
a. apply statefulset with 3, and use aws-ebs-csi storageclass
b. reboot the node

Anything else we need to know?:

Environment

  • Kubernetes version (use kubectl version):
    v1.16.4
  • Driver version:
    v0.4.0
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Dec 24, 2019
@leakingtapan
Copy link
Contributor

leakingtapan commented Dec 29, 2019

This should be fixed by #375

Have you tested the driver in the latest image tag?

@zhoudayongdennis
Copy link
Author

zhoudayongdennis commented Jan 3, 2020

just compare your change with the private change I made to pass the failure. Just want to confirm with you, for ErrNotFound return value, it will NOT be treated as failure case, right?

here was the change you made in DetachDisk function of Cloud.go:

@@ -401,6 +401,11 @@ func (c *cloud) DetachDisk(ctx context.Context, volumeID, nodeID string) error {

_, err = c.ec2.DetachVolumeWithContext(ctx, request)
if err != nil {
	**if isAWSErrorIncorrectState(err) ||
		isAWSErrorInvalidAttachmentNotFound(err) ||
		isAWSErrorVolumeNotFound(err) {
		return ErrNotFound
	}**
	return fmt.Errorf("could not detach volume %q from node %q: %v", volumeID, nodeID, err)

here was the private change I made in the same function:
_, err = c.ec2.DetachVolumeWithContext(ctx, request)
if err != nil {
if !device.IsAlreadyAssigned {
klog.Warningf("DetachDisk called on non-attached volume, ignore error: %s", volumeID)
return nil
}

	return fmt.Errorf("could not detach volume %q from node %q: %v", volumeID, nodeID, err)
}

@zhoudayongdennis
Copy link
Author

zhoudayongdennis commented Jan 3, 2020

My project checkout the branch 0.4.0 instead of master branch.

what's the difference between 0.4.0 and master? Should i use the master branch for image build?

@zhoudayongdennis
Copy link
Author

any schedule for new branch release?

@leakingtapan
Copy link
Contributor

what's the difference between 0.4.0 and master?

Here is a list of changes: v0.4.0...master

Should i use the master branch for image build?

Are you using it for testing purpose or production use? If production use, I would recommend wait for the v0.5.0 release

@zhoudayongdennis
Copy link
Author

ok, I will wait for 0.5.0. Do you have the schedule for it?

@leakingtapan
Copy link
Contributor

leakingtapan commented Jan 3, 2020

Just want to confirm with you, for ErrNotFound return value, it will NOT be treated as failure case, right?

Yep. With the change, the driver will return success when detaching a NotFound volume. Could you test the container image with latest tag and see if this fixes your issue?

@SimonDreher
Copy link

I am also interested, if there is a planned time for the release of 0.5.0?

This is blocking us from migrating to kubernetes v1.15, since there we need v0.4.0 and with this bug all deployments with persistent volumes will break (until manual fix) every time their node is dying ...

If 0.5.0 release still takes time, is there the possibility to cherry-pick the fix for this and release a 0.4.1 release?

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 5, 2020
@leakingtapan
Copy link
Contributor

/close

as v0.5.0 is released

@k8s-ci-robot
Copy link
Contributor

@leakingtapan: Closing this issue.

In response to this:

/close

as v0.5.0 is released

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

5 participants