-
Notifications
You must be signed in to change notification settings - Fork 787
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible race condition in EBS volume creation #1951
Comments
So I just got another two of these today, right at the same time and in the same Kubernetes cluster. What I found really interesting was that the logs showed them both pointing to a single EBS volume that was created for a different PVC:
That PVC shows as being deleted at 11:46:11 in CloudTrail, which fits the pattern from last time. |
Hi, thanks for the great writeup - we've been vaguely aware of this issue for some time and it's on our backlog to fix. In the meantime, deleting and recreating the PVC/VolumeSnapshot object that is affected should act as a workaround for the rare times you hit this issue. |
For us its happening all the time(may be 95/100), any updates would be appreciated |
We're working on a fix for this issie. However, that failure rate is highly unusual and likely indicates your volume(s) are failing to create because of invalid parameters (for example, an invalid KMS key or one the driver doesn't have access to). |
Hi 👋🏻
I think I've found a race condition that happens very occasionally when creating a PV for a new PVC. The outcome of the race condition is a PVC that is permanently stuck in
Pending
and has to be deleted and recreated.As a ballpark estimate, based on the number of
CreateVolume
calls per day in CloudTrail and how frequently we see this issue, I would guess that this happens on the order of 1 in 100,000 volume creations.I've substituted the actual PV and EBS volume names below with placeholders like
pvc-foo
for ease of reading. The identifier of each PVC corresponds directly to the identifier of the EBS volume (i.e. PVpvc-foo
corresponds to EBS volumevol-foo
).Summary
Based on CloudTrail events, it seems that aws-ebs-csi-driver sometimes creates and then shortly after deletes an EBS volume to provide a PV for a new PVC.
Once that has happened, the controller repeatedly retries the creation of the EBS volume, but it always fails because of how the API request to AWS is formed. Specifically:
clientToken
, which is used as an idempotency key in the request to the AWS API, is generated from the name of the PV.clientToken
will always be the same for a specific PVC, and the retries will fail indefinitely.From the driver side, every retry results in the following error:
and in the CloudTrail event, the error message is:
All of the CloudTrail events all have the same
userIdentity
(the service account assumed by the driver) and source IP address (the driver's pod IP).Possible EBS volume ID confusion
One really weird thing I saw in the driver logs was this:
It seems to suggest that the driver was looking for an EBS volume with the ID
vol-bar
when creating the PVpvc-foo
. That other volume belonged to a PV that was created about 10 minutes earlier and deleted whilepvc-foo
was in the process of being provisioned (see timeline below).In our application, due to backup schedules, we do a fair amount of disk creation and deletion on the hour mark. I'm wondering if the increased load at those times is enough to trigger a race condition inside the driver.
Timeline
All times in UTC
04:50:03 - CloudTrail - EBS volume
vol-bar
created forpvc-bar
05:00:09 - CloudTrail - EBS volume
vol-foo
created forpvc-foo
05:00:13 - CloudTrail - EBS volume
vol-bar
deleted05:00:14 - aws-ebs-csi-driver - Could not create PV
pvc-foo
because it couldn't find EBS volumevol-bar
in AWS (see above section "Possible EBS volume ID confusion")05:00:14 - CloudTrail - EBS volume
vol-foo
deleted05:00:15 - CloudTrail - EBS volume
vol-foo
creation attempted again, but fails withClient.IdempotentParameterMismatch
(this repeats indefinitely)Note: I can't be sure of the order of events between aws-ebs-csi-driver and CloudTrail. In particular, the two events at 05:00:14 may have happened in a different order.
Software versions
Kubernetes: v1.26.12-eks-5e0fdde
aws-ebs-csi-driver: v1.28.0
Potentially related issues
I don't think any of these are the same issue as they affect every PVC created by the people reporting them. They seem more like config or compatibility issues on their end, whereas this seems more like a race condition.
More info
I'm happy to get any more info you need to debug this. I think CloudTrail events stick around for 90 days and our own logs should stick around for a decent amount of time too.
/kind bug
The text was updated successfully, but these errors were encountered: