-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DependencyViolation: resource sg has a dependent object #194
Comments
Do you have details on what this ENI is?
This feels suggestive that the ENI is created by Kubernetes. Are you sure this is not something that the Kubernetes load balancer support is doing? |
Ahh - so CNI allocates ENIs on instances dynamically as pods are scheduled. Likely related to that? https://github.com/aws/amazon-vpc-cni-k8s#eni-allocation |
Correct. Pod IPs are attached via the ENI so pods sticking around during deletion could be the root source of this. We've seen other instantiations of this problem in general, where pulumi does not know about resources stood up and managed by k8s and/or EKS creating similar scenarios. |
Note, this Further deep dives points leaves me to conclude that this is a bug of sorts in AWS/EKS, k8s, and/or the aws-k8s-cni plugin. |
We're seeing this
|
Updates:
@lukehoban thoughts? |
From the two most recent comments, I can’t quite tell - is the assumption this is a CNI related issue, or a Kubernetes LB issue? There seem to be references to both - but they seem like they would be different things with different fixes. |
TBH I'm not sure. Both cases seem valid and possibly related, but I can't definitively say. Though leaked ENI's have always been the common result of the failure to delete the secgroup that causes the I mention the LB issues because it may have to do with this issue, and is the common denominator between a failed CI run in pulumi/examples#348 ( |
Output of a Again, the cause of 400 on the secgroup deletion req is because there is a leaked ENI - it is available, but not deleted in the secgroup, which holds up the secgroup deletion. Having seen this exact issue a couple of times now, aws-cni#69 is seemingly the potential source. Here's the ENI attached the secgroup from the CI output linked: |
I wonder if for tests we should try setting |
@lukehoban I set It does not seem that dropping the
|
That's unfortunate... Looking into aws/amazon-vpc-cni-k8s#69 a bit more, as well as the other symptoms here and IPAMD implementation, it does appear that this must be the case we are hitting:
In particular, if we kill the instance at any point in this code it looks like it will leak the ENI. My best guess for a workaround would be to wait for CNI warm pool population to acquiesce before killing the instances/workloads. Can we just delay 5 minutes in We might also be able to get IPAMD logs to understand more about what ENIs it is trying to allocate when? |
Yes, I'll give this a shot. For context, in CloudTrail we're seeing the error coming from aws-cni's use of aws-sdk-go as it attempts to
I've tried tailing the aws-cni logs as well as kube-proxy on cluster tear down's and nothing meaningful is logged before the log stream gets clipped due to the tear down. I've considered also adding fluentd to the setup to funnel container logs to CloudWatch, but there is always about a 1 minute delay to flush the logs to CW, and this would also expand the footprint of the test. That said, I'll look into seeing if this isn't too much trouble, and if it can help gather more insight as to what's taking place. |
Hey, sorry about the I can think of two ways to fix this issue. When
|
Thanks for the reply @mogren and follow-up - much appreciated!
This makes sense. IIUC though, I don't believe these settings deal with the leaked ENI issue that's causing the OP. Were there any follow up issues to track these items? |
@metral Opened a PR to handle the Also released v1.5.3 with some more startup fixes. (aws/amazon-vpc-cni-k8s@65873cf might be of interest to you, solves aws/amazon-vpc-cni-k8s#537) |
Thank you @mogren!
I appreciate the update! aws/amazon-vpc-cni-k8s@65873cf resolved our leaked ENI issue in v1.5.2. I'll make sure to check out the new fixes in v1.5.3 👍 |
We're hitting an issue on EKS cluster tear downs: the security group fails (http 400) on a deletion request due to a lingering ENI created by AWS for the worker instance that does not get deleted first.
A manual work around is going into the console and deleting the ENI first, and then the sg can be deleted as expected.
The only diff between this setup and a vanilla EKS cluster is that there is a workload deployed to k8s with a classic ELB that gets all torn down together on
pulumi destroy
, but the ENI shows it belongs to the instance, so doubt its related to the LB.Semi-related / same err message: hashicorp/terraform-provider-aws#1671 (comment).
The text was updated successfully, but these errors were encountered: