-
Notifications
You must be signed in to change notification settings - Fork 741
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Timeout waiting for IPAM to start causing perpetual failure (1.6.1, EKS 1.15) #1055
Comments
This is curious indeed:
Basically, the instance metadata service is refusing the call. It could again be because of throttling, but this time on the metadata and not EC2. The CNI does not call instance metadata very often, on average around once every 5 seconds. Is there some other service or pod on the node that does a lot more frequent calls? Also:
Suggests that the node has a lot of processes running, or for some reason have very low limits set. Is the load on these nodes very high? |
I don't think there's anything else on-node that would be using that metadata service. The DaemonSets we have are nginx, datadog, consul and nodelocaldns, and of course kube-proxy. We do leverage Cluster Autoscaler (single instance Deployment) as well which would call the API around ASGs but I don't think that should interfere here.
If you feel strongly enough about this being part of the issue, I can do more research on the state of the node when the CNI restarted, but it's still in this restarting state. I still have this node available so please feel free to give me a slew of tests you'd like to see done. I feel like this has to do with a long-term change that's happened somehow. Additionally, the node has been drained and cordoned so there's just the DaemonSets running on it, hence less load while still having the CNI in a restarting state. |
Looks like I have similar. In
despite I do have both ec2:DescribeNetworkInterfaces and sts:AssumeRoleWithWebIdentity explicitly attached to node role. Please note "caused by: [nothing here]". |
@InAnimaTe Sorry for being late here. if your still having the issue we can jump on a call and debug the issue. |
@InAnimaTe Did you upload your logs somewhere? I tried to find the link but couldn't find it. |
Thanks for responding @SaranBalaji90 !! I did not upload logs, just the excerpts I've shared. The logs are in our support tickets so I could pull them out if needbe. We haven't upgraded but the issue is much less of a problem now. I would be interested in possibly getting on a call to talk through this. However, I'm not entirely sure it makes sense if we're aware that newer versions fix this? |
Reopened #486 to discuss further on this. |
Sounds good. I have not yet tested if this still occurs on newer CNI versions and have no real way to replicate it. Seems to be completely random, and hasn't happened to us in a little while anyways. |
Synced up with @InAnimaTe offline too. Just to update everyone here, we haven't made any changes to ipamd to fix this issue. Hopefully driving #486 to completion will help us to find the actual issue here. |
FWIW, I'm running on EKS 1.17 and amazon-k8s-cni:v1.6.3. When I attempt to roll my cni daemonset (for example changing the deamonset node tags), I've frequently seen this issue:
The only solution I've found has been to recycle the node which makes this super painful. |
@nigelellis Can you also paste the ipamd logs? |
@SaranBalaji90 how do I get the ipamd logs? |
@nigelellis https://docs.aws.amazon.com/eks/latest/userguide/troubleshooting.html#troubleshoot-cni is the relevant doc. If you have access to the node, running |
Thanks, I'll see if I can reproduce it tomorrow.
...
…On Thu, Sep 03, 2020 at 5:20 PM, Claes Mogren < ***@***.*** > wrote:
@ nigelellis ( https://github.com/nigelellis ) https:/ / docs. aws. amazon.
com/ eks/ latest/ userguide/ troubleshooting. html#troubleshoot-cni (
https://docs.aws.amazon.com/eks/latest/userguide/troubleshooting.html#troubleshoot-cni
) is the relevant doc.
If you have access to the node, running sudo bash / opt/ cni/ bin/ aws-cni-support.
sh ( http://opt/cni/bin/aws-cni-support.sh ) should give you an archive
with all relevant logs. (Note, you might want to send this through AWS
support or email, depending on the size of the log file)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub (
#1055 (comment)
) , or unsubscribe (
https://github.com/notifications/unsubscribe-auth/AC4UMHJJYSPEBLOTYYM222TSEAXFZANCNFSM4OHMIZ6A
).
|
I was able to get a repro - final log line is
The pod logs returned:
I was tailing the ipamd log and the container exits with:
Here's the podspec: apiVersion: v1
kind: Pod
metadata:
annotations:
kubernetes.io/limit-ranger: 'LimitRanger plugin set: memory request for container
aws-node; cpu, memory limit for container aws-node'
kubernetes.io/psp: eks.privileged
creationTimestamp: "2020-09-02T20:54:17Z"
generateName: aws-node-
labels:
controller-revision-hash: 88769d4f9
k8s-app: aws-node
pod-template-generation: "9"
name: aws-node-569fm
namespace: kube-system
ownerReferences:
- apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: DaemonSet
name: aws-node
uid: 87a32608-ed1a-11e8-8d8d-026b59889896
resourceVersion: "183343173"
selfLink: /api/v1/namespaces/kube-system/pods/aws-node-569fm
uid: 5a4e21da-b7dd-4f0f-a5fa-2c3039a31a04
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchFields:
- key: metadata.name
operator: In
values:
- ip-10-10-196-9.us-west-2.compute.internal
automountServiceAccountToken: true
containers:
- env:
- name: AWS_VPC_K8S_CNI_LOGLEVEL
value: DEBUG
- name: MY_NODE_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.nodeName
image: 602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:v1.6.4
imagePullPolicy: Always
name: aws-node
ports:
- containerPort: 61678
hostPort: 61678
name: metrics
protocol: TCP
resources:
limits:
cpu: 300m
memory: 300Mi
requests:
cpu: 10m
memory: 300Mi
securityContext:
allowPrivilegeEscalation: true
privileged: true
readOnlyRootFilesystem: false
runAsGroup: 0
runAsNonRoot: false
runAsUser: 0
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /host/opt/cni/bin
mountPropagation: None
name: cni-bin-dir
- mountPath: /host/etc/cni/net.d
mountPropagation: None
name: cni-net-dir
- mountPath: /host/var/log
mountPropagation: None
name: log-dir
- mountPath: /var/run/docker.sock
mountPropagation: None
name: dockersock
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: aws-node-token-vw5dm
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
hostNetwork: true
nodeName: ip-10-10-196-9.us-west-2.compute.internal
priority: 2000001000
priorityClassName: system-node-critical
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: aws-node
serviceAccountName: aws-node
shareProcessNamespace: false
terminationGracePeriodSeconds: 30
tolerations:
- operator: Exists
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/disk-pressure
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/memory-pressure
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/pid-pressure
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/unschedulable
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/network-unavailable
operator: Exists
volumes:
- hostPath:
path: /opt/cni/bin
type: ""
name: cni-bin-dir
- hostPath:
path: /etc/cni/net.d
type: ""
name: cni-net-dir
- hostPath:
path: /var/log
type: ""
name: log-dir
- hostPath:
path: /var/run/docker.sock
type: ""
name: dockersock
- name: aws-node-token-vw5dm
secret:
defaultMode: 420
secretName: aws-node-token-vw5dm
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2020-09-02T20:54:17Z"
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: "2020-09-04T00:31:35Z"
message: 'containers with unready status: [aws-node]'
reason: ContainersNotReady
status: "False"
type: Ready
- lastProbeTime: null
lastTransitionTime: "2020-09-04T00:31:35Z"
message: 'containers with unready status: [aws-node]'
reason: ContainersNotReady
status: "False"
type: ContainersReady
- lastProbeTime: null
lastTransitionTime: "2020-09-02T20:54:17Z"
status: "True"
type: PodScheduled
containerStatuses:
- containerID: docker://3fd7469cd1288c1ca964f50730d8d489692c046094d2f05c81615f07de4caec7
image: 602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:v1.6.4
imageID: docker-pullable://602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni@sha256:e01133675b0ac9857392593aaaf62e56dd8a28c1bf6a23eac34cd577a9c2da20
lastState:
terminated:
containerID: docker://3fd7469cd1288c1ca964f50730d8d489692c046094d2f05c81615f07de4caec7
exitCode: 1
finishedAt: "2020-09-04T00:31:35Z"
reason: Error
startedAt: "2020-09-04T00:30:59Z"
name: aws-node
ready: false
restartCount: 7
started: false
state:
waiting:
message: back-off 5m0s restarting failed container=aws-node pod=aws-node-569fm_kube-system(5a4e21da-b7dd-4f0f-a5fa-2c3039a31a04)
reason: CrashLoopBackOff
hostIP: 10.10.196.9
phase: Running
podIP: 10.10.196.9
podIPs:
- ip: 10.10.196.9
qosClass: Burstable
startTime: "2020-09-02T20:54:17Z" To reproduce this issue:
kubectl -n kube-system exec -it aws-node-XXX-- tail -f /host/var/log/aws-routed-eni/ipamd.log | tee ipamd.log I'm running EKS kubelet v1.17.9-eks-4c6976 |
Thanks a lot @nigelellis, this should help us reproduce the issue. Is this the final line in the logs?
So reading from the docker shim is hanging, locking up ipamd is that correct? What does |
@mogren yes, that's the last line in the log before the pod terminates with error 137. My read is it dies while reading from Docker. I'm running the stock AWS image Looks like it's running Docker Does that help? |
@mogren any idea on when this issue might be resolved? LMK if I can help with more logs, etc. Thanks. |
@nigelellis Hey, sorry for the late reply on this. In the pod spec you pated, I don't see The missing lines are aws-k8s-cni.yaml#L141-L142 and aws-k8s-cni.yaml#L156-L158. This was mentioned in the release notes for v1.6.0 and the next 3 releases when updating from earlier versions. Sorry for not noticing this right away. Could you please try adding |
Thank you, I'll try this out tomorrow.
...
…On Sun, Sep 13, 2020 at 3:18 PM, Claes Mogren < ***@***.*** > wrote:
@ nigelellis ( https://github.com/nigelellis ) Hey, sorry for the late
reply on this. In the pod spec you pated, I don't see unix:///var/run/dockershim.sock
being mounted into the aws-node pod. The docker.sock is a docker specific
API that was only used on CNI v1.5x and earlier.
The missing lines are aws-k8s-cni. yaml#L141-L142 (
https://github.com/aws/amazon-vpc-cni-k8s/blob/master/config/v1.6/aws-k8s-cni.yaml#L141-L142
) and aws-k8s-cni. yaml#L156-L158 (
https://github.com/aws/amazon-vpc-cni-k8s/blob/master/config/v1.6/aws-k8s-cni.yaml#L156-L158
).
This was mentioned in the release notes for v1. 6. 0 (
https://github.com/aws/amazon-vpc-cni-k8s/releases/tag/v1.6.0 ) and the
next 3 releases when updating from earlier versions. Sorry for not
noticing this right away. Could you please try adding /var/run/dockershim.sock
?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub (
#1055 (comment)
) , or unsubscribe (
https://github.com/notifications/unsubscribe-auth/AC4UMHKL74A3RC6RLRNHJB3SFVALBANCNFSM4OHMIZ6A
).
|
@mogren - I can confirm updating the daemonset to mount |
Thanks a lot for confirming @nigelellis! There will be a v1.7.2 out very soon with a lot of fixes. The v1.7.x versions now have an init-container as well, so the config has changed a lot since v1.6.x |
@mogren will 1.7 be recommended for EKS 1.17? I still see 1.6.3 on the EKS upgrade guide. |
@nigelellis Yes, as soon as we make it the default for new clusters. 🙂 |
@InAnimaTe Are you still seeing this issue? Could you please try using v1.7.3? |
Hey @mogren so we have not seen this anymore in the past couple months in any of our environments. We are planning to upgrade to 1.7.2/3 in the next few weeks so I'll be looking out for these issues but please don't keep this open on my account. I can reopen if I see clear evidence of this happening again. Thanks! |
Great, thanks a lot @InAnimaTe, and please open a new ticket for if you see any issue! |
We've been plaqued for a while (at least through 1.5.x, maybe even before) with the CNI dying (not sure how) and never coming back up. This sort of issue causes a node to continue accepting workloads of which can't provide any value because networking is broken.
For one of the nodes which this happened, we cordoned and blocked CA from touching it for observation. Any information following relates to the aws-node instance running on that node,
aws-node-mq5bk
.We had case
7079142701
open with AWS where they suggested #1019 (issue #1011) would fix the problem but couldn't assist beyond that. We've also reached out to our TAM to setup a meeting with your CNI team and he's failed to respond in over a week.Therefore, I'm writing this ticket to agglomerate available information from what I'm seeing because I don't quite think this problem is like the others (see list at bottom). Please @mogren, you're my only hope!
Logs K8s provides (
kubectl logs
):Support Person found errors in logs (provided via eks-logs-collector.sh)
(
ipamd.log
):(
plugin.log
):A unique readiness probe from what I think is around the time the container crashed and ceased to come back (~Jun 3/4):
I then see this liveness probe failure about 20 seconds after:
Here's more Container details in its current state:
Other suggested issues:
The text was updated successfully, but these errors were encountered: