Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pod startup connectivity issue when using calico and vpc-cni #493

Closed
mlsmaycon opened this issue Jun 3, 2019 · 10 comments · Fixed by #1629, getamis/terraform-ignition-kubernetes#53 or getamis/vishwakarma#143
Labels

Comments

@mlsmaycon
Copy link
Contributor

mlsmaycon commented Jun 3, 2019

Kubernetes cluster

  • Kubernetes v1.12.6-eks-d69f1b
  • AWS CNI plugin (1.1 and 1.4)
  • Calico (3.1.3 and 3.3.6)
  • Kubernetes nodes: several instance groups with t2,r4,c4 and c5 types
  • Kubernetes nodes OS images amazon-eks-node-1.12-v20190329

What are the issues

While running Calico for network policy, a pod when starts up is unable to connect to another IP for a time that variates from 300ms up to 1.2s.

This doesn't happens when you run a pod in a node that Calico's Felix daemon wasn't provisioned.

How we reproduce the issue:

Run a pod through a cronjob that do some network test, like a ping to its own kubernetes node. This is the example we used:

apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: mlsmaycon
spec:
  schedule: "*/1 * * * *"
  concurrencyPolicy: Forbid
  successfulJobsHistoryLimit: 1
  failedJobsHistoryLimit: 1
  jobTemplate:
    spec:
      template:
        metadata:
            labels:
              app: mlsmaycon
        spec:
          containers:
          - name: mlsmaycon
            image: mlsmaycon/nettools
            imagePullPolicy: IfNotPresent
            args:
            - /bin/bash
            - -c
            - echo myPOD IP $MY_POD_IP;COUNT=0;date +"%Y-%m-%d %H:%M:%S,%3N";until timeout 0.1 ping -c1 $MY_HOST_IP; do echo Ping Failed on $(date +"%Y-%m-%d %H:%M:%S,%3N");COUNT=$((COUNT + 1));done;echo Ping failed $COUNT times;date +"%Y-%m-%d %H:%M:%S,%3N"
            env:
            - name: MY_POD_IP
              valueFrom:
                fieldRef:
                  fieldPath: status.podIP
            - name: MY_HOST_IP
              valueFrom:
                fieldRef:
                  fieldPath: status.hostIP
            resources:
              requests:
                cpu: 100m
                memory: 128Mi
          restartPolicy: Never

It tries to ping the host IP until it works, I have added a timeout of 100ms in order to see get a better number of the time it couldn't connect to the host IP.

Result

POD OUTPUT: mlsmaycon-1559528520-jfccx 
myPOD IP 10.20.56.24
2019-06-03 02:22:07,312
Ping Failed on 2019-06-03 02:22:07,415
Ping Failed on 2019-06-03 02:22:07,517
Ping Failed on 2019-06-03 02:22:07,619
Ping Failed on 2019-06-03 02:22:07,721
Ping Failed on 2019-06-03 02:22:07,823
Ping Failed on 2019-06-03 02:22:07,926
Ping Failed on 2019-06-03 02:22:08,028
Ping Failed on 2019-06-03 02:22:08,130
Ping Failed on 2019-06-03 02:22:08,232
Ping Failed on 2019-06-03 02:22:08,335
Ping Failed on 2019-06-03 02:22:08,438
Ping Failed on 2019-06-03 02:22:08,545
PING 10.20.57.46 (10.20.57.46) 56(84) bytes of data.
64 bytes from 10.20.57.46: icmp_seq=1 ttl=255 time=0.064 ms

--- 10.20.57.46 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.064/0.064/0.064/0.000 ms
Ping failed 12 times
2019-06-03 02:22:08,548

It failed for 1.2s before it could reach its own host.

Logs

  • ipamd.log
2019-06-03T02:22:06Z [INFO]  Add/Update for Pod mlsmaycon-1559528520-jfccx on my node, namespace = mtt-staging, IP =
2019-06-03T02:22:06Z [INFO]  Add/Update for Pod mlsmaycon-1559528520-jfccx on my node, namespace = mtt-staging, IP =
2019-06-03T02:22:06Z [INFO] Received AddNetwork for NS /proc/16434/ns/net, Pod mlsmaycon-1559528520-jfccx, NameSpace mtt-staging, Container 85bdbe866890a6081b29b342ce40b94dae110c3b1cce8bcec13d26f3eb80e87d, ifname eth0
2019-06-03T02:22:06Z [DEBUG] AssignIPv4Address: IP address pool stats: total:10, assigned 2
2019-06-03T02:22:06Z [INFO] AssignPodIPv4Address Assign IP 10.20.56.24 to pod (name mlsmaycon-1559528520-jfccx, namespace mtt-staging container 85bdbe866890a6081b29b342ce40b94dae110c3b1cce8bcec13d26f3eb80e87d)
2019-06-03T02:22:06Z [DEBUG] VPC CIDR 10.10.0.0/20
2019-06-03T02:22:06Z [DEBUG] VPC CIDR 10.20.0.0/16
2019-06-03T02:22:06Z [INFO] Send AddNetworkReply: IPv4Addr 10.20.56.24, DeviceNumber: 2, err: <nil>
2019-06-03T02:22:08Z [INFO]  Add/Update for Pod mlsmaycon-1559528520-jfccx on my node, namespace = mtt-staging, IP = 10.20.56.24
2019-06-03T02:22:09Z [INFO]  Add/Update for Pod mlsmaycon-1559528520-jfccx on my node, namespace = mtt-staging, IP = 10.20.56.24
2019-06-03T02:22:09Z [INFO] Received DelNetwork for IP <nil>, Pod mlsmaycon-1559528520-jfccx, Namespace mtt-staging, Container 85bdbe866890a6081b29b342ce40b94dae110c3b1cce8bcec13d26f3eb80e87d
2019-06-03T02:22:09Z [DEBUG] UnAssignIPv4Address: IP address pool stats: total:10, assigned 3, pod(Name: mlsmaycon-1559528520-jfccx, Namespace: mtt-staging, Container 85bdbe866890a6081b29b342ce40b94dae110c3b1cce8bcec13d26f3eb80e87d)
2019-06-03T02:22:09Z [INFO] UnAssignIPv4Address: pod (Name: mlsmaycon-1559528520-jfccx, NameSpace mtt-staging Container 85bdbe866890a6081b29b342ce40b94dae110c3b1cce8bcec13d26f3eb80e87d)'s ipAddr 10.20.56.24, DeviceNumber2
2019-06-03T02:22:09Z [INFO] Send DelNetworkReply: IPv4Addr 10.20.56.24, DeviceNumber: 2, err: <nil>
2019-06-03T02:22:09Z [INFO] Received DelNetwork for IP <nil>, Pod mlsmaycon-1559528520-jfccx, Namespace mtt-staging, Container 85bdbe866890a6081b29b342ce40b94dae110c3b1cce8bcec13d26f3eb80e87d
2019-06-03T02:22:09Z [DEBUG] UnAssignIPv4Address: IP address pool stats: total:10, assigned 2, pod(Name: mlsmaycon-1559528520-jfccx, Namespace: mtt-staging, Container 85bdbe866890a6081b29b342ce40b94dae110c3b1cce8bcec13d26f3eb80e87d)
2019-06-03T02:22:09Z [WARN] UnassignIPv4Address: Failed to find pod mlsmaycon-1559528520-jfccx namespace mtt-staging Container 85bdbe866890a6081b29b342ce40b94dae110c3b1cce8bcec13d26f3eb80e87d
2019-06-03T02:22:09Z [DEBUG] UnAssignIPv4Address: IP address pool stats: total:10, assigned 2, pod(Name: mlsmaycon-1559528520-jfccx, Namespace: mtt-staging, Container )
2019-06-03T02:22:09Z [WARN] UnassignIPv4Address: Failed to find pod mlsmaycon-1559528520-jfccx namespace mtt-staging Container
  • plugin.log
2019-06-03T02:22:06Z [INFO] Received CNI add request: ContainerID(85bdbe866890a6081b29b342ce40b94dae110c3b1cce8bcec13d26f3eb80e87d) Netns(/proc/16434/ns/net) IfName(eth0) Args(IgnoreUnknown=1;K8S_POD_NAMESPACE=mtt-staging;K8S_POD_NAME=mlsmaycon-1559528520-jfccx;K8S_POD_INFRA_CONTAINER_ID=85bdbe866890a6081b29b342ce40b94dae110c3b1cce8bcec13d26f3eb80e87d) Path(/opt/cni/bin) argsStdinData({"cniVersion":"","name":"aws-cni","type":"aws-cni","vethPrefix":"eni"})
2019-06-03T02:22:06Z [INFO] Received add network response for pod mlsmaycon-1559528520-jfccx namespace mtt-staging container 85bdbe866890a6081b29b342ce40b94dae110c3b1cce8bcec13d26f3eb80e87d: 10.20.56.24, table 2, external-SNAT: true, vpcCIDR: [10.10.0.0/20 10.20.0.0/16]
2019-06-03T02:22:07Z [INFO] Added toContainer rule for 10.20.56.24/32
2019-06-03T02:22:07Z [INFO] Added rule priority 1536 from 10.20.56.24/32 table 2
2019-06-03T02:22:09Z [INFO] Starting CNI Plugin v1.4.0-3-g2647de75  ...
2019-06-03T02:22:09Z [INFO] Received CNI del request: ContainerID(85bdbe866890a6081b29b342ce40b94dae110c3b1cce8bcec13d26f3eb80e87d) Netns(/proc/16434/ns/net) IfName(eth0) Args(IgnoreUnknown=1;K8S_POD_NAMESPACE=mtt-staging;K8S_POD_NAME=mlsmaycon-1559528520-jfccx;K8S_POD_INFRA_CONTAINER_ID=85bdbe866890a6081b29b342ce40b94dae110c3b1cce8bcec13d26f3eb80e87d) Path(/opt/cni/bin) argsStdinData({"cniVersion":"","name":"aws-cni","type":"aws-cni","vethPrefix":"eni"})
2019-06-03T02:22:09Z [INFO] Delete toContainer rule for 10.20.56.24/32
2019-06-03T02:22:09Z [INFO] Delete Rule List By Src [{10.20.56.24 ffffffff}]
2019-06-03T02:22:09Z [INFO] Remove current list [[ip rule 1536: from 10.20.56.24/32 table 2]]
2019-06-03T02:22:09Z [INFO] Delete fromContainer rule for 10.20.56.24/32 in table 2
2019-06-03T02:22:09Z [INFO] Starting CNI Plugin v1.4.0-3-g2647de75  ...
2019-06-03T02:22:09Z [INFO] Received CNI del request: ContainerID(85bdbe866890a6081b29b342ce40b94dae110c3b1cce8bcec13d26f3eb80e87d) Netns(/proc/16434/ns/net) IfName(eth0) Args(IgnoreUnknown=1;K8S_POD_NAMESPACE=mtt-staging;K8S_POD_NAME=mlsmaycon-1559528520-jfccx;K8S_POD_INFRA_CONTAINER_ID=85bdbe866890a6081b29b342ce40b94dae110c3b1cce8bcec13d26f3eb80e87d) Path(/opt/cni/bin) argsStdinData({"cniVersion":"","name":"aws-cni","type":"aws-cni","vethPrefix":"eni"})
  • typha.log
2019-06-03 02:22:06.185 [DEBUG][6] conversion.go 133: Pod is not scheduled. pod="mlsmaycon-1559528520-jfccx"
2019-06-03 02:22:06.185 [DEBUG][6] watcher.go 110: Event converted to a no-op resource="Pod"
2019-06-03 02:22:06.193 [DEBUG][6] conversion.go 203: Converting pod to WorkloadEndpoint pod=&Pod{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:mlsmaycon-1559528520-jfccx,GenerateName:mlsmaycon-1559528520-,Namespace:mtt-staging,SelfLink:/api/v1/namespaces/mtt-staging/pods/mlsmaycon-1559528520-jfccx,UID:61b3a06b-85a6-11e9-8cb3-0a954d2dbc68,ResourceVersion:28816236,Generation:0,CreationTimestamp:2019-06-03 02:22:06 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{app: mlsmaycon,controller-uid: 61b2d1c7-85a6-11e9-8cb3-0a954d2dbc68,job-name: mlsmaycon-1559528520,role: backend,},Annotations:map[string]string{iam.amazonaws.com/role: role,},OwnerReferences:[{batch/v1 Job mlsmaycon-1559528520 61b2d1c7-85a6-11e9-8cb3-0a954d2dbc68 0xc421276aa3 0xc421276aa4}],Finalizers:[],ClusterName:,Initializers:nil,},Spec:PodSpec{Volumes:[{default-token-fbnhh {nil nil nil nil nil SecretVolumeSource{SecretName:default-token-fbnhh,Items:[],DefaultMode:*420,Optional:nil,} nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil}}],Containers:[{mlsmaycon mlsmaycon/nettools [] [/bin/bash -c echo myPOD IP $MY_POD_IP;COUNT=0;date +"%Y-%m-%d %H:%M:%S,%3N";until timeout 0.1 ping -c1 $MY_HOST_IP; do echo Ping Failed on $(date +"%Y-%m-%d %H:%M:%S,%3N");COUNT=$((COUNT + 1));done;echo Ping failed $COUNT times;date +"%Y-%m-%d %H:%M:%S,%3N"]  [] [] [{MY_POD_IP  EnvVarSource{FieldRef:&ObjectFieldSelector{APIVersion:v1,FieldPath:status.podIP,},ResourceFieldRef:nil,ConfigMapKeyRef:nil,SecretKeyRef:nil,}} {MY_HOST_IP  &EnvVarSource{FieldRef:&ObjectFieldSelector{APIVersion:v1,FieldPath:status.hostIP,},ResourceFieldRef:nil,ConfigMapKeyRef:nil,SecretKeyRef:nil,}}] {map[] map[cpu:{{100 -3} {<nil>} 100m DecimalSI} memory:{{134217728 0} {<nil>}  BinarySI}]} [{default-token-fbnhh true /var/run/secrets/kubernetes.io/serviceaccount  <nil>}] [] nil nil nil /dev/termination-log File IfNotPresent nil false false false}],RestartPolicy:Never,TerminationGracePeriodSeconds:*30,ActiveDeadlineSeconds:nil,DNSPolicy:ClusterFirst,NodeSelector:map[string]string{},ServiceAccountName:default,DeprecatedServiceAccount:default,NodeName:ip-10-40-57-46.us-east-2.compute.internal,HostNetwork:false,HostPID:false,HostIPC:false,SecurityContext:&PodSecurityContext{SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,SupplementalGroups:[],FSGroup:nil,RunAsGroup:nil,Sysctls:[],},ImagePullSecrets:[],Hostname:,Subdomain:,Affinity:&Affinity{NodeAffinity:&NodeAffinity{RequiredDuringSchedulingIgnoredDuringExecution:&NodeSelector{NodeSelectorTerms:[{[{nodegroup In [debug]}] []}],},PreferredDuringSchedulingIgnoredDuringExecution:[],},PodAffinity:nil,PodAntiAffinity:nil,},SchedulerName:default-scheduler,InitContainers:[],AutomountServiceAccountToken:nil,Tolerations:[{node.kubernetes.io/not-ready Exists  NoExecute 0xc421276c90} {node.kubernetes.io/unreachable Exists  NoExecute 0xc421276cb0}],HostAliases:[],PriorityClassName:,Priority:*0,DNSConfig:nil,ShareProcessNamespace:nil,ReadinessGates:[],},Status:PodStatus{Phase:Pending,Conditions:[{PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2019-06-03 02:22:06 +0000 UTC  }],Message:,Reason:,HostIP:,PodIP:,StartTime:<nil>,ContainerStatuses:[],QOSClass:Burstable,InitContainerStatuses:[],NominatedNodeName:,},}
2019-06-03 02:22:06.193 [DEBUG][6] conversion.go 187: Pod has no IP. ip=""

the block above repeated one more time in less than 300ms before I get the following:

2019-06-03 02:22:08.388 [DEBUG][6] conversion.go 203: Converting pod to WorkloadEndpoint pod=&Pod{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:mlsmaycon-1559528520-jfccx,GenerateName:mlsmaycon-1559528520-,Namespace:mtt-staging,SelfLink:/api/v1/namespaces/mtt-staging/pods/mlsmaycon-1559528520-jfccx,UID:61b3a06b-85a6-11e9-8cb3-0a954d2dbc68,ResourceVersion:28816261,Generation:0,CreationTimestamp:2019-06-03 02:22:06 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{app: mlsmaycon,controller-uid: 61b2d1c7-85a6-11e9-8cb3-0a954d2dbc68,job-name: mlsmaycon-1559528520,role: backend,},Annotations:map[string]string{iam.amazonaws.com/role: role,},OwnerReferences:[{batch/v1 Job mlsmaycon-1559528520 61b2d1c7-85a6-11e9-8cb3-0a954d2dbc68 0xc421a4eb63 0xc421a4eb64}],Finalizers:[],ClusterName:,Initializers:nil,},Spec:PodSpec{Volumes:[{default-token-fbnhh {nil nil nil nil nil SecretVolumeSource{SecretName:default-token-fbnhh,Items:[],DefaultMode:*420,Optional:nil,} nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil}}],Containers:[{mlsmaycon mlsmaycon/nettools [] [/bin/bash -c echo myPOD IP $MY_POD_IP;COUNT=0;date +"%Y-%m-%d %H:%M:%S,%3N";until timeout 0.1 ping -c1 $MY_HOST_IP; do echo Ping Failed on $(date +"%Y-%m-%d %H:%M:%S,%3N");COUNT=$((COUNT + 1));done;echo Ping failed $COUNT times;date +"%Y-%m-%d %H:%M:%S,%3N"]  [] [] [{MY_POD_IP  EnvVarSource{FieldRef:&ObjectFieldSelector{APIVersion:v1,FieldPath:status.podIP,},ResourceFieldRef:nil,ConfigMapKeyRef:nil,SecretKeyRef:nil,}} {MY_HOST_IP  &EnvVarSource{FieldRef:&ObjectFieldSelector{APIVersion:v1,FieldPath:status.hostIP,},ResourceFieldRef:nil,ConfigMapKeyRef:nil,SecretKeyRef:nil,}}] {map[] map[cpu:{{100 -3} {<nil>} 100m DecimalSI} memory:{{134217728 0} {<nil>}  BinarySI}]} [{default-token-fbnhh true /var/run/secrets/kubernetes.io/serviceaccount  <nil>}] [] nil nil nil /dev/termination-log File IfNotPresent nil false false false}],RestartPolicy:Never,TerminationGracePeriodSeconds:*30,ActiveDeadlineSeconds:nil,DNSPolicy:ClusterFirst,NodeSelector:map[string]string{},ServiceAccountName:default,DeprecatedServiceAccount:default,NodeName:ip-10-40-57-46.us-east-2.compute.internal,HostNetwork:false,HostPID:false,HostIPC:false,SecurityContext:&PodSecurityContext{SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,SupplementalGroups:[],FSGroup:nil,RunAsGroup:nil,Sysctls:[],},ImagePullSecrets:[],Hostname:,Subdomain:,Affinity:&Affinity{NodeAffinity:&NodeAffinity{RequiredDuringSchedulingIgnoredDuringExecution:&NodeSelector{NodeSelectorTerms:[{[{nodegroup In [debug]}] []}],},PreferredDuringSchedulingIgnoredDuringExecution:[],},PodAffinity:nil,PodAntiAffinity:nil,},SchedulerName:default-scheduler,InitContainers:[],AutomountServiceAccountToken:nil,Tolerations:[{node.kubernetes.io/not-ready Exists  NoExecute 0xc421a4ed50} {node.kubernetes.io/unreachable Exists  NoExecute 0xc421a4ed70}],HostAliases:[],PriorityClassName:,Priority:*0,DNSConfig:nil,ShareProcessNamespace:nil,ReadinessGates:[],},Status:PodStatus{Phase:Running,Conditions:[{Initialized True 0001-01-01 00:00:00 +0000 UTC 2019-06-03 02:22:06 +0000 UTC  } {Ready True 0001-01-01 00:00:00 +0000 UTC 2019-06-03 02:22:08 +0000 UTC  } {ContainersReady True 0001-01-01 00:00:00 +0000 UTC 2019-06-03 02:22:08 +0000 UTC  } {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2019-06-03 02:22:06 +0000 UTC  }],Message:,Reason:,HostIP:10.20.57.46,PodIP:10.20.56.24,StartTime:2019-06-03 02:22:06 +0000 UTC,ContainerStatuses:[{mlsmaycon {nil ContainerStateRunning{StartedAt:2019-06-03 02:22:07 +0000 UTC,} nil} {nil nil nil} true 0 mlsmaycon/nettools:latest docker-pullable://mlsmaycon/nettools@sha256:d4d9d8e242588be5c395cd23114afc33a1a43c8a1051e8e8d1063e0cf6fd0b50 docker://edde2db0e348ccf6b8ee789b10f80209eb5edb8905c19fec59831fd626d525a3}],QOSClass:Burstable,InitContainerStatuses:[],NominatedNodeName:,},}
2019-06-03 02:22:08.389 [DEBUG][6] conversion.go 183: PodIP field filled in. ip="10.20.56.24"
2019-06-03 02:22:08.389 [DEBUG][6] conversion.go 69: Using prefix to create a WorkloadEndpoint veth name prefix="eni"
2019-06-03 02:22:08.389 [DEBUG][6] watcher.go 117: Kubernetes event converted and sent to backend watcher resource="Pod"
2019-06-03 02:22:08.389 [DEBUG][6] watchercache.go 99: Reading event from results channel ListRoot="/calico/resources/v3/projectcalico.org/workloadendpoints" RC=(<-chan api.WatchEvent)(0xc421b97680)
2019-06-03 02:22:08.389 [DEBUG][6] watchercache.go 361: Cache entry added, sending syncer update Key="WorkloadEndpoint(node=ip-10-40-57-46.us-east-2.compute.internal, orchestrator=k8s, workload=mtt-staging/mlsmaycon-1559528520-jfccx, name=eth0)" ListRoot="/calico/resources/v3/projectcalico.org/workloadendpoints"
2019-06-03 02:22:08.389 [DEBUG][6] watchersyncer.go 234: Sending syncer updates (if any to send) NumUpdates=1
2019-06-03 02:22:08.389 [DEBUG][6] validation_filter.go 50: Validating KV pair. key=WorkloadEndpoint(node=ip-10-40-57-46.us-east-2.compute.internal, orchestrator=k8s, workload=mtt-staging/mlsmaycon-1559528520-jfccx, name=eth0) value=&model.WorkloadEndpoint{State:"active", Name:"eni65250e73e92", ActiveInstanceID:"", Mac:(*net.MAC)(nil), ProfileIDs:[]string{"kns.mtt-staging", "ksa.mtt-staging.default"}, IPv4Nets:[]net.IPNet{net.IPNet{IPNet:net.IPNet{IP:net.IP{0xa, 0x28, 0x38, 0x18}, Mask:net.IPMask{0xff, 0xff, 0xff, 0xff}}}}, IPv6Nets:[]net.IPNet(nil), IPv4NAT:[]model.IPNAT(nil), IPv6NAT:[]model.IPNAT(nil), Labels:map[string]string{"role":"backend", "projectcalico.org/namespace":"mtt-staging", "projectcalico.org/orchestrator":"k8s", "projectcalico.org/serviceaccount":"default", "app":"mlsmaycon", "controller-uid":"61b2d1c7-85a6-11e9-8cb3-0a954d2dbc68", "job-name":"mlsmaycon-1559528520"}, IPv4Gateway:(*net.IP)(nil), IPv6Gateway:(*net.IP)(nil), Ports:[]model.EndpointPort{}, GenerateName:"mlsmaycon-1559528520-"}
2019-06-03 02:22:08.389 [DEBUG][6] cache.go 254: Got first update, peeking... update=[]api.Update{api.Update{KVPair:model.KVPair{Key:model.WorkloadEndpointKey{Hostname:"ip-10-40-57-46.us-east-2.compute.internal", OrchestratorID:"k8s", WorkloadID:"mtt-staging/mlsmaycon-1559528520-jfccx", EndpointID:"eth0"}, Value:(*model.WorkloadEndpoint)(0xc4205360f0), Revision:"28816261", TTL:0}, UpdateType:0x1}}
2019-06-03 02:22:08.389 [DEBUG][6] cache.go 243: Received updates. numUpdates=1
2019-06-03 02:22:08.389 [DEBUG][6] cache.go 268: Finished reading batch. numUpdates=1
2019-06-03 02:22:08.389 [DEBUG][6] cache.go 349: Comparing update newUpd=SerializedUpdate<Key:/calico/v1/host/ip-10-40-57-46.us-east-2.compute.internal/workload/k8s/mtt-staging%2fmlsmaycon-1559528520-jfccx/endpoint/eth0, Value:{"state":"active","name":"eni65250e73e92","active_instance_id":"","mac":null,"profile_ids":["kns.mtt-staging","ksa.mtt-staging.default"],"ipv4_nets":["10.20.56.24/32"],"ipv6_nets":null,"labels":{"app":"mlsmaycon","controller-uid":"61b2d1c7-85a6-11e9-8cb3-0a954d2dbc68","job-name":"mlsmaycon-1559528520","projectcalico.org/namespace":"mtt-staging","projectcalico.org/orchestrator":"k8s","projectcalico.org/serviceaccount":"default","role":"backend"},"generate_name":"mlsmaycon-1559528520-"}, Revision:28816261, TTL:0s, UpdateType:1> oldUpd=<nil>
  • felix.log
2019-06-03 02:22:07.012 [INFO][37] route_table.go 442: Syncing routes: found unexpected route; ignoring due to grace period. dest=10.40.56.24/32 ifaceName="eni65250e73e92" ipVersion=0x4
2019-06-03 02:22:08.394 [DEBUG][37] keys.go 194: Path is a workload endpoint: /calico/v1/host/ip-10-40-57-46.us-east-2.compute.internal/workload/k8s/mtt-staging%2fmlsmaycon-1559528520-jfccx/endpoint/eth0
2019-06-03 02:22:08.394 [DEBUG][37] sync_client.go 320: Decoded update from Typha address="172.20.31.209:5473" connID=0x0 deserialized=api.Update{KVPair:model.KVPair{Key:model.WorkloadEndpointKey{Hostname:"ip-10-40-57-46.us-east-2.compute.internal", OrchestratorID:"k8s", WorkloadID:"mtt-staging/mlsmaycon-1559528520-jfccx", EndpointID:"eth0"}, Value:(*model.WorkloadEndpoint)(0xc420166a50), Revision:"28816261", TTL:0}, UpdateType:0x1} serialized=SerializedUpdate<Key:/calico/v1/host/ip-10-40-57-46.us-east-2.compute.internal/workload/k8s/mtt-staging%2fmlsmaycon-1559528520-jfccx/endpoint/eth0, Value:{"state":"active","name":"eni65250e73e92","active_instance_id":"","mac":null,"profile_ids":["kns.mtt-staging","ksa.mtt-staging.default"],"ipv4_nets":["10.40.56.24/32"],"ipv6_nets":null,"labels":{"app":"mlsmaycon","controller-uid":"61b2d1c7-85a6-11e9-8cb3-0a954d2dbc68","job-name":"mlsmaycon-1559528520","projectcalico.org/namespace":"mtt-staging","projectcalico.org/orchestrator":"k8s","projectcalico.org/serviceaccount":"default","role":"backend"},"generate_name":"mlsmaycon-1559528520-"}, Revision:28816261, TTL:0s, UpdateType:1> type=""
2019-06-03 02:22:08.394 [DEBUG][37] sync_client.go 379: New message from Typha. address="172.20.31.209:5473" connID=0x0 envelope=syncproto.Envelope{Message:syncproto.MsgKVs{KVs:[]syncproto.SerializedUpdate{syncproto.SerializedUpdate{Key:"/calico/v1/host/ip-10-40-57-46.us-east-2.compute.internal/workload/k8s/mtt-staging%2fmlsmaycon-1559528520-jfccx/endpoint/eth0", Value:[]uint8{0x7b, 0x22, 0x73, 0x74, 0x61, 0x74, 0x65, 0x22, 0x3a, 0x22, 0x61, 0x63, 0x74, 0x69, 0x76, 0x65, 0x22, 0x2c, 0x22, 0x6e, 0x61, 0x6d, 0x65, 0x22, 0x3a, 0x22, 0x65, 0x6e, 0x69, 0x36, 0x35, 0x32, 0x35, 0x30, 0x65, 0x37, 0x33, 0x65, 0x39, 0x32, 0x22, 0x2c, 0x22, 0x61, 0x63, 0x74, 0x69, 0x76, 0x65, 0x5f, 0x69, 0x6e, 0x73, 0x74, 0x61, 0x6e, 0x63, 0x65, 0x5f, 0x69, 0x64, 0x22, 0x3a, 0x22, 0x22, 0x2c, 0x22, 0x6d, 0x61, 0x63, 0x22, 0x3a, 0x6e, 0x75, 0x6c, 0x6c, 0x2c, 0x22, 0x70, 0x72, 0x6f, 0x66, 0x69, 0x6c, 0x65, 0x5f, 0x69, 0x64, 0x73, 0x22, 0x3a, 0x5b, 0x22, 0x6b, 0x6e, 0x73, 0x2e, 0x6d, 0x74, 0x74, 0x2d, 0x73, 0x74, 0x61, 0x67, 0x69, 0x6e, 0x67, 0x22, 0x2c, 0x22, 0x6b, 0x73, 0x61, 0x2e, 0x6d, 0x74, 0x74, 0x2d, 0x73, 0x74, 0x61, 0x67, 0x69, 0x6e, 0x67, 0x2e, 0x64, 0x65, 0x66, 0x61, 0x75, 0x6c, 0x74, 0x22, 0x5d, 0x2c, 0x22, 0x69, 0x70, 0x76, 0x34, 0x5f, 0x6e, 0x65, 0x74, 0x73, 0x22, 0x3a, 0x5b, 0x22, 0x31, 0x30, 0x2e, 0x34, 0x30, 0x2e, 0x35, 0x36, 0x2e, 0x32, 0x34, 0x2f, 0x33, 0x32, 0x22, 0x5d, 0x2c, 0x22, 0x69, 0x70, 0x76, 0x36, 0x5f, 0x6e, 0x65, 0x74, 0x73, 0x22, 0x3a, 0x6e, 0x75, 0x6c, 0x6c, 0x2c, 0x22, 0x6c, 0x61, 0x62, 0x65, 0x6c, 0x73, 0x22, 0x3a, 0x7b, 0x22, 0x61, 0x70, 0x70, 0x22, 0x3a, 0x22, 0x6d, 0x6c, 0x73, 0x6d, 0x61, 0x79, 0x63, 0x6f, 0x6e, 0x22, 0x2c, 0x22, 0x63, 0x6f, 0x6e, 0x74, 0x72, 0x6f, 0x6c, 0x6c, 0x65, 0x72, 0x2d, 0x75, 0x69, 0x64, 0x22, 0x3a, 0x22, 0x36, 0x31, 0x62, 0x32, 0x64, 0x31, 0x63, 0x37, 0x2d, 0x38, 0x35, 0x61, 0x36, 0x2d, 0x31, 0x31, 0x65, 0x39, 0x2d, 0x38, 0x63, 0x62, 0x33, 0x2d, 0x30, 0x61, 0x39, 0x35, 0x34, 0x64, 0x32, 0x64, 0x62, 0x63, 0x36, 0x38, 0x22, 0x2c, 0x22, 0x6a, 0x6f, 0x62, 0x2d, 0x6e, 0x61, 0x6d, 0x65, 0x22, 0x3a, 0x22, 0x6d, 0x6c, 0x73, 0x6d, 0x61, 0x79, 0x63, 0x6f, 0x6e, 0x2d, 0x31, 0x35, 0x35, 0x39, 0x35, 0x32, 0x38, 0x35, 0x32, 0x30, 0x22, 0x2c, 0x22, 0x70, 0x72, 0x6f, 0x6a, 0x65, 0x63, 0x74, 0x63, 0x61, 0x6c, 0x69, 0x63, 0x6f, 0x2e, 0x6f, 0x72, 0x67, 0x2f, 0x6e, 0x61, 0x6d, 0x65, 0x73, 0x70, 0x61, 0x63, 0x65, 0x22, 0x3a, 0x22, 0x6d, 0x74, 0x74, 0x2d, 0x73, 0x74, 0x61, 0x67, 0x69, 0x6e, 0x67, 0x22, 0x2c, 0x22, 0x70, 0x72, 0x6f, 0x6a, 0x65, 0x63, 0x74, 0x63, 0x61, 0x6c, 0x69, 0x63, 0x6f, 0x2e, 0x6f, 0x72, 0x67, 0x2f, 0x6f, 0x72, 0x63, 0x68, 0x65, 0x73, 0x74, 0x72, 0x61, 0x74, 0x6f, 0x72, 0x22, 0x3a, 0x22, 0x6b, 0x38, 0x73, 0x22, 0x2c, 0x22, 0x70, 0x72, 0x6f, 0x6a, 0x65, 0x63, 0x74, 0x63, 0x61, 0x6c, 0x69, 0x63, 0x6f, 0x2e, 0x6f, 0x72, 0x67, 0x2f, 0x73, 0x65, 0x72, 0x76, 0x69, 0x63, 0x65, 0x61, 0x63, 0x63, 0x6f, 0x75, 0x6e, 0x74, 0x22, 0x3a, 0x22, 0x64, 0x65, 0x66, 0x61, 0x75, 0x6c, 0x74, 0x22, 0x2c, 0x22, 0x72, 0x6f, 0x6c, 0x65, 0x22, 0x3a, 0x22, 0x62, 0x61, 0x63, 0x6b, 0x65, 0x6e, 0x64, 0x22, 0x7d, 0x2c, 0x22, 0x67, 0x65, 0x6e, 0x65, 0x72, 0x61, 0x74, 0x65, 0x5f, 0x6e, 0x61, 0x6d, 0x65, 0x22, 0x3a, 0x22, 0x6d, 0x6c, 0x73, 0x6d, 0x61, 0x79, 0x63, 0x6f, 0x6e, 0x2d, 0x31, 0x35, 0x35, 0x39, 0x35, 0x32, 0x38, 0x35, 0x32, 0x30, 0x2d, 0x22, 0x7d}, Revision:"28816261", TTL:0, UpdateType:0x1}}}} type=""
2019-06-03 02:22:08.394 [DEBUG][37] validation_filter.go 50: Validating KV pair. key=WorkloadEndpoint(node=ip-10-40-57-46.us-east-2.compute.internal, orchestrator=k8s, workload=mtt-staging/mlsmaycon-1559528520-jfccx, name=eth0) value=&model.WorkloadEndpoint{State:"active", Name:"eni65250e73e92", ActiveInstanceID:"", Mac:(*net.MAC)(nil), ProfileIDs:[]string{"kns.mtt-staging", "ksa.mtt-staging.default"}, IPv4Nets:[]net.IPNet{net.IPNet{IPNet:net.IPNet{IP:net.IP{0xa, 0x28, 0x38, 0x18}, Mask:net.IPMask{0xff, 0xff, 0xff, 0xff}}}}, IPv6Nets:[]net.IPNet(nil), IPv4NAT:[]model.IPNAT(nil), IPv6NAT:[]model.IPNAT(nil), Labels:map[string]string{"projectcalico.org/orchestrator":"k8s", "projectcalico.org/serviceaccount":"default", "role":"backend", "app":"mlsmaycon", "controller-uid":"61b2d1c7-85a6-11e9-8cb3-0a954d2dbc68", "job-name":"mlsmaycon-1559528520", "projectcalico.org/namespace":"mtt-staging"}, IPv4Gateway:(*net.IP)(nil), IPv6Gateway:(*net.IP)(nil), Ports:[]model.EndpointPort(nil), GenerateName:"mlsmaycon-1559528520-"}
2019-06-03 02:22:08.395 [DEBUG][37] active_rules_calculator.go 125: Updating ARC with endpoint WorkloadEndpoint(node=ip-10-40-57-46.us-east-2.compute.internal, orchestrator=k8s, workload=mtt-staging/mlsmaycon-1559528520-jfccx, name=eth0)
2019-06-03 02:22:08.395 [DEBUG][37] active_rules_calculator.go 253: Endpoint model.WorkloadEndpointKey{Hostname:"ip-10-40-57-46.us-east-2.compute.internal", OrchestratorID:"k8s", WorkloadID:"mtt-staging/mlsmaycon-1559528520-jfccx", EndpointID:"eth0"} now has profile IDs: [kns.mtt-staging ksa.mtt-staging.default]
2019-06-03 02:22:08.395 [DEBUG][37] dispatcher.go 93: Dispatching {{WorkloadEndpoint(node=ip-10-40-57-46.us-east-2.compute.internal, orchestrator=k8s, workload=mtt-staging/mlsmaycon-1559528520-jfccx, name=eth0) 0xc420166a50 28816261 0s} 1}
2019-06-03 02:22:08.395 [DEBUG][37] dispatcher.go 93: Dispatching {{WorkloadEndpoint(node=ip-10-40-57-46.us-east-2.compute.internal, orchestrator=k8s, workload=mtt-staging/mlsmaycon-1559528520-jfccx, name=eth0) 0xc420166a50 28816261 0s} 1}
2019-06-03 02:22:08.395 [DEBUG][37] label_inheritance_index.go 143: Updating InheritIndex with endpoint WorkloadEndpoint(node=ip-10-40-57-46.us-east-2.compute.internal, orchestrator=k8s, workload=mtt-staging/mlsmaycon-1559528520-jfccx, name=eth0)
2019-06-03 02:22:08.395 [DEBUG][37] label_inheritance_index.go 214: Inherit index updating labels for WorkloadEndpoint(node=ip-10-40-57-46.us-east-2.compute.internal, orchestrator=k8s, workload=mtt-staging/mlsmaycon-1559528520-jfccx, name=eth0)
2019-06-03 02:22:08.395 [INFO][37] calc_graph.go 361: Local endpoint updated id=WorkloadEndpoint(node=ip-10-40-57-46.us-east-2.compute.internal, orchestrator=k8s, workload=mtt-staging/mlsmaycon-1559528520-jfccx, name=eth0)
2019-06-03 02:22:08.396 [DEBUG][37] label_inheritance_index.go 287: Updating parents id=WorkloadEndpoint(node=ip-10-40-57-46.us-east-2.compute.internal, orchestrator=k8s, workload=mtt-staging/mlsmaycon-1559528520-jfccx, name=eth0) newParents=[]*labelindex.parentData{(*labelindex.parentData)(0xc420743980), (*labelindex.parentData)(0xc420794480)} oldParents=[]*labelindex.parentData(nil)
2019-06-03 02:22:08.396 [DEBUG][37] label_inheritance_index.go 362: Flushing model.WorkloadEndpointKey{Hostname:"ip-10-40-57-46.us-east-2.compute.internal", OrchestratorID:"k8s", WorkloadID:"mtt-staging/mlsmaycon-1559528520-jfccx", EndpointID:"eth0"}
2019-06-03 02:22:08.396 [DEBUG][37] label_inheritance_index.go 377: Flushing update of item WorkloadEndpoint(node=ip-10-40-57-46.us-east-2.compute.internal, orchestrator=k8s, workload=mtt-staging/mlsmaycon-1559528520-jfccx, name=eth0)
2019-06-03 02:22:08.396 [DEBUG][37] label_inheritance_index.go 393: Scanning all (3) selectors against labels WorkloadEndpoint(node=ip-10-40-57-46.us-east-2.compute.internal, orchestrator=k8s, workload=mtt-staging/mlsmaycon-1559528520-jfccx, name=eth0)
2019-06-03 02:22:08.396 [DEBUG][37] named_port_index.go 209: Updating NamedPortIndex with endpoint WorkloadEndpoint(node=ip-10-40-57-46.us-east-2.compute.internal, orchestrator=k8s, workload=mtt-staging/mlsmaycon-1559528520-jfccx, name=eth0)
2019-06-03 02:22:08.396 [DEBUG][37] named_port_index.go 448: Updating endpoint/network set CIDRs=[]ip.CIDR{ip.V4CIDR{addr:ip.V4Addr{0xa, 0x28, 0x38, 0x18}, prefix:0x20}} endpointOrSetID=WorkloadEndpoint(node=ip-10-40-57-46.us-east-2.compute.internal, orchestrator=k8s, workload=mtt-staging/mlsmaycon-1559528520-jfccx, name=eth0) newLabels=map[string]string{"app":"mlsmaycon", "controller-uid":"61b2d1c7-85a6-11e9-8cb3-0a954d2dbc68", "job-name":"mlsmaycon-1559528520", "projectcalico.org/namespace":"mtt-staging", "projectcalico.org/orchestrator":"k8s", "projectcalico.org/serviceaccount":"default", "role":"backend"} parentIDs=[]string{"kns.mtt-staging", "ksa.mtt-staging.default"} ports=[]model.EndpointPort(nil)
2019-06-03 02:22:08.396 [DEBUG][37] policy_resolver.go 159: Sending tier update for endpoint WorkloadEndpoint(node=ip-10-40-57-46.us-east-2.compute.internal, orchestrator=k8s, workload=mtt-staging/mlsmaycon-1559528520-jfccx, name=eth0)
2019-06-03 02:22:08.396 [DEBUG][37] policy_resolver.go 175: Checking if policy Policy(name=srt-dev/knp.default.apm-server-access) matches WorkloadEndpoint(node=ip-10-40-57-46.us-east-2.compute.internal, orchestrator=k8s, workload=mtt-staging/mlsmaycon-1559528520-jfccx, name=eth0)
2019-06-03 02:22:08.396 [DEBUG][37] policy_resolver.go 175: Checking if policy Policy(name=srt-dev/knp.default.backend-access) matches WorkloadEndpoint(node=ip-10-40-57-46.us-east-2.compute.internal, orchestrator=k8s, workload=mtt-staging/mlsmaycon-1559528520-jfccx, name=eth0)
2019-06-03 02:22:08.396 [DEBUG][37] policy_resolver.go 175: Checking if policy Policy(name=srt-system/knp.default.kibana-access) matches WorkloadEndpoint(node=ip-10-40-57-46.us-east-2.compute.internal, orchestrator=k8s, workload=mtt-staging/mlsmaycon-1559528520-jfccx, name=eth0)
2019-06-03 02:22:08.396 [DEBUG][37] policy_resolver.go 187: Endpoint tier update: WorkloadEndpoint(node=ip-10-40-57-46.us-east-2.compute.internal, orchestrator=k8s, workload=mtt-staging/mlsmaycon-1559528520-jfccx, name=eth0) -> []
2019-06-03 02:22:08.397 [DEBUG][37] stats_collector.go 135: Host-specific key added host="ip-10-40-57-46.us-east-2.compute.internal" key=WorkloadEndpoint(node=ip-10-40-57-46.us-east-2.compute.internal, orchestrator=k8s, workload=mtt-staging/mlsmaycon-1559528520-jfccx, name=eth0) newCount=3
2019-06-03 02:22:08.400 [DEBUG][37] endpoint_mgr.go 212: Received message msg=id:<orchestrator_id:"k8s" workload_id:"mtt-staging/mlsmaycon-1559528520-jfccx" endpoint_id:"eth0" > endpoint:<state:"active" name:"eni65250e73e92" profile_ids:"kns.mtt-staging" profile_ids:"ksa.mtt-staging.default" ipv4_nets:"10.40.56.24/32" >
2019-06-03 02:22:08.400 [DEBUG][37] endpoint_mgr.go 470: Endpoint up, adding routes id=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"mtt-staging/mlsmaycon-1559528520-jfccx", EndpointId:"eth0"}
2019-06-03 02:22:08.400 [INFO][37] endpoint_mgr.go 408: Updating per-endpoint chains. id=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"mtt-staging/mlsmaycon-1559528520-jfccx", EndpointId:"eth0"}
2019-06-03 02:22:08.400 [INFO][37] endpoint_mgr.go 435: Updating endpoint routes. id=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"mtt-staging/mlsmaycon-1559528520-jfccx", EndpointId:"eth0"}
2019-06-03 02:22:08.400 [INFO][37] int_dataplane.go 630: Received *proto.WorkloadEndpointUpdate update from calculation graph msg=id:<orchestrator_id:"k8s" workload_id:"mtt-staging/mlsmaycon-1559528520-jfccx" endpoint_id:"eth0" > endpoint:<state:"active" name:"eni65250e73e92" profile_ids:"kns.mtt-staging" profile_ids:"ksa.mtt-staging.default" ipv4_nets:"10.40.56.24/32" >
2019-06-03 02:22:08.401 [DEBUG][37] endpoint_mgr.go 305: Reporting endpoint status. dirtyEndpoints=set.mapSet{proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"mtt-staging/mlsmaycon-1559528520-jfccx", EndpointId:"eth0"}:set.empty{}}
2019-06-03 02:22:08.401 [DEBUG][37] endpoint_mgr.go 322: Re-evaluating workload endpoint status workloadEndpointID=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"mtt-staging/mlsmaycon-1559528520-jfccx", EndpointId:"eth0"}
2019-06-03 02:22:08.401 [DEBUG][37] route_table.go 437: Syncing routes: Found expected route. dest=10.40.56.24/32 ifaceName="eni65250e73e92" ipVersion=0x4
2019-06-03 02:22:08.401 [INFO][37] endpoint_mgr.go 350: Re-evaluated workload endpoint status adminUp=true failed=false known=true operUp=true status="up" workloadEndpointID=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"mtt-staging/mlsmaycon-1559528520-jfccx", EndpointId:"eth0"}
2019-06-03 02:22:08.401 [INFO][37] status_combiner.go 58: Storing endpoint status update ipVersion=0x4 status="up" workload=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"mtt-staging/mlsmaycon-1559528520-jfccx", EndpointId:"eth0"}
2019-06-03 02:22:08.467 [DEBUG][37] daemon.go 821: New message from dataplane payload=id:<orchestrator_id:"k8s" workload_id:"mtt-staging/mlsmaycon-1559528520-jfccx" endpoint_id:"eth0" > status:<status:"up" >
2019-06-03 02:22:08.467 [INFO][37] status_combiner.go 81: Endpoint up for at least one IP version id=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"mtt-staging/mlsmaycon-1559528520-jfccx", EndpointId:"eth0"} ipVersion=0x4 status="up"
2019-06-03 02:22:08.467 [INFO][37] status_combiner.go 98: Reporting combined status. id=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"mtt-staging/mlsmaycon-1559528520-jfccx", EndpointId:"eth0"} status="up"
2019-06-03 02:22:09.413 [DEBUG][37] sync_client.go 379: New message from Typha. address="172.20.31.209:5473" connID=0x0 envelope=syncproto.Envelope{Message:syncproto.MsgKVs{KVs:[]syncproto.SerializedUpdate{syncproto.SerializedUpdate{Key:"/calico/v1/host/ip-10-40-57-46.us-east-2.compute.internal/workload/k8s/mtt-staging%2fmlsmaycon-1559528520-jfccx/endpoint/eth0", Value:[]uint8(nil), Revision:"", TTL:0, UpdateType:0x3}}}} type=""

It is not clear to me if the issue is caused by Typha not being able to verify that the pod got an IP address, maybe because of a check interval or if the Ipamd is not making this information available in the datastore fast enough so calico can do its thing.

@ikatson
Copy link
Contributor

ikatson commented Jul 29, 2019

Have the same problem. It feels like that's because of Calico policies asynchronous nature.

Here's my theory: calico policies kick in after all the CNI plugins finish running. It then races with the pod startup, and the pods that start network interactions immediately almost always get a delay, because Calico has not injected all the rules yet.

The best scenario would be somehow to make the AWS-VPC-CNI plugin wait for Calico policies, but unfortunately looks like it won't work. I wrote a PoC CNI plugin today that just sleeps, wanted to check if Calico felix "sees" that pod's IP and starts injecting the rules before CNI finishes execution. Looks like they don't.

The way I worked around it for now is by creating an initContainer that loops with a DNS query for kubernetes.default. As soon as this is ok, the init container quits. This guarantees (sort of) that all the "normal" pods will have networking setup by the time they start up.
Made things much more stable.

@mlsmaycon
Copy link
Contributor Author

It could be a problem with async way that calico's apply its policies, but strange or not, running on a similar fashion on GKE it didn't show this issue. Certainly their network plugin works in a different way.

I will test version 1.5.2 as it has a Ipam improvment:

Improvement - Reduce the wait time when checking for pods without IPs (#552, @mogren)

@jaypipes
Copy link
Contributor

Hi @mlsmaycon did you ever get a chance to test later versions of the plugin and see if this issue went away?

@mogren
Copy link
Contributor

mogren commented Jan 29, 2020

Closing because of no updates.

@mogren mogren closed this as completed Jan 29, 2020
@gmatev
Copy link

gmatev commented Jan 30, 2020

@mogren Fairly certain we are running into this on:

EKS Platform version: eks.7
Kubernetes version: 1.14
Calico node version: 3.8.1

Calico install was from official AWS instruction pointing to https://raw.githubusercontent.com/aws/amazon-vpc-cni-k8s/release-1.5/config/v1.5/calico.yaml

Very consistently we see a connectivity issue when the first pod launches on a newly scaled up node.

@mogren mogren reopened this Jan 30, 2020
@mogren
Copy link
Contributor

mogren commented Jan 30, 2020

@gmatev Thanks for the update, still something we need to reproduce then.

@NikitaGandhi
Copy link

Facing a similar issue as mentioned above by @gmatev - consistently seeing connectivity issue when the pod launches on a newly scaled up node.

@joseapena
Copy link

joseapena commented May 31, 2020

I opened a similar issue to Calico. They have replied that this is caused by Kubelet being very slow in updating the IP addresses in the Kubernetes API. The Calico CNI driver has implemented a workaround consisting of updating the IP address directly instead of relying on Kubelet. Would it be possible for the VPC CNI driver to implement the same workaround?.

Here is the Calico issue: projectcalico/calico#3530

@mogren
Copy link
Contributor

mogren commented Aug 10, 2020

The Calico CNI driver has implemented a workaround consisting of updating the IP address directly instead of relying on Kubelet. Would it be possible for the VPC CNI driver to implement the same workaround?

I am sure it us possible, and it might be worth taking a look at how much work this would be. We just have to be sure that it can be done without any impact to clusters not using Calico.

@jayanthvn
Copy link
Contributor

jayanthvn commented Nov 4, 2020

Hi,

As @caseydavenport mentioned in projectcalico/calico#3530, AWS CNI is not aware when calico is done setting up the policy unless we get an endpoint to query that will give the init status.

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment