Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flake: timeout reached waiting for pod IDs in ipcache of Cilium pod #361

Closed
michi-covalent opened this issue Jun 24, 2021 · 9 comments
Closed
Labels
area/CI Continuous Integration testing issue or flake ci/flake Issues tracking failing (integration or unit) tests.

Comments

@michi-covalent
Copy link
Contributor

michi-covalent commented Jun 24, 2021

flake instances

symptoms

connectivity check fails with an error like this:

Error: Connectivity test failed: timeout reached waiting for pod IDs in ipcache of Cilium pod kube-system/cilium-wbrzq

other notes

a related upstream issue: cilium/cilium#16542

@michi-covalent michi-covalent added the area/CI Continuous Integration testing issue or flake label Jun 24, 2021
@tklauser
Copy link
Member

tklauser commented Jul 5, 2021

@tklauser
Copy link
Member

tklauser commented Jul 5, 2021

@tklauser
Copy link
Member

tklauser commented Jul 7, 2021

Occured again on master in the multicluster workflow: https://github.com/cilium/cilium-cli/runs/3005090578?check_suite_focus=true

I'll open a PR enabling debugging options for connectivity tests to maybe get some more insight into what part of (*ConnectivityTest).waitForIPCache is failing.

tklauser added a commit that referenced this issue Jul 7, 2021
This should give us some additional information to debug flakes, e.g.
for #361

Signed-off-by: Tobias Klauser <[email protected]>
tklauser added a commit that referenced this issue Jul 8, 2021
This should give us some additional information to debug flakes, e.g.
for #361

Signed-off-by: Tobias Klauser <[email protected]>
@nbusseneau
Copy link
Member

jrajahalme added a commit that referenced this issue Jul 19, 2021
Add error to the debug output for ipcache check failures. This should
help solve #361.

Signed-off-by: Jarno Rajahalme <[email protected]>
jrajahalme added a commit that referenced this issue Jul 19, 2021
Add error to the debug output for ipcache check failures. This should
help solve #361.

Signed-off-by: Jarno Rajahalme <[email protected]>
@jrajahalme
Copy link
Member

Hit in #442: https://github.com/cilium/cilium-cli/actions/runs/1045185846

In this case the fail is due to this timing out:

2021-07-19T12:44:51.662536416Z ⌛ [gke_***_us-west2-a_cilium-cilium-cli-1045185846-mesh-1] Waiting for Cilium pod kube-system/cilium-pkl65 to have all the pod IPs in eBPF ipcache...
2021-07-19T12:44:52.077224782Z 🐛 Couldn't find client Pod cilium-test/client-7b7bf54b85-fnfjr in ipcache, retrying...

The pod cilium-test/client-7b7bf54b85-fnfjr has the IP 10.20.1.4:

NAMESPACE     NAME                                                             READY   STATUS    RESTARTS   AGE   IP            NODE                                                  NOMINATED NODE   READINESS GATES
cilium-test   client-7b7bf54b85-fnfjr                                          1/1     Running   0          15m   10.20.1.4     gke-cilium-cilium-cli-10-default-pool-bb3439e4-wgb1   <none>           <none>

Looking at the kube-system/cilium-pkl65 (cluster2) ipcache in test artifacts, it has the pod IP:

IP PREFIX/ADDRESS   IDENTITY
10.20.1.4/32        79723 0 10.168.0.77    

Since this is a flake, it is possible that the ipcache entry was populated right after the last check. To see if something else is going on, #444 adds the error string to the appropriate debug messages.

jrajahalme added a commit that referenced this issue Jul 19, 2021
Add error to the debug output for ipcache check failures. This should
help solve #361.

Signed-off-by: Jarno Rajahalme <[email protected]>
tklauser pushed a commit that referenced this issue Aug 10, 2021
Add error to the debug output for ipcache check failures. This should
help solve #361.

Signed-off-by: Jarno Rajahalme <[email protected]>
@zhaojizhuang
Copy link
Contributor

@bmcustodio
Copy link
Contributor

bmcustodio commented Aug 30, 2021

I've spotted this in a GKE multi-cluster test:

⌛ [gke_cilium-dev_europe-west6-b_bruno-gke-1] Waiting for Cilium pod kube-system/cilium-fdz6b to have all the pod IPs in eBPF ipcache...
🐛 Couldn't find client Pod cilium-test/client-7b7bf54b85-qwnsc in ipcache, retrying...
(...)
Connectivity test failed: timeout reached waiting for pod IDs in ipcache of Cilium pod kube-system/cilium-fdz6b

Deleting the reported pod and restarting the Cilium agents did not seem to help:

cilium connectivity test --context gke_cilium-dev_europe-west6-b_bruno-gke-1 --multi-cluster gke_cilium-dev_europe-west6-b_bruno-gke-2 --debug
ℹ️  Monitor aggregation detected, will skip some flow validation steps
✨ [gke_cilium-dev_europe-west6-b_bruno-gke-1] Creating namespace for connectivity check...
✨ [gke_cilium-dev_europe-west6-b_bruno-gke-2] Creating namespace for connectivity check...
✨ [gke_cilium-dev_europe-west6-b_bruno-gke-1] Deploying echo-same-node service...
✨ [gke_cilium-dev_europe-west6-b_bruno-gke-1] Deploying echo-other-node service...
✨ [gke_cilium-dev_europe-west6-b_bruno-gke-1] Deploying same-node deployment...
✨ [gke_cilium-dev_europe-west6-b_bruno-gke-1] Deploying client deployment...
✨ [gke_cilium-dev_europe-west6-b_bruno-gke-1] Deploying client2 deployment...
✨ [gke_cilium-dev_europe-west6-b_bruno-gke-2] Deploying echo-other-node service...
✨ [gke_cilium-dev_europe-west6-b_bruno-gke-2] Deploying other-node deployment...
🐛 Validating Deployments...
⌛ [gke_cilium-dev_europe-west6-b_bruno-gke-1] Waiting for deployments [client client2 echo-same-node] to become ready...
⌛ [gke_cilium-dev_europe-west6-b_bruno-gke-2] Waiting for deployments [echo-other-node] to become ready...
⌛ [gke_cilium-dev_europe-west6-b_bruno-gke-1] Waiting for CiliumEndpoint for pod cilium-test/client-7b7bf54b85-42db6 to appear...
⌛ [gke_cilium-dev_europe-west6-b_bruno-gke-1] Waiting for CiliumEndpoint for pod cilium-test/client2-666976c95b-775pw to appear...
⌛ [gke_cilium-dev_europe-west6-b_bruno-gke-1] Waiting for CiliumEndpoint for pod cilium-test/echo-same-node-7967996674-prfgp to appear...
⌛ [gke_cilium-dev_europe-west6-b_bruno-gke-2] Waiting for CiliumEndpoint for pod cilium-test/echo-other-node-697d5d69b7-4dtlm to appear...
⌛ [gke_cilium-dev_europe-west6-b_bruno-gke-1] Waiting for Service cilium-test/echo-other-node to become ready...
⌛ [gke_cilium-dev_europe-west6-b_bruno-gke-1] Waiting for Service cilium-test/echo-same-node to become ready...
⌛ [gke_cilium-dev_europe-west6-b_bruno-gke-1] Waiting for Cilium pod kube-system/cilium-pq7hd to have all the pod IPs in eBPF ipcache...
🐛 Couldn't find client Pod cilium-test/client-7b7bf54b85-42db6 in ipcache, retrying...
🐛 Couldn't find client Pod cilium-test/client-7b7bf54b85-42db6 in ipcache, retrying...
🐛 Couldn't find client Pod cilium-test/client-7b7bf54b85-42db6 in ipcache, retrying...
🐛 Couldn't find client Pod cilium-test/client-7b7bf54b85-42db6 in ipcache, retrying...
🐛 Couldn't find client Pod cilium-test/client-7b7bf54b85-42db6 in ipcache, retrying...
🐛 Couldn't find client Pod cilium-test/client-7b7bf54b85-42db6 in ipcache, retrying...
🐛 Couldn't find client Pod cilium-test/client-7b7bf54b85-42db6 in ipcache, retrying...
🐛 Couldn't find client Pod cilium-test/client2-666976c95b-775pw in ipcache, retrying...
🐛 Couldn't find client Pod cilium-test/client-7b7bf54b85-42db6 in ipcache, retrying...
🐛 Couldn't find client Pod cilium-test/client-7b7bf54b85-42db6 in ipcache, retrying...
🐛 Couldn't find client Pod cilium-test/client-7b7bf54b85-42db6 in ipcache, retrying...
🐛 Couldn't find client Pod cilium-test/client-7b7bf54b85-42db6 in ipcache, retrying...
🐛 Couldn't find client Pod cilium-test/client-7b7bf54b85-42db6 in ipcache, retrying...
🐛 Couldn't find client Pod cilium-test/client-7b7bf54b85-42db6 in ipcache, retrying...
🐛 Couldn't find client Pod cilium-test/client-7b7bf54b85-42db6 in ipcache, retrying...
🐛 Couldn't find client Pod cilium-test/client-7b7bf54b85-42db6 in ipcache, retrying...
🐛 Couldn't find client Pod cilium-test/client-7b7bf54b85-42db6 in ipcache, retrying...

Error: Connectivity test failed: timeout reached waiting for pod IDs in ipcache of Cilium pod kube-system/cilium-pq7hd

I am attaching a sysdump of each cluster:

@nbusseneau
Copy link
Member

I think this is a good suggestion from @tklauser:

Given that the connectivity tests worked without the ipcache check in place, maybe a short term solution would be to delete/comment out that check and then assign a SIG to investigate how to properly implement the ipcache check?

michi-covalent added a commit that referenced this issue Aug 30, 2021
Temporarily disabling the ipcache check while we investigate what's
causing the flake.

Ref: #361

Signed-off-by: Michi Mutsuzaki <[email protected]>
michi-covalent added a commit that referenced this issue Aug 30, 2021
Add `--skip-ip-cache-check` flag with the default set to true. This is
meant to be a temporary flag while we investigate what's causing the flake.

Ref: #361

Signed-off-by: Michi Mutsuzaki <[email protected]>
michi-covalent added a commit that referenced this issue Aug 31, 2021
Add `--skip-ip-cache-check` flag with the default set to true. This is
meant to be a temporary flag while we investigate what's causing the flake.

Ref: #361

Signed-off-by: Michi Mutsuzaki <[email protected]>
tklauser added a commit that referenced this issue Sep 1, 2021
Follow-up for #503 to address
#503 (comment)

Also add a comment so we don't forget to re-enable the check again once
issue #361 is resolved.

Signed-off-by: Tobias Klauser <[email protected]>
tklauser added a commit that referenced this issue Sep 1, 2021
Follow-up for #503 to address
#503 (comment)

Also add a comment so we don't forget to re-enable the check again once
issue #361 is resolved.

Signed-off-by: Tobias Klauser <[email protected]>
@nbusseneau
Copy link
Member

nbusseneau commented Sep 1, 2021

Following community meeting: closing since we disabled the ipcache check in the CLI.
If we see flakes around ipcache issues then we can continue investigating and revisit implementing the ipcache check again.

@nbusseneau nbusseneau added the ci/flake Issues tracking failing (integration or unit) tests. label Sep 8, 2021
aditighag pushed a commit to aditighag/cilium-cli that referenced this issue Apr 21, 2023
This should give us some additional information to debug flakes, e.g.
for cilium#361

Signed-off-by: Tobias Klauser <[email protected]>
aditighag pushed a commit to aditighag/cilium-cli that referenced this issue Apr 21, 2023
Add error to the debug output for ipcache check failures. This should
help solve cilium#361.

Signed-off-by: Jarno Rajahalme <[email protected]>
aditighag pushed a commit to aditighag/cilium-cli that referenced this issue Apr 21, 2023
Add `--skip-ip-cache-check` flag with the default set to true. This is
meant to be a temporary flag while we investigate what's causing the flake.

Ref: cilium#361

Signed-off-by: Michi Mutsuzaki <[email protected]>
aditighag pushed a commit to aditighag/cilium-cli that referenced this issue Apr 21, 2023
Follow-up for cilium#503 to address
cilium#503 (comment)

Also add a comment so we don't forget to re-enable the check again once
issue cilium#361 is resolved.

Signed-off-by: Tobias Klauser <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/CI Continuous Integration testing issue or flake ci/flake Issues tracking failing (integration or unit) tests.
Projects
None yet
Development

No branches or pull requests

6 participants