flake: timeout reached waiting for pod IDs in ipcache of Cilium pod #361

michi-covalent · 2021-06-24T03:13:19Z

flake instances

symptoms

connectivity check fails with an error like this:

Error: Connectivity test failed: timeout reached waiting for pod IDs in ipcache of Cilium pod kube-system/cilium-wbrzq

other notes

a related upstream issue: cilium/cilium#16542

The text was updated successfully, but these errors were encountered:

tklauser · 2021-07-05T12:27:31Z

Hit in #392: https://github.com/cilium/cilium-cli/runs/2989797441?check_suite_focus=true

tklauser · 2021-07-05T12:59:40Z

Another failure in #397 https://github.com/cilium/cilium-cli/runs/2990063487?check_suite_focus=true

tklauser · 2021-07-07T07:51:59Z

Occured again on master in the multicluster workflow: https://github.com/cilium/cilium-cli/runs/3005090578?check_suite_focus=true

I'll open a PR enabling debugging options for connectivity tests to maybe get some more insight into what part of (*ConnectivityTest).waitForIPCache is failing.

This should give us some additional information to debug flakes, e.g. for #361 Signed-off-by: Tobias Klauser <[email protected]>

nbusseneau · 2021-07-19T13:13:13Z

Hit in #442: https://github.com/cilium/cilium-cli/actions/runs/1045185846

Add error to the debug output for ipcache check failures. This should help solve #361. Signed-off-by: Jarno Rajahalme <[email protected]>

jrajahalme · 2021-07-19T20:21:51Z

Hit in #442: https://github.com/cilium/cilium-cli/actions/runs/1045185846

In this case the fail is due to this timing out:

2021-07-19T12:44:51.662536416Z ⌛ [gke_***_us-west2-a_cilium-cilium-cli-1045185846-mesh-1] Waiting for Cilium pod kube-system/cilium-pkl65 to have all the pod IPs in eBPF ipcache...
2021-07-19T12:44:52.077224782Z 🐛 Couldn't find client Pod cilium-test/client-7b7bf54b85-fnfjr in ipcache, retrying...

The pod cilium-test/client-7b7bf54b85-fnfjr has the IP 10.20.1.4:

NAMESPACE     NAME                                                             READY   STATUS    RESTARTS   AGE   IP            NODE                                                  NOMINATED NODE   READINESS GATES
cilium-test   client-7b7bf54b85-fnfjr                                          1/1     Running   0          15m   10.20.1.4     gke-cilium-cilium-cli-10-default-pool-bb3439e4-wgb1   <none>           <none>

Looking at the kube-system/cilium-pkl65 (cluster2) ipcache in test artifacts, it has the pod IP:

IP PREFIX/ADDRESS   IDENTITY
10.20.1.4/32        79723 0 10.168.0.77

Since this is a flake, it is possible that the ipcache entry was populated right after the last check. To see if something else is going on, #444 adds the error string to the appropriate debug messages.

Add error to the debug output for ipcache check failures. This should help solve #361. Signed-off-by: Jarno Rajahalme <[email protected]>

zhaojizhuang · 2021-08-17T03:51:21Z

still in https://github.com/cilium/cilium-cli/pull/482/checks?check_run_id=3346586422

bmcustodio · 2021-08-30T09:19:29Z

I've spotted this in a GKE multi-cluster test:

⌛ [gke_cilium-dev_europe-west6-b_bruno-gke-1] Waiting for Cilium pod kube-system/cilium-fdz6b to have all the pod IPs in eBPF ipcache...
🐛 Couldn't find client Pod cilium-test/client-7b7bf54b85-qwnsc in ipcache, retrying...
(...)
Connectivity test failed: timeout reached waiting for pod IDs in ipcache of Cilium pod kube-system/cilium-fdz6b

Deleting the reported pod and restarting the Cilium agents did not seem to help:

cilium connectivity test --context gke_cilium-dev_europe-west6-b_bruno-gke-1 --multi-cluster gke_cilium-dev_europe-west6-b_bruno-gke-2 --debug
ℹ️  Monitor aggregation detected, will skip some flow validation steps
✨ [gke_cilium-dev_europe-west6-b_bruno-gke-1] Creating namespace for connectivity check...
✨ [gke_cilium-dev_europe-west6-b_bruno-gke-2] Creating namespace for connectivity check...
✨ [gke_cilium-dev_europe-west6-b_bruno-gke-1] Deploying echo-same-node service...
✨ [gke_cilium-dev_europe-west6-b_bruno-gke-1] Deploying echo-other-node service...
✨ [gke_cilium-dev_europe-west6-b_bruno-gke-1] Deploying same-node deployment...
✨ [gke_cilium-dev_europe-west6-b_bruno-gke-1] Deploying client deployment...
✨ [gke_cilium-dev_europe-west6-b_bruno-gke-1] Deploying client2 deployment...
✨ [gke_cilium-dev_europe-west6-b_bruno-gke-2] Deploying echo-other-node service...
✨ [gke_cilium-dev_europe-west6-b_bruno-gke-2] Deploying other-node deployment...
🐛 Validating Deployments...
⌛ [gke_cilium-dev_europe-west6-b_bruno-gke-1] Waiting for deployments [client client2 echo-same-node] to become ready...
⌛ [gke_cilium-dev_europe-west6-b_bruno-gke-2] Waiting for deployments [echo-other-node] to become ready...
⌛ [gke_cilium-dev_europe-west6-b_bruno-gke-1] Waiting for CiliumEndpoint for pod cilium-test/client-7b7bf54b85-42db6 to appear...
⌛ [gke_cilium-dev_europe-west6-b_bruno-gke-1] Waiting for CiliumEndpoint for pod cilium-test/client2-666976c95b-775pw to appear...
⌛ [gke_cilium-dev_europe-west6-b_bruno-gke-1] Waiting for CiliumEndpoint for pod cilium-test/echo-same-node-7967996674-prfgp to appear...
⌛ [gke_cilium-dev_europe-west6-b_bruno-gke-2] Waiting for CiliumEndpoint for pod cilium-test/echo-other-node-697d5d69b7-4dtlm to appear...
⌛ [gke_cilium-dev_europe-west6-b_bruno-gke-1] Waiting for Service cilium-test/echo-other-node to become ready...
⌛ [gke_cilium-dev_europe-west6-b_bruno-gke-1] Waiting for Service cilium-test/echo-same-node to become ready...
⌛ [gke_cilium-dev_europe-west6-b_bruno-gke-1] Waiting for Cilium pod kube-system/cilium-pq7hd to have all the pod IPs in eBPF ipcache...
🐛 Couldn't find client Pod cilium-test/client-7b7bf54b85-42db6 in ipcache, retrying...
🐛 Couldn't find client Pod cilium-test/client-7b7bf54b85-42db6 in ipcache, retrying...
🐛 Couldn't find client Pod cilium-test/client-7b7bf54b85-42db6 in ipcache, retrying...
🐛 Couldn't find client Pod cilium-test/client-7b7bf54b85-42db6 in ipcache, retrying...
🐛 Couldn't find client Pod cilium-test/client-7b7bf54b85-42db6 in ipcache, retrying...
🐛 Couldn't find client Pod cilium-test/client-7b7bf54b85-42db6 in ipcache, retrying...
🐛 Couldn't find client Pod cilium-test/client-7b7bf54b85-42db6 in ipcache, retrying...
🐛 Couldn't find client Pod cilium-test/client2-666976c95b-775pw in ipcache, retrying...
🐛 Couldn't find client Pod cilium-test/client-7b7bf54b85-42db6 in ipcache, retrying...
🐛 Couldn't find client Pod cilium-test/client-7b7bf54b85-42db6 in ipcache, retrying...
🐛 Couldn't find client Pod cilium-test/client-7b7bf54b85-42db6 in ipcache, retrying...
🐛 Couldn't find client Pod cilium-test/client-7b7bf54b85-42db6 in ipcache, retrying...
🐛 Couldn't find client Pod cilium-test/client-7b7bf54b85-42db6 in ipcache, retrying...
🐛 Couldn't find client Pod cilium-test/client-7b7bf54b85-42db6 in ipcache, retrying...
🐛 Couldn't find client Pod cilium-test/client-7b7bf54b85-42db6 in ipcache, retrying...
🐛 Couldn't find client Pod cilium-test/client-7b7bf54b85-42db6 in ipcache, retrying...
🐛 Couldn't find client Pod cilium-test/client-7b7bf54b85-42db6 in ipcache, retrying...

Error: Connectivity test failed: timeout reached waiting for pod IDs in ipcache of Cilium pod kube-system/cilium-pq7hd

I am attaching a sysdump of each cluster:

nbusseneau · 2021-08-30T13:45:21Z

I think this is a good suggestion from @tklauser:

Given that the connectivity tests worked without the ipcache check in place, maybe a short term solution would be to delete/comment out that check and then assign a SIG to investigate how to properly implement the ipcache check?

Temporarily disabling the ipcache check while we investigate what's causing the flake. Ref: #361 Signed-off-by: Michi Mutsuzaki <[email protected]>

Add `--skip-ip-cache-check` flag with the default set to true. This is meant to be a temporary flag while we investigate what's causing the flake. Ref: #361 Signed-off-by: Michi Mutsuzaki <[email protected]>

Follow-up for #503 to address #503 (comment) Also add a comment so we don't forget to re-enable the check again once issue #361 is resolved. Signed-off-by: Tobias Klauser <[email protected]>

nbusseneau · 2021-09-01T17:27:07Z

Following community meeting: closing since we disabled the ipcache check in the CLI.
If we see flakes around ipcache issues then we can continue investigating and revisit implementing the ipcache check again.

This should give us some additional information to debug flakes, e.g. for cilium#361 Signed-off-by: Tobias Klauser <[email protected]>

Add error to the debug output for ipcache check failures. This should help solve cilium#361. Signed-off-by: Jarno Rajahalme <[email protected]>

Add `--skip-ip-cache-check` flag with the default set to true. This is meant to be a temporary flag while we investigate what's causing the flake. Ref: cilium#361 Signed-off-by: Michi Mutsuzaki <[email protected]>

Follow-up for cilium#503 to address cilium#503 (comment) Also add a comment so we don't forget to re-enable the check again once issue cilium#361 is resolved. Signed-off-by: Tobias Klauser <[email protected]>

michi-covalent added the area/CI Continuous Integration testing issue or flake label Jun 24, 2021

tklauser mentioned this issue Jul 5, 2021

Update cobra to v1.2.1 and use built-in completion command #392

Merged

tklauser mentioned this issue Jul 5, 2021

install: set enable-ipv4-masquerade by default on ≥ Cilium 1.10.0 #397

Merged

tklauser added a commit that referenced this issue Jul 7, 2021

CI: enable debug logs in connectivity tests

d203f05

This should give us some additional information to debug flakes, e.g. for #361 Signed-off-by: Tobias Klauser <[email protected]>

tklauser mentioned this issue Jul 7, 2021

CI: enable debug logs in connectivity tests #409

Merged

tklauser added a commit that referenced this issue Jul 8, 2021

CI: enable debug logs in connectivity tests

00d0b11

This should give us some additional information to debug flakes, e.g. for #361 Signed-off-by: Tobias Klauser <[email protected]>

nbusseneau mentioned this issue Jul 19, 2021

workflows: refactor EKS workflows #442

Merged

jrajahalme added a commit that referenced this issue Jul 19, 2021

test: Add error detail to ipcahce check failures

5fdef71

Add error to the debug output for ipcache check failures. This should help solve #361. Signed-off-by: Jarno Rajahalme <[email protected]>

jrajahalme added a commit that referenced this issue Jul 19, 2021

test: Add error detail to ipcache check failures

76470ec

Add error to the debug output for ipcache check failures. This should help solve #361. Signed-off-by: Jarno Rajahalme <[email protected]>

jrajahalme mentioned this issue Jul 19, 2021

test: Add error detail to ipcache check failures #444

Merged

jrajahalme added a commit that referenced this issue Jul 19, 2021

test: Add error detail to ipcache check failures

e51e34f

Add error to the debug output for ipcache check failures. This should help solve #361. Signed-off-by: Jarno Rajahalme <[email protected]>

nbusseneau mentioned this issue Aug 9, 2021

health: Add flag to set HTTP port cilium/cilium#16926

Merged

tklauser pushed a commit that referenced this issue Aug 10, 2021

test: Add error detail to ipcache check failures

0e47ccf

Add error to the debug output for ipcache check failures. This should help solve #361. Signed-off-by: Jarno Rajahalme <[email protected]>

This was referenced Aug 12, 2021

sysdump: Remove the Hidden flag #481

Merged

internal/cli/cmd: fix formatting linter issue #483

Merged

main: Add gops agent #325

Merged

tklauser mentioned this issue Aug 19, 2021

ci: bump golangci-lint to v1.42.0 #493

Merged

jrajahalme mentioned this issue Aug 30, 2021

1.10 backport for envoy: Update to release 1.18.4 cilium/cilium#17200

Merged

michi-covalent added a commit that referenced this issue Aug 30, 2021

Remove the ipcache check

4eb10be

Temporarily disabling the ipcache check while we investigate what's causing the flake. Ref: #361 Signed-off-by: Michi Mutsuzaki <[email protected]>

michi-covalent mentioned this issue Aug 30, 2021

Add a flag to disable the ipcache check #503

Merged

tklauser mentioned this issue Sep 1, 2021

internal/cli/cmd: mark --skip-ip-cache-check as hidden #508

Merged

jrajahalme mentioned this issue Sep 1, 2021

Fix flow validation breakage #485

Merged

nbusseneau closed this as completed Sep 1, 2021

nbusseneau added the ci/flake Issues tracking failing (integration or unit) tests. label Sep 8, 2021

aditighag pushed a commit to aditighag/cilium-cli that referenced this issue Apr 21, 2023

CI: enable debug logs in connectivity tests

0071ead

This should give us some additional information to debug flakes, e.g. for cilium#361 Signed-off-by: Tobias Klauser <[email protected]>

aditighag pushed a commit to aditighag/cilium-cli that referenced this issue Apr 21, 2023

test: Add error detail to ipcache check failures

9f20104

Add error to the debug output for ipcache check failures. This should help solve cilium#361. Signed-off-by: Jarno Rajahalme <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

flake: timeout reached waiting for pod IDs in ipcache of Cilium pod #361

flake: timeout reached waiting for pod IDs in ipcache of Cilium pod #361

michi-covalent commented Jun 24, 2021 •

edited

Loading

tklauser commented Jul 5, 2021 •

edited

Loading

tklauser commented Jul 5, 2021

tklauser commented Jul 7, 2021

nbusseneau commented Jul 19, 2021

jrajahalme commented Jul 19, 2021

zhaojizhuang commented Aug 17, 2021

bmcustodio commented Aug 30, 2021 •

edited

Loading

nbusseneau commented Aug 30, 2021

nbusseneau commented Sep 1, 2021 •

edited

Loading

flake: timeout reached waiting for pod IDs in ipcache of Cilium pod #361

flake: timeout reached waiting for pod IDs in ipcache of Cilium pod #361

Comments

michi-covalent commented Jun 24, 2021 • edited Loading

flake instances

symptoms

other notes

tklauser commented Jul 5, 2021 • edited Loading

tklauser commented Jul 5, 2021

tklauser commented Jul 7, 2021

nbusseneau commented Jul 19, 2021

jrajahalme commented Jul 19, 2021

zhaojizhuang commented Aug 17, 2021

bmcustodio commented Aug 30, 2021 • edited Loading

nbusseneau commented Aug 30, 2021

nbusseneau commented Sep 1, 2021 • edited Loading

michi-covalent commented Jun 24, 2021 •

edited

Loading

tklauser commented Jul 5, 2021 •

edited

Loading

bmcustodio commented Aug 30, 2021 •

edited

Loading

nbusseneau commented Sep 1, 2021 •

edited

Loading