Cri-o stuck in "Could not restore" when node load is high when restart cri-o #8673

lance5890 · 2024-10-15T05:54:59Z

What happened?

in one node, when system load is high, and then we restart cri-o, the cri-o stuck in Restore process for a long time, the logs show as follows:

Oct 15 13:32:42 master-lharm-2 crio[3480]: time="2024-10-15 13:32:42.903497855+08:00" level=warning msg="Could not restore sandbox e456015ab35e79331beb0071d58ff312747bee64056d7abd4f746358a7401712: failed to Statfs \"/var/run/netns/f6300a46-4b30-4a3d-8c06-5d2bb5c67905\": no such file or directory"
Oct 15 13:32:43 master-lharm-2 crio[3480]: time="2024-10-15 13:32:43.339161797+08:00" level=warning msg="Deleting all containers under sandbox e456015ab35e79331beb0071d58ff312747bee64056d7abd4f746358a7401712 since it could not be restored"
Oct 15 13:33:15 master-lharm-2 crio[3480]: time="2024-10-15 13:33:15.733919215+08:00" level=warning msg="Could not restore sandbox e85249f52bd82fab8b187e5e6ff0e7f9f5e9244a12523baa971be8ba5d36df00: failed to Statfs \"/var/run/netns/202e8db9-b93d-4576-9e39-5b6589aef158\": no such file or directory"
Oct 15 13:33:16 master-lharm-2 crio[3480]: time="2024-10-15 13:33:16.327566397+08:00" level=warning msg="Deleting all containers under sandbox e85249f52bd82fab8b187e5e6ff0e7f9f5e9244a12523baa971be8ba5d36df00 since it could not be restored"
Oct 15 13:33:53 master-lharm-2 crio[3480]: time="2024-10-15 13:33:53.088736014+08:00" level=warning msg="Could not restore sandbox ea5dffa43cc58888f331c4542f7fa02fd87ce6e8722c018701f34adb3bbf2e4c: failed to Statfs \"/var/run/netns/8cee8b58-faca-4159-bae6-44119e7bfb7c\": no such file or directory"
Oct 15 13:33:53 master-lharm-2 crio[3480]: time="2024-10-15 13:33:53.471422710+08:00" level=warning msg="Deleting all containers under sandbox ea5dffa43cc58888f331c4542f7fa02fd87ce6e8722c018701f34adb3bbf2e4c since it could not be restored"
Oct 15 13:34:16 master-lharm-2 crio[3480]: time="2024-10-15 13:34:16.940943535+08:00" level=warning msg="Could not restore sandbox 92d39c21bd77d349068f1f6f8379267c40e77e4ebd981f5828c1ddbdf2662162: failed to Statfs \"/var/run/netns/fa2aacd2-986b-458c-96ef-2a4e231a00d2\": no such file or directory"
Oct 15 13:34:17 master-lharm-2 crio[3480]: time="2024-10-15 13:34:17.518848482+08:00" level=warning msg="Deleting all containers under sandbox 92d39c21bd77d349068f1f6f8379267c40e77e4ebd981f5828c1ddbdf2662162 since it could not be restored"
Oct 15 13:34:47 master-lharm-2 crio[3480]: time="2024-10-15 13:34:47.982638318+08:00" level=warning msg="Could not restore sandbox b23e853d33f32b930dc718be396ab2a632647979a76dabed6322a1c59fe2104d: failed to Statfs \"/var/run/netns/af5f5567-28a0-43fd-9072-32b7db1697d2\": no such file or directory"
Oct 15 13:34:48 master-lharm-2 crio[3480]: time="2024-10-15 13:34:48.174033605+08:00" level=warning msg="Deleting all containers under sandbox b23e853d33f32b930dc718be396ab2a632647979a76dabed6322a1c59fe2104d since it could not be restored"
Oct 15 13:35:14 master-lharm-2 crio[3480]: time="2024-10-15 13:35:14.167490731+08:00" level=warning msg="Could not restore sandbox a71024afae081939f8ddd2f386240de5fb1827bfab1c20319fbb72fdeeef398d: failed to Statfs \"/var/run/netns/787418dc-88a8-4b4e-a319-166232202cf6\": no such file or directory"
Oct 15 13:35:14 master-lharm-2 crio[3480]: time="2024-10-15 13:35:14.606607516+08:00" level=warning msg="Deleting all containers under sandbox a71024afae081939f8ddd2f386240de5fb1827bfab1c20319fbb72fdeeef398d since it could not be restored"

ls /var/lib/containers/storage/overlay-containers | wc -l
5785

What did you expect to happen?

even when the node has high system load, The cri-o could not stuck in the Restoring process for a long time

How can we reproduce it (as minimally and precisely as possible)?

in the high system load, create many pods

Anything else we need to know?

No response

CRI-O and Kubernetes version

$ crio --version
# paste output here

1.25.8

$ kubectl version --output=json
# paste output here

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
5.15.131-3
# paste output here

Additional environment details (AWS, VirtualBox, physical, etc.)

physical

The text was updated successfully, but these errors were encountered:

lance5890 added the kind/bug Categorizes issue or PR as related to a bug. label Oct 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cri-o stuck in "Could not restore" when node load is high when restart cri-o #8673

Cri-o stuck in "Could not restore" when node load is high when restart cri-o #8673

lance5890 commented Oct 15, 2024

Cri-o stuck in "Could not restore" when node load is high when restart cri-o #8673

Cri-o stuck in "Could not restore" when node load is high when restart cri-o #8673

Comments

lance5890 commented Oct 15, 2024

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

CRI-O and Kubernetes version

OS version

Additional environment details (AWS, VirtualBox, physical, etc.)