-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make iptables initialization error non fatal #1497
Conversation
Thanks for your PR. The following commands are available:
|
/test-all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I think that blocking in CNI Add is the right thing to do but hopefully it doesn't create additional issues.
nit: in the commit message, s/leading to more xtables lock competitor/leading to more xtables lock contention/
@@ -44,6 +45,10 @@ import ( | |||
"github.com/vmware-tanzu/antrea/pkg/ovs/ovsconfig" | |||
) | |||
|
|||
const ( | |||
networkReadyTimeout = 30 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the default kubelet / container runtime timeout for CNI Add? could you add the info as a comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
runtime request timeout is 2 minutes: https://github.com/kubernetes/kubernetes/blob/fc87c5927ca00eee2b437dc77b64f993728da87b/staging/src/k8s.io/kubelet/config/v1beta1/types.go#L451.
will add this information.
pkg/agent/route/route_linux.go
Outdated
func (c *Client) initIPTablesOnce(done func()) { | ||
defer done() | ||
for { | ||
time.Sleep(2 * time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
curious about why you put a sleep before the first try?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was a mistake, thanks for catching it, now I understand why Initialize took more than 2 seconds in integration test even if I didn't make the lock hold..
/test-all |
/test-all |
Codecov Report
@@ Coverage Diff @@
## master #1497 +/- ##
==========================================
+ Coverage 68.16% 68.62% +0.46%
==========================================
Files 165 165
Lines 13107 13163 +56
==========================================
+ Hits 8934 9033 +99
+ Misses 3237 3193 -44
- Partials 936 937 +1
Flags with carried forward coverage won't be shown. Click here to find out more.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for fixing this so quick
/test-all |
In large scale clusters, xtables lock may be hold by kubelet/ kube-proxy/ portmap for a long time, especially when there are many service rules in nat table. antrea-agent may not be able to acquire the lock in short time. If the agent blocks on the lock or quit itself, the CNI server won't be running, causing all CNI requests to fail. If the Pods' restart policy is Always and there are dead Pods, container runtime will keep retrying calling CNIs, during which portmap is called first, leading to more xtables lock contention. This patch makes iptables initialization error non fatal and uses a goroutine to retry it until success. The agent will start the CNI server anyway and handle the CNI Del requests but won't handle CNI Add requests until the network is ready.
@antoninbas @jianjuns I rebased on master and fixed a lint error, could you approve again? /test-all |
/test-windows-conformance |
In large scale clusters, xtables lock may be hold by kubelet/ kube-proxy/ portmap for a long time, especially when there are many service rules in nat table. antrea-agent may not be able to acquire the lock in short time. If the agent blocks on the lock or quit itself, the CNI server won't be running, causing all CNI requests to fail. If the Pods' restart policy is Always and there are dead Pods, container runtime will keep retrying calling CNIs, during which portmap is called first, leading to more xtables lock contention. This patch makes iptables initialization error non fatal and uses a goroutine to retry it until success. The agent will start the CNI server anyway and handle the CNI Del requests but won't handle CNI Add requests until the network is ready.
In large scale clusters, xtables lock may be hold by kubelet/ kube-proxy/ portmap for a long time, especially when there are many service rules in nat table. antrea-agent may not be able to acquire the lock in short time. If the agent blocks on the lock or quit itself, the CNI server won't be running, causing all CNI requests to fail. If the Pods' restart policy is Always and there are dead Pods, container runtime will keep retrying calling CNIs, during which portmap is called first, leading to more xtables lock contention. This patch makes iptables initialization error non fatal and uses a goroutine to retry it until success. The agent will start the CNI server anyway and handle the CNI Del requests but won't handle CNI Add requests until the network is ready.
In large scale clusters, xtables lock may be hold by kubelet/ kube-proxy/ portmap for a long time, especially when there are many service rules in nat table. antrea-agent may not be able to acquire the lock in short time. If the agent blocks on the lock or quit itself, the CNI server won't be running, causing all CNI requests to fail. If the Pods' restart policy is Always and there are dead Pods, container runtime will keep retrying calling CNIs, during which portmap is called first, leading to more xtables lock contention. This patch makes iptables initialization error non fatal and uses a goroutine to retry it until success. The agent will start the CNI server anyway and handle the CNI Del requests but won't handle CNI Add requests until the network is ready.
In large scale clusters, xtables lock may be hold by kubelet/
kube-proxy/ portmap for a long time, especially when there are many
service rules in nat table. antrea-agent may not be able to acquire the
lock in short time. If the agent blocks on the lock or quit itself, the
CNI server won't be running, causing all CNI requests to fail. If the
Pods' restart policy is Always and there are dead Pods, container
runtime will keep retrying calling CNIs, during which portmap is called
first, leading to more xtables lock contention.
This patch makes iptables initialization error non fatal and uses a
goroutine to retry it until success. The agent will start the CNI server
anyway and handle the CNI Del requests but won't handle CNI Add requests
until the network is ready.
Fixes #1499