-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix agent nil panic in pure IPv6 cluster #2655
Conversation
/test-e2e /test-ipv6-only-conformance /test-ipv6-only-e2e |
Codecov Report
@@ Coverage Diff @@
## main #2655 +/- ##
==========================================
- Coverage 60.66% 60.49% -0.18%
==========================================
Files 285 285
Lines 23006 23017 +11
==========================================
- Hits 13957 13923 -34
- Misses 7550 7598 +48
+ Partials 1499 1496 -3
Flags with carried forward coverage won't be shown. Click here to find out more.
|
/test-e2e /test-ipv6-only-e2e |
1577d3a
to
e0a6988
Compare
/test-ipv6-only-e2e |
/test-ipv6-only-e2e |
d56d20f
to
b4046c7
Compare
/test-ipv6-only-e2e |
/test-e2e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test/e2e/egress_test.go
Outdated
// skipIfNotIPv6Cluster(t) | ||
if clusterInfo.podV6NetworkCIDR == "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't that the same check as skipIfNotIPv6Cluster
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good catch, will fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
test/e2e/egress_test.go
Outdated
for _, ipR := range ipRanges { | ||
ipVsersion := fmt.Sprintf("-IPv%d", ipR.ipVersion) | ||
expectedTotal := ipR.expectedTotalIpNum | ||
t.Run(tt.name+ipVsersion, func(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aren't you going to have several tests with the same name here? it seems the name is the same for all IPRanges sharing the same IP family.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wenqiq You should reply comment that is not specific to a line in the comment of conversation page. Replying here is confusing and hard to track the original comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Thanks for reminding.
test/e2e/egress_test.go
Outdated
return exists == false, nil | ||
}) | ||
require.NoError(t, err, "Failed to check if IP exists on Node") | ||
// assert.False(t, exists, "Found stale IP on Node") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove if not needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that I comment this line by mistake, will fix it.
test/e2e/framework.go
Outdated
var builder strings.Builder | ||
pods, errGet := data.clientset.CoreV1().Pods(antreaNamespace).List(context.TODO(), metav1.ListOptions{ | ||
LabelSelector: "app=antrea,component=antrea-agent", | ||
}) | ||
if errGet != nil { | ||
builder.WriteString(errGet.Error()) | ||
} | ||
for _, pod := range pods.Items { | ||
code, stdout, stderr, errCmd := provider.RunCommandOnNode(controlPlaneNodeName(), fmt.Sprintf("kubectl -n %s logs %s antrea-agent", antreaNamespace, pod.Name)) | ||
builder.WriteString(fmt.Sprintf("RunCommandOnNode, code: %d, stdout: %s, stderr: %s, error: %v", code, stdout, stderr, errCmd)) | ||
} | ||
return fmt.Errorf("restartAntreaAgentPods error: %v, logs: %s", err, builder.String()) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems unrelated to the rest of the PR and it's unclear what the purpose is since there is no comment
if there is an issue with the e2e test framework, it should be addressed in a separate PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the agent starting failed it just describe the agent pods, I think we need to print agent starting log and makes it easier to debug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with @antoninbas. And we already collect the component logs in specific directory if the test fails. I'm not sure if it will collect logs when deploying fails. But even if it doesn't, it's easy to change its behavior to do it. Dumpping the whole log in the error may mess up the test output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't have the same test environment. It`s hard to find the panic cause, because the describe pod logs in Jenkins CI output just say 'Back off....'.
Normal Created 116s (x4 over 2m53s) kubelet Created container antrea-agent
Normal Started 116s (x4 over 2m53s) kubelet Started container antrea-agent
Warning BackOff 112s (x7 over 2m48s) kubelet Back-off restarting failed container
cmd/antrea-agent/agent.go
Outdated
} else if nodeConfig.NodeIPv6Addr != nil { | ||
nodeIP = nodeConfig.NodeIPv6Addr.IP | ||
} else { | ||
return fmt.Errorf("NodeIPAddr in Node config invalid: %v", nodeConfig) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NodeIPAddr in Node config is invalid
or invalid NodeIPAddr in Node config
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
de06b87
to
8d9c26b
Compare
test/e2e/egress_test.go
Outdated
} | ||
if egress.Spec.EgressIP != tt.expectedEgressIP { | ||
return false, nil | ||
for _, ipR := range ipRanges { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The nested subtests don't seem necessary and hard to understand what the test focus on and expect.
Can you just add subtests for several IPv6 cases here? e.g. "single matching Node with IPv6 range". I don't think we need to cover all scenario with all address families with all possible ip range type. Just typical is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
test/e2e/egress_test.go
Outdated
require.NoError(t, err, "Failed to check if IP exists on Node") | ||
assert.False(t, exists, "Found stale IP on Node") | ||
}) | ||
for _, ipR := range ipRanges { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
test/e2e/framework.go
Outdated
var builder strings.Builder | ||
pods, errGet := data.clientset.CoreV1().Pods(antreaNamespace).List(context.TODO(), metav1.ListOptions{ | ||
LabelSelector: "app=antrea,component=antrea-agent", | ||
}) | ||
if errGet != nil { | ||
builder.WriteString(errGet.Error()) | ||
} | ||
for _, pod := range pods.Items { | ||
code, stdout, stderr, errCmd := provider.RunCommandOnNode(controlPlaneNodeName(), fmt.Sprintf("kubectl -n %s logs %s antrea-agent", antreaNamespace, pod.Name)) | ||
builder.WriteString(fmt.Sprintf("RunCommandOnNode, code: %d, stdout: %s, stderr: %s, error: %v", code, stdout, stderr, errCmd)) | ||
} | ||
return fmt.Errorf("restartAntreaAgentPods error: %v, logs: %s", err, builder.String()) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with @antoninbas. And we already collect the component logs in specific directory if the test fails. I'm not sure if it will collect logs when deploying fails. But even if it doesn't, it's easy to change its behavior to do it. Dumpping the whole log in the error may mess up the test output.
Done /test-all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/test-e2e |
@tnqn @antoninbas It seems all the test checks related have finished successfully. Do you have any more comments about this PR? |
@wenqiq I see failure for dual-stack e2e tests. Is that expected?
|
/test-ipv6-e2e
|
/test-ipv6-e2e |
/test-all |
Dual-stack e2e tests failed.
|
/test-all |
/test-ipv6-only-e2e
|
/test-ipv6-conformance |
@wenqiq the failure about @tnqn Branch 1.3 seems waiting for this PR so I suggest merging Xu's PR now, ignoring the comment to add a comment in script. |
@wenqiq DCO is missing in the second patch |
Thanks @lzhecheng @tnqn @xliuxu , It seems I should rebase and merge my two commit after #2675 merged. |
/test-ipv6-e2e |
#2675 has been merged, you could rebase now. |
The nodeConfig.NodeIPv4Addr is nil which would cause panic in agent, when starting agent with Egress feature enabled in pure IPv6 cluster. It also adds Egress IPv6 test cases in dual-stack and pure IPv6 cluster. Related antrea-io#2196 Signed-off-by: Wenqi Qiu <[email protected]>
@@ -552,7 +717,7 @@ func (data *TestData) createEgress(t *testing.T, generateName string, matchExpre | |||
} | |||
|
|||
func (data *TestData) waitForEgressRealized(egress *v1alpha2.Egress) (*v1alpha2.Egress, error) { | |||
err := wait.PollImmediate(200*time.Millisecond, 3*time.Second, func() (done bool, err error) { | |||
err := wait.PollImmediate(200*time.Millisecond, 5*time.Second, func() (done bool, err error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So 3 seconds is enough in some cases? I feel it may indicate there is some problem but I don't think it's related to this PR so I'm fine with the change. We'd better to look at the logs to understand why the delay is so long.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3 seconds is enough in most cases. I have tested hundreds of times in my local env (dual-stack cluster) and nerver failed.It failed once in the jenkins CI workflow:
=== RUN TestEgress/testEgressNodeFailure/IPv6_cluster
egress_test.go:611:
Error Trace: egress_test.go:611
egress_test.go:589
Error: Received unexpected error:
timed out waiting for the condition
Test: TestEgress/testEgressNodeFailure/IPv6_cluster
So I changed it to 5 seconds.
/test-all /test-ipv6-conformance |
The only failure in "jenkins-ipv6-only-e2e" is not related to this PR and is going to be fixed by #2712:
|
Fixed agent panic because of nil NodeIP in pure IPv6 cluster
The
nodeConfig.NodeIPv4Addr
is nil which would cause panic in agent,when starting agent with Egress feature enabled in pure IPv6 cluster.
Related #2436
Add Egress IPv6 test cases in dual-stack or pure IPv6 cluster. Related #2196
Signed-off-by: Wenqi Qiu [email protected]