Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

e2e cases failed quite often after testing TestFlowAggregator #1956

Closed
tnqn opened this issue Mar 15, 2021 · 6 comments · Fixed by #1959
Closed

e2e cases failed quite often after testing TestFlowAggregator #1956

tnqn opened this issue Mar 15, 2021 · 6 comments · Fixed by #1959
Assignees
Labels
area/test/e2e Issues or PRs related to Antrea specific end-to-end testing. kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.

Comments

@tnqn
Copy link
Member

tnqn commented Mar 15, 2021

Describe the bug
Many e2e tests failed after testing TestFlowAggregator regardless of where it ran:
Jenkins tests:
https://jenkins.antrea-ci.rocks/job/antrea-e2e-for-pull-request/2180/console
https://jenkins.antrea-ci.rocks/job/antrea-e2e-for-pull-request/2181/console
Kind tests:
https://pipelines.actions.githubusercontent.com/rI6se88BdHGIt3c6pWSIOE7hniT9sK2Yykq02f3adUskIOWLhd/_apis/pipelines/1/runs/41803/signedlogcontent/62?urlExpires=2021-03-15T10%3A33%3A53.1605722Z&urlSigningMethod=HMACV1&urlSignature=wdQSJpI%2FMRR0L8CQN8qzg%2Bz%2FD58ic2XdpXQm%2FEYNYVg%3D
https://pipelines.actions.githubusercontent.com/rI6se88BdHGIt3c6pWSIOE7hniT9sK2Yykq02f3adUskIOWLhd/_apis/pipelines/1/runs/41784/signedlogcontent/64?urlExpires=2021-03-15T11%3A51%3A10.5229666Z&urlSigningMethod=HMACV1&urlSignature=PSJmMVUGj9jRbGmfnS0AaHRIx%2FQSfewr3aClt6bhRTA%3D

The common pattern is antrea-agent cannot get up and running after runnning TestFlowAggregator, the case itself may succeed or fail:

2021-03-13T06:13:57.0285748Z === RUN   TestFlowAggregator
2021-03-13T06:13:57.0288577Z ##[error]    fixtures.go:133: Creating 'antrea-test' K8s Namespace
2021-03-13T06:13:57.0311588Z ##[error]    fixtures.go:96: Applying Antrea YAML
2021-03-13T06:13:57.9460506Z ##[error]    fixtures.go:100: Waiting for all Antrea DaemonSet Pods
2021-03-13T06:13:58.9598129Z ##[error]    fixtures.go:104: Checking CoreDNS deployment
2021-03-13T06:14:01.9906449Z ##[error]    fixtures.go:160: Applying flow aggregator YAML with ipfix collector address: 172.18.0.4:4739:tcp
2021-03-13T06:14:05.3566800Z ##[error]    fixtures.go:172: Deploying flow exporter with collector address: 10.96.63.61:4739:tcp
2021-03-13T06:15:36.9364958Z ##[error]    flowaggregator_test.go:110: Error when setting up test: error when restarting antrea-agent Pod: antrea-agent DaemonSet not ready within 1m30s
2021-03-13T06:15:36.9368127Z --- FAIL: TestFlowAggregator (99.91s)
2021-03-13T06:15:36.9369146Z === RUN   TestReplaceFieldValue
2021-03-13T06:15:36.9372023Z --- PASS: TestReplaceFieldValue (0.00s)
2021-03-13T06:15:36.9392459Z === RUN   TestIPSecTunnelConnectivity
2021-03-13T06:15:36.9395283Z ##[error]    fixtures.go:36: Skipping test for the 'kind' provider: IPSec tunnel does not work with Kind
2021-03-13T06:15:36.9398119Z --- SKIP: TestIPSecTunnelConnectivity (0.00s)
2021-03-13T06:15:36.9399464Z === RUN   TestIPSecDeleteStaleTunnelPorts
2021-03-13T06:15:36.9401897Z ##[error]    fixtures.go:36: Skipping test for the 'kind' provider: IPSec tunnel does not work with Kind
2021-03-13T06:15:36.9407904Z --- SKIP: TestIPSecDeleteStaleTunnelPorts (0.00s)
2021-03-13T06:15:36.9409278Z === RUN   TestNetworkPolicyStats
2021-03-13T06:15:36.9411329Z ##[error]    fixtures.go:133: Creating 'antrea-test' K8s Namespace
2021-03-13T06:15:36.9413605Z ##[error]    fixtures.go:96: Applying Antrea YAML
2021-03-13T06:17:08.0802565Z ##[error]    networkpolicy_test.go:40: Error when setting up test: error when waiting for antrea-agent rollout to complete
2021-03-13T06:17:08.0859094Z --- FAIL: TestNetworkPolicyStats (91.14s)
2021-03-13T06:17:08.0860619Z === RUN   TestDifferentNamedPorts
2021-03-13T06:17:08.0862750Z ##[error]    fixtures.go:133: Creating 'antrea-test' K8s Namespace
2021-03-13T06:17:08.0865191Z ##[error]    fixtures.go:96: Applying Antrea YAML
2021-03-13T06:18:39.0666049Z ##[error]    networkpolicy_test.go:155: Error when setting up test: error when waiting for antrea-agent rollout to complete
2021-03-13T06:18:39.0686473Z --- FAIL: TestDifferentNamedPorts (90.99s)
--- PASS: TestFlowAggregator (209.81s)
    --- PASS: TestFlowAggregator/IntraNodeFlows (11.19s)
    --- PASS: TestFlowAggregator/InterNodeFlows (11.24s)
    --- PASS: TestFlowAggregator/LocalServiceAccess (11.10s)
    --- PASS: TestFlowAggregator/RemoteServiceAccess (11.10s)
=== RUN   TestReplaceFieldValue
--- PASS: TestReplaceFieldValue (0.00s)
=== RUN   TestIPSecTunnelConnectivity
    fixtures.go:133: Creating 'antrea-test' K8s Namespace
    fixtures.go:96: Applying Antrea YAML
    fixtures.go:100: Waiting for all Antrea DaemonSet Pods
    ipsec_test.go:66: Error when setting up test: antrea-agent DaemonSet not ready within 1m30s
--- FAIL: TestIPSecTunnelConnectivity (91.18s)
=== RUN   TestIPSecDeleteStaleTunnelPorts
    fixtures.go:133: Creating 'antrea-test' K8s Namespace
    fixtures.go:96: Applying Antrea YAML
    ipsec_test.go:100: Error when setting up test: error when waiting for antrea-agent rollout to complete
--- FAIL: TestIPSecDeleteStaleTunnelPorts (91.16s)

To Reproduce
Run e2e test

Expected
A test case should clean up its side effect, not affecting other cases.

@tnqn tnqn added kind/bug Categorizes issue or PR as related to a bug. area/test/e2e Issues or PRs related to Antrea specific end-to-end testing. labels Mar 15, 2021
@srikartati
Copy link
Member

Thanks @tnqn for reporting this. This may have caused by #1898
Looking into this closely to figure out the exact reason.

@antoninbas
Copy link
Contributor

antoninbas commented Mar 15, 2021

@srikartati maybe the hack that was used to ensure correct restart order between agent and aggregator?
please prioritize this so that PRs don't get stuck because of CI failures.

@antoninbas antoninbas added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Mar 15, 2021
@srikartati
Copy link
Member

@srikartati maybe the hack that was used to ensure correct restart order between agent and aggregator?
please prioritize this so that PRs don't get stuck because of CI failures.

Flow aggregator starting with no sleep affects only its yaml, but in this case, flow aggregator started fine and the error is related to Antrea Agent not restarting again properly during TestFlowAggregator or after the test. Need to look at logs to see why Antrea Deamonset did not come up properly.

2021-03-13T06:14:01.9906449Z ##[error]    fixtures.go:160: Applying flow aggregator YAML with ipfix collector address: 172.18.0.4:4739:tcp
2021-03-13T06:14:05.3566800Z ##[error]    fixtures.go:172: Deploying flow exporter with collector address: 10.96.63.61:4739:tcp
2021-03-13T06:15:36.9364958Z ##[error]    flowaggregator_test.go:110: Error when setting up test: error when restarting antrea-agent Pod: antrea-agent DaemonSet not ready within 1m30s
2021-03-13T06:15:36.9368127Z --- FAIL: TestFlowAggregator (99.91s)

Tried to reproduce this locally to get hold of agent logs when the error has occurred, but I was not successful even after running multiple times. Artifacts in tests do not seem to store the logs when the error has occurred on both kind and Jenkins runs. I was thinking of doing dummy PR and turning on logsExportOnSuccess to fetch logs for CI runs. Any better ideas on saving logs?

@srikartati
Copy link
Member

srikartati commented Mar 17, 2021

By dumping the describe output when error happens, I see that the antrea-agent container goes into crashloopbackoff state and doesn't start again :
Warning Unhealthy 5s (x8 over 75s) kubelet, antrea-e2e-for-pull-request-2200-5g9kg Readiness probe failed: Get "https://localhost:10350/readyz": dial tcp 127.0.0.1:10350: connect: connection refused

I think the root cause is the non-availability of correct certificates related to flow aggregator service when the antrea agent restarts--we should always fetch the certificates when making a new connection to the flow aggregator service. This is caused by a code change related to fetching certificates in agent in PR #1714. Adding the fix now and doing testing to confirm.

@srikartati
Copy link
Member

srikartati commented Mar 17, 2021

Log prior to the crash shows that DNS resolution in below error. The intermittent nature is explained because flow aggregator service might be up or might not be up. This situation occurs more frequently with newly added code that includes the bug related to fetching security certificates in the recent PR.

E0317 02:49:02.724945       1 process.go:116] Cannot the create the connection to the Collector flow-aggregator.flow-aggregator.svc:4739: dial tcp: lookup flow-aggregator.flow-aggregator.svc on 10.96.0.10:53: read udp 172.18.0.4:40549->10.96.0.10:53: i/o timeout
E0317 02:49:02.727441       1 exporter.go:195] Error when initializing flow exporter: error when starting exporter: error while initializing IPFIX exporting process: dial tcp: lookup flow-aggregator.flow-aggregator.svc on 10.96.0.10:53: read udp 172.18.0.4:40549->10.96.0.10:53: i/o timeout

Backtrace that caused the agent to crash is following:

goroutine 500 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:55 +0x10c
panic(0x1c24a20, 0x2e159c0)
	/usr/local/go/src/runtime/panic.go:969 +0x1b9
github.com/vmware-tanzu/antrea/pkg/ipfix.(*ipfixExportingProcess).CloseConnToCollector(0x0)
	/antrea/pkg/ipfix/ipfix_process.go:53 +0x22
github.com/vmware-tanzu/antrea/pkg/agent/flowexporter/exporter.(*flowExporter).Export(0xc0009012c0)
	/antrea/pkg/agent/flowexporter/exporter/exporter.go:198 +0x1ff
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc000996c10)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:155 +0x5f
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc000996c10, 0x2126e20, 0xc00087ce70, 0x1, 0xc0000abb00)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:156 +0xad
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000996c10, 0x3b9aca00, 0x0, 0x1fa2c01, 0xc0000abb00)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:133 +0x98
k8s.io/apimachinery/pkg/util/wait.Until(0xc000996c10, 0x3b9aca00, 0xc0000abb00)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:90 +0x4d
created by github.com/vmware-tanzu/antrea/pkg/agent/flowexporter/exporter.(*flowExporter).Run
	/antrea/pkg/agent/flowexporter/exporter/exporter.go:184 +0x85

The reason is with the code for checking the interface to be nil. The interface is not nil, even though the object typecasted by the interface is nil. This led to illegal check and crash. Fixed it using reflect package in PR #1959.

@srikartati
Copy link
Member

srikartati commented Mar 17, 2021

Summarizing the fix here: There are two bugs here.

  1. Fetching certificates right before connecting to the flow aggregator service during the retries.
  2. If there is any issue is creating an exporter connection to the flow aggregator (because of prior bug or unavailability of service leading to DNS failure). This leads to Antrea agent crash.

Fixed those in #1959

srikartati added a commit that referenced this issue Mar 17, 2021
…ests(#1959)

This commit fixes the bug in Flow Exporter code. This bug was introduced in PR #1714. Because of this, the IPFIX exporting process interface in Flow Exporter is assigned with typecasted nil (typecasted to corresponding structure ipfixExportingProcess). Therefore, the code comparing with nil breaks in exporting process initialization code of Flow Exporter.
This PR fixes the bug along with another issue in fetching certificates that exacerbates this bug when the Flow Exporter cannot connect to the Flow Aggregator.

Fixes #1956
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/test/e2e Issues or PRs related to Antrea specific end-to-end testing. kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants