-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
e2e cases failed quite often after testing TestFlowAggregator #1956
Comments
@srikartati maybe the hack that was used to ensure correct restart order between agent and aggregator? |
Flow aggregator starting with no sleep affects only its yaml, but in this case, flow aggregator started fine and the error is related to Antrea Agent not restarting again properly during
Tried to reproduce this locally to get hold of agent logs when the error has occurred, but I was not successful even after running multiple times. Artifacts in tests do not seem to store the logs when the error has occurred on both kind and Jenkins runs. I was thinking of doing dummy PR and turning on |
By dumping the describe output when error happens, I see that the antrea-agent container goes into I think the root cause is the non-availability of correct certificates related to flow aggregator service when the antrea agent restarts--we should always fetch the certificates when making a new connection to the flow aggregator service. This is caused by a code change related to fetching certificates in agent in PR #1714. Adding the fix now and doing testing to confirm. |
Log prior to the crash shows that DNS resolution in below error. The intermittent nature is explained because flow aggregator service might be up or might not be up. This situation occurs more frequently with newly added code that includes the bug related to fetching security certificates in the recent PR.
Backtrace that caused the agent to crash is following:
The reason is with the code for checking the interface to be nil. The interface is not nil, even though the object typecasted by the interface is nil. This led to illegal check and crash. Fixed it using reflect package in PR #1959. |
Summarizing the fix here: There are two bugs here.
Fixed those in #1959 |
…ests(#1959) This commit fixes the bug in Flow Exporter code. This bug was introduced in PR #1714. Because of this, the IPFIX exporting process interface in Flow Exporter is assigned with typecasted nil (typecasted to corresponding structure ipfixExportingProcess). Therefore, the code comparing with nil breaks in exporting process initialization code of Flow Exporter. This PR fixes the bug along with another issue in fetching certificates that exacerbates this bug when the Flow Exporter cannot connect to the Flow Aggregator. Fixes #1956
Describe the bug
Many e2e tests failed after testing TestFlowAggregator regardless of where it ran:
Jenkins tests:
https://jenkins.antrea-ci.rocks/job/antrea-e2e-for-pull-request/2180/console
https://jenkins.antrea-ci.rocks/job/antrea-e2e-for-pull-request/2181/console
Kind tests:
https://pipelines.actions.githubusercontent.com/rI6se88BdHGIt3c6pWSIOE7hniT9sK2Yykq02f3adUskIOWLhd/_apis/pipelines/1/runs/41803/signedlogcontent/62?urlExpires=2021-03-15T10%3A33%3A53.1605722Z&urlSigningMethod=HMACV1&urlSignature=wdQSJpI%2FMRR0L8CQN8qzg%2Bz%2FD58ic2XdpXQm%2FEYNYVg%3D
https://pipelines.actions.githubusercontent.com/rI6se88BdHGIt3c6pWSIOE7hniT9sK2Yykq02f3adUskIOWLhd/_apis/pipelines/1/runs/41784/signedlogcontent/64?urlExpires=2021-03-15T11%3A51%3A10.5229666Z&urlSigningMethod=HMACV1&urlSignature=PSJmMVUGj9jRbGmfnS0AaHRIx%2FQSfewr3aClt6bhRTA%3D
The common pattern is antrea-agent cannot get up and running after runnning TestFlowAggregator, the case itself may succeed or fail:
To Reproduce
Run e2e test
Expected
A test case should clean up its side effect, not affecting other cases.
The text was updated successfully, but these errors were encountered: