-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI: rerun always if any failure #26574
CI: rerun always if any failure #26574
Conversation
💚 Build Succeeded
Expand to view the summary
Build stats
Test stats 🧪
Trends 🧪💚 Flaky test reportTests succeeded. Expand to view the summary
Test stats 🧪
|
…stage-failed-within-same-build * upstream/master: (36 commits) Revert "[CI] fight the flakiness with some retry option in the CI only for the Pull Requests (elastic#26617)" (elastic#26704) Packaging: linux/armv7 is not supported (elastic#26706) Cyberarkpas: Link to official docs on how to setup TLS (elastic#26614) Make network_direction, registered_domain and convert processors compatible with ES older than 7.13.0 (elastic#26676) Disable armv7 packaging (elastic#26679) [Heartbeat] use --params flag for synthetics (elastic#26674) Update dependent package to avoid downloading a suspicious file (elastic#26406) [mergify] set title and allow bp in any direction (elastic#26648) Fix memory leak in SQL helper when database is not available (elastic#26607) [CI] fight the flakiness with some retry option in the CI only for the Pull Requests (elastic#26617) [mergify] automate PRs that change the backport rules (elastic#26641) [Metricbeat] Add Airflow module in xpack (elastic#26220) chore: add-backport-next (elastic#26620) [metricbeat] Add state_job metricset (elastic#26479) CI: jenkins labels are less time consuming now (elastic#26613) [MetricBeat] [AWS] Fix aws metric tags with resourcegroupstaggingapi paginator (elastic#26385) (elastic#26443) Move openmetrics module to oss (elastic#26561) Skip flaky test TestFilestreamMetadataUpdatedOnRename (elastic#26609) [filebeat][fortinet] Use default add_locale for fortinet.firewall (elastic#26524) Enroll proxy settings (elastic#26514) ...
I vote to retry only the command directly at https://github.com/elastic/beats/blob/master/Jenkinsfile#L565 nothing else |
If we do so, we cannot warranty the worker is in a good shape for the below reasons:
Also, we cannot skip genuine failures in stages such as linting or packaging, that they don't have flakiness normally. #26736 is the one that contains the changes from your suggestion |
most of the failures are due to test failures that are covered by the retry. |
Still 10% of the Beats builds are having a reused worker within the same build. We fixed the reused workspace with https://github.com/elastic/apm-pipeline-library/blob/7f03e76e64c3a615a3ccdc8b911fbd236112daa7/vars/withNode.groovy#L38-L39 but still as you can see in the above numbers for reusing is still there. So we can give a go with your suggestion, though I'd like to add some further configuration to exclude the retry in the linting and packaging |
Superseded by #26736 Let's try a simple approach and if needed we can come back to this particular approach |
This pull request is now in conflicts. Could you fix it? 🙏
|
/test |
We have agreed to close this approach to avoid adding more complexity in the pipeline and wait for the fix in the CI ecosystem. |
What does this PR do?
Rerun any failed stages automatically up to
3 times
. Those stages are:For every beat.
It's excluded for:
In addition a
rerun.json
file is archived with the stages that were retried and their parameters, this should help to debug how often retries are happening.Why is it important?
If flakiness, we built a way to rerun the given commit manually and discard the existing success stages.
This new proposal, it adds the logic to the pipeline, therefore every stage will have the chance to retry again.
IMPORTANT: If a genuine failure then the stage will be retried up to 3 times :/ . I assume this is what we pay in order to reduce the flakiness, though, potentially we could add a test analyser between retries, but it might add some complexity and maintainability.
Issues
Tests
Failed stage
If a stage failed then the test results won't be archived as long as there are more retries
Retried stages