-
Notifications
You must be signed in to change notification settings - Fork 144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove flaky integration test assertion #3917
Conversation
Asserting there are no errors in the logs from Elastic-Agent and all Beats is flaky and does not ensure the Elastic-Agent is working correctly. The test already assert the healthy of all components, so there is no need to look in the logs. The number of exceptions this assertion for no log errors is already an example of how fragile this is. The Elastic-Agent life cycle is complex and some transient errors are expected, as the code evolves those errors or messages will change, making assertion on them flaky.
Pinging @elastic/elastic-agent (Team:Elastic-Agent) |
This pull request does not have a backport label. Could you fix it @belimawr? 🙏
NOTE: |
Are we sure we want to remove this one? |
I'd love to have an online discussion regarding this PR. I'm a big fan of doing everything needed to KEEP this code in. |
List is short IMHO.. i expect it to grow to even 100+ and that's also OK. I don't think it's an indication of it being fragile... it's just that we have tests that expose the error flows, as expected, and that's perfectly fine. |
Quality Gate passedKudos, no new issues were introduced! 0 New issues |
One more thing we can improve is to provide a list of expected errors per test, in addition to a global list like we have here... will help making the list shorter. |
The description is correct, this is not strictly a test of the code our team owns. It is a essentially an alert that there is a new error in our logs. Broadly these failures fall into two categories:
Case 2 is the important part. What the test is trying to do is ensure we don't ship a bug to customers that was hiding in our logs without anybody noticing. We have done this before. Since this test is new many of the errors are just falling into case 1. We can ignore and fix these to not be at the error level. I agree that this test is extremely valuable to Elastic Agent as a product, I also agree that it causes some toil in our team to maintain it and this may be a bit annoying. However, something in our QA process needs to do this job of looking for new errors, and right now the best place for that is this integration test. Note: we should also be looking for |
We already have an error log check when running agent without installing it via If we want to have more checks on the agent and component logs it should be done as an assertion after each test (with or without agent installed) not just this one. Keeping only this specific testcase as a safeguard is:
If there is a specific need to keep error logs in check it should be done in every test and every test should be able to define exceptions/expected errors when testing error scenarios |
Whereas I understand you point @cmacknz, I've never thought about this test like this and it might just show there are different views on what this test is actually about. As the test name states Besides there are several issues that are either transient, but we want to know about, or fall into 1., they aren't supposed to be logged as errors. I think it'd be better to either look for ingestion specific error logs on filebeat logs or perhaps have a manual test that is responsible for checking all error logs on a given scenario and ensuring they're fine. That, I believe, would also help us to better tune the errors and warning logs we log as we could make the requirement/output of the manual test to include a list of log-level corrections. |
Thanks for all the feedback folks! I'll respond some, if not all key points raised here, by quoting them and adding my thoughts. At the end I'll add some other thoughts/ideas.
There is at least a 3rd category: Errors that are expected, correctly logged at error level and transient. An example I found today (and led me to create this PR) is issues connecting to Elasticsearch that happen when the Elastic-Agent is installed and running, but has not enrolled in Fleet yet. In this case all Beats will try to connect to
100% agree with this, that I believe could be built in the test framework itself panics should never happen.
I agree. As the name
While I like the idea, I believe in reality it will only create a chance of flakiness on all tests. Which is pretty much what is happening with
I had never seen this test or this assertion like that. I really thought it was a just a way to make sure the Elastic-Agent was running rather than something to uncover bugs that went under our radar and did not make the Agent/Beats unhealthy. My suggestionNow I see the value of looking for any new error log entry in normal execution of the Elastic-Agent as a way to uncover bugs/issues that were not caught by other tests and ways of asserting the correct execution of the Elastic-Agent. As we know looking for any
This way we will have a single test that will fail when a new unexpected, or expected, error log appears, any engineer having their PR fail on that will have a good direction to start investigating the failure. We can name it very clearly, something like This all could be done on this PR so it is merged as a single commit. |
Agreed this is better. The only reason this is confined to this one test is to test how annoying it will be to maintain. It would be better if every test had this check. It would be best if the product itself had this check and every true error appropriately updated the agent status.
Correct this is not a real error, and it shouldn't be logged as such. I have a different view that connecting to
Isn't this pretty much exactly what this test does? Except the fixed number of logs are the logs of the Elastic agent itself? Integration and E2E tests are not unit tests, they can and should do more than one job. To the earlier point, I am more in favour of doing this on every test rather than only one of them or having the product do this as part of normal operation (this should be the true end goal). |
I totally agree with this. On a quick glance at the current exceptions I'd consider most of them as bugs.
I always understood the focus of this test to be ensuring we can run the Elastic-Agent with a simple integration, more like an E2E test to make sure we didn't break anything on this basic flow.
I'm not against it, but we should try to make it clear in the test error message and code itself what is the purpose of this assertion. I know I didn't understood it until now, maybe other people thought the same. My main point here is to make very clear what we expect of this assertion so 6 months from now nobody will just see it as a flaky test again. |
Ah yes, actually looking at the error message it doesn't justify it's existence. We should fix that and probably include the reason for the assertion in the test failure message. |
I 100% agree that we should add to the |
Thinking a bit more about this PR and avoiding turning it into a change for the testing framework, I think I could:
|
This pull request is now in conflicts. Could you fix it? 🙏
|
Putting back to draft for now. |
This pull request is now in conflicts. Could you fix it? 🙏
|
Cleaning this up a bit, closing for now. |
What does this PR do?
Remove flaky integration test assertion
Asserting there are no errors in the logs from Elastic-Agent and all Beats is flaky and does not ensure the Elastic-Agent is working correctly. The test already assert the healthy of all components, so there is no need to look in the logs.
The number of exceptions this assertion for no log errors is already an example of how fragile this is. The Elastic-Agent life cycle is complex and some transient errors are expected, as the code evolves those errors or messages will change, making assertion on them flaky.
Why is it important?
It removes test flakiness
Checklist
- [ ] I have commented my code, particularly in hard-to-understand areas- [ ] I have made corresponding changes to the documentation- [ ] I have made corresponding change to the default configuration files- [ ] I have added tests that prove my fix is effective or that my feature works- [ ] I have added an entry in(https://github.com/elastic/elastic-agent#changelog)./changelog/fragments
using the [changelog tool]- [ ] I have added an integration test or an E2E testAuthor's Checklist
TestLogIngestionFleetManaged
is what ensures this PR does not introduce bugs.How to test this PR locally
Just run the test:
## Related issues## Use cases## Screenshots## LogsQuestions to ask yourself