Remove flaky integration test assertion #3917

belimawr · 2023-12-14T14:44:14Z

What does this PR do?

Remove flaky integration test assertion

Asserting there are no errors in the logs from Elastic-Agent and all Beats is flaky and does not ensure the Elastic-Agent is working correctly. The test already assert the healthy of all components, so there is no need to look in the logs.

The number of exceptions this assertion for no log errors is already an example of how fragile this is. The Elastic-Agent life cycle is complex and some transient errors are expected, as the code evolves those errors or messages will change, making assertion on them flaky.

Why is it important?

It removes test flakiness

Checklist

My code follows the style guidelines of this project

~~- [ ] I have commented my code, particularly in hard-to-understand areas~~
~~- [ ] I have made corresponding changes to the documentation~~
~~- [ ] I have made corresponding change to the default configuration files~~
~~- [ ] I have added tests that prove my fix is effective or that my feature works~~
~~- [ ] I have added an entry in ./changelog/fragments using the [changelog tool]~~(https://github.com/elastic/elastic-agent#changelog)
~~- [ ] I have added an integration test or an E2E test~~

Author's Checklist

No flakiness on TestLogIngestionFleetManaged is what ensures this PR does not introduce bugs.

How to test this PR locally

Just run the test:

SNAPSHOT=true TEST_PLATFORMS="linux/amd64" mage -v integration:single TestLogIngestionFleetManaged

~~## Related issues~~
~~## Use cases~~
~~## Screenshots~~
~~## Logs~~

Questions to ask yourself

How are we going to support this in production?
How are we going to measure its adoption?
How are we going to debug this?
What are the metrics I should take care of?
...

Asserting there are no errors in the logs from Elastic-Agent and all Beats is flaky and does not ensure the Elastic-Agent is working correctly. The test already assert the healthy of all components, so there is no need to look in the logs. The number of exceptions this assertion for no log errors is already an example of how fragile this is. The Elastic-Agent life cycle is complex and some transient errors are expected, as the code evolves those errors or messages will change, making assertion on them flaky.

elasticmachine · 2023-12-14T14:44:16Z

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

mergify · 2023-12-14T14:45:34Z

This pull request does not have a backport label. Could you fix it @belimawr? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-v./d./d./d is the label to automatically backport to the 8./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

pierrehilbert · 2023-12-14T15:04:08Z

Are we sure we want to remove this one?
I understand it's a bit fragile but at least it will allow us to highlight if we introduced new unwanted errors in our logs.
cc @cmacknz what's your opinion?

amitkanfer · 2023-12-14T15:30:20Z

I'd love to have an online discussion regarding this PR. I'm a big fan of doing everything needed to KEEP this code in.

amitkanfer · 2023-12-14T15:33:28Z

The number of exceptions this assertion for no log errors is already an example of how fragile this is. The Elastic-Agent life cycle is complex and some transient errors are expected, as the code evolves those errors or messages will change, making assertion on them flaky.

List is short IMHO.. i expect it to grow to even 100+ and that's also OK. I don't think it's an indication of it being fragile... it's just that we have tests that expose the error flows, as expected, and that's perfectly fine.
The value-add of keeping this code is >> vs. removing it.
I also don't see how this is "flaky". it's either that errors are thrown, or not.

elastic-sonarqube · 2023-12-14T15:36:12Z

Quality Gate passed

Kudos, no new issues were introduced!

0 New issues
0 Security Hotspots
No data about Coverage
0.0% 0.0% Duplication on New Code

See analysis details on SonarQube

amitkanfer · 2023-12-14T15:36:17Z

One more thing we can improve is to provide a list of expected errors per test, in addition to a global list like we have here... will help making the list shorter.

cmacknz · 2023-12-14T15:39:18Z

The description is correct, this is not strictly a test of the code our team owns.

It is a essentially an alert that there is a new error in our logs. Broadly these failures fall into two categories:

The log message shouldn't be at the error level to begin with.
The error is real and has identified a probable bug.

Case 2 is the important part. What the test is trying to do is ensure we don't ship a bug to customers that was hiding in our logs without anybody noticing. We have done this before. Since this test is new many of the errors are just falling into case 1. We can ignore and fix these to not be at the error level.

I agree that this test is extremely valuable to Elastic Agent as a product, I also agree that it causes some toil in our team to maintain it and this may be a bit annoying.

However, something in our QA process needs to do this job of looking for new errors, and right now the best place for that is this integration test.

Note: we should also be looking for panic and not just error logs.

pchila · 2023-12-14T15:56:54Z

We already have an error log check when running agent without installing it via fixture.Run() unless otherwise specified (there's an option to disable this check when creating the fixture).

If we want to have more checks on the agent and component logs it should be done as an assertion after each test (with or without agent installed) not just this one.

Keeping only this specific testcase as a safeguard is:

annoying for all cases that fall into category 1
incomplete for all the other testcases that we run for which we are interested in error logs that fall in category 2

If there is a specific need to keep error logs in check it should be done in every test and every test should be able to define exceptions/expected errors when testing error scenarios

AndersonQ · 2023-12-14T16:21:53Z

It is a essentially an alert that there is a new error in our logs. Broadly these failures fall into two categories:

The log message shouldn't be at the error level to begin with.

The error is real and has identified a probable bug.

Whereas I understand you point @cmacknz, I've never thought about this test like this and it might just show there are different views on what this test is actually about. As the test name states TestLogIngestionFleetManaged, it should be testing the logs are properly ingested. Which it can do well even if it does not check for error logs. Actually, checking for error logs on ES pretty much just shows the logs are ingested.

Besides there are several issues that are either transient, but we want to know about, or fall into 1., they aren't supposed to be logged as errors.

I think it'd be better to either look for ingestion specific error logs on filebeat logs or perhaps have a manual test that is responsible for checking all error logs on a given scenario and ensuring they're fine. That, I believe, would also help us to better tune the errors and warning logs we log as we could make the requirement/output of the manual test to include a list of log-level corrections.

belimawr · 2023-12-14T17:04:51Z

Thanks for all the feedback folks! I'll respond some, if not all key points raised here, by quoting them and adding my thoughts. At the end I'll add some other thoughts/ideas.

It is a essentially an alert that there is a new error in our logs. Broadly these failures fall into two categories:

The log message shouldn't be at the error level to begin with.

The error is real and has identified a probable bug.

There is at least a 3rd category: Errors that are expected, correctly logged at error level and transient.

An example I found today (and led me to create this PR) is issues connecting to Elasticsearch that happen when the Elastic-Agent is installed and running, but has not enrolled in Fleet yet. In this case all Beats will try to connect to http://172.0.0.1:9200. This is a perfect example of an error that is correctly logged and expected to happen, but should stop once the Elastic-Agent is enrolled.

Note: we should also be looking for panic and not just error logs.

100% agree with this, that I believe could be built in the test framework itself panics should never happen.

I've never thought about this test like this and it might just show there are different views on what this test is actually about. As the test name states TestLogIngestionFleetManaged, it should be testing the logs are properly ingested. Which it can do well even if it does not check for error logs.

I agree. As the name TestLogIngestionFleetManaged suggests, it should be testing that logs are correctly ingested for Fleet managed agents, nothing else.

If there is a specific need to keep error logs in check it should be done in every test and every test should be able to define exceptions/expected errors when testing error scenarios

While I like the idea, I believe in reality it will only create a chance of flakiness on all tests. Which is pretty much what is happening with TestLogIngestionFleetManaged.

It is a essentially an alert that there is a new error in our logs.
(...)
I agree that this test is extremely valuable to Elastic Agent as a product, I also agree that it causes some toil in our team to maintain it and this may be a bit annoying.

However, something in our QA process needs to do this job of looking for new errors, and right now the best place for that is this integration test.

I had never seen this test or this assertion like that. I really thought it was a just a way to make sure the Elastic-Agent was running rather than something to uncover bugs that went under our radar and did not make the Agent/Beats unhealthy.

My suggestion

Now I see the value of looking for any new error log entry in normal execution of the Elastic-Agent as a way to uncover bugs/issues that were not caught by other tests and ways of asserting the correct execution of the Elastic-Agent.

As we know looking for any log.level: error is inherently flaky and introduces toil (this is at least the second time I have to add exceptions to this test), then instead of having TestLogIngestionFleetManaged doing this, we remove this assertion from it and create a new test that will:

Install a Fleet managed Elastic-Agent (we could also have standalone)
Waits the Elastic-Agent to be healthy
Ingest a fixed number of logs (possibly also some metrics)
Shutdown or uninstall the Elastic-Agent
Assert that there are no error logs and no panics.

This way we will have a single test that will fail when a new unexpected, or expected, error log appears, any engineer having their PR fail on that will have a good direction to start investigating the failure. We can name it very clearly, something like TestEnsureNoErrorLogsAppearOnNormalExecution.

This all could be done on this PR so it is merged as a single commit.

cmacknz · 2023-12-14T18:57:54Z

If we want to have more checks on the agent and component logs it should be done as an assertion after each test (with or without agent installed) not just this one.

Agreed this is better. The only reason this is confined to this one test is to test how annoying it will be to maintain. It would be better if every test had this check. It would be best if the product itself had this check and every true error appropriately updated the agent status.

An example I found today (and led me to create this PR) is issues connecting to Elasticsearch that happen when the Elastic-Agent is installed and running, but has not enrolled in Fleet yet. In this case all Beats will try to connect to http://172.0.0.1:9200. This is a perfect example of an error that is correctly logged and expected to happen, but should stop once the Elastic-Agent is enrolled.

Correct this is not a real error, and it shouldn't be logged as such. I have a different view that connecting to 127.0.0.1 before we actually know the real output destination is a bug. We should not attempt to connect to anything until we get the first configuration from the control protocol.

As we know looking for any log.level: error is inherently flaky and introduces toil (this is at least the second time I have to add exceptions to this test), then instead of having TestLogIngestionFleetManaged doing this, we remove this assertion from it and create a new test that will:

Install a Fleet managed Elastic-Agent (we could also have standalone)
Waits the Elastic-Agent to be healthy
Ingest a fixed number of logs (possibly also some metrics)
Shutdown or uninstall the Elastic-Agent
Assert that there are no error logs and no panics.

Isn't this pretty much exactly what this test does? Except the fixed number of logs are the logs of the Elastic agent itself?

Integration and E2E tests are not unit tests, they can and should do more than one job. To the earlier point, I am more in favour of doing this on every test rather than only one of them or having the product do this as part of normal operation (this should be the true end goal).

belimawr · 2023-12-14T19:10:06Z

Correct this is not a real error, and it shouldn't be logged as such. I have a different view that connecting to 127.0.0.1 before we actually know the real output destination is a bug. We should not attempt to connect to anything until we get the first configuration from the control protocol.

I totally agree with this. On a quick glance at the current exceptions I'd consider most of them as bugs.

Isn't this pretty much exactly what this test does? Except the fixed number of logs are the logs of the Elastic agent itself?

I always understood the focus of this test to be ensuring we can run the Elastic-Agent with a simple integration, more like an E2E test to make sure we didn't break anything on this basic flow.

Integration and E2E tests are not unit tests, they can and should do more than one job. To the earlier point, I am more in favour of doing this on every test rather than only one of them or having the product do this as part of normal operation (this should be the true end goal).

I'm not against it, but we should try to make it clear in the test error message and code itself what is the purpose of this assertion. I know I didn't understood it until now, maybe other people thought the same.

My main point here is to make very clear what we expect of this assertion so 6 months from now nobody will just see it as a flaky test again.

cmacknz · 2023-12-14T20:04:13Z

I'm not against it, but we should try to make it clear in the test error message and code itself what is the purpose of this assertion. I know I didn't understood it until now, maybe other people thought the same.

My main point here is to make very clear what we expect of this assertion so 6 months from now nobody will just see it as a flaky test again.

Ah yes, actually looking at the error message it doesn't justify it's existence. We should fix that and probably include the reason for the assertion in the test failure message.

blakerouse · 2023-12-14T21:22:08Z

I 100% agree that we should add to the fixture.Install the same checks that are in fixture.Run. We should be ensuring on every test that an error is not reported. If an error is reported in our logs it is a bug, unless that test is specifically expecting or testing for that error.

belimawr · 2023-12-18T11:06:33Z

Thinking a bit more about this PR and avoiding turning it into a change for the testing framework, I think I could:

Rename TestLogIngestionFleetManaged to TestNoErrorLoggedInSimpleLogIngestCase
Keep the "no error in logs assertion", however improve the and overall output when it fails.
Create a task to add this concept of no error in logs as a default behaviour from the integration test framework

mergify · 2023-12-18T17:14:53Z

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b fix-flakiness upstream/fix-flakiness
git merge upstream/main
git push upstream fix-flakiness

pierrehilbert · 2024-01-09T13:38:28Z

Putting back to draft for now.

mergify · 2024-04-15T14:58:32Z

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b fix-flakiness upstream/fix-flakiness
git merge upstream/main
git push upstream fix-flakiness

jlind23 · 2024-07-02T12:19:12Z

Cleaning this up a bit, closing for now.

belimawr added Team:Elastic-Agent Label for the Agent team skip-changelog labels Dec 14, 2023

belimawr requested a review from a team as a code owner December 14, 2023 14:44

belimawr requested review from michalpristas and blakerouse December 14, 2023 14:44

mergify bot assigned belimawr Dec 14, 2023

mergify bot added the backport-skip label Dec 14, 2023

pchila approved these changes Dec 14, 2023

View reviewed changes

belimawr mentioned this pull request Dec 14, 2023

Fix install/enroll command not failing when the daemon restart fails #3815

Merged

6 tasks

pierrehilbert marked this pull request as draft January 9, 2024 13:38

jlind23 closed this Jul 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove flaky integration test assertion #3917

Remove flaky integration test assertion #3917

belimawr commented Dec 14, 2023

elasticmachine commented Dec 14, 2023

mergify bot commented Dec 14, 2023

pierrehilbert commented Dec 14, 2023

amitkanfer commented Dec 14, 2023

amitkanfer commented Dec 14, 2023

elastic-sonarqube bot commented Dec 14, 2023

amitkanfer commented Dec 14, 2023

cmacknz commented Dec 14, 2023

pchila commented Dec 14, 2023 •

edited

Loading

AndersonQ commented Dec 14, 2023

belimawr commented Dec 14, 2023

cmacknz commented Dec 14, 2023 •

edited

Loading

belimawr commented Dec 14, 2023

cmacknz commented Dec 14, 2023

blakerouse commented Dec 14, 2023

belimawr commented Dec 18, 2023

mergify bot commented Dec 18, 2023

pierrehilbert commented Jan 9, 2024

mergify bot commented Apr 15, 2024

jlind23 commented Jul 2, 2024

Remove flaky integration test assertion #3917

Remove flaky integration test assertion #3917

Conversation

belimawr commented Dec 14, 2023

What does this PR do?

Why is it important?

Checklist

Author's Checklist

How to test this PR locally

Questions to ask yourself

elasticmachine commented Dec 14, 2023

mergify bot commented Dec 14, 2023

pierrehilbert commented Dec 14, 2023

amitkanfer commented Dec 14, 2023

amitkanfer commented Dec 14, 2023

elastic-sonarqube bot commented Dec 14, 2023

Quality Gate passed

amitkanfer commented Dec 14, 2023

cmacknz commented Dec 14, 2023

pchila commented Dec 14, 2023 • edited Loading

AndersonQ commented Dec 14, 2023

belimawr commented Dec 14, 2023

My suggestion

cmacknz commented Dec 14, 2023 • edited Loading

belimawr commented Dec 14, 2023

cmacknz commented Dec 14, 2023

blakerouse commented Dec 14, 2023

belimawr commented Dec 18, 2023

mergify bot commented Dec 18, 2023

pierrehilbert commented Jan 9, 2024

mergify bot commented Apr 15, 2024

jlind23 commented Jul 2, 2024

pchila commented Dec 14, 2023 •

edited

Loading

cmacknz commented Dec 14, 2023 •

edited

Loading