Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove flaky integration test assertion #3917

Closed
wants to merge 1 commit into from

Conversation

belimawr
Copy link
Contributor

What does this PR do?

Remove flaky integration test assertion

Asserting there are no errors in the logs from Elastic-Agent and all Beats is flaky and does not ensure the Elastic-Agent is working correctly. The test already assert the healthy of all components, so there is no need to look in the logs.

The number of exceptions this assertion for no log errors is already an example of how fragile this is. The Elastic-Agent life cycle is complex and some transient errors are expected, as the code evolves those errors or messages will change, making assertion on them flaky.

Why is it important?

It removes test flakiness

Checklist

  • My code follows the style guidelines of this project

- [ ] I have commented my code, particularly in hard-to-understand areas
- [ ] I have made corresponding changes to the documentation
- [ ] I have made corresponding change to the default configuration files
- [ ] I have added tests that prove my fix is effective or that my feature works
- [ ] I have added an entry in ./changelog/fragments using the [changelog tool](https://github.com/elastic/elastic-agent#changelog)
- [ ] I have added an integration test or an E2E test

Author's Checklist

  • No flakiness on TestLogIngestionFleetManaged is what ensures this PR does not introduce bugs.

How to test this PR locally

Just run the test:

SNAPSHOT=true TEST_PLATFORMS="linux/amd64" mage -v integration:single TestLogIngestionFleetManaged

## Related issues
## Use cases
## Screenshots
## Logs

Questions to ask yourself

  • How are we going to support this in production?
  • How are we going to measure its adoption?
  • How are we going to debug this?
  • What are the metrics I should take care of?
  • ...

Asserting there are no errors in the logs from Elastic-Agent and all
Beats is flaky and does not ensure the Elastic-Agent is working
correctly. The test already assert the healthy of all components, so
there is no need to look in the logs.

The number of exceptions this assertion for no log errors is already
an example of how fragile this is. The Elastic-Agent life cycle is
complex and some transient errors are expected, as the code evolves
those errors or messages will change, making assertion on them flaky.
@belimawr belimawr added Team:Elastic-Agent Label for the Agent team skip-changelog labels Dec 14, 2023
@belimawr belimawr requested a review from a team as a code owner December 14, 2023 14:44
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

Copy link
Contributor

mergify bot commented Dec 14, 2023

This pull request does not have a backport label. Could you fix it @belimawr? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v./d./d./d is the label to automatically backport to the 8./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

@pierrehilbert
Copy link
Contributor

Are we sure we want to remove this one?
I understand it's a bit fragile but at least it will allow us to highlight if we introduced new unwanted errors in our logs.
cc @cmacknz what's your opinion?

@amitkanfer
Copy link
Contributor

I'd love to have an online discussion regarding this PR. I'm a big fan of doing everything needed to KEEP this code in.

@amitkanfer
Copy link
Contributor

The number of exceptions this assertion for no log errors is already an example of how fragile this is. The Elastic-Agent life cycle is complex and some transient errors are expected, as the code evolves those errors or messages will change, making assertion on them flaky.

List is short IMHO.. i expect it to grow to even 100+ and that's also OK. I don't think it's an indication of it being fragile... it's just that we have tests that expose the error flows, as expected, and that's perfectly fine.
The value-add of keeping this code is >> vs. removing it.
I also don't see how this is "flaky". it's either that errors are thrown, or not.

Copy link

Quality Gate passed Quality Gate passed

Kudos, no new issues were introduced!

0 New issues
0 Security Hotspots
No Coverage information No data about Coverage
0.0% 0.0% Duplication on New Code

See analysis details on SonarQube

@amitkanfer
Copy link
Contributor

One more thing we can improve is to provide a list of expected errors per test, in addition to a global list like we have here... will help making the list shorter.

@cmacknz
Copy link
Member

cmacknz commented Dec 14, 2023

The description is correct, this is not strictly a test of the code our team owns.

It is a essentially an alert that there is a new error in our logs. Broadly these failures fall into two categories:

  1. The log message shouldn't be at the error level to begin with.
  2. The error is real and has identified a probable bug.

Case 2 is the important part. What the test is trying to do is ensure we don't ship a bug to customers that was hiding in our logs without anybody noticing. We have done this before. Since this test is new many of the errors are just falling into case 1. We can ignore and fix these to not be at the error level.

I agree that this test is extremely valuable to Elastic Agent as a product, I also agree that it causes some toil in our team to maintain it and this may be a bit annoying.

However, something in our QA process needs to do this job of looking for new errors, and right now the best place for that is this integration test.

Note: we should also be looking for panic and not just error logs.

@pchila
Copy link
Member

pchila commented Dec 14, 2023

We already have an error log check when running agent without installing it via fixture.Run() unless otherwise specified (there's an option to disable this check when creating the fixture).

If we want to have more checks on the agent and component logs it should be done as an assertion after each test (with or without agent installed) not just this one.

Keeping only this specific testcase as a safeguard is:

  • annoying for all cases that fall into category 1
  • incomplete for all the other testcases that we run for which we are interested in error logs that fall in category 2

If there is a specific need to keep error logs in check it should be done in every test and every test should be able to define exceptions/expected errors when testing error scenarios

@AndersonQ
Copy link
Member

It is a essentially an alert that there is a new error in our logs. Broadly these failures fall into two categories:

  1. The log message shouldn't be at the error level to begin with.
  2. The error is real and has identified a probable bug.

Whereas I understand you point @cmacknz, I've never thought about this test like this and it might just show there are different views on what this test is actually about. As the test name states TestLogIngestionFleetManaged, it should be testing the logs are properly ingested. Which it can do well even if it does not check for error logs. Actually, checking for error logs on ES pretty much just shows the logs are ingested.

Besides there are several issues that are either transient, but we want to know about, or fall into 1., they aren't supposed to be logged as errors.

I think it'd be better to either look for ingestion specific error logs on filebeat logs or perhaps have a manual test that is responsible for checking all error logs on a given scenario and ensuring they're fine. That, I believe, would also help us to better tune the errors and warning logs we log as we could make the requirement/output of the manual test to include a list of log-level corrections.

@belimawr
Copy link
Contributor Author

Thanks for all the feedback folks! I'll respond some, if not all key points raised here, by quoting them and adding my thoughts. At the end I'll add some other thoughts/ideas.

It is a essentially an alert that there is a new error in our logs. Broadly these failures fall into two categories:

  1. The log message shouldn't be at the error level to begin with.
  2. The error is real and has identified a probable bug.

There is at least a 3rd category: Errors that are expected, correctly logged at error level and transient.

An example I found today (and led me to create this PR) is issues connecting to Elasticsearch that happen when the Elastic-Agent is installed and running, but has not enrolled in Fleet yet. In this case all Beats will try to connect to http://172.0.0.1:9200. This is a perfect example of an error that is correctly logged and expected to happen, but should stop once the Elastic-Agent is enrolled.

Note: we should also be looking for panic and not just error logs.

100% agree with this, that I believe could be built in the test framework itself panics should never happen.

I've never thought about this test like this and it might just show there are different views on what this test is actually about. As the test name states TestLogIngestionFleetManaged, it should be testing the logs are properly ingested. Which it can do well even if it does not check for error logs.

I agree. As the name TestLogIngestionFleetManaged suggests, it should be testing that logs are correctly ingested for Fleet managed agents, nothing else.

If there is a specific need to keep error logs in check it should be done in every test and every test should be able to define exceptions/expected errors when testing error scenarios

While I like the idea, I believe in reality it will only create a chance of flakiness on all tests. Which is pretty much what is happening with TestLogIngestionFleetManaged.

It is a essentially an alert that there is a new error in our logs.
(...)
I agree that this test is extremely valuable to Elastic Agent as a product, I also agree that it causes some toil in our team to maintain it and this may be a bit annoying.

However, something in our QA process needs to do this job of looking for new errors, and right now the best place for that is this integration test.

I had never seen this test or this assertion like that. I really thought it was a just a way to make sure the Elastic-Agent was running rather than something to uncover bugs that went under our radar and did not make the Agent/Beats unhealthy.

My suggestion

Now I see the value of looking for any new error log entry in normal execution of the Elastic-Agent as a way to uncover bugs/issues that were not caught by other tests and ways of asserting the correct execution of the Elastic-Agent.

As we know looking for any log.level: error is inherently flaky and introduces toil (this is at least the second time I have to add exceptions to this test), then instead of having TestLogIngestionFleetManaged doing this, we remove this assertion from it and create a new test that will:

  1. Install a Fleet managed Elastic-Agent (we could also have standalone)
  2. Waits the Elastic-Agent to be healthy
  3. Ingest a fixed number of logs (possibly also some metrics)
  4. Shutdown or uninstall the Elastic-Agent
  5. Assert that there are no error logs and no panics.

This way we will have a single test that will fail when a new unexpected, or expected, error log appears, any engineer having their PR fail on that will have a good direction to start investigating the failure. We can name it very clearly, something like TestEnsureNoErrorLogsAppearOnNormalExecution.

This all could be done on this PR so it is merged as a single commit.

@cmacknz
Copy link
Member

cmacknz commented Dec 14, 2023

If we want to have more checks on the agent and component logs it should be done as an assertion after each test (with or without agent installed) not just this one.

Agreed this is better. The only reason this is confined to this one test is to test how annoying it will be to maintain. It would be better if every test had this check. It would be best if the product itself had this check and every true error appropriately updated the agent status.

An example I found today (and led me to create this PR) is issues connecting to Elasticsearch that happen when the Elastic-Agent is installed and running, but has not enrolled in Fleet yet. In this case all Beats will try to connect to http://172.0.0.1:9200. This is a perfect example of an error that is correctly logged and expected to happen, but should stop once the Elastic-Agent is enrolled.

Correct this is not a real error, and it shouldn't be logged as such. I have a different view that connecting to 127.0.0.1 before we actually know the real output destination is a bug. We should not attempt to connect to anything until we get the first configuration from the control protocol.

As we know looking for any log.level: error is inherently flaky and introduces toil (this is at least the second time I have to add exceptions to this test), then instead of having TestLogIngestionFleetManaged doing this, we remove this assertion from it and create a new test that will:

Install a Fleet managed Elastic-Agent (we could also have standalone)
Waits the Elastic-Agent to be healthy
Ingest a fixed number of logs (possibly also some metrics)
Shutdown or uninstall the Elastic-Agent
Assert that there are no error logs and no panics.

Isn't this pretty much exactly what this test does? Except the fixed number of logs are the logs of the Elastic agent itself?

Integration and E2E tests are not unit tests, they can and should do more than one job. To the earlier point, I am more in favour of doing this on every test rather than only one of them or having the product do this as part of normal operation (this should be the true end goal).

@belimawr
Copy link
Contributor Author

Correct this is not a real error, and it shouldn't be logged as such. I have a different view that connecting to 127.0.0.1 before we actually know the real output destination is a bug. We should not attempt to connect to anything until we get the first configuration from the control protocol.

I totally agree with this. On a quick glance at the current exceptions I'd consider most of them as bugs.

Isn't this pretty much exactly what this test does? Except the fixed number of logs are the logs of the Elastic agent itself?

I always understood the focus of this test to be ensuring we can run the Elastic-Agent with a simple integration, more like an E2E test to make sure we didn't break anything on this basic flow.

Integration and E2E tests are not unit tests, they can and should do more than one job. To the earlier point, I am more in favour of doing this on every test rather than only one of them or having the product do this as part of normal operation (this should be the true end goal).

I'm not against it, but we should try to make it clear in the test error message and code itself what is the purpose of this assertion. I know I didn't understood it until now, maybe other people thought the same.

My main point here is to make very clear what we expect of this assertion so 6 months from now nobody will just see it as a flaky test again.

@cmacknz
Copy link
Member

cmacknz commented Dec 14, 2023

I'm not against it, but we should try to make it clear in the test error message and code itself what is the purpose of this assertion. I know I didn't understood it until now, maybe other people thought the same.

My main point here is to make very clear what we expect of this assertion so 6 months from now nobody will just see it as a flaky test again.

Ah yes, actually looking at the error message it doesn't justify it's existence. We should fix that and probably include the reason for the assertion in the test failure message.

@blakerouse
Copy link
Contributor

I 100% agree that we should add to the fixture.Install the same checks that are in fixture.Run. We should be ensuring on every test that an error is not reported. If an error is reported in our logs it is a bug, unless that test is specifically expecting or testing for that error.

@belimawr
Copy link
Contributor Author

Thinking a bit more about this PR and avoiding turning it into a change for the testing framework, I think I could:

  1. Rename TestLogIngestionFleetManaged to TestNoErrorLoggedInSimpleLogIngestCase
  2. Keep the "no error in logs assertion", however improve the and overall output when it fails.
  3. Create a task to add this concept of no error in logs as a default behaviour from the integration test framework

Copy link
Contributor

mergify bot commented Dec 18, 2023

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b fix-flakiness upstream/fix-flakiness
git merge upstream/main
git push upstream fix-flakiness

@pierrehilbert
Copy link
Contributor

Putting back to draft for now.

@pierrehilbert pierrehilbert marked this pull request as draft January 9, 2024 13:38
Copy link
Contributor

mergify bot commented Apr 15, 2024

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b fix-flakiness upstream/fix-flakiness
git merge upstream/main
git push upstream fix-flakiness

@jlind23
Copy link
Contributor

jlind23 commented Jul 2, 2024

Cleaning this up a bit, closing for now.

@jlind23 jlind23 closed this Jul 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants