-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[filebeat][azure-blob-storage] - Fixed concurrency & flakey tests issue #36124
Conversation
Pinging @elastic/security-external-integrations (Team:Security-External Integrations) |
This pull request does not have a backport label.
To fixup this pull request, you need to add the backport labels for the needed
|
This pull request is now in conflicts. Could you fix it? 🙏
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you confirmed that the tests here fail without the fix?
@efd6 The concurrency error only happens when using multiple workers at scale with 1000s of records, I'm not sure how we can simulate that without an Integration test. |
Would it not be possible to make a mock handler that just responds with arbitrary random object data and then run a test with multiple workers pulling from that mock? There does not need to be any comparison of the values, just an absence of a concurrent map-write throw. |
Will try this and update. |
@efd6 I've update the PR with suitable concurrency tests having random blob generation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you confirm that the new tests fail without the fix?
I have run the tests with the main branch's code in job.go and state.go, and the tests pass unless -race
is passed. In the case that -race
is passed, only the 5000 workers case fails. This failure is consistent, but appears to be purely a timeout effect caused by the added load the race detector imposes; increasing the timeout on the test to 200s allows it to pass.
@efd6 |
I suggest writing a new beat client implementation (these are small, so it should not be onerous) that mutates the event. This was the cause of the issue, so it should be detectable that way. |
@efd6 I've updated the pr with the suggested changes. For me 3000 workers is not timing out and is passing with the |
@efd6 I've updated the test case code. |
…ue (elastic#36124) ## Type of change - Bug ## What does this PR do? This PR fixes the concurrency issues present in the azure blob storage input and the flakey tests issue. ## Why is it important? Concurrent ops were failing at scale and this fix addresses that issue. ## Checklist - [x] My code follows the style guidelines of this project - [x] I have commented my code, particularly in hard-to-understand areas ~~- [ ] I have made corresponding changes to the documentation~~ ~~- [ ] I have made corresponding change to the default configuration files~~ - [x] I have added tests that prove my fix is effective or that my feature works - [x] I have added an entry in `CHANGELOG.next.asciidoc` or `CHANGELOG-developer.next.asciidoc`. ## Author's Checklist <!-- Recommended Add a checklist of things that are required to be reviewed in order to have the PR approved --> - [ ] ## How to test this PR locally <!-- Recommended Explain here how this PR will be tested by the reviewer: commands, dependencies, steps, etc. --> ## Related issues - Relates elastic#35983 ## Use cases <!-- Recommended Explain here the different behaviors that this PR introduces or modifies in this project, user roles, environment configuration, etc. If you are familiar with Gherkin test scenarios, we recommend its usage: https://cucumber.io/docs/gherkin/reference/ --> ## Screenshots <!-- Optional Add here screenshots about how the project will be changed after the PR is applied. They could be related to web pages, terminal, etc, or any other image you consider important to be shared with the team. --> ## Logs <!-- Recommended Paste here output logs discovered while creating this PR, such as stack traces or integration logs, or any other output you consider important to be shared with the team. -->
Type of change
What does this PR do?
This PR fixes the concurrency issues present in the azure blob storage input and the flakey tests issue.
Why is it important?
Concurrent ops were failing at scale and this fix addresses that issue.
Checklist
- [ ] I have made corresponding changes to the documentation- [ ] I have made corresponding change to the default configuration filesCHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.Author's Checklist
How to test this PR locally
Related issues
Use cases
Screenshots
Logs