-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[8.14](backport #39131) Fix concurrency bugs that could cause data loss in the aws-s3
input
#39262
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…#39131) This is a cleanup of concurrency and error handling in the `aws-s3` input that could cause several known bugs: - Memory leaks ([1](elastic/integrations#9463), [2](#39052)). This issue was caused because the input could run several scans of its s3 bucket simultaneously, which led to the cleanup routine `s3Poller.Purge` being called many times concurrently. Inefficiencies in this function caused it to accumulate over time, creating many copies of the state data which could overload process memory. Fixed by: * Changing the `s3Poller` run loop to only run one scan at a time, and wait for it to complete before starting the next one. * Having each object persist its own state after completing, instead of waiting until the end of a scan and writing an entire bucket worth of metadata at once. - This also allowed the removal of other metadata: there is no longer any reason to track the detailed acknowledgment state of each "listing" (page of ~1K events during bucket enumeration), so the `states` helper object is now much simpler. - Skipped data due to buggy last-modified calculations ([3](#39065)). The most recent scanned timestamp was calculated incorrectly, causing the input to skip a growing number of events as ingestion progressed. * Fixed by removing the bucket-wide last modified check entirely. This feature was already risky, since objects with earlier creation timestamps can appear after ones with later timestamps, so there is always the possibility to miss objects. Since the value was calculated incorrectly and was discarded between runs, we can remove it without breaking compatibility and reimplement it more safely in the future if needed. - Skipped data because rate limiting is treated as permanent failure ([4](#39114)). The input treats all error types the same, which causes many objects to be skipped for ephemeral errors. * Fixed by creating an error, `errS3DownloadFailure`, that is returned when processing failure is caused by a download error. In this case, the S3 workers will not persist the failure to the `states` table, so the object will be retried on the next bucket scan. When this happens the worker also sleeps (using an exponential backoff) before trying the next object. * Exponential backoff was also added to the bucket scanning loop for page listing errors, so the bucket scan is not restarted needlessly. (cherry picked from commit e588628) # Conflicts: # x-pack/filebeat/input/awss3/input.go
mergify
bot
added
backport
conflicts
There is a conflict in the backported pull request
labels
Apr 29, 2024
Cherry-pick of e588628 has failed:
To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally |
botelastic
bot
added
the
needs_team
Indicates that the issue/PR needs a Team:* label
label
Apr 29, 2024
This pull request doesn't have a |
zmoog
approved these changes
Apr 29, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
backport
conflicts
There is a conflict in the backported pull request
needs_team
Indicates that the issue/PR needs a Team:* label
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is a cleanup of concurrency and error handling in the
aws-s3
input that could cause several known bugs:s3Poller.Purge
being called many times concurrently. Inefficiencies in this function caused it to accumulate over time, creating many copies of the state data which could overload process memory. Fixed by:s3Poller
run loop to only run one scan at a time, and wait for it to complete before starting the next one.states
helper object is now much simpler.errS3DownloadFailure
, that is returned when processing failure is caused by a download error. In this case, the S3 workers will not persist the failure to thestates
table, so the object will be retried on the next bucket scan. When this happens the worker also sleeps (using an exponential backoff) before trying the next object.Checklist
I have made corresponding changes to the documentationI have made corresponding change to the default configuration filesCHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.Results
Comparison when ingesting a bucket of 1.9 million objects using the configuration (bucket/auth data redacted):
Without this PR
After ingesting 218K events in 1:15, ingestion stopped permanently.
With this PR
1.9 million events ingested in 3 hours. Ingestion then continues at a much lower rate as the input begins the next bucket scan, picking up new entries and retrying failures from the last pass.
This ingestion is now output-limited -- the slowdown visible around 11:30 was caused by Elasticsearch-side throttling producing
429 Too Many Requests
responses, not by any issue with the input.Related issues
aws-s3
input writes to Filebeat registry without proper synchronization #39052aws-s3
input skips events and slows ingestion based on object creation timestamp #39065aws-s3
input treats client rate limiting as permanent failure #39114This is an automatic backport of pull request #39131 done by [Mergify](https://mergify.com).