Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[8.14](backport #39131) Fix concurrency bugs that could cause data loss in the aws-s3 input #39262

Merged
merged 2 commits into from
Apr 29, 2024

Conversation

mergify[bot]
Copy link
Contributor

@mergify mergify bot commented Apr 29, 2024

This is a cleanup of concurrency and error handling in the aws-s3 input that could cause several known bugs:

  • Memory leaks (1, 2). This issue was caused because the input could run several scans of its s3 bucket simultaneously, which led to the cleanup routine s3Poller.Purge being called many times concurrently. Inefficiencies in this function caused it to accumulate over time, creating many copies of the state data which could overload process memory. Fixed by:
    • Changing the s3Poller run loop to only run one scan at a time, and wait for it to complete before starting the next one.
    • Having each object persist its own state after completing, instead of waiting until the end of a scan and writing an entire bucket worth of metadata at once.
      • This also allowed the removal of other metadata: there is no longer any reason to track the detailed acknowledgment state of each "listing" (page of ~1K events during bucket enumeration), so the states helper object is now much simpler.
  • Skipped data due to buggy last-modified calculations (3). The most recent scanned timestamp was calculated incorrectly, causing the input to skip a growing number of events as ingestion progressed.
    • Fixed by removing the bucket-wide last modified check entirely. This feature was already risky, since objects with earlier creation timestamps can appear after ones with later timestamps, so there is always the possibility to miss objects. Since the value was calculated incorrectly and was discarded between runs, we can remove it without breaking compatibility and reimplement it more safely in the future if needed.
  • Skipped data because rate limiting is treated as permanent failure (4). The input treats all error types the same, which causes many objects to be skipped for ephemeral errors.
    • Fixed by creating an error, errS3DownloadFailure, that is returned when processing failure is caused by a download error. In this case, the S3 workers will not persist the failure to the states table, so the object will be retried on the next bucket scan. When this happens the worker also sleeps (using an exponential backoff) before trying the next object.
    • Exponential backoff was also added to the bucket scanning loop for page listing errors, so the bucket scan is not restarted needlessly.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Results

Comparison when ingesting a bucket of 1.9 million objects using the configuration (bucket/auth data redacted):

filebeat.inputs:
- type: aws-s3
  number_of_workers: 200

output.elasticsearch:
  allow_older_versions: true
  worker: 10

queue.mem.flush.timeout: 0

Without this PR

s3-bucket-before

After ingesting 218K events in 1:15, ingestion stopped permanently.

With this PR

s3-bucket-after

1.9 million events ingested in 3 hours. Ingestion then continues at a much lower rate as the input begins the next bucket scan, picking up new entries and retrying failures from the last pass.

This ingestion is now output-limited -- the slowdown visible around 11:30 was caused by Elasticsearch-side throttling producing 429 Too Many Requests responses, not by any issue with the input.

Related issues


This is an automatic backport of pull request #39131 done by [Mergify](https://mergify.com).

…#39131)

This is a cleanup of concurrency and error handling in the `aws-s3` input that could cause several known bugs:

- Memory leaks ([1](elastic/integrations#9463), [2](#39052)). This issue was caused because the input could run several scans of its s3 bucket simultaneously, which led to the cleanup routine `s3Poller.Purge` being called many times concurrently. Inefficiencies in this function caused it to accumulate over time, creating many copies of the state data which could overload process memory. Fixed by:
  * Changing the `s3Poller` run loop to only run one scan at a time, and wait for it to complete before starting the next one.
  * Having each object persist its own state after completing, instead of waiting until the end of a scan and writing an entire bucket worth of metadata at once.
    - This also allowed the removal of other metadata: there is no longer any reason to track the detailed acknowledgment state of each "listing" (page of ~1K events during bucket enumeration), so the `states` helper object is now much simpler.
- Skipped data due to buggy last-modified calculations ([3](#39065)). The most recent scanned timestamp was calculated incorrectly, causing the input to skip a growing number of events as ingestion progressed.
  * Fixed by removing the bucket-wide last modified check entirely. This feature was already risky, since objects with earlier creation timestamps can appear after ones with later timestamps, so there is always the possibility to miss objects. Since the value was calculated incorrectly and was discarded between runs, we can remove it without breaking compatibility and reimplement it more safely in the future if needed.
- Skipped data because rate limiting is treated as permanent failure ([4](#39114)). The input treats all error types the same, which causes many objects to be skipped for ephemeral errors.
  * Fixed by creating an error, `errS3DownloadFailure`, that is returned when processing failure is caused by a download error. In this case, the S3 workers will not persist the failure to the `states` table, so the object will be retried on the next bucket scan. When this happens the worker also sleeps (using an exponential backoff) before trying the next object.
  * Exponential backoff was also added to the bucket scanning loop for page listing errors, so the bucket scan is not restarted needlessly.

(cherry picked from commit e588628)

# Conflicts:
#	x-pack/filebeat/input/awss3/input.go
@mergify mergify bot requested a review from a team as a code owner April 29, 2024 12:41
@mergify mergify bot added backport conflicts There is a conflict in the backported pull request labels Apr 29, 2024
@mergify mergify bot assigned faec Apr 29, 2024
Copy link
Contributor Author

mergify bot commented Apr 29, 2024

Cherry-pick of e588628 has failed:

On branch mergify/bp/8.14/pr-39131
Your branch is up to date with 'origin/8.14'.

You are currently cherry-picking commit e588628b24.
  (fix conflicts and run "git cherry-pick --continue")
  (use "git cherry-pick --skip" to skip this patch)
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Changes to be committed:
	modified:   x-pack/filebeat/input/awss3/input_benchmark_test.go
	modified:   x-pack/filebeat/input/awss3/s3.go
	modified:   x-pack/filebeat/input/awss3/s3_objects.go
	modified:   x-pack/filebeat/input/awss3/s3_objects_test.go
	modified:   x-pack/filebeat/input/awss3/s3_test.go
	modified:   x-pack/filebeat/input/awss3/state.go
	modified:   x-pack/filebeat/input/awss3/state_test.go
	modified:   x-pack/filebeat/input/awss3/states.go
	modified:   x-pack/filebeat/input/awss3/states_test.go

Unmerged paths:
  (use "git add <file>..." to mark resolution)
	both modified:   x-pack/filebeat/input/awss3/input.go

To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally

@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Apr 29, 2024
@botelastic
Copy link

botelastic bot commented Apr 29, 2024

This pull request doesn't have a Team:<team> label.

@faec faec requested a review from zmoog April 29, 2024 12:48
@elasticmachine
Copy link
Collaborator

elasticmachine commented Apr 29, 2024

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Duration: 137 min 49 sec

❕ Flaky test report

No test was executed to be analysed.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@faec faec merged commit fbc2db5 into 8.14 Apr 29, 2024
29 of 30 checks passed
@faec faec deleted the mergify/bp/8.14/pr-39131 branch April 29, 2024 16:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport conflicts There is a conflict in the backported pull request needs_team Indicates that the issue/PR needs a Team:* label
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants