`aws-s3` input skips events and slows ingestion based on object creation timestamp #39065

faec · 2024-04-18T20:59:04Z

When polling an S3 bucket, the aws-s3 input keeps a sync timestamp of the last time that has been fully ingested. Objects with creation timestamps before that will be skipped when polling the bucket.

The sync timestamp is updated in (*s3Poller).Purge, which handles cleanup after a full scan of the bucket. Purge advances the timestamp based on the creation time of the objects ingested during the scan. However, if the scan encounters too many ephemeral errors (rate-limit warnings, network instability) it will be restarted and Purge will be called early. In this case, it will advance the sync timestamp based on the objects that were processed, even though there may still be older objects later in the scan that were skipped. This can result in a dramatic slowdown and eventual stop of ingestion, as an increasing number of objects are skipped on each pass.

A few things mask the severity:

The way the sync timestamp is stored means it often doesn't persist between Filebeat restarts.
The object timestamp collected on each page during the scan isn't the most recent timestamp, it's just some timestamp ahead of the current time, so the rate of slowdown is less severe than it could have been.
It is relatively uncommon for the error state to be triggered on small buckets with modest ingestion speed, so when slowdown is observed it's attributed to bucket size

The text was updated successfully, but these errors were encountered:

elasticmachine · 2024-04-18T20:59:05Z

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

…#39131) This is a cleanup of concurrency and error handling in the `aws-s3` input that could cause several known bugs: - Memory leaks ([1](elastic/integrations#9463), [2](#39052)). This issue was caused because the input could run several scans of its s3 bucket simultaneously, which led to the cleanup routine `s3Poller.Purge` being called many times concurrently. Inefficiencies in this function caused it to accumulate over time, creating many copies of the state data which could overload process memory. Fixed by: * Changing the `s3Poller` run loop to only run one scan at a time, and wait for it to complete before starting the next one. * Having each object persist its own state after completing, instead of waiting until the end of a scan and writing an entire bucket worth of metadata at once. - This also allowed the removal of other metadata: there is no longer any reason to track the detailed acknowledgment state of each "listing" (page of ~1K events during bucket enumeration), so the `states` helper object is now much simpler. - Skipped data due to buggy last-modified calculations ([3](#39065)). The most recent scanned timestamp was calculated incorrectly, causing the input to skip a growing number of events as ingestion progressed. * Fixed by removing the bucket-wide last modified check entirely. This feature was already risky, since objects with earlier creation timestamps can appear after ones with later timestamps, so there is always the possibility to miss objects. Since the value was calculated incorrectly and was discarded between runs, we can remove it without breaking compatibility and reimplement it more safely in the future if needed. - Skipped data because rate limiting is treated as permanent failure ([4](#39114)). The input treats all error types the same, which causes many objects to be skipped for ephemeral errors. * Fixed by creating an error, `errS3DownloadFailure`, that is returned when processing failure is caused by a download error. In this case, the S3 workers will not persist the failure to the `states` table, so the object will be retried on the next bucket scan. When this happens the worker also sleeps (using an exponential backoff) before trying the next object. * Exponential backoff was also added to the bucket scanning loop for page listing errors, so the bucket scan is not restarted needlessly.

…#39131) This is a cleanup of concurrency and error handling in the `aws-s3` input that could cause several known bugs: - Memory leaks ([1](elastic/integrations#9463), [2](#39052)). This issue was caused because the input could run several scans of its s3 bucket simultaneously, which led to the cleanup routine `s3Poller.Purge` being called many times concurrently. Inefficiencies in this function caused it to accumulate over time, creating many copies of the state data which could overload process memory. Fixed by: * Changing the `s3Poller` run loop to only run one scan at a time, and wait for it to complete before starting the next one. * Having each object persist its own state after completing, instead of waiting until the end of a scan and writing an entire bucket worth of metadata at once. - This also allowed the removal of other metadata: there is no longer any reason to track the detailed acknowledgment state of each "listing" (page of ~1K events during bucket enumeration), so the `states` helper object is now much simpler. - Skipped data due to buggy last-modified calculations ([3](#39065)). The most recent scanned timestamp was calculated incorrectly, causing the input to skip a growing number of events as ingestion progressed. * Fixed by removing the bucket-wide last modified check entirely. This feature was already risky, since objects with earlier creation timestamps can appear after ones with later timestamps, so there is always the possibility to miss objects. Since the value was calculated incorrectly and was discarded between runs, we can remove it without breaking compatibility and reimplement it more safely in the future if needed. - Skipped data because rate limiting is treated as permanent failure ([4](#39114)). The input treats all error types the same, which causes many objects to be skipped for ephemeral errors. * Fixed by creating an error, `errS3DownloadFailure`, that is returned when processing failure is caused by a download error. In this case, the S3 workers will not persist the failure to the `states` table, so the object will be retried on the next bucket scan. When this happens the worker also sleeps (using an exponential backoff) before trying the next object. * Exponential backoff was also added to the bucket scanning loop for page listing errors, so the bucket scan is not restarted needlessly. (cherry picked from commit e588628) # Conflicts: # x-pack/filebeat/input/awss3/input.go

…ss in the `aws-s3` input (#39262) * Fix concurrency bugs that could cause data loss in the `aws-s3` input (#39131) This is a cleanup of concurrency and error handling in the `aws-s3` input that could cause several known bugs: - Memory leaks ([1](elastic/integrations#9463), [2](#39052)). This issue was caused because the input could run several scans of its s3 bucket simultaneously, which led to the cleanup routine `s3Poller.Purge` being called many times concurrently. Inefficiencies in this function caused it to accumulate over time, creating many copies of the state data which could overload process memory. Fixed by: * Changing the `s3Poller` run loop to only run one scan at a time, and wait for it to complete before starting the next one. * Having each object persist its own state after completing, instead of waiting until the end of a scan and writing an entire bucket worth of metadata at once. - This also allowed the removal of other metadata: there is no longer any reason to track the detailed acknowledgment state of each "listing" (page of ~1K events during bucket enumeration), so the `states` helper object is now much simpler. - Skipped data due to buggy last-modified calculations ([3](#39065)). The most recent scanned timestamp was calculated incorrectly, causing the input to skip a growing number of events as ingestion progressed. * Fixed by removing the bucket-wide last modified check entirely. This feature was already risky, since objects with earlier creation timestamps can appear after ones with later timestamps, so there is always the possibility to miss objects. Since the value was calculated incorrectly and was discarded between runs, we can remove it without breaking compatibility and reimplement it more safely in the future if needed. - Skipped data because rate limiting is treated as permanent failure ([4](#39114)). The input treats all error types the same, which causes many objects to be skipped for ephemeral errors. * Fixed by creating an error, `errS3DownloadFailure`, that is returned when processing failure is caused by a download error. In this case, the S3 workers will not persist the failure to the `states` table, so the object will be retried on the next bucket scan. When this happens the worker also sleeps (using an exponential backoff) before trying the next object. * Exponential backoff was also added to the bucket scanning loop for page listing errors, so the bucket scan is not restarted needlessly. (cherry picked from commit e588628) # Conflicts: # x-pack/filebeat/input/awss3/input.go * fix merge --------- Co-authored-by: Fae Charlton <[email protected]>

faec added the Team:Elastic-Agent Label for the Agent team label Apr 18, 2024

faec self-assigned this Apr 18, 2024

faec mentioned this issue Apr 18, 2024

Meta: Improve performance and reliability of awss3 and awscloudwatch inputs #38956

Open

faec added the bug label Apr 18, 2024

This was referenced Apr 22, 2024

aws-s3 input's bucket polling accumulates state in the registry #39116

Open

Fix concurrency bugs that could cause data loss in the aws-s3 input #39131

Merged

faec closed this as completed in #39131 Apr 29, 2024

mergify bot mentioned this issue Apr 29, 2024

[8.14](backport #39131) Fix concurrency bugs that could cause data loss in the aws-s3 input #39262

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`aws-s3` input skips events and slows ingestion based on object creation timestamp #39065

`aws-s3` input skips events and slows ingestion based on object creation timestamp #39065

faec commented Apr 18, 2024

elasticmachine commented Apr 18, 2024

aws-s3 input skips events and slows ingestion based on object creation timestamp #39065

aws-s3 input skips events and slows ingestion based on object creation timestamp #39065

Comments

faec commented Apr 18, 2024

elasticmachine commented Apr 18, 2024

`aws-s3` input skips events and slows ingestion based on object creation timestamp #39065

`aws-s3` input skips events and slows ingestion based on object creation timestamp #39065