🐛 Source Amazon S3: solve possible case of files being missed during incremental syncs #5365

Phlair · 2021-08-12T12:21:41Z

Current Behavior

Based on Amazon S3's Consistency Model, we are protected against files changing as they are being synced, we should always get either old or new (not error or corrupt). This could mean more-than-once appearance of records but we won't miss data.

However, that consistency model coupled with the way S3 sets its last modified property (well explained in this stackoverflow post) it is a feasible possibility that a file is not available during a sync but upon availability has a last modified property earlier than our value in state. This means that on subsequent syncs, the file is available but we skip it due to the incremental filtering.

An example of how this could happen:

User concurrently writes x files to S3 (with 1 of the files being significantly larger than the rest)
x-1 writes all complete (all except the larger file)
Airbyte sync begins
Sync finds all available files (x-1) and syncs, state is updated at end to the last modified date of the latest
At some point the larger file finishes writing and is available. Its last modified property represents the start of its write, not the end.
During next Airbyte sync, the now available larger file has a last modified property <= our value in state and is therefore skipped, meaning it was never and will never be ingested.

Ideal Behavior

Incremental syncs should never miss files. Any relevant file that is newly available should be synced.

Solution Ideas

Retain information in the state about every file (url + last modified) ever synced.
- I think this would need to know history of all time (rather than just last sync) because multiple syncs could take place between start of file write to file availability.
- This is the bulletproof solution, however this seems infeasible due to how potentially large this could make the state (with the state size increasing over time).
Use a bloom filter (probably a cuckoo filter) to achieve the above in a much more memory efficient way
- Since this is probabilistic, the balance here is between size of state v.s. likelihood of false positives.
- In theory this could still technically make state size too large, but the N number of files where this would happen would be much higher due to storage saving.
Grace period on each sync so that it only syncs files up to last modified <= Tnow - n
- This solution results in a compromise between consistency and availability. When n=0, the negative behaviour described above can happen. As n->∞ the chances of this happening decrease at the cost of new-ness of the synced data.
- e.g. n=1day means data will only be missed if it takes > 1 day to write it to S3, but data synced through Airbyte will always be at least 1 day stale.

Note: for retaining information in state (directly or via bloom filter), we could put a hard limit of X days of files to care about to avoid a forever increasing state size.

@subodh1810 thanks for pointing this potential issue out and discussing solutions.

Are you willing to submit a PR?

yes, opening issue to discuss approach.

sherifnada · 2022-04-26T06:39:58Z

@lazebnyi

Some questions about the cuckoo filter approach:

What is the expected false-positive rate? and what is the impact on syncs/performance? what size would the state object become?
how does this compare to something like keeping a cursor field value + the names of all files in the last 3 days? (presumably 3 days is a safe enough period) we could compress this list of files if that helps too

lazebnyi · 2022-04-26T12:13:17Z

@sherifnada Thanks for the questions.

I think the error rate will be near 0.000001. I think we can use this one realisation https://github.com/huydhn/cuckoo-filter.
To sync performance shouldn't be affected. Size = capacity * bucket size * fingerprint.

For the cuckoo filter search is O(1) for list O(n), so I think for performance better use the cuckoo filter.

sherifnada · 2022-04-29T08:46:50Z

@lazebnyi what happens if we get a false positive? Do we miss data?

Could you say more about why this approach is better than keeping track of all files synced in the last N days, and using that list to determine if a file should be synced? that seems 100% deterministic and simpler to understand. At what scale does this become problematic?

We could even do an adaptive approach down the line where, if that object becomes too big, we transform it to a cuckoo filter or something

Phlair · 2022-04-29T09:42:42Z

my take, fwiw...

keeping track of all files synced in the last N days, and using that list to determine if a file should be synced

The idea to use a cuckoo filter is to do precisely this while being more efficient with storage.

We could even do an adaptive approach down the line where, if that object becomes too big, we transform it to a cuckoo filter or something

^ imo this sounds like a sensible approach. We should avoid introducing the complexity of a cuckoo filter unless we really have to and we might find out by doing the simpler approach first that there are other bottlenecks way before we hit state storage limitations.

Phlair added the type/bug Something isn't working label Aug 12, 2021

sherifnada added the area/connectors Connector related issues label Nov 19, 2021

igrankova added connectors/sources-files connectors/source/s3 labels Jan 14, 2022

lazebnyi mentioned this issue Mar 23, 2022

[EPIC] Source Amazon S3 to Beta #10939

Closed

17 tasks

lazebnyi self-assigned this Apr 2, 2022

lazebnyi changed the title ~~🐛 Source S3: solve possible case of files being missed during incremental syncs~~ 🐛 Source Amazon S3: solve possible case of files being missed during incremental syncs Apr 2, 2022

bleonard added the team/connectors-python label Apr 22, 2022

lazebnyi mentioned this issue May 4, 2022

🐛 Source Amazon S3: solve possible case of files being missed during incremental syncs #12568

Merged

14 tasks

lazebnyi linked a pull request May 4, 2022 that will close this issue

🐛 Source Amazon S3: solve possible case of files being missed during incremental syncs #12568

Merged

14 tasks

lazebnyi closed this as completed in #12568 May 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛 Source Amazon S3: solve possible case of files being missed during incremental syncs #5365

🐛 Source Amazon S3: solve possible case of files being missed during incremental syncs #5365

Phlair commented Aug 12, 2021 •

edited

Loading

sherifnada commented Apr 26, 2022

lazebnyi commented Apr 26, 2022

sherifnada commented Apr 29, 2022

Phlair commented Apr 29, 2022 •

edited

Loading

🐛 Source Amazon S3: solve possible case of files being missed during incremental syncs #5365

🐛 Source Amazon S3: solve possible case of files being missed during incremental syncs #5365

Comments

Phlair commented Aug 12, 2021 • edited Loading

Current Behavior

Ideal Behavior

Solution Ideas

Are you willing to submit a PR?

sherifnada commented Apr 26, 2022

lazebnyi commented Apr 26, 2022

sherifnada commented Apr 29, 2022

Phlair commented Apr 29, 2022 • edited Loading

Phlair commented Aug 12, 2021 •

edited

Loading

Phlair commented Apr 29, 2022 •

edited

Loading