-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛 Source Amazon S3: solve possible case of files being missed during incremental syncs #5365
Comments
Some questions about the cuckoo filter approach:
|
@sherifnada Thanks for the questions. I think the error rate will be near 0.000001. I think we can use this one realisation https://github.com/huydhn/cuckoo-filter. For the cuckoo filter search is O(1) for list O(n), so I think for performance better use the cuckoo filter. |
@lazebnyi what happens if we get a false positive? Do we miss data? Could you say more about why this approach is better than keeping track of all files synced in the last N days, and using that list to determine if a file should be synced? that seems 100% deterministic and simpler to understand. At what scale does this become problematic? We could even do an adaptive approach down the line where, if that object becomes too big, we transform it to a cuckoo filter or something |
my take, fwiw...
The idea to use a cuckoo filter is to do precisely this while being more efficient with storage.
^ imo this sounds like a sensible approach. We should avoid introducing the complexity of a cuckoo filter unless we really have to and we might find out by doing the simpler approach first that there are other bottlenecks way before we hit state storage limitations. |
Current Behavior
Based on Amazon S3's Consistency Model, we are protected against files changing as they are being synced, we should always get either old or new (not error or corrupt). This could mean more-than-once appearance of records but we won't miss data.
However, that consistency model coupled with the way S3 sets its last modified property (well explained in this stackoverflow post) it is a feasible possibility that a file is not available during a sync but upon availability has a last modified property earlier than our value in state. This means that on subsequent syncs, the file is available but we skip it due to the incremental filtering.
An example of how this could happen:
Ideal Behavior
Incremental syncs should never miss files. Any relevant file that is newly available should be synced.
Solution Ideas
Note: for retaining information in state (directly or via bloom filter), we could put a hard limit of X days of files to care about to avoid a forever increasing state size.
@subodh1810 thanks for pointing this potential issue out and discussing solutions.
Are you willing to submit a PR?
yes, opening issue to discuss approach.
The text was updated successfully, but these errors were encountered: