Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 Source Amazon S3: solve possible case of files being missed during incremental syncs #12568

Conversation

lazebnyi
Copy link
Collaborator

@lazebnyi lazebnyi commented May 4, 2022

What

#5365 - 🐛 Source Amazon S3: solve possible case of files being missed during incremental syncs

How

Retain information in the state about every file's _ab_source_file_url ever synced in the last N days (N = 3 days).

History of records saved to set and not in cuckoo filter cause case in history saving on str, so expected size of state and complexity of set class can have us more profit than cuckoo filter with error_rate more than 0.

Pre-merge Checklist

Updating a connector

Community member or Airbyter

  • Grant edit access to maintainers (instructions)
  • Secrets in the connector's spec are annotated with airbyte_secret
  • Unit & integration tests added and passing. Community members, please provide proof of success locally e.g: screenshot or copy-paste unit, integration, and acceptance test output. To run acceptance tests for a Python connector, follow instructions in the README. For java connectors run ./gradlew :airbyte-integrations:connectors:<name>:integrationTest.
  • Code reviews completed
  • Documentation updated
    • Connector's README.md
    • Connector's bootstrap.md. See description and examples
    • Changelog updated in docs/integrations/<source or destination>/<name>.md including changelog. See changelog example
  • PR name follows PR naming conventions

Airbyter

If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.

  • Create a non-forked branch based on this PR and test the below items on it
  • Build is successful
  • If new credentials are required for use in CI, add them to GSM. Instructions.
  • /test connector=connectors/<name> command is passing
  • New Connector version released on Dockerhub and connector version bumped by running the /publish command described here

@github-actions github-actions bot added the area/connectors Connector related issues label May 4, 2022
@CLAassistant
Copy link

CLAassistant commented May 5, 2022

CLA assistant check
All committers have signed the CLA.

@lazebnyi
Copy link
Collaborator Author

lazebnyi commented May 9, 2022

/test connector=connectors/source-s3

🕑 connectors/source-s3 https://github.com/airbytehq/airbyte/actions/runs/2293985957
❌ connectors/source-s3 https://github.com/airbytehq/airbyte/actions/runs/2293985957
🐛 https://gradle.com/s/xaxfl6yzxcqm6
Python short test summary info:

=========================== short test summary info ============================
FAILED test_incremental.py::TestIncremental::test_two_sequential_reads[inputs2]
FAILED test_incremental.py::TestIncremental::test_state_with_abnormally_large_values[inputs0]
FAILED test_incremental.py::TestIncremental::test_state_with_abnormally_large_values[inputs2]
=================== 3 failed, 41 passed in 93.84s (0:01:33) ====================

@lazebnyi
Copy link
Collaborator Author

lazebnyi commented May 11, 2022

/test connector=connectors/source-s3

🕑 connectors/source-s3 https://github.com/airbytehq/airbyte/actions/runs/2308114664
❌ connectors/source-s3 https://github.com/airbytehq/airbyte/actions/runs/2308114664
🐛 https://gradle.com/s/dsntizz7ikzuc

@lazebnyi
Copy link
Collaborator Author

lazebnyi commented May 17, 2022

/test connector=connectors/source-s3

🕑 connectors/source-s3 https://github.com/airbytehq/airbyte/actions/runs/2338746091
❌ connectors/source-s3 https://github.com/airbytehq/airbyte/actions/runs/2338746091
🐛 https://gradle.com/s/atcsum5yu5l2g

@lazebnyi
Copy link
Collaborator Author

lazebnyi commented May 23, 2022

/test connector=connectors/source-s3

🕑 connectors/source-s3 https://github.com/airbytehq/airbyte/actions/runs/2371223655
✅ connectors/source-s3 https://github.com/airbytehq/airbyte/actions/runs/2371223655
Python tests coverage:

Name                                                 Stmts   Miss  Cover
------------------------------------------------------------------------
source_acceptance_test/utils/__init__.py                 6      0   100%
source_acceptance_test/tests/__init__.py                 4      0   100%
source_acceptance_test/__init__.py                       2      0   100%
source_acceptance_test/tests/test_full_refresh.py       52      2    96%
source_acceptance_test/utils/asserts.py                 37      2    95%
source_acceptance_test/config.py                        75      6    92%
source_acceptance_test/utils/json_schema_helper.py     105     13    88%
source_acceptance_test/utils/common.py                  80     17    79%
source_acceptance_test/tests/test_incremental.py        85     25    71%
source_acceptance_test/utils/compare.py                 62     23    63%
source_acceptance_test/tests/test_core.py              285    106    63%
source_acceptance_test/base.py                          10      4    60%
source_acceptance_test/utils/connector_runner.py       110     48    56%
------------------------------------------------------------------------
TOTAL                                                  913    246    73%
Name                                                              Stmts   Miss  Cover
-------------------------------------------------------------------------------------
source_s3/source_files_abstract/formats/parquet_spec.py               9      0   100%
source_s3/source_files_abstract/formats/csv_spec.py                  16      0   100%
source_s3/source_files_abstract/formats/avro_spec.py                  5      0   100%
source_s3/s3_utils.py                                                19      0   100%
source_s3/__init__.py                                                 2      0   100%
source_s3/source.py                                                  29      1    97%
source_s3/source_files_abstract/storagefile.py                       23      1    96%
source_s3/s3file.py                                                  37      2    95%
source_s3/source_files_abstract/formats/abstract_file_parser.py      35      2    94%
source_s3/source_files_abstract/stream.py                           200     13    94%
source_s3/stream.py                                                  43      3    93%
source_s3/source_files_abstract/formats/csv_parser.py                76     18    76%
source_s3/source_files_abstract/file_info.py                         26      8    69%
source_s3/utils.py                                                   31     10    68%
source_s3/source_files_abstract/source.py                            37     14    62%
source_s3/source_files_abstract/spec.py                              43     22    49%
source_s3/source_files_abstract/formats/avro_parser.py               38     25    34%
source_s3/source_files_abstract/formats/parquet_parser.py            61     44    28%
-------------------------------------------------------------------------------------
TOTAL                                                               730    163    78%
Name                                                              Stmts   Miss  Cover
-------------------------------------------------------------------------------------
source_s3/source_files_abstract/formats/parquet_spec.py               9      0   100%
source_s3/source_files_abstract/formats/csv_spec.py                  16      0   100%
source_s3/source_files_abstract/formats/avro_spec.py                  5      0   100%
source_s3/source_files_abstract/formats/abstract_file_parser.py      35      0   100%
source_s3/source.py                                                  29      0   100%
source_s3/__init__.py                                                 2      0   100%
source_s3/source_files_abstract/formats/parquet_parser.py            61      3    95%
source_s3/source_files_abstract/formats/avro_parser.py               38      3    92%
source_s3/source_files_abstract/storagefile.py                       23      5    78%
source_s3/source_files_abstract/formats/csv_parser.py                76     18    76%
source_s3/utils.py                                                   31      8    74%
source_s3/source_files_abstract/file_info.py                         26     10    62%
source_s3/source_files_abstract/source.py                            37     15    59%
source_s3/s3file.py                                                  37     18    51%
source_s3/source_files_abstract/spec.py                              43     22    49%
source_s3/source_files_abstract/stream.py                           200    103    48%
source_s3/s3_utils.py                                                19     13    32%
source_s3/stream.py                                                  43     30    30%
-------------------------------------------------------------------------------------
TOTAL                                                               730    248    66%

@lazebnyi lazebnyi requested review from evantahler and sherifnada and removed request for annalvova05 May 23, 2022 15:16
@lazebnyi lazebnyi marked this pull request as ready for review May 23, 2022 18:33
Copy link
Contributor

@evantahler evantahler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to see a test that shows the new state.history is what you would expect

Copy link
Contributor

@sherifnada sherifnada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed with evan that we should have very thorough testing on this functionality -- it's pretty important that we:

  1. don't miss any files
  2. drop old files from the STATE object to ensure it doesn't bloat unnecessarily

@lazebnyi
Copy link
Collaborator Author

lazebnyi commented May 27, 2022

/test connector=connectors/source-s3

🕑 connectors/source-s3 https://github.com/airbytehq/airbyte/actions/runs/2397240005
✅ connectors/source-s3 https://github.com/airbytehq/airbyte/actions/runs/2397240005
Python tests coverage:

Name                                                 Stmts   Miss  Cover
------------------------------------------------------------------------
source_acceptance_test/utils/__init__.py                 6      0   100%
source_acceptance_test/tests/__init__.py                 4      0   100%
source_acceptance_test/__init__.py                       2      0   100%
source_acceptance_test/tests/test_full_refresh.py       52      2    96%
source_acceptance_test/utils/asserts.py                 37      2    95%
source_acceptance_test/config.py                        75      6    92%
source_acceptance_test/utils/json_schema_helper.py     105     13    88%
source_acceptance_test/utils/common.py                  80     17    79%
source_acceptance_test/tests/test_incremental.py        85     25    71%
source_acceptance_test/utils/compare.py                 62     23    63%
source_acceptance_test/tests/test_core.py              285    106    63%
source_acceptance_test/base.py                          10      4    60%
source_acceptance_test/utils/connector_runner.py       110     48    56%
------------------------------------------------------------------------
TOTAL                                                  913    246    73%
Name                                                              Stmts   Miss  Cover
-------------------------------------------------------------------------------------
source_s3/source_files_abstract/formats/parquet_spec.py               9      0   100%
source_s3/source_files_abstract/formats/csv_spec.py                  16      0   100%
source_s3/source_files_abstract/formats/avro_spec.py                  5      0   100%
source_s3/s3_utils.py                                                19      0   100%
source_s3/__init__.py                                                 2      0   100%
source_s3/source.py                                                  29      1    97%
source_s3/source_files_abstract/storagefile.py                       23      1    96%
source_s3/s3file.py                                                  37      2    95%
source_s3/source_files_abstract/formats/abstract_file_parser.py      35      2    94%
source_s3/source_files_abstract/stream.py                           217     14    94%
source_s3/stream.py                                                  43      3    93%
source_s3/source_files_abstract/formats/csv_parser.py                76     18    76%
source_s3/source_files_abstract/file_info.py                         26      8    69%
source_s3/utils.py                                                   31     10    68%
source_s3/source_files_abstract/source.py                            37     14    62%
source_s3/source_files_abstract/spec.py                              43     22    49%
source_s3/source_files_abstract/formats/avro_parser.py               38     25    34%
source_s3/source_files_abstract/formats/parquet_parser.py            61     44    28%
-------------------------------------------------------------------------------------
TOTAL                                                               747    164    78%
Name                                                              Stmts   Miss  Cover
-------------------------------------------------------------------------------------
source_s3/source_files_abstract/formats/parquet_spec.py               9      0   100%
source_s3/source_files_abstract/formats/csv_spec.py                  16      0   100%
source_s3/source_files_abstract/formats/avro_spec.py                  5      0   100%
source_s3/source_files_abstract/formats/abstract_file_parser.py      35      0   100%
source_s3/source.py                                                  29      0   100%
source_s3/__init__.py                                                 2      0   100%
source_s3/source_files_abstract/formats/parquet_parser.py            61      3    95%
source_s3/source_files_abstract/formats/avro_parser.py               38      3    92%
source_s3/source_files_abstract/storagefile.py                       23      5    78%
source_s3/source_files_abstract/formats/csv_parser.py                76     18    76%
source_s3/utils.py                                                   31      8    74%
source_s3/source_files_abstract/file_info.py                         26     10    62%
source_s3/source_files_abstract/source.py                            37     15    59%
source_s3/source_files_abstract/stream.py                           217     89    59%
source_s3/s3file.py                                                  37     18    51%
source_s3/source_files_abstract/spec.py                              43     22    49%
source_s3/s3_utils.py                                                19     13    32%
source_s3/stream.py                                                  43     30    30%
-------------------------------------------------------------------------------------
TOTAL                                                               747    234    69%

@lazebnyi
Copy link
Collaborator Author

lazebnyi commented May 31, 2022

/test connector=connectors/source-s3

🕑 connectors/source-s3 https://github.com/airbytehq/airbyte/actions/runs/2416661045
✅ connectors/source-s3 https://github.com/airbytehq/airbyte/actions/runs/2416661045
Python tests coverage:

Name                                                 Stmts   Miss  Cover
------------------------------------------------------------------------
source_acceptance_test/utils/__init__.py                 6      0   100%
source_acceptance_test/tests/__init__.py                 4      0   100%
source_acceptance_test/__init__.py                       2      0   100%
source_acceptance_test/tests/test_full_refresh.py       52      2    96%
source_acceptance_test/utils/asserts.py                 37      2    95%
source_acceptance_test/config.py                        77      6    92%
source_acceptance_test/utils/json_schema_helper.py     105     13    88%
source_acceptance_test/tests/test_incremental.py       121     25    79%
source_acceptance_test/utils/common.py                  80     17    79%
source_acceptance_test/tests/test_core.py              294    106    64%
source_acceptance_test/utils/compare.py                 62     23    63%
source_acceptance_test/base.py                          10      4    60%
source_acceptance_test/utils/connector_runner.py       110     48    56%
------------------------------------------------------------------------
TOTAL                                                  960    246    74%
Name                                                              Stmts   Miss  Cover
-------------------------------------------------------------------------------------
source_s3/source_files_abstract/formats/parquet_spec.py               9      0   100%
source_s3/source_files_abstract/formats/csv_spec.py                  16      0   100%
source_s3/source_files_abstract/formats/avro_spec.py                  5      0   100%
source_s3/s3file.py                                                  37      0   100%
source_s3/s3_utils.py                                                19      0   100%
source_s3/__init__.py                                                 2      0   100%
source_s3/source.py                                                  29      1    97%
source_s3/source_files_abstract/storagefile.py                       23      1    96%
source_s3/source_files_abstract/formats/abstract_file_parser.py      35      2    94%
source_s3/source_files_abstract/stream.py                           217     14    94%
source_s3/stream.py                                                  43      3    93%
source_s3/source_files_abstract/formats/csv_parser.py                76     18    76%
source_s3/source_files_abstract/file_info.py                         26      8    69%
source_s3/utils.py                                                   31     10    68%
source_s3/source_files_abstract/source.py                            37     14    62%
source_s3/source_files_abstract/spec.py                              43     22    49%
source_s3/source_files_abstract/formats/avro_parser.py               38     25    34%
source_s3/source_files_abstract/formats/parquet_parser.py            61     44    28%
-------------------------------------------------------------------------------------
TOTAL                                                               747    162    78%
Name                                                              Stmts   Miss  Cover
-------------------------------------------------------------------------------------
source_s3/source_files_abstract/storagefile.py                       23      0   100%
source_s3/source_files_abstract/spec.py                              43      0   100%
source_s3/source_files_abstract/formats/parquet_spec.py               9      0   100%
source_s3/source_files_abstract/formats/csv_spec.py                  16      0   100%
source_s3/source_files_abstract/formats/avro_spec.py                  5      0   100%
source_s3/source_files_abstract/formats/abstract_file_parser.py      35      0   100%
source_s3/source.py                                                  29      0   100%
source_s3/s3file.py                                                  37      0   100%
source_s3/s3_utils.py                                                19      0   100%
source_s3/__init__.py                                                 2      0   100%
source_s3/source_files_abstract/formats/parquet_parser.py            61      1    98%
source_s3/stream.py                                                  43      1    98%
source_s3/source_files_abstract/source.py                            37      2    95%
source_s3/source_files_abstract/formats/avro_parser.py               38      3    92%
source_s3/source_files_abstract/file_info.py                         26      3    88%
source_s3/source_files_abstract/stream.py                           217     36    83%
source_s3/source_files_abstract/formats/csv_parser.py                76     18    76%
source_s3/utils.py                                                   31      8    74%
-------------------------------------------------------------------------------------
TOTAL                                                               747     72    90%

Build Passed

Test summary info:

All Passed

@codecov
Copy link

codecov bot commented May 31, 2022

Codecov Report

❗ No coverage uploaded for pull request base (master@0c5cdc7). Click here to learn what that means.
The diff coverage is n/a.

@@            Coverage Diff            @@
##             master   #12568   +/-   ##
=========================================
  Coverage          ?   90.36%           
=========================================
  Files             ?       18           
  Lines             ?      747           
  Branches          ?        0           
=========================================
  Hits              ?      675           
  Misses            ?       72           
  Partials          ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0c5cdc7...3a8377d. Read the comment docs.

@github-actions github-actions bot added the area/documentation Improvements or additions to documentation label May 31, 2022
@lazebnyi
Copy link
Collaborator Author

lazebnyi commented May 31, 2022

/publish connector=connectors/source-s3

🕑 connectors/source-s3 https://github.com/airbytehq/airbyte/actions/runs/2416841026
🚀 Successfully published connectors/source-s3
❌ Couldn't auto-bump version for connectors/source-s3

@lazebnyi lazebnyi temporarily deployed to more-secrets May 31, 2022 18:05 Inactive
@lazebnyi lazebnyi merged commit f9348b2 into master May 31, 2022
@lazebnyi lazebnyi deleted the lazebnyi/5365-s3-solve-possible-case-of-files-being-missed-during-incremental-syncs branch May 31, 2022 18:39
jscottpolevault pushed a commit to jscottpolevault/airbyte that referenced this pull request Jun 1, 2022
…incremental syncs (airbytehq#12568)

* Added history to state

* Deleted unused import

* Rollback abnormal state file

* Rollback abnormal state file

* Fixed type error issue

* Fix state issue

* Updated after review

* Bumped version
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/connectors Connector related issues area/documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

🐛 Source Amazon S3: solve possible case of files being missed during incremental syncs
5 participants