Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🎉 New Destination: Databricks #5998

Merged
merged 28 commits into from
Sep 14, 2021
Merged

🎉 New Destination: Databricks #5998

merged 28 commits into from
Sep 14, 2021

Conversation

tuliren
Copy link
Contributor

@tuliren tuliren commented Sep 12, 2021

What

How

  • This is a copy destination using Samba Spark JDBC driver.
  • To build it, download the driver and put it under the /lib directory.
  • This connector needs to be published as a private image.

Recommended reading order

  1. databricks.md
  2. DatabricksDestination.java, DatabricksStreamCopier.java

Pre-merge Checklist

New Connector

Community member or Airbyter

  • Secrets in the connector's spec are annotated with airbyte_secret
  • Unit & integration tests added and passing. Community members, please provide proof of success locally e.g: screenshot or copy-paste unit, integration, and acceptance test output. To run acceptance tests for a Python connector, follow instructions in the README. For java connectors run ./gradlew :airbyte-integrations:connectors:<name>:integrationTest.
  • Code reviews completed
  • Documentation updated
    • Connector's README.md
    • Connector's bootstrap.md. See description and examples
    • docs/SUMMARY.md
    • docs/integrations/<source or destination>/<name>.md including changelog. See changelog example
    • docs/integrations/README.md
    • airbyte-integrations/builds.md
  • PR name follows PR naming conventions
  • Connector added to connector index like described here

Airbyter

If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.

  • Create a non-forked branch based on this PR and test the below items on it
  • Build is successful
  • Credentials added to Github CI. Instructions.
  • /test connector=connectors/<name> command is passing.
  • New Connector version released on Dockerhub by running the /publish command described here

@github-actions github-actions bot added area/connectors Connector related issues area/documentation Improvements or additions to documentation labels Sep 12, 2021
@tuliren tuliren requested a review from Phlair September 12, 2021 16:30
@tuliren
Copy link
Contributor Author

tuliren commented Sep 13, 2021

/test connector=connectors/destination-databricks

🕑 connectors/destination-databricks https://github.com/airbytehq/airbyte/actions/runs/1228154206
❌ connectors/destination-databricks https://github.com/airbytehq/airbyte/actions/runs/1228154206
🐛 https://gradle.com/s/tcq6746fc3pvi

@jrhizor jrhizor temporarily deployed to more-secrets September 13, 2021 04:28 Inactive
@tuliren
Copy link
Contributor Author

tuliren commented Sep 13, 2021

/test and /publish can only be done locally, since the JDBC driver cannot be checked into the repo. All integration tests have passed locally:

Screen Shot 2021-09-12 at 21 55 42

@@ -36,7 +36,7 @@
/**
* Writes a value to a staging file for the stream.
*/
void write(UUID id, String jsonDataString, Timestamp emittedAt) throws Exception;
void write(UUID id, AirbyteRecordMessage recordMessage) throws Exception;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This interface change is needed so that the stream copier can work with Parquet writer.

@tuliren tuliren changed the title 🎉 New Destination: Databricks 🎉 New Destination: Databricks (Cloud Only) Sep 13, 2021
Cannot use jooq method directly because it incorrectly quote the schema name.
Copy link
Contributor

@sherifnada sherifnada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

"description": "The region of the S3 staging bucket to use if utilising a copy strategy.",
"enum": [
"",
"us-east-1",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just confirming: these are all supported by DB?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any region restrictions in their documentation.

docs/integrations/destinations/databricks.md Show resolved Hide resolved

Data streams are first written as staging Parquet files on S3, and then loaded into Databricks tables. All the staging files will be deleted after the sync is done. For debugging purposes, here is the full path for a staging file:

```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very helpful ascii art

@@ -89,6 +89,7 @@
| :--- | :--- |
| Azure Blob Storage | [![destination-azure-blob-storage](https://img.shields.io/endpoint?url=https%3A%2F%2Fdnsgjos7lj2fu.cloudfront.net%2Ftests%2Fsummary%2Fdestination-azure-blob-storage%2Fbadge.json)](https://dnsgjos7lj2fu.cloudfront.net/tests/summary/destination-azure-blob-storage) |
| BigQuery | [![destination-bigquery](https://img.shields.io/endpoint?url=https%3A%2F%2Fdnsgjos7lj2fu.cloudfront.net%2Ftests%2Fsummary%2Fdestination-bigquery%2Fbadge.json)](https://dnsgjos7lj2fu.cloudfront.net/tests/summary/destination-bigquery) |
| Databricks | [![destination-bigquery](https://img.shields.io/endpoint?url=https%3A%2F%2Fdnsgjos7lj2fu.cloudfront.net%2Ftests%2Fsummary%2Fdestination-databricks%2Fbadge.json)](https://dnsgjos7lj2fu.cloudfront.net/tests/summary/destination-databricks) |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be branded as delta lake? I'm not sure what the correct branding is

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe databricks delta lake?

@michel-tricot
Copy link
Contributor

That's amazing!

@tuliren tuliren changed the title 🎉 New Destination: Databricks (Cloud Only) 🎉 New Destination: Databricks Sep 14, 2021
@tuliren tuliren merged commit e837048 into master Sep 14, 2021
@tuliren tuliren deleted the liren/destination-databricks branch September 14, 2021 23:55
htrueman added a commit that referenced this pull request Sep 17, 2021
* Update check connection method

* #5796 silence printing full config when config validation fails (#5879)

* - #5796 silence printing full config when config validation fails

* fix unit tests after config validation check changes

Co-authored-by: Marcos Eliziario Santos <[email protected]>

* Format google-search-console schemas (#6047)

* Update ads_insights.json (#5946)

fix ads_insights schema according to [facebook docs](https://developers.facebook.com/docs/marketing-api/reference/adgroup/insights/) and my own data

* Bump connectors version + update docs (#6060)

* 🐛 Source Facebook Marketing: Convert values' types according to schema types (#4978)

* Convert values' types according to schema types

* Put streams back to `configured_catalog.json`

Put back `ads_insights` and `ads_insights_age_and_gender` streams.

* Pickup changes from #5946

* Implement change request + fix previous PR

* Update schema

* Remove items_type from convert_to_schema_types()

* Bump connectors version

* add oauth to connector_base dependencies (#6064)

* use spec when persisting source configs (#6036)

* switch most usages of writing sources to using specs

* fix other usages

* fix test

* only wait on the server in the scheduler, not the worker

* fix

* rephrase sanity check and remove stdout

* 🎉 Source Stripe: Add `PaymentIntents` stream (#6004)

* Add `PaymentIntents` stream

* Update docs

* Implement change request + few updates

Split `source.py` file into `source.py` and `streams.py` files.
Update `payment_intents.json` file.

* Bump connectors version + update docs

* Add skeleton for databricks destination (#5629)

Co-authored-by: Liren Tu <[email protected]>
Co-authored-by: LiRen Tu <[email protected]>

* Revert "Add skeleton for databricks destination (#5629)" (#6066)

This reverts commit 79256c4.

* 🎉 New Destination: Databricks (#5998)

Implement new destination connector for databricks delta lake.
Resolves #2075.

Co-authored-by: George Claireaux <[email protected]>
Co-authored-by: Sherif A. Nada <[email protected]>

* Source PostHog: add support for self-hosted instances (#6058)

* publish #6058 (#6059)

* Destination Kafka: correct spec json and data types in config (#6040)

* correct spec json and data types in config

* bump version

* correct tests

* correct config parser NPE

* format files

Co-authored-by: Marcos Marx <[email protected]>

* Fix or delete broken links (#6069)

* Fix more doc issues (#6072)

* 🎉 Added optional platform flag for build image script (#6000)

* Fix dependabot security alert. (#6073)

* Pin set value to greater than 4.0.1 to fix security warning.

* Format the rest of the connectors.

* add coverage report (#6045)

Co-authored-by: Dmytro Rezchykov <[email protected]>

* Fix the format of the data returned by Google Ads oauth to match the config accepted by the connector (#6032)

* update salesforce docs (#6081)

* 🎉 Source Github: add caching for all streams (#5949)

* Source Github: add checking for all streams

* bump version, update changelogs

* Disable automatic migration acceptance test (#5988)

- The automatic migration acceptance test no longer works because of the new Flyway migration system.
- The file-based migration system is being deprecated.

* 🎉 CDK: Add requests native authenticator support (#5731)

* Add requests native auth class

* Update init file.
Update type annotations.
Bump version.

* Update TokenAuthenticator implementation.
Update Oauth2Authenticator implemetation.
Add CHANGELOG.md record.

* Update Oauth2Authenticator default value setting.
Update CHANGELOG.md

* Add requests native authenticator tests

* Add CDK requests native __call__ method tests.
Update CHANGELOG.md

* Add outdated auth deprication messages

* Update requests native auth __call__ method tests

* Bump CDK version to 0.1.20

* Interface changes to support separating secrets from the config (#6065)

* Interface changes to support separating secrets from the config
* Cleanup from PR comments and whitespace

* Update log message for empty env variable (#6115)

Co-authored-by: Jared Rhizor <[email protected]>

* Bump Airbyte version from 0.29.17-alpha to 0.29.18-alpha (#6125)

Co-authored-by: davinchia <[email protected]>

* return auth spec in the API when getting definition specification (#6121)

* Ignore python test coverage files (#6144)

* CDK: support nested refs resolving (#6044)

Co-authored-by: Dmytro Rezchykov <[email protected]>

* feat: path for nested fields (#6130)

* feat: path for nested fields

* fix: clipRule error

* fix: remove field name

* Fix request middleware for ConnectionService (#6148)

* Jamakase/update onboarding flow (#5656)

* Doc explains normalization full-refresh implications (#6097)

* update docs

* add info in quickstart connection page

* update abhi comments

Co-authored-by: Marcos Marx <[email protected]>

* Fix migration validation issue (#6154)

Resolves #6151.

* Bump Airbyte version from 0.29.18-alpha to 0.29.19-alpha (#6156)

Co-authored-by: tuliren <[email protected]>

* Add information on which destinations support Incremental - Deduped History in their docs (#6031)

Co-authored-by: Abhi Vaidyanatha <[email protected]>

* Update Airbyte Spec acknowledgements. (#6155)

Co-authored-by: Abhi Vaidyanatha <[email protected]>

* Update new integration request

* Add back the migration acceptance test (#6163)

* 🎉 Create a Helm Chart For Airbyte (#5891)

See number #1868. This creates an initial helm chart for installing Airbyte in Kubernetes to make it easier for users who are more familiar with helm. It also includes GitHub actions to help continually test that the chart works in the most basic case.
All of the templates are based off of the kustomize folder, but minio and postgres have been removed in favor of adding the bitnami helm charts as dependencies since they have an active community and allow easily tweaking their install.

* Fix OAuth Summary strings (#6143)

Co-authored-by: Marcos Eliziario Santos <[email protected]>
Co-authored-by: Marcos Eliziario Santos <[email protected]>
Co-authored-by: oleh.zorenko <[email protected]>
Co-authored-by: Mauro <[email protected]>
Co-authored-by: Sherif A. Nada <[email protected]>
Co-authored-by: Jared Rhizor <[email protected]>
Co-authored-by: George Claireaux <[email protected]>
Co-authored-by: Liren Tu <[email protected]>
Co-authored-by: LiRen Tu <[email protected]>
Co-authored-by: coeurdestenebres <[email protected]>
Co-authored-by: Marcos Marx <[email protected]>
Co-authored-by: Marcos Marx <[email protected]>
Co-authored-by: Harsha Teja Kanna <[email protected]>
Co-authored-by: Davin Chia <[email protected]>
Co-authored-by: Dmytro <[email protected]>
Co-authored-by: Dmytro Rezchykov <[email protected]>
Co-authored-by: Yevhenii <[email protected]>
Co-authored-by: Jenny Brown <[email protected]>
Co-authored-by: davinchia <[email protected]>
Co-authored-by: Iakov Salikov <[email protected]>
Co-authored-by: Artem Astapenko <[email protected]>
Co-authored-by: tuliren <[email protected]>
Co-authored-by: Abhi Vaidyanatha <[email protected]>
Co-authored-by: Abhi Vaidyanatha <[email protected]>
Co-authored-by: Jonathan Stacks <[email protected]>
Co-authored-by: Christophe Duong <[email protected]>
htrueman added a commit that referenced this pull request Sep 17, 2021
* Add GET_FBA_INVENTORY_AGED_DATA data

* Add GET_MERCHANT_LISTINGS_ALL_DATA stream support

* Update schemas

* Update configured_catalog.json

* Update connector to airbyte-cdk

* Add amazon seller partner test creds

* Update state sample files

* Apply code format

* Update acceptance-test-config.yml

* Add dummy integration test

* Refactor auth signature.
Update streams.py

* Remove print_function import from auth.py

* Refactor source class.
Add pydantic spec.
PR fixes.

* Add dummy integration test

* Typing added.
Add _create_prepared_request docstring.

* Add extra streams and schemas

* Update docs and spec

* Post merge code fixes

* Fix test setup

* Fix test setup

* Add sample_state.json

* Update reports streams logics.
Update test and config files.

* Update tests config.
Small code style fixes.

* Add reports stream slices.
Update check_connection method.

* Post review fixes.

* Streams update

* Add reports document retrieval and decrypting.
Update schemas and configs.

* Add CVS parsing into result rows

* Update ReportsAmazonSPStream class to be the child of Stream class.
Update GET_FLAT_FILE_OPEN_LISTINGS_DATA and GET_MERCHANT_LISTINGS_ALL_DATA schemas.

* Schema updates

* Source check method updated

* Update ReportsAmazonSPStream retry report logics

* Update check_connection source method

* Update reports read_records method.
Update report schemas.

* Update streams.py

* Update acceptance tests config.
Add small code fixes.

* Update report read_records logics

* Add reports streams rate limit handling logics.
Add rate limit unit tests.

* Source Amazon SP: Update reports streams logics. (#5311)

* Update check connection method

* #5796 silence printing full config when config validation fails (#5879)

* - #5796 silence printing full config when config validation fails

* fix unit tests after config validation check changes

Co-authored-by: Marcos Eliziario Santos <[email protected]>

* Format google-search-console schemas (#6047)

* Update ads_insights.json (#5946)

fix ads_insights schema according to [facebook docs](https://developers.facebook.com/docs/marketing-api/reference/adgroup/insights/) and my own data

* Bump connectors version + update docs (#6060)

* 🐛 Source Facebook Marketing: Convert values' types according to schema types (#4978)

* Convert values' types according to schema types

* Put streams back to `configured_catalog.json`

Put back `ads_insights` and `ads_insights_age_and_gender` streams.

* Pickup changes from #5946

* Implement change request + fix previous PR

* Update schema

* Remove items_type from convert_to_schema_types()

* Bump connectors version

* add oauth to connector_base dependencies (#6064)

* use spec when persisting source configs (#6036)

* switch most usages of writing sources to using specs

* fix other usages

* fix test

* only wait on the server in the scheduler, not the worker

* fix

* rephrase sanity check and remove stdout

* 🎉 Source Stripe: Add `PaymentIntents` stream (#6004)

* Add `PaymentIntents` stream

* Update docs

* Implement change request + few updates

Split `source.py` file into `source.py` and `streams.py` files.
Update `payment_intents.json` file.

* Bump connectors version + update docs

* Add skeleton for databricks destination (#5629)

Co-authored-by: Liren Tu <[email protected]>
Co-authored-by: LiRen Tu <[email protected]>

* Revert "Add skeleton for databricks destination (#5629)" (#6066)

This reverts commit 79256c4.

* 🎉 New Destination: Databricks (#5998)

Implement new destination connector for databricks delta lake.
Resolves #2075.

Co-authored-by: George Claireaux <[email protected]>
Co-authored-by: Sherif A. Nada <[email protected]>

* Source PostHog: add support for self-hosted instances (#6058)

* publish #6058 (#6059)

* Destination Kafka: correct spec json and data types in config (#6040)

* correct spec json and data types in config

* bump version

* correct tests

* correct config parser NPE

* format files

Co-authored-by: Marcos Marx <[email protected]>

* Fix or delete broken links (#6069)

* Fix more doc issues (#6072)

* 🎉 Added optional platform flag for build image script (#6000)

* Fix dependabot security alert. (#6073)

* Pin set value to greater than 4.0.1 to fix security warning.

* Format the rest of the connectors.

* add coverage report (#6045)

Co-authored-by: Dmytro Rezchykov <[email protected]>

* Fix the format of the data returned by Google Ads oauth to match the config accepted by the connector (#6032)

* update salesforce docs (#6081)

* 🎉 Source Github: add caching for all streams (#5949)

* Source Github: add checking for all streams

* bump version, update changelogs

* Disable automatic migration acceptance test (#5988)

- The automatic migration acceptance test no longer works because of the new Flyway migration system.
- The file-based migration system is being deprecated.

* 🎉 CDK: Add requests native authenticator support (#5731)

* Add requests native auth class

* Update init file.
Update type annotations.
Bump version.

* Update TokenAuthenticator implementation.
Update Oauth2Authenticator implemetation.
Add CHANGELOG.md record.

* Update Oauth2Authenticator default value setting.
Update CHANGELOG.md

* Add requests native authenticator tests

* Add CDK requests native __call__ method tests.
Update CHANGELOG.md

* Add outdated auth deprication messages

* Update requests native auth __call__ method tests

* Bump CDK version to 0.1.20

* Interface changes to support separating secrets from the config (#6065)

* Interface changes to support separating secrets from the config
* Cleanup from PR comments and whitespace

* Update log message for empty env variable (#6115)

Co-authored-by: Jared Rhizor <[email protected]>

* Bump Airbyte version from 0.29.17-alpha to 0.29.18-alpha (#6125)

Co-authored-by: davinchia <[email protected]>

* return auth spec in the API when getting definition specification (#6121)

* Ignore python test coverage files (#6144)

* CDK: support nested refs resolving (#6044)

Co-authored-by: Dmytro Rezchykov <[email protected]>

* feat: path for nested fields (#6130)

* feat: path for nested fields

* fix: clipRule error

* fix: remove field name

* Fix request middleware for ConnectionService (#6148)

* Jamakase/update onboarding flow (#5656)

* Doc explains normalization full-refresh implications (#6097)

* update docs

* add info in quickstart connection page

* update abhi comments

Co-authored-by: Marcos Marx <[email protected]>

* Fix migration validation issue (#6154)

Resolves #6151.

* Bump Airbyte version from 0.29.18-alpha to 0.29.19-alpha (#6156)

Co-authored-by: tuliren <[email protected]>

* Add information on which destinations support Incremental - Deduped History in their docs (#6031)

Co-authored-by: Abhi Vaidyanatha <[email protected]>

* Update Airbyte Spec acknowledgements. (#6155)

Co-authored-by: Abhi Vaidyanatha <[email protected]>

* Update new integration request

* Add back the migration acceptance test (#6163)

* 🎉 Create a Helm Chart For Airbyte (#5891)

See number #1868. This creates an initial helm chart for installing Airbyte in Kubernetes to make it easier for users who are more familiar with helm. It also includes GitHub actions to help continually test that the chart works in the most basic case.
All of the templates are based off of the kustomize folder, but minio and postgres have been removed in favor of adding the bitnami helm charts as dependencies since they have an active community and allow easily tweaking their install.

* Fix OAuth Summary strings (#6143)

Co-authored-by: Marcos Eliziario Santos <[email protected]>
Co-authored-by: Marcos Eliziario Santos <[email protected]>
Co-authored-by: oleh.zorenko <[email protected]>
Co-authored-by: Mauro <[email protected]>
Co-authored-by: Sherif A. Nada <[email protected]>
Co-authored-by: Jared Rhizor <[email protected]>
Co-authored-by: George Claireaux <[email protected]>
Co-authored-by: Liren Tu <[email protected]>
Co-authored-by: LiRen Tu <[email protected]>
Co-authored-by: coeurdestenebres <[email protected]>
Co-authored-by: Marcos Marx <[email protected]>
Co-authored-by: Marcos Marx <[email protected]>
Co-authored-by: Harsha Teja Kanna <[email protected]>
Co-authored-by: Davin Chia <[email protected]>
Co-authored-by: Dmytro <[email protected]>
Co-authored-by: Dmytro Rezchykov <[email protected]>
Co-authored-by: Yevhenii <[email protected]>
Co-authored-by: Jenny Brown <[email protected]>
Co-authored-by: davinchia <[email protected]>
Co-authored-by: Iakov Salikov <[email protected]>
Co-authored-by: Artem Astapenko <[email protected]>
Co-authored-by: tuliren <[email protected]>
Co-authored-by: Abhi Vaidyanatha <[email protected]>
Co-authored-by: Abhi Vaidyanatha <[email protected]>
Co-authored-by: Jonathan Stacks <[email protected]>
Co-authored-by: Christophe Duong <[email protected]>

* Bump source version.
Update source docs.

* Mock time.sleep in test_reports_stream_send_request_backoff_exception test

* Acceptance test basic_read test disabled

Co-authored-by: Marcos Eliziario Santos <[email protected]>
Co-authored-by: Marcos Eliziario Santos <[email protected]>
Co-authored-by: oleh.zorenko <[email protected]>
Co-authored-by: Mauro <[email protected]>
Co-authored-by: Sherif A. Nada <[email protected]>
Co-authored-by: Jared Rhizor <[email protected]>
Co-authored-by: George Claireaux <[email protected]>
Co-authored-by: Liren Tu <[email protected]>
Co-authored-by: LiRen Tu <[email protected]>
Co-authored-by: coeurdestenebres <[email protected]>
Co-authored-by: Marcos Marx <[email protected]>
Co-authored-by: Marcos Marx <[email protected]>
Co-authored-by: Harsha Teja Kanna <[email protected]>
Co-authored-by: Davin Chia <[email protected]>
Co-authored-by: Dmytro <[email protected]>
Co-authored-by: Dmytro Rezchykov <[email protected]>
Co-authored-by: Yevhenii <[email protected]>
Co-authored-by: Jenny Brown <[email protected]>
Co-authored-by: davinchia <[email protected]>
Co-authored-by: Iakov Salikov <[email protected]>
Co-authored-by: Artem Astapenko <[email protected]>
Co-authored-by: tuliren <[email protected]>
Co-authored-by: Abhi Vaidyanatha <[email protected]>
Co-authored-by: Abhi Vaidyanatha <[email protected]>
Co-authored-by: Jonathan Stacks <[email protected]>
Co-authored-by: Christophe Duong <[email protected]>
ghost pushed a commit to dandorazio/airbyte that referenced this pull request Sep 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

New destination: Databricks Delta Lake
6 participants