Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(iceberg): Upgrade Iceberg ingestion source to pyiceberg 0.4.0 #8357

Merged
merged 42 commits into from
Aug 31, 2023

Conversation

cccs-eric
Copy link
Contributor

@cccs-eric cccs-eric commented Jul 3, 2023

This PR upgrades the Iceberg ingestion source to the new Iceberg Python API (pyiceberg).

The current Iceberg source relies on the Iceberg python_legacy package and it only works for Iceberg tables from an HadoopCatalog stored in Azure Datalake. pyiceberg has since replaced python_legacy and is now officially maintained as part of the Iceberg project. This will introduce support for table catalogs and make acryl-iceberg-legacy obsolete. The Iceberg source will now be able to ingest tables from Iceberg catalogs, stored in S3, ADLFS, files or any other pyiceberg supported storage layer.

cccs-Dustin and others added 3 commits June 30, 2023 11:26
* CLDN-1784 - Migration to new pyiceberg SDK

* Python: update test case

* Change to fsspec instead of Azure filesystem

* Change how to find column count

* Use Iceberg visitor to build avro schema

* CLDN-1784 - Refactor code to pyiceberg

* Merge setup.py

* Fix linting errors

* Update code comments

* Added table format-version property to output

* Fix dataset_name parsing problem

* Use official pyiceberg release 0.3.0

* Change how we handle missing fields from schema during profiling
@github-actions github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Jul 3, 2023
@cccs-eric cccs-eric changed the title feat(iceberg): Upgrade Iceberg ingestion source to pyiceberg 0.3.0 feat(iceberg): Upgrade Iceberg ingestion source to pyiceberg 0.4.0 Jul 3, 2023
Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great @cccs-eric. First few comments, tomorrow I'll give it a spin against a REST catalog

@anshbansal anshbansal added the community-contribution PR or Issue raised by member(s) of DataHub Community label Jul 17, 2023
@asikowitz asikowitz self-assigned this Jul 17, 2023
@cccs-eric
Copy link
Contributor Author

I'm having a hard time to configure the build to not consider Iceberg source when Python 3.7. pyiceberg is adding support for 3.7, but the work won't be done until late Summer/early Fall.

@Fokko Fokko mentioned this pull request Jul 30, 2023
Copy link
Collaborator

@asikowitz asikowitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @cccs-eric, sorry for the long delay on the review. I think this is really close and I tried to make some small changes myself, but looks like I don't have permissions for that (generally I do on community contribution branches, oops) so I've left some large code change comments instead.

If you give me permissions to make commits to your branch, I can also try to resolve the lint error myself!

metadata-ingestion/setup.py Show resolved Hide resolved
metadata-ingestion/tests/unit/test_iceberg.py Outdated Show resolved Hide resolved
metadata-ingestion/tests/unit/test_iceberg.py Outdated Show resolved Hide resolved
@vercel
Copy link

vercel bot commented Aug 21, 2023

Deployment failed with the following error:

The provided GitHub repository does not contain the requested branch or commit reference. Please ensure the repository is not empty.

@asikowitz
Copy link
Collaborator

Please merge master into this branch to fix CI

@asikowitz asikowitz merged commit 6fe60a2 into datahub-project:master Aug 31, 2023
57 checks passed
@cccs-eric
Copy link
Contributor Author

Thanks @asikowitz

@asikowitz
Copy link
Collaborator

Thank you for the well-written contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community-contribution PR or Issue raised by member(s) of DataHub Community ingestion PR or Issue related to the ingestion of metadata
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants