-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(iceberg): Upgrade Iceberg ingestion source to pyiceberg 0.4.0 #8357
feat(iceberg): Upgrade Iceberg ingestion source to pyiceberg 0.4.0 #8357
Conversation
* CLDN-1784 - Migration to new pyiceberg SDK * Python: update test case * Change to fsspec instead of Azure filesystem * Change how to find column count * Use Iceberg visitor to build avro schema * CLDN-1784 - Refactor code to pyiceberg * Merge setup.py * Fix linting errors * Update code comments * Added table format-version property to output * Fix dataset_name parsing problem * Use official pyiceberg release 0.3.0 * Change how we handle missing fields from schema during profiling
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking great @cccs-eric. First few comments, tomorrow I'll give it a spin against a REST catalog
metadata-ingestion/src/datahub/ingestion/source/iceberg/iceberg_common.py
Outdated
Show resolved
Hide resolved
metadata-ingestion/src/datahub/ingestion/source/iceberg/iceberg_common.py
Outdated
Show resolved
Hide resolved
metadata-ingestion/src/datahub/ingestion/source/iceberg/iceberg_profiler.py
Outdated
Show resolved
Hide resolved
metadata-ingestion/src/datahub/ingestion/source/iceberg/iceberg_profiler.py
Outdated
Show resolved
Hide resolved
metadata-ingestion/src/datahub/ingestion/source/iceberg/iceberg_profiler.py
Outdated
Show resolved
Hide resolved
metadata-ingestion/tests/integration/iceberg/docker-compose.yml
Outdated
Show resolved
Hide resolved
metadata-ingestion/tests/integration/iceberg/docker-compose.yml
Outdated
Show resolved
Hide resolved
…g_profiler.py Co-authored-by: Fokko Driesprong <[email protected]>
…g_profiler.py Co-authored-by: Fokko Driesprong <[email protected]>
Co-authored-by: Fokko Driesprong <[email protected]>
Co-authored-by: Fokko Driesprong <[email protected]>
I'm having a hard time to configure the build to not consider Iceberg source when Python 3.7. pyiceberg is adding support for 3.7, but the work won't be done until late Summer/early Fall. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @cccs-eric, sorry for the long delay on the review. I think this is really close and I tried to make some small changes myself, but looks like I don't have permissions for that (generally I do on community contribution branches, oops) so I've left some large code change comments instead.
If you give me permissions to make commits to your branch, I can also try to resolve the lint error myself!
metadata-ingestion/src/datahub/ingestion/source/iceberg/iceberg_common.py
Outdated
Show resolved
Hide resolved
metadata-ingestion/src/datahub/ingestion/source/iceberg/iceberg.py
Outdated
Show resolved
Hide resolved
metadata-ingestion/src/datahub/ingestion/source/iceberg/iceberg_common.py
Outdated
Show resolved
Hide resolved
Co-authored-by: Andrew Sikowitz <[email protected]>
Deployment failed with the following error:
|
Co-authored-by: Andrew Sikowitz <[email protected]>
Co-authored-by: Andrew Sikowitz <[email protected]>
Co-authored-by: Andrew Sikowitz <[email protected]>
Please merge master into this branch to fix CI |
Thanks @asikowitz |
Thank you for the well-written contribution! |
This PR upgrades the Iceberg ingestion source to the new Iceberg Python API (pyiceberg).
The current Iceberg source relies on the Iceberg
python_legacy
package and it only works for Iceberg tables from an HadoopCatalog stored in Azure Datalake.pyiceberg
has since replacedpython_legacy
and is now officially maintained as part of the Iceberg project. This will introduce support for table catalogs and makeacryl-iceberg-legacy
obsolete. The Iceberg source will now be able to ingest tables from Iceberg catalogs, stored in S3, ADLFS, files or any other pyiceberg supported storage layer.