Upgrade to pyiceberg #12

cccs-eric · 2022-12-26T14:07:26Z

Still a WIP, looking for code review.

metadata-ingestion/src/datahub/ingestion/source/iceberg/iceberg.py

metadata-ingestion/src/datahub/ingestion/source/iceberg/iceberg_common.py

metadata-ingestion/src/datahub/ingestion/source/iceberg/iceberg_profiler.py

cccs-eric · 2023-01-05T14:43:57Z

@rdblue @Fokko Thank you both for the review, very much appreciated. The pyiceberg API is a nice upgrade to python_legacy!

cccs-eric · 2023-01-13T16:58:13Z

@cccs-tom @cccs-Dustin Here are the remaining things to check FYI:

Finalize review
Wait for new pyiceberg release (so it includes adlfs support that I added 2-3 weeks ago)
Troubleshoot why S3 does not work using the iceberg profiler integration test. I made it pass using PyArrow instead of FsSpec.
Update our fork to have the latest dependency changes from DataHub

Once we migrate our platform to REST catalog, we will be able to remove the code that mimics the HadoopCatalog.

cccs-Dustin · 2023-01-16T15:50:59Z

lgtm! (didn't mean to approve the previous message, I will wait until the items on your checklist are completed before giving approval)

cccs-tom

lgtm!

cccs-Dustin · 2023-01-18T14:31:30Z

iceberg_common.py:

dataset_name = ".".join(s for s in strltrim(f.path, self.localfs).split("/") if s)

* CLDN-1784 - Migration to new pyiceberg SDK * Python: update test case * Change to fsspec instead of Azure filesystem * Change how to find column count * Use Iceberg visitor to build avro schema * CLDN-1784 - Refactor code to pyiceberg * Merge setup.py * Fix linting errors * Update code comments * Added table format-version property to output * Fix dataset_name parsing problem * Use official pyiceberg release 0.3.0 * Change how we handle missing fields from schema during profiling

Fokko reviewed Dec 26, 2022

View reviewed changes

cccs-eric marked this pull request as ready for review January 5, 2023 14:52

cccs-eric marked this pull request as draft January 5, 2023 14:52

cccs-eric marked this pull request as ready for review January 13, 2023 15:19

cccs-eric requested review from cccs-Dustin and cccs-tom January 13, 2023 15:20

cccs-Dustin approved these changes Jan 16, 2023 •

edited

Loading

View reviewed changes

cccs-Dustin self-requested a review January 16, 2023 15:50

cccs-tom reviewed Jan 16, 2023

View reviewed changes

cccs-eric added 7 commits February 6, 2023 07:14

CLDN-1784 - Migration to new pyiceberg SDK

b7683ac

Python: update test case

b1dd0d1

Change to fsspec instead of Azure filesystem

13f84de

Change how to find column count

0f1e03a

Use Iceberg visitor to build avro schema

0c159c3

CLDN-1784 - Refactor code to pyiceberg

f870729

Merge setup.py

b0b069f

cccs-eric force-pushed the feature/CLDN-1784 branch from 06f8205 to b0b069f Compare February 6, 2023 12:28

cccs-eric added 4 commits February 6, 2023 08:18

Fix linting errors

b2fd591

Update code comments

09aef9e

Added table format-version property to output

154ef1b

Fix dataset_name parsing problem

1dafc76

cccs-tom approved these changes Feb 6, 2023

View reviewed changes

Use official pyiceberg release 0.3.0

a4a6f54

cccs-Dustin approved these changes Feb 15, 2023

View reviewed changes

Change how we handle missing fields from schema during profiling

f3b54d0

cccs-Dustin merged commit 21ea0f6 into cccs-main Feb 16, 2023

cccs-Dustin deleted the feature/CLDN-1784 branch February 16, 2023 12:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade to pyiceberg #12

Upgrade to pyiceberg #12

cccs-eric commented Dec 26, 2022

cccs-eric commented Jan 5, 2023

cccs-eric commented Jan 13, 2023

cccs-Dustin commented Jan 16, 2023

cccs-tom left a comment

cccs-Dustin commented Jan 18, 2023

Upgrade to pyiceberg #12

Upgrade to pyiceberg #12

Conversation

cccs-eric commented Dec 26, 2022

cccs-eric commented Jan 5, 2023

cccs-eric commented Jan 13, 2023

cccs-Dustin commented Jan 16, 2023

cccs-tom left a comment

Choose a reason for hiding this comment

cccs-Dustin commented Jan 18, 2023