Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade to pyiceberg #12

Merged
merged 13 commits into from
Feb 16, 2023
Merged

Upgrade to pyiceberg #12

merged 13 commits into from
Feb 16, 2023

Conversation

cccs-eric
Copy link
Collaborator

Still a WIP, looking for code review.

@cccs-eric
Copy link
Collaborator Author

@rdblue @Fokko Thank you both for the review, very much appreciated. The pyiceberg API is a nice upgrade to python_legacy!

@cccs-eric cccs-eric marked this pull request as ready for review January 5, 2023 14:52
@cccs-eric cccs-eric marked this pull request as draft January 5, 2023 14:52
@cccs-eric cccs-eric marked this pull request as ready for review January 13, 2023 15:19
@cccs-eric
Copy link
Collaborator Author

@cccs-tom @cccs-Dustin Here are the remaining things to check FYI:

  1. Finalize review
  2. Wait for new pyiceberg release (so it includes adlfs support that I added 2-3 weeks ago)
  3. Troubleshoot why S3 does not work using the iceberg profiler integration test. I made it pass using PyArrow instead of FsSpec.
  4. Update our fork to have the latest dependency changes from DataHub

Once we migrate our platform to REST catalog, we will be able to remove the code that mimics the HadoopCatalog.

cccs-Dustin
cccs-Dustin approved these changes Jan 16, 2023
@cccs-Dustin cccs-Dustin self-requested a review January 16, 2023 15:50
@cccs-Dustin
Copy link
Member

lgtm! (didn't mean to approve the previous message, I will wait until the items on your checklist are completed before giving approval)

Copy link
Member

@cccs-tom cccs-tom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@cccs-Dustin
Copy link
Member

iceberg_common.py:

dataset_name = ".".join(s for s in strltrim(f.path, self.localfs).split("/") if s)

@cccs-Dustin cccs-Dustin merged commit 21ea0f6 into cccs-main Feb 16, 2023
@cccs-Dustin cccs-Dustin deleted the feature/CLDN-1784 branch February 16, 2023 12:25
cccs-eric added a commit that referenced this pull request Jul 3, 2023
* CLDN-1784 - Migration to new pyiceberg SDK

* Python: update test case

* Change to fsspec instead of Azure filesystem

* Change how to find column count

* Use Iceberg visitor to build avro schema

* CLDN-1784 - Refactor code to pyiceberg

* Merge setup.py

* Fix linting errors

* Update code comments

* Added table format-version property to output

* Fix dataset_name parsing problem

* Use official pyiceberg release 0.3.0

* Change how we handle missing fields from schema during profiling
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants