-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add iceberg datasource #46889
Add iceberg datasource #46889
Conversation
Signed-off-by: Dev <[email protected]>
Signed-off-by: Dev <[email protected]>
Signed-off-by: Dev <[email protected]>
Signed-off-by: Dev <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love the comments documenting some of the less obvious things in the code. Few questions and nitty comments.
Signed-off-by: Dev <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for responding / addressing comments. Lgtm!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution @dev-goyal !
Co-authored-by: Scott Lee <[email protected]> Signed-off-by: dev-goyal <[email protected]>
Co-authored-by: Scott Lee <[email protected]> Signed-off-by: dev-goyal <[email protected]>
Signed-off-by: Dev <[email protected]>
9de0817
to
0031576
Compare
## Why are these changes needed? This PR adds the capability to load an Iceberg table into a Ray Dataset. Compared to the PyIceberg functionality, which can only materialize the entire Iceberg table into a single `pyarrow` table first, which is then converted to a Ray dataset, this PR allows a streaming implementation, where the Iceberg table can be distributed into a Ray Dataset. ## Related issue number ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [x] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Dev <[email protected]> Signed-off-by: dev-goyal <[email protected]> Signed-off-by: Alan Guo <[email protected]> Signed-off-by: tungh2 <[email protected]> Signed-off-by: Jiajun Yao <[email protected]> Signed-off-by: Scott Lee <[email protected]> Signed-off-by: Ruiyang Wang <[email protected]> Signed-off-by: Galen Wang <[email protected]> Signed-off-by: Shilun Fan <[email protected]> Signed-off-by: Deepyaman Datta <[email protected]> Signed-off-by: Matthew Owen <[email protected]> Signed-off-by: cristianjd <[email protected]> Signed-off-by: Justin Yu <[email protected]> Co-authored-by: Fokko Driesprong <[email protected]> Co-authored-by: Sven Mika <[email protected]> Co-authored-by: Alan Guo <[email protected]> Co-authored-by: tungh2 <[email protected]> Co-authored-by: Jiajun Yao <[email protected]> Co-authored-by: Scott Lee <[email protected]> Co-authored-by: Ruiyang Wang <[email protected]> Co-authored-by: Galen Wang <[email protected]> Co-authored-by: Max van Dijck <[email protected]> Co-authored-by: slfan1989 <[email protected]> Co-authored-by: Deepyaman Datta <[email protected]> Co-authored-by: Samuel Chan <[email protected]> Co-authored-by: Matthew Owen <[email protected]> Co-authored-by: cristianjd <[email protected]> Co-authored-by: Justin Yu <[email protected]>
This is a very good PR, but I noticed that the PR only implements read operations, are there any plans to support write operations in the future?😃 |
@wengzhenjie Currently we don't have plans to add write operations, but we welcome community contributions and would be happy to help review! |
Waiting on PyIceberg to support it as a first class operation. Furthermore, waiting on this to be fixed |
Why are these changes needed?
This PR adds the capability to load an Iceberg table into a Ray Dataset. Compared to the PyIceberg functionality, which can only materialize the entire Iceberg table into a single
pyarrow
table first, which is then converted to a Ray dataset, this PR allows a streaming implementation, where the Iceberg table can be distributed into a Ray Dataset.Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.