Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add iceberg datasource #46889

Merged
merged 87 commits into from
Aug 15, 2024
Merged

Conversation

dev-goyal
Copy link
Contributor

@dev-goyal dev-goyal commented Jul 31, 2024

Why are these changes needed?

This PR adds the capability to load an Iceberg table into a Ray Dataset. Compared to the PyIceberg functionality, which can only materialize the entire Iceberg table into a single pyarrow table first, which is then converted to a Ray dataset, this PR allows a streaming implementation, where the Iceberg table can be distributed into a Ray Dataset.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Copy link
Contributor

@omatthew98 omatthew98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love the comments documenting some of the less obvious things in the code. Few questions and nitty comments.

doc/source/data/loading-data.rst Outdated Show resolved Hide resolved
doc/source/data/loading-data.rst Show resolved Hide resolved
python/ray/data/_internal/datasource/iceberg_datasource.py Outdated Show resolved Hide resolved
python/ray/data/_internal/datasource/iceberg_datasource.py Outdated Show resolved Hide resolved
python/ray/data/read_api.py Show resolved Hide resolved
Copy link
Contributor

@omatthew98 omatthew98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for responding / addressing comments. Lgtm!

Copy link
Contributor

@scottjlee scottjlee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution @dev-goyal !

@scottjlee scottjlee added the go add ONLY when ready to merge, run all tests label Aug 15, 2024
Signed-off-by: Dev <[email protected]>
@scottjlee scottjlee merged commit 14a095f into ray-project:master Aug 15, 2024
5 checks passed
simonsays1980 pushed a commit to simonsays1980/ray that referenced this pull request Aug 16, 2024
## Why are these changes needed?

This PR adds the capability to load an Iceberg table into a Ray Dataset.
Compared to the PyIceberg functionality, which can only materialize the
entire Iceberg table into a single `pyarrow` table first, which is then
converted to a Ray dataset, this PR allows a streaming implementation,
where the Iceberg table can be distributed into a Ray Dataset.

## Related issue number

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [x] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [x] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [x] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Dev <[email protected]>
Signed-off-by: dev-goyal <[email protected]>
Signed-off-by: Alan Guo <[email protected]>
Signed-off-by: tungh2 <[email protected]>
Signed-off-by: Jiajun Yao <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Ruiyang Wang <[email protected]>
Signed-off-by: Galen Wang <[email protected]>
Signed-off-by: Shilun Fan <[email protected]>
Signed-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: Matthew Owen <[email protected]>
Signed-off-by: cristianjd <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Co-authored-by: Fokko Driesprong <[email protected]>
Co-authored-by: Sven Mika <[email protected]>
Co-authored-by: Alan Guo <[email protected]>
Co-authored-by: tungh2 <[email protected]>
Co-authored-by: Jiajun Yao <[email protected]>
Co-authored-by: Scott Lee <[email protected]>
Co-authored-by: Ruiyang Wang <[email protected]>
Co-authored-by: Galen Wang <[email protected]>
Co-authored-by: Max van Dijck <[email protected]>
Co-authored-by: slfan1989 <[email protected]>
Co-authored-by: Deepyaman Datta <[email protected]>
Co-authored-by: Samuel Chan <[email protected]>
Co-authored-by: Matthew Owen <[email protected]>
Co-authored-by: cristianjd <[email protected]>
Co-authored-by: Justin Yu <[email protected]>
@wengzhenjie
Copy link

This is a very good PR, but I noticed that the PR only implements read operations, are there any plans to support write operations in the future?😃

@scottjlee
Copy link
Contributor

@wengzhenjie Currently we don't have plans to add write operations, but we welcome community contributions and would be happy to help review!

@dev-goyal
Copy link
Contributor Author

dev-goyal commented Sep 12, 2024

Waiting on PyIceberg to support it as a first class operation. Furthermore, waiting on this to be fixed

apache/iceberg-python#1132

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data Ray Data-related issues enhancement Request for new feature and/or capability go add ONLY when ready to merge, run all tests P0 Issues that should be fixed in short order
Projects
None yet
Development

Successfully merging this pull request may close these issues.