-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Data] Add read API for Delta sharing tables #46072
[Data] Add read API for Delta sharing tables #46072
Conversation
56fe0f1
to
8019e9b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @brucebismarck for the contribution! Looks solid. Having some comments.
python/ray/data/read_api.py
Outdated
version: Optional[int] = None, | ||
timestamp: Optional[str] = None, | ||
jsonPredicateHints: Optional[str] = None, | ||
parallelism: int = -1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parallelism
is deprecated, we can remove it.
python/ray/data/read_api.py
Outdated
|
||
This function reads data from a Delta Sharing table specified by the URL. | ||
It supports various options such as limiting the number of rows, specifying | ||
a version or timestamp, and configuring parallelism and concurrency. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
configuring parallelism and concurrency
-> configuring concurrency
.
python/ray/data/read_api.py
Outdated
ml_adhoc_data_sharing.ralph_user_pairs_with_seq_features_310d_p01_train", | ||
limit=100000, | ||
version=1, | ||
).materialize() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.materialize()
can be removed.
python/ray/data/read_api.py
Outdated
setup_delta_sharing_connections, | ||
) | ||
|
||
table, rest_client = setup_delta_sharing_connections(url) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this be called instead inside DeltaSharingDatasource.__init__
?
:param url: a url under the format "<profile>#<share>.<schema>.<table>" | ||
: | ||
""" | ||
from delta_sharing.protocol import DeltaSharingProfile, Table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto: change to _check_import
.
@@ -0,0 +1,133 @@ | |||
import json |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: for file name, let's keep consistent with test_delta_sharing.py
self.assertEqual(table.share, "share") | ||
self.assertEqual(table.schema, "schema") | ||
self.assertIsInstance(rest_client, DataSharingRestClient) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible if we can test ray.data.read_delta_sharing_tables
as well?
e5a16dc
to
fe364af
Compare
Couple of questions:
Traceback (most recent call last):
Tested this in my dataset and it works. |
You can also run the followed script to fix format:
The error is:
Ray CI does not install
|
checkin lint update read_api update read_api add docstring, ut etc. updates revert precommit-config checkin-last address comments update tests update tests Signed-off-by: Wenyue Liu <[email protected]>
Signed-off-by: Wenyue Liu <[email protected]>
Signed-off-by: Wenyue Liu <[email protected]>
Signed-off-by: Wenyue Liu <[email protected]>
c077dc8
to
094d8ff
Compare
Signed-off-by: Wenyue Liu <[email protected]>
Merging in master to attempt to resolve CI failures |
looks like there are merge conflicts @brucebismarck can you review? |
@anyscalesam |
460b552
to
8b6ddfb
Compare
if __name__ == "__main__": | ||
unittest.main() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We usually use pytest as the main function.
Not sure if unittest also works well.
But let's keep it consistent with other files.
if __name__ == "__main__": | |
unittest.main() | |
if __name__ == "__main__": | |
import sys | |
sys.exit(pytest.main(["-v", __file__])) |
def __init__( | ||
self, | ||
url: str, | ||
jsonPredicateHints: Optional[str] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
noticed some variables using the camel case naming convention. Can you update them to use snake_case?
We follow google's python coding style guide.
Signed-off-by: Wenyue Liu <[email protected]>
fc8fb34
to
e02de9a
Compare
Signed-off-by: Wenyue Liu <[email protected]>
Signed-off-by: Wenyue Liu <[email protected]>
Signed-off-by: Wenyue Liu <[email protected]>
…ad task part Signed-off-by: Wenyue Liu <[email protected]>
Why are these changes needed?
Databricks prefers to use deltasharing rather than execution statement SQL to share data to external.
Using execution statement SQL (current databricks_uc_datasource) will have a 100GB limit, data will be truncated.
https://github.com/ray-project/ray/blob/master/python/ray/data/datasource/databricks_uc_datasource.py#L99-L103
This is the design/decision from databricks. However, using delta sharing does not have this limit.
I have tested locally to pull 180+ GB data from databricks to ray data.
With this, transferring data from one of the most famous/trusted data management system Databricks to Ray will be much easier. Otherwise, the best solution is use a spark job dump data to cloud drive (s3) and then read from there.
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.