-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Support Arrow zerocopy serialization in object store #35110
Conversation
@jovany-wang , please help review, thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a cross language unit test?
@@ -347,7 +347,7 @@ cdef class Pickle5Writer: | |||
@cython.boundscheck(False) | |||
@cython.wraparound(False) | |||
cdef void write_to(self, const uint8_t[:] inband, uint8_t[:] data, | |||
int memcopy_threads) nogil: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why change this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something went wrong with GIL, and let me update this comment after I make some more testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added back nogil
to all write_to
functions.
However, for function L567 we have to wrap the code block with with gil
since pyarrow functions are used.
Thanks for your review! I will add more tests later. |
Nice! So after this, we can also remove the |
Yes, we no longer need |
Signed-off-by: Deegue <[email protected]>
Added two cross language test cases, revised Gentle ping @ericl @jovany-wang for another review, thanks! |
Some lint failures, seems to be around |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, pending tests
Signed-off-by: Deegue <[email protected]>
Merged, thanks! |
…ay-project#35110)" This reverts commit 158c2bf.
Thanks for your review! @ericl @jovany-wang @kira-lin . Sorry for the test failure @rkooo567 , let me check and fix later. |
Tested This assert failure looks unrelated to this PR, gentle ping @rkooo567 @ericl for help, thanks! ray/python/ray/tests/test_advanced_9.py Lines 426 to 430 in 0b190ee
|
… store (ray-project#35110)" (ray-project#36000)" This reverts commit 822904b.
…ect#35110) Support Arrow in object store with zerocopy and improve performance. We made a benchmark under the dataset [NYC TAXI FARE](https://www.kaggle.com/c/new-york-city-taxi-fare-prediction/data), which has 8 columns and 55423855 rows in csv, 5.4G on disk. Here are the results: | Java to Java | Java Write(ms) | Java Read(ms) | | :-----: | :----: | :----: | | Before | 23,637 | 3,162 | | After | 23,320 | 226 | | Java to Python | Java Write(ms) | Python Read(ms) | | :---: | :---: | :---: | | Before | 28,771 | 2,645 | | After | 25,864 | 8 | | Python to Java | Python Write(ms) | Java Read(ms) | | :---: | :---: | :---: | | Before | 10,597 | 3,386 | | After | 5,271 | 3,251 | | Python to Python | Python Write(ms) | Python Read(ms) | | :---: | :---: | :---: | | Before | 9,113 | 988 | | After | 5,636 | 66 | Benchmark code: ```python import ray, raydp, time from pyarrow import csv import sys file_path = "FilePath_/train.csv" # file_path = "FilePath_/train_tiny.csv" if __name__ == '__main__': ray.init() write, read = sys.argv[1], sys.argv[2] assert write in ("java", "python") and read in ("java", "python"), "Illegal arguments. Please use java or python" spark = raydp.init_spark('benchmark', 10, 5, '2G', configs={"spark.default.parallelism": 50}) if write == "java": df = spark.read.format("csv").option("header", "true") \ .option("inferSchema", "true") \ .load(f"file://{file_path}") print(df.count()) start = time.time() blocks, _ = raydp.spark.dataset._save_spark_df_to_object_store(df, False) end = time.time() ds = ray.data.from_arrow_refs(blocks) elif write == "python": table = csv.read_csv(file_path) start = time.time() ds = ray.data.from_arrow(table) end = time.time() print(ds.num_blocks()) ds = ds.repartition(50) print(f"{write} writing takes {end - start} seconds.") if read == "java": start = time.time() df = ds.to_spark(spark) end = time.time() print(df.count()) elif read == "python": start = time.time() ray.get(ds.get_internal_block_refs()) end = time.time() print(f"{read} reading takes {end - start} seconds.") raydp.stop_spark() ray.shutdown() ``` Signed-off-by: e428265 <[email protected]>
…ay-project#35110)" (ray-project#36000) This reverts commit 158c2bf. Signed-off-by: e428265 <[email protected]>
Why are these changes needed?
Support Arrow in object store with zerocopy and improve performance.
We made a benchmark under the dataset NYC TAXI FARE, which has 8 columns and 55423855 rows in csv, 5.4G on disk.
Here are the results:
Benchmark code:
Related issue number
A follow-up PR of #20242, @kira-lin
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.