[Core] Support Arrow zerocopy serialization in object store #35110

Deegue · 2023-05-06T05:58:35Z

Why are these changes needed?

Support Arrow in object store with zerocopy and improve performance.

We made a benchmark under the dataset NYC TAXI FARE, which has 8 columns and 55423855 rows in csv, 5.4G on disk.

Here are the results:

Java to Java	Java Write(ms)	Java Read(ms)
Before	23,637	3,162
After	23,320	226

Java to Python	Java Write(ms)	Python Read(ms)
Before	28,771	2,645
After	25,864	8

Python to Java	Python Write(ms)	Java Read(ms)
Before	10,597	3,386
After	5,271	3,251

Python to Python	Python Write(ms)	Python Read(ms)
Before	9,113	988
After	5,636	66

Benchmark code:

import ray, raydp, time
from pyarrow import csv
import sys

file_path = "FilePath_/train.csv"
# file_path = "FilePath_/train_tiny.csv"

if __name__ == '__main__':
  ray.init()
  write, read = sys.argv[1], sys.argv[2]
  assert write in ("java", "python") and read in ("java", "python"), "Illegal arguments. Please use java or python"

  spark = raydp.init_spark('benchmark', 10, 5, '2G', configs={"spark.default.parallelism": 50})

  if write == "java":
    df = spark.read.format("csv").option("header", "true") \
            .option("inferSchema", "true") \
            .load(f"file://{file_path}")
    print(df.count())
    start = time.time()
    blocks, _ = raydp.spark.dataset._save_spark_df_to_object_store(df, False)
    end = time.time()
    ds = ray.data.from_arrow_refs(blocks)
  elif write == "python":
    table = csv.read_csv(file_path)
    start = time.time()
    ds = ray.data.from_arrow(table)
    end = time.time()
    print(ds.num_blocks())
    ds = ds.repartition(50)

  print(f"{write} writing takes {end - start} seconds.")

  if read == "java":
    start = time.time()
    df = ds.to_spark(spark)
    end = time.time()
    print(df.count())
  elif read == "python":
    start = time.time()
    ray.get(ds.get_internal_block_refs())
    end = time.time()

  print(f"{read} reading takes {end - start} seconds.")

  raydp.stop_spark()
  ray.shutdown()

Related issue number

A follow-up PR of #20242, @kira-lin

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

kira-lin · 2023-05-08T03:13:32Z

@jovany-wang , please help review, thanks

jovany-wang

Could you add a cross language unit test?

src/ray/core_worker/lib/java/jni_utils.h

java/BUILD.bazel

jovany-wang · 2023-05-08T08:59:27Z

python/ray/includes/serialization.pxi

@@ -347,7 +347,7 @@ cdef class Pickle5Writer:
    @cython.boundscheck(False)
    @cython.wraparound(False)
    cdef void write_to(self, const uint8_t[:] inband, uint8_t[:] data,
-                       int memcopy_threads) nogil:


Why change this?

Something went wrong with GIL, and let me update this comment after I make some more testing.

Added back nogil to all write_to functions.

However, for function L567 we have to wrap the code block with with gil since pyarrow functions are used.

Deegue · 2023-05-09T08:36:35Z

Could you add a cross language unit test?

Thanks for your review! I will add more tests later.

python/ray/_private/serialization.py

python/ray/includes/serialization.pxi

ericl · 2023-05-09T21:47:37Z

Nice! So after this, we can also remove the bytes block type for Ray Data right?

kira-lin · 2023-05-10T00:56:04Z

Nice! So after this, we can also remove the bytes block type for Ray Data right?

Yes, we no longer need bytes block type for ray dataset after this PR.

Signed-off-by: Deegue <[email protected]>

Deegue · 2023-05-18T04:00:16Z

Added two cross language test cases, revised Python to Python benchmark results in PR description and resolved all comments above. Thanks @kira-lin for offline discussion.

Gentle ping @ericl @jovany-wang for another review, thanks!

python/ray/_private/serialization.py

ericl · 2023-05-30T23:47:40Z

Some lint failures, seems to be around pa import. This should also be rebased.

ericl

LGTM, pending tests

…ser_dev

Signed-off-by: Deegue <[email protected]>

ericl · 2023-06-01T18:43:58Z

Merged, thanks!

rkooo567 · 2023-06-01T23:23:19Z

Just heads up. This breaks test_advanced_9, so I will revert the PR.

…ay-project#35110)" This reverts commit 158c2bf.

…35110)" (#36000) This reverts commit 158c2bf.

Deegue · 2023-06-02T01:31:37Z

Thanks for your review! @ericl @jovany-wang @kira-lin . Sorry for the test failure @rkooo567 , let me check and fix later.

Deegue · 2023-06-05T02:45:49Z

Tested test_advanced_9 on local device and it passed.

This assert failure looks unrelated to this PR, gentle ping @rkooo567 @ericl for help, thanks!

ray/python/ray/tests/test_advanced_9.py

Lines 426 to 430 in 0b190ee

    
           import numpy  # noqa: F401 
        
           from threadpoolctl import threadpool_info 
        
           for pool_info in threadpool_info(): 
        
               assert pool_info["num_threads"] == 2

… store (ray-project#35110)" (ray-project#36000)" This reverts commit 822904b.

…ect#35110) Support Arrow in object store with zerocopy and improve performance. We made a benchmark under the dataset [NYC TAXI FARE](https://www.kaggle.com/c/new-york-city-taxi-fare-prediction/data), which has 8 columns and 55423855 rows in csv, 5.4G on disk. Here are the results: | Java to Java | Java Write(ms) | Java Read(ms) | | :-----: | :----: | :----: | | Before | 23,637 | 3,162 | | After | 23,320 | 226 | | Java to Python | Java Write(ms) | Python Read(ms) | | :---: | :---: | :---: | | Before | 28,771 | 2,645 | | After | 25,864 | 8 | | Python to Java | Python Write(ms) | Java Read(ms) | | :---: | :---: | :---: | | Before | 10,597 | 3,386 | | After | 5,271 | 3,251 | | Python to Python | Python Write(ms) | Python Read(ms) | | :---: | :---: | :---: | | Before | 9,113 | 988 | | After | 5,636 | 66 | Benchmark code: ```python import ray, raydp, time from pyarrow import csv import sys file_path = "FilePath_/train.csv" # file_path = "FilePath_/train_tiny.csv" if __name__ == '__main__': ray.init() write, read = sys.argv[1], sys.argv[2] assert write in ("java", "python") and read in ("java", "python"), "Illegal arguments. Please use java or python" spark = raydp.init_spark('benchmark', 10, 5, '2G', configs={"spark.default.parallelism": 50}) if write == "java": df = spark.read.format("csv").option("header", "true") \ .option("inferSchema", "true") \ .load(f"file://{file_path}") print(df.count()) start = time.time() blocks, _ = raydp.spark.dataset._save_spark_df_to_object_store(df, False) end = time.time() ds = ray.data.from_arrow_refs(blocks) elif write == "python": table = csv.read_csv(file_path) start = time.time() ds = ray.data.from_arrow(table) end = time.time() print(ds.num_blocks()) ds = ds.repartition(50) print(f"{write} writing takes {end - start} seconds.") if read == "java": start = time.time() df = ds.to_spark(spark) end = time.time() print(df.count()) elif read == "python": start = time.time() ray.get(ds.get_internal_block_refs()) end = time.time() print(f"{read} reading takes {end - start} seconds.") raydp.stop_spark() ray.shutdown() ``` Signed-off-by: e428265 <[email protected]>

…ay-project#35110)" (ray-project#36000) This reverts commit 158c2bf. Signed-off-by: e428265 <[email protected]>

Deegue requested review from ericl, scv119, c21, amogkam, scottjlee, bveeramani, raulchen, jovany-wang, kfstorm, fishbone and WangTaoTheTonic as code owners May 6, 2023 05:58

Deegue force-pushed the ray_ser_dev branch from ae495fe to 7c486ca Compare May 6, 2023 06:00

ericl self-assigned this May 6, 2023

Deegue mentioned this pull request May 6, 2023

Support arrow zerocopy for reader and writer in object store oap-project/raydp#341

Open

jovany-wang self-assigned this May 8, 2023

jovany-wang reviewed May 8, 2023

View reviewed changes

jovany-wang added the @external-author-action-required Alternate tag for PRs where the author doesn't have labeling permission. label May 8, 2023

Deegue changed the title ~~[Core] Support Arrow zerocopy serialization in object store~~ [WIP] [Core] Support Arrow zerocopy serialization in object store May 9, 2023

ericl reviewed May 9, 2023

View reviewed changes

python/ray/_private/serialization.py Outdated Show resolved Hide resolved

ericl reviewed May 9, 2023

View reviewed changes

python/ray/includes/serialization.pxi Outdated Show resolved Hide resolved

arrow ser

186e9a8

Signed-off-by: Deegue <[email protected]>

Deegue force-pushed the ray_ser_dev branch from 1d6910e to 186e9a8 Compare May 18, 2023 03:23

Deegue changed the title ~~[WIP] [Core] Support Arrow zerocopy serialization in object store~~ [Core] Support Arrow zerocopy serialization in object store May 18, 2023

jovany-wang removed the @external-author-action-required Alternate tag for PRs where the author doesn't have labeling permission. label May 19, 2023

ericl reviewed May 30, 2023

View reviewed changes

python/ray/_private/serialization.py Outdated Show resolved Hide resolved

ericl approved these changes May 30, 2023

View reviewed changes

Deegue and others added 3 commits May 31, 2023 09:54

Merge branch 'ray-project:master' into ray_ser_dev

677b8b8

cache import

6756500

Merge branch 'ray_ser_dev' of https://github.com/Deegue/ray into ray_…

8f09f8e

…ser_dev

jovany-wang approved these changes May 31, 2023

View reviewed changes

Deegue and others added 2 commits May 31, 2023 06:08

fix lint and ut

a5938ac

Signed-off-by: Deegue <[email protected]>

Merge branch 'ray-project:master' into ray_ser_dev

169c3b1

ericl merged commit 158c2bf into ray-project:master Jun 1, 2023

rkooo567 added a commit to rkooo567/ray that referenced this pull request Jun 1, 2023

Revert "[Core] Support Arrow zerocopy serialization in object store (r…

58e0b6e

…ay-project#35110)" This reverts commit 158c2bf.

rkooo567 mentioned this pull request Jun 1, 2023

Revert "[Core] Support Arrow zerocopy serialization in object store (… #36000

Merged

8 tasks

ericl pushed a commit that referenced this pull request Jun 1, 2023

Revert "[Core] Support Arrow zerocopy serialization in object store (#…

822904b

…35110)" (#36000) This reverts commit 158c2bf.

Deegue added a commit to Deegue/ray that referenced this pull request Jun 7, 2023

Revert "Revert "[Core] Support Arrow zerocopy serialization in object…

95986a4

… store (ray-project#35110)" (ray-project#36000)" This reverts commit 822904b.

Deegue mentioned this pull request Jun 7, 2023

Revert "Revert "[Core] Support Arrow zerocopy serialization in object… #36153

Open

8 tasks

arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023

Revert "[Core] Support Arrow zerocopy serialization in object store (r…

f299041

…ay-project#35110)" (ray-project#36000) This reverts commit 158c2bf. Signed-off-by: e428265 <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] Support Arrow zerocopy serialization in object store #35110

[Core] Support Arrow zerocopy serialization in object store #35110

Deegue commented May 6, 2023 •

edited

Loading

kira-lin commented May 8, 2023

jovany-wang left a comment

jovany-wang May 8, 2023

Deegue May 9, 2023

Deegue May 18, 2023

Deegue commented May 9, 2023

ericl commented May 9, 2023

kira-lin commented May 10, 2023

Deegue commented May 18, 2023

ericl commented May 30, 2023 •

edited

Loading

ericl left a comment

ericl commented Jun 1, 2023

rkooo567 commented Jun 1, 2023

Deegue commented Jun 2, 2023

Deegue commented Jun 5, 2023 •

edited

Loading

[Core] Support Arrow zerocopy serialization in object store #35110

[Core] Support Arrow zerocopy serialization in object store #35110

Conversation

Deegue commented May 6, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

kira-lin commented May 8, 2023

jovany-wang left a comment

Choose a reason for hiding this comment

jovany-wang May 8, 2023

Choose a reason for hiding this comment

Deegue May 9, 2023

Choose a reason for hiding this comment

Deegue May 18, 2023

Choose a reason for hiding this comment

Deegue commented May 9, 2023

ericl commented May 9, 2023

kira-lin commented May 10, 2023

Deegue commented May 18, 2023

ericl commented May 30, 2023 • edited Loading

ericl left a comment

Choose a reason for hiding this comment

ericl commented Jun 1, 2023

rkooo567 commented Jun 1, 2023

Deegue commented Jun 2, 2023

Deegue commented Jun 5, 2023 • edited Loading

Deegue commented May 6, 2023 •

edited

Loading

ericl commented May 30, 2023 •

edited

Loading

Deegue commented Jun 5, 2023 •

edited

Loading