Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ray on spark implementation #28771

Merged
merged 126 commits into from
Dec 20, 2022
Merged

Ray on spark implementation #28771

merged 126 commits into from
Dec 20, 2022

Conversation

WeichenXu123
Copy link
Contributor

@WeichenXu123 WeichenXu123 commented Sep 26, 2022

Signed-off-by: Weichen Xu [email protected]

Why are these changes needed?

REP: ray-project/enhancements#14

Commands to run tests:

Prerequisite

  • Only supports linux system e.g. ubuntu, does not support macOS for now.
  • Spark version >= 3.3

Testing on local machine (Requires linux system)

  1. Install ray dev version in editable mode.

first, merge latest master into ray-on-spark branch, then

cd ray-repo, then
Install latest ray dev version: pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp38-cp38-manylinux2014_x86_64.whl (for other python version see https://docs.ray.io/en/latest/ray-overview/installation.html#install-nightlies) , then run python python/ray/setup-dev.py -y , it will link ray source directory to your local ray repo code directory.

  1. Run tests pytest python/ray/tests/spark/test_ray_on_spark.py -s

Testing on databricks runtime

Install my PR branch ray package on databricks notebook:
https://e2-dogfood.staging.cloud.databricks.com/?o=6051921418418893#notebook/2110795073425386/command/2110795073425387
Using my testing notebook:
https://e2-dogfood.staging.cloud.databricks.com/?o=6051921418418893#notebook/2110795073420548/command/2110795073420549

Debugging tips

Checking ray processes output logs is painful. These files are scattered on every spark cluster worker nodes. For easier testing, we can create a spark cluster with only one worker machine, then,
By default, we can check following local disk path:

"Ray start" script output:

/tmp/ray-logs-{head-port}-XXXX

Other ray processes log output:

/tmp/ray-temp-{head-port}-XXXX/session_latest/logs

There will be a warning message output like:

You can check ray head / worker starting script logs under local disk path /tmp/ray-logs-50233-070d, and you can check ray processes logs under local disk path /tmp/ray-temp-50233-070d/session_latest/logs.

But note the log path is local path on every spark cluster nodes. So, for non-driver nodes, you have to write a spark job to collect those files, like:

# Check files under the dir
def mapper(_):
  import os
  return os.listdir("/tmp/ray-temp-50233-070d/session_latest")

print(sc.parallelize([1], 1).map(mapper).collect()[0])

# read specific files.
def mapper(_):
  import os
  with open("/tmp/ray-temp-50233-070d/session_latest/logs/raylet.1.err") as f:
    return f.read()

print(sc.parallelize([1], 1).map(mapper).collect()[0])

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Weichen Xu <[email protected]>
@WeichenXu123
Copy link
Contributor Author

CC @jjyao Ready for first pass reviewing :)

Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
num_spark_tasks,
head_options=None,
worker_options=None,
ray_temp_dir="/tmp/ray/temp",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Provide a param to control the ray_temp_dir path, this is useful when the default temp dir disk capacity is not sufficient.

context.barrier()
task_id = context.partitionId()

# TODO: remove temp dir when ray worker exits.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ericl @jjyao

Could you help add an option for ray start script to make ray node delete temp directory when ray node is killed ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is something we could offer except as best effort, which suggests it isn't the right mechanism for cleanup. Instead, why not set the temp dir to a known location and remove that in a wrapper script after the Ray worker exits?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ericl Yes. Make sense. I use a wrapper script and it register a SIGTERM handler and in handler it deletes the temp dir.

Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
@ericl ericl merged commit e76ccee into ray-project:master Dec 20, 2022
tamohannes pushed a commit to ju2ez/ray that referenced this pull request Jan 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants