Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prepare socket file when start ray #3925

Merged
merged 3 commits into from
Feb 2, 2019
Merged

Conversation

guoyuhong
Copy link
Contributor

What do these changes do?

Sometimes, users want to specify their own socket file name. However, there are two cases that they could fail.

  1. The directory specified does not exist. In this case, ray should make the directory.
  2. The socket file exists before they start the cluster, which may due to that we are test failover functionality and send SIGKILL to raylet. In this case, ray should clean the uncleaned file.

For the first case, raylet will crash with:

libc++abi.dylib: terminating with uncaught exception of type boost::exception_detail::clone_impl<boost::exception_detail::  error_info_injector<boost::system::system_error> >: bind: No such file or directory
*** Aborted at 1549001100 (unix time) try "date -d @1549001100" if you are using GNU date ***
PC: @                0x0 (unknown)
*** SIGABRT (@0x7fff6972ab86) received by PID 76085 (TID 0x10cef95c0) stack trace: ***
    @     0x7fff697d5b3d _sigtramp

For the second case, raylet will crash with:

libc++abi.dylib: terminating with uncaught exception of type boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<boost::system::system_error> >: bind: Address already in use
*** Aborted at 1549001712 (unix time) try "date -d @1549001712" if you are using GNU date ***
PC: @                0x0 (unknown)
*** SIGABRT (@0x7fff6972ab86) received by PID 77036 (TID 0x1161875c0) stack trace: ***
    @     0x7fff697d5b3d _sigtramp

Related issue number

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11390/
Test FAILed.

Copy link
Contributor

@jovany-wang jovany-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!

python/ray/node.py Outdated Show resolved Hide resolved
Args:
socket_path (string): the socket file to prepare.
"""
if not os.path.isfile(socket_path):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if not os.path.exist(socket_path) will be more explicit?

test/runtest.py Outdated
@@ -2607,3 +2607,20 @@ def test_pandas_parquet_serialization():
pd.DataFrame({"col1": [0, 1], "col2": [0, 1]}).to_parquet(filename)
# Clean up
shutil.rmtree(tempdir)


def test_socket_directory_none_existant(shutdown_only):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo, existent. or maybe shorter test_socket_dir_not_existing

test/runtest.py Outdated
def test_socket_directory_none_existant(shutdown_only):
level1_name = ray.ObjectID(_random_string()).hex()
level2_name = ray.ObjectID(_random_string()).hex()
temp_raylet_socket_dir = "/tmp/{}/{}".format(level1_name, level2_name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, let's put the socket under something like /tmp/ray/tests to avoid creating too many files in /tmp

if not os.path.isdir(path):
try_to_create_directory(path)
else:
os.remove(socket_path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if it's a good idea to remove the existing socket file. If a raylet is still running and user starts ray again, this will break the previous raylet.
We can check if this socket file is still used by any processes before removing it. Or maybe simply raise an error if it already exists.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this function is used in raylet socket and plasma socket, it not easy to connect to it to check whether it is in use. I raise exception here. It is better than nothing. If we don't do this and start cluster using ray start, there will be no error after this command. We need to check it in raylet.err. With this exception, ray start will stop immediately.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, raising the error earlier is better.
BTW, we can detect if a socket file is in use with the lsof command. But raising an error is okay as well.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11402/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/11403/
Test PASSed.

@guoyuhong
Copy link
Contributor Author

@robertnishihara The Linux Wheel test fails in all PRs. Shall we change the number of expecting wheels to 4 to unblock the test?

@raulchen raulchen merged commit 54cbb43 into ray-project:master Feb 2, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants