New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[Core] Fix task submission never return when network partition happens #44692

Merged

jjyao merged 1 commit into ray-project:master from hongchaodeng:fix-resubmit

Apr 26, 2024

Member

hongchaodeng commented Apr 11, 2024 •

edited by jjyao

Loading

This fixes the problem that PushTask() grpc call is hanging when network partition happens. This grpc call hang because by default grpc sends two ping frames and then it won't send anything if no data frame sent. Meanwhile, the worker node was taking a force shutdown and won't send FIN back to the caller so grpc won't discover the disconnection without ping frames.

hongchaodeng requested a review from fishbone

April 11, 2024 21:16

fishbone changed the title ~~[Core] Test Network Partition~~ [Core] Fix task submission never return when network partition happens

hongchaodeng force-pushed the fix-resubmit branch 6 times, most recently from 06f7e99 to 3964008 Compare

April 12, 2024 02:18

hongchaodeng changed the title ~~[Core] Fix task submission never return when network partition happens~~ (DON'T MERGE) test network partition happens

hongchaodeng force-pushed the fix-resubmit branch 8 times, most recently from 1a1043b to 5c3d99d Compare

April 13, 2024 19:02

hongchaodeng changed the title ~~(DON'T MERGE) test network partition happens~~ [Core] Fix task submission never return when network partition happens

hongchaodeng force-pushed the fix-resubmit branch 3 times, most recently from 4f3fa53 to be06d61 Compare

April 17, 2024 22:59

hongchaodeng requested a review from jjyao

April 18, 2024 00:32

hongchaodeng added the core label

hongchaodeng self-assigned this

alexeykudinkin reviewed

View reviewed changes

src/ray/rpc/grpc_server.cc Outdated Show resolved Hide resolved

src/ray/rpc/grpc_server.cc Show resolved Hide resolved

python/ray/tests/test_network_failure_e2e.py Outdated

Comment on lines 95 to 122

    
                  def check_task_not_running():

                      output = head.exec_run(cmd="ray list tasks --format json")

                      if output.exit_code == 0:

                          tasks_json = json.loads(output.output)

                          print("tasks_json:", json.dumps(tasks_json, indent=2))

                          return all([task["state"] != "RUNNING" for task in tasks_json])

                      return False

                  def check_task_state(n=0, state="RUNNING"):

                      output = head.exec_run(cmd="ray list tasks --format json")

                      if output.exit_code == 0:

                          tasks_json = json.loads(output.output)

                          print("tasks_json:", json.dumps(tasks_json, indent=2))

                          return n == sum([task["state"] == state for task in tasks_json])

                      return False

Contributor

alexeykudinkin Apr 18, 2024

These could be deduplicated

Member Author

hongchaodeng Apr 22, 2024

They have some subtle differences.

For example, in the first check_task_running(), we not only check if the tasks are running, but also check the length of task states are equal to 2. This ensure the network is stable and no task failure.

Refactoring them into common functions would make the code more complicated than simply:

    wait_for_condition(check_task_not_running)

Let's just keep it simple this way :)

hongchaodeng assigned jjyao

hongchaodeng force-pushed the fix-resubmit branch 2 times, most recently from e747893 to d31a85b Compare

April 22, 2024 17:47

hongchaodeng unassigned jjyao

jjyao reviewed

View reviewed changes

python/ray/tests/test_network_failure_e2e.py Outdated

    
              SLEEP_TASK_SCRIPTS = """

              import ray

              ray.init(address="localhost:6379")

Collaborator

jjyao Apr 23, 2024

nit: just ray.init() is fine, it will connect to the existing cluster.

Member Author

hongchaodeng Apr 26, 2024

Fixed

python/ray/tests/test_network_failure_e2e.py Outdated

    
                          "RAY_grpc_client_keepalive_timeout_ms=1000",

                      ],

                  )

                  sleep(3)

Collaborator

jjyao Apr 23, 2024

no need this sleep since you have wait_for_condition later on.

Member Author

hongchaodeng Apr 26, 2024 •

edited

Loading

Previously grpc client will only send 2 ping frames when there is no data/header frame to be sent.
keepalive interval is 1s. So after 3s it wouldn't send anything and failed the test before. time.sleep(3)

python/ray/tests/test_network_failure_e2e.py Outdated

Comment on lines 76 to 80

    
                  sleep(2)

                  # kill the worker to simulate the spot shutdown

                  worker.kill()

                  print("Killed worker")

                  sleep(2)

Collaborator

jjyao Apr 23, 2024

why these sleeps?

Member Author

hongchaodeng Apr 26, 2024

In these steps, we will first observe that the node died. In previous error case, the tasks will stay in RUNNING state and become hanging forever. After the fix, we should observe that the tasks are changed to PENDING_NODE_ASSIGNMENT state.

I also added the comments to it

hongchaodeng force-pushed the fix-resubmit branch 3 times, most recently from e805df9 to fd88d43 Compare

April 26, 2024 15:37

jjyao reviewed

View reviewed changes

python/ray/tests/test_network_failure_e2e.py Outdated

    
                  # https://docker-py.readthedocs.io/en/stable/networks.html#docker.models.networks.Network.disconnect

                  network.disconnect(worker.name)

                  print("Disconnected network")

                  sleep(2)

Collaborator

jjyao Apr 26, 2024

Could you explain why we need to sleep 2 seconds here?

Member Author

hongchaodeng Apr 26, 2024

Because keepalive interval is set to 1s and timeout is set to 1s. After 2s the worker would be known dead from head node side. From there on we can continue to run the checkers in the following code. Note that the worker.kill() is a bit unnecessary in this case though. Although we want to simulate the shutdown too.

Member Author

hongchaodeng commented Apr 26, 2024

@jjyao Ready to review. PTAL!

hongchaodeng force-pushed the fix-resubmit branch 2 times, most recently from cbbed25 to 80a0a9d Compare

April 26, 2024 19:54

jjyao approved these changes

View reviewed changes


          [Core] Fix task submission never return when network partition happens

Signed-off-by: hongchaodeng <[email protected]>

hongchaodeng force-pushed the fix-resubmit branch from 80a0a9d to 4403051 Compare

April 26, 2024 20:10

jjyao merged commit b0a0d34 into ray-project:master

5 checks passed

hongchaodeng deleted the fix-resubmit branch

April 26, 2024 22:47

terraflops1048576 mentioned this pull request

[Core] Allow manual marking of node death via CLI and API #45632

Open

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core