Skip to content

Commit

Permalink
[spark] Refine comment in Starting ray worker spark task (ray-project…
Browse files Browse the repository at this point in the history
…#47670)

Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
  • Loading branch information
WeichenXu123 authored and ujjawal-khare committed Oct 15, 2024
1 parent 1acc942 commit c241475
Showing 1 changed file with 13 additions and 3 deletions.
16 changes: 13 additions & 3 deletions python/ray/util/spark/cluster_init.py
Original file line number Diff line number Diff line change
Expand Up @@ -1595,11 +1595,21 @@ def ray_cluster_job_mapper(_):
extra_env=ray_worker_node_extra_envs,
)
except Exception as e:
# In the following 2 cases, exception is raised:
# (1)
# Starting Ray worker node fails, the `e` will contain detail
# subprocess stdout/stderr output.
# (2)
# In autoscaling mode, when Ray worker node is down, autoscaler will
# try to start new Ray worker node if necessary,
# but we use spark job to launch Ray worker node process,
# to avoid trigger spark task retries, we swallow exception here
# to make spark task exit normally.
# and it creates a new spark job to launch Ray worker node process,
# note the old spark job will reschedule the failed spark task
# and raise error of "Starting Ray worker node twice with the same
# node id is not allowed".
#
# For either case (1) or case (2),
# to avoid Spark triggers more spark task retries, we swallow
# exception here to make spark the task exit normally.
_logger.warning(f"Ray worker node process exit, reason: {repr(e)}.")

yield 0
Expand Down

0 comments on commit c241475

Please sign in to comment.