[core] Deadlock during task cancellation #29739

stephanie-wang · 2022-10-27T00:15:33Z

What happened + What you expected to happen

I was running an AIR data ingest example where the map tasks throw an exception and the script hung in map_batches. ray stack shows this is due to a deadlock in task cancellation:

Thread 32280 (idle): "MainThread"
    syscall (libc-2.31.so)
    absl::lts_20211102::synchronization_internal::Waiter::Wait (ray/_raylet.so)
    AbslInternalPerThreadSemWait_lts_20211102 (ray/_raylet.so)
    absl::lts_20211102::Mutex::Block (ray/_raylet.so)
    absl::lts_20211102::Mutex::LockSlowLoop (ray/_raylet.so)
    absl::lts_20211102::Mutex::LockSlowWithDeadline (ray/_raylet.so)
    absl::lts_20211102::Mutex::LockSlow (ray/_raylet.so)
    ray::core::CoreWorker::GetResourceIDs (ray/_raylet.so)
    resource_ids (ray/_raylet.so)
    get_gpu_ids (ray/_private/worker.py:844)
    wrapper (ray/_private/client_mode_hook.py:105)
    _raylet_task_execution_handler (ray/_raylet.so)
    std::_Function_handler<ray::Status(ray::rpc::Address const&, ray::rpc::TaskType, std::string, ray::core::RayFunction const&, std::unordered_map<std::string, double, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, double> > > const&, std::vector<std::shared_ptr<ray::RayObject>, std::allocator<std::shared_ptr<ray::RayObject> > > const&, std::vector<ray::rpc::ObjectReference, std::allocator<ray::rpc::ObjectReference> > const&, std::string const&, std::string const&, std::vector<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> >, std::allocator<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> > > >*, std::vector<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> >, std::allocator<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> > > >*, std::shared_ptr<ray::LocalMemoryBuffer>&, bool*, bool*, std::vector<ray::ConcurrencyGroup, std::allocator<ray::ConcurrencyGroup> > const&, std::string, bool), ray::Status (*)(ray::rpc::Address const&, ray::rpc::TaskType, std::string, ray::core::RayFunction const&, std::unordered_map<std::string, double, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, double> > > const&, std::vector<std::shared_ptr<ray::RayObject>, std::allocator<std::shared_ptr<ray::RayObject> > > const&, std::vector<ray::rpc::ObjectReference, std::allocator<ray::rpc::ObjectReference> > const&, std::string, std::string, std::vector<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> >, std::allocator<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> > > >*, std::vector<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> >, std::allocator<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> > > >*, std::shared_ptr<ray::LocalMemoryBuffer>&, bool*, bool*, std::vector<ray::ConcurrencyGroup, std::allocator<ray::ConcurrencyGroup> > const&, std::string, bool)>::_M_invoke (ray/_raylet.so)
    ray::core::CoreWorker::ExecuteTask (ray/_raylet.so)
...
Thread 32460 (idle)
    pthread_cond_timedwait@@GLIBC_2.3.2 (libpthread-2.31.so)
    _raylet_kill_main_task (ray/_raylet.so)
    ray::core::CoreWorker::HandleCancelTask (ray/_raylet.so)
    std::_Function_handler<void (), ray::rpc::ServerCallImpl<ray::rpc::CoreWorkerServiceHandler, ray::rpc::CancelTaskRequest, ray::rpc::CancelTaskReply>::Ha
ndleRequest()::{lambda()#1}>::_M_invoke (ray/_raylet.so)
    EventTracker::RecordExecution (ray/_raylet.so)
    std::_Function_handler<void (), instrumented_io_context::post(std::function<void ()>, std::string)::{lambda()#1}>::_M_invoke (ray/_raylet.so)
    boost::asio::detail::completion_handler<std::function<void ()>, boost::asio::io_context::basic_executor_type<std::allocator<void>, (unsigned int)0> >::d
o_complete (ray/_raylet.so)
    boost::asio::detail::scheduler::do_run_one (ray/_raylet.so)
    boost::asio::detail::scheduler::run (ray/_raylet.so)
    boost::asio::io_context::run (ray/_raylet.so)
    ray::core::CoreWorker::RunIOService (ray/_raylet.so)
    execute_native_thread_routine (ray/_raylet.so)
    clone (libc-2.31.so)

ray-stack.txt

Versions / Dependencies

3.0dev

Reproduction script

Issue Severity

No response

The text was updated successfully, but these errors were encountered:

ray.cancel is executed by a background worker thread that invokes a Python callback to kill the current task. To avoid killing the wrong task, the worker holds the global CoreWorker lock during this process. However, this can cause deadlock because the worker also needs to acquire the GIL to call the kill callback. The GIL is usually held by the main thread, which is executing the normal task code. If this thread does not release the GIL to call CoreWorker methods, it can cause deadlock. To avoid the deadlock, we can either always release the GIL before calling any CoreWorker method, or we can change task cancellation to not hold the CoreWorker lock. I chose the latter for the fix; while it is generally good to release the GIL before calling CoreWorker methods, it's hard to guarantee that we are always doing this. Therefore, modifying task cancellation seems safer. This PR changes task cancellation to guard against the race condition in the Python code instead of relying on the CoreWorker lock. It does this by setting the current task ID during task execution, then holding a lock during task cancellation. This is guaranteed not to deadlock because we always acquire the GIL before the current task ID lock. Related issue number Closes #29739. Signed-off-by: Stephanie Wang [email protected]

ray.cancel is executed by a background worker thread that invokes a Python callback to kill the current task. To avoid killing the wrong task, the worker holds the global CoreWorker lock during this process. However, this can cause deadlock because the worker also needs to acquire the GIL to call the kill callback. The GIL is usually held by the main thread, which is executing the normal task code. If this thread does not release the GIL to call CoreWorker methods, it can cause deadlock. To avoid the deadlock, we can either always release the GIL before calling any CoreWorker method, or we can change task cancellation to not hold the CoreWorker lock. I chose the latter for the fix; while it is generally good to release the GIL before calling CoreWorker methods, it's hard to guarantee that we are always doing this. Therefore, modifying task cancellation seems safer. This PR changes task cancellation to guard against the race condition in the Python code instead of relying on the CoreWorker lock. It does this by setting the current task ID during task execution, then holding a lock during task cancellation. This is guaranteed not to deadlock because we always acquire the GIL before the current task ID lock. Related issue number Closes ray-project#29739. Signed-off-by: Stephanie Wang [email protected] Signed-off-by: Weichen Xu <[email protected]>

stephanie-wang added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) P0 Issues that should be fixed in short order and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Oct 27, 2022

stephanie-wang self-assigned this Oct 27, 2022

stephanie-wang added this to the Core Backlog milestone Oct 27, 2022

stephanie-wang mentioned this issue Oct 27, 2022

[core] Fix potential deadlock in ray.cancel #29763

Merged

7 tasks

stephanie-wang closed this as completed in #29763 Oct 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core] Deadlock during task cancellation #29739

[core] Deadlock during task cancellation #29739

stephanie-wang commented Oct 27, 2022

[core] Deadlock during task cancellation #29739

[core] Deadlock during task cancellation #29739

Comments

stephanie-wang commented Oct 27, 2022

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity