Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Deadlock during task cancellation #29739

Closed
stephanie-wang opened this issue Oct 27, 2022 · 0 comments · Fixed by #29763
Closed

[core] Deadlock during task cancellation #29739

stephanie-wang opened this issue Oct 27, 2022 · 0 comments · Fixed by #29763
Assignees
Labels
bug Something that is supposed to be working; but isn't P0 Issues that should be fixed in short order
Milestone

Comments

@stephanie-wang
Copy link
Contributor

What happened + What you expected to happen

I was running an AIR data ingest example where the map tasks throw an exception and the script hung in map_batches. ray stack shows this is due to a deadlock in task cancellation:

Thread 32280 (idle): "MainThread"
    syscall (libc-2.31.so)
    absl::lts_20211102::synchronization_internal::Waiter::Wait (ray/_raylet.so)
    AbslInternalPerThreadSemWait_lts_20211102 (ray/_raylet.so)
    absl::lts_20211102::Mutex::Block (ray/_raylet.so)
    absl::lts_20211102::Mutex::LockSlowLoop (ray/_raylet.so)
    absl::lts_20211102::Mutex::LockSlowWithDeadline (ray/_raylet.so)
    absl::lts_20211102::Mutex::LockSlow (ray/_raylet.so)
    ray::core::CoreWorker::GetResourceIDs (ray/_raylet.so)
    resource_ids (ray/_raylet.so)
    get_gpu_ids (ray/_private/worker.py:844)
    wrapper (ray/_private/client_mode_hook.py:105)
    _raylet_task_execution_handler (ray/_raylet.so)
    std::_Function_handler<ray::Status(ray::rpc::Address const&, ray::rpc::TaskType, std::string, ray::core::RayFunction const&, std::unordered_map<std::string, double, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, double> > > const&, std::vector<std::shared_ptr<ray::RayObject>, std::allocator<std::shared_ptr<ray::RayObject> > > const&, std::vector<ray::rpc::ObjectReference, std::allocator<ray::rpc::ObjectReference> > const&, std::string const&, std::string const&, std::vector<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> >, std::allocator<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> > > >*, std::vector<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> >, std::allocator<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> > > >*, std::shared_ptr<ray::LocalMemoryBuffer>&, bool*, bool*, std::vector<ray::ConcurrencyGroup, std::allocator<ray::ConcurrencyGroup> > const&, std::string, bool), ray::Status (*)(ray::rpc::Address const&, ray::rpc::TaskType, std::string, ray::core::RayFunction const&, std::unordered_map<std::string, double, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, double> > > const&, std::vector<std::shared_ptr<ray::RayObject>, std::allocator<std::shared_ptr<ray::RayObject> > > const&, std::vector<ray::rpc::ObjectReference, std::allocator<ray::rpc::ObjectReference> > const&, std::string, std::string, std::vector<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> >, std::allocator<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> > > >*, std::vector<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> >, std::allocator<std::pair<ray::ObjectID, std::shared_ptr<ray::RayObject> > > >*, std::shared_ptr<ray::LocalMemoryBuffer>&, bool*, bool*, std::vector<ray::ConcurrencyGroup, std::allocator<ray::ConcurrencyGroup> > const&, std::string, bool)>::_M_invoke (ray/_raylet.so)
    ray::core::CoreWorker::ExecuteTask (ray/_raylet.so)
...
Thread 32460 (idle)
    pthread_cond_timedwait@@GLIBC_2.3.2 (libpthread-2.31.so)
    _raylet_kill_main_task (ray/_raylet.so)
    ray::core::CoreWorker::HandleCancelTask (ray/_raylet.so)
    std::_Function_handler<void (), ray::rpc::ServerCallImpl<ray::rpc::CoreWorkerServiceHandler, ray::rpc::CancelTaskRequest, ray::rpc::CancelTaskReply>::Ha
ndleRequest()::{lambda()#1}>::_M_invoke (ray/_raylet.so)
    EventTracker::RecordExecution (ray/_raylet.so)
    std::_Function_handler<void (), instrumented_io_context::post(std::function<void ()>, std::string)::{lambda()#1}>::_M_invoke (ray/_raylet.so)
    boost::asio::detail::completion_handler<std::function<void ()>, boost::asio::io_context::basic_executor_type<std::allocator<void>, (unsigned int)0> >::d
o_complete (ray/_raylet.so)
    boost::asio::detail::scheduler::do_run_one (ray/_raylet.so)
    boost::asio::detail::scheduler::run (ray/_raylet.so)
    boost::asio::io_context::run (ray/_raylet.so)
    ray::core::CoreWorker::RunIOService (ray/_raylet.so)
    execute_native_thread_routine (ray/_raylet.so)
    clone (libc-2.31.so)

ray-stack.txt

Versions / Dependencies

3.0dev

Reproduction script

Issue Severity

No response

@stephanie-wang stephanie-wang added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) P0 Issues that should be fixed in short order and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Oct 27, 2022
@stephanie-wang stephanie-wang self-assigned this Oct 27, 2022
@stephanie-wang stephanie-wang added this to the Core Backlog milestone Oct 27, 2022
stephanie-wang added a commit that referenced this issue Oct 27, 2022
ray.cancel is executed by a background worker thread that invokes a Python callback to kill the current task. To avoid killing the wrong task, the worker holds the global CoreWorker lock during this process. However, this can cause deadlock because the worker also needs to acquire the GIL to call the kill callback. The GIL is usually held by the main thread, which is executing the normal task code. If this thread does not release the GIL to call CoreWorker methods, it can cause deadlock.

To avoid the deadlock, we can either always release the GIL before calling any CoreWorker method, or we can change task cancellation to not hold the CoreWorker lock. I chose the latter for the fix; while it is generally good to release the GIL before calling CoreWorker methods, it's hard to guarantee that we are always doing this. Therefore, modifying task cancellation seems safer.

This PR changes task cancellation to guard against the race condition in the Python code instead of relying on the CoreWorker lock. It does this by setting the current task ID during task execution, then holding a lock during task cancellation. This is guaranteed not to deadlock because we always acquire the GIL before the current task ID lock.
Related issue number

Closes #29739.

Signed-off-by: Stephanie Wang [email protected]
WeichenXu123 pushed a commit to WeichenXu123/ray that referenced this issue Dec 19, 2022
ray.cancel is executed by a background worker thread that invokes a Python callback to kill the current task. To avoid killing the wrong task, the worker holds the global CoreWorker lock during this process. However, this can cause deadlock because the worker also needs to acquire the GIL to call the kill callback. The GIL is usually held by the main thread, which is executing the normal task code. If this thread does not release the GIL to call CoreWorker methods, it can cause deadlock.

To avoid the deadlock, we can either always release the GIL before calling any CoreWorker method, or we can change task cancellation to not hold the CoreWorker lock. I chose the latter for the fix; while it is generally good to release the GIL before calling CoreWorker methods, it's hard to guarantee that we are always doing this. Therefore, modifying task cancellation seems safer.

This PR changes task cancellation to guard against the race condition in the Python code instead of relying on the CoreWorker lock. It does this by setting the current task ID during task execution, then holding a lock during task cancellation. This is guaranteed not to deadlock because we always acquire the GIL before the current task ID lock.
Related issue number

Closes ray-project#29739.

Signed-off-by: Stephanie Wang [email protected]
Signed-off-by: Weichen Xu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't P0 Issues that should be fixed in short order
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant