Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ray deadlocks on shutdown after Daft execution #1591

Closed
jaychia opened this issue Nov 10, 2023 · 1 comment · Fixed by #1597
Closed

Ray deadlocks on shutdown after Daft execution #1591

jaychia opened this issue Nov 10, 2023 · 1 comment · Fixed by #1597
Assignees
Labels
bug Something isn't working

Comments

@jaychia
Copy link
Contributor

jaychia commented Nov 10, 2023

Describe the bug

After running some Daft code, shutting down Ray will deadlock.

^CException ignored in: <module 'threading' from '/Users/jaychia/miniconda/lib/python3.10/threading.py'>
Traceback (most recent call last):
  File "/Users/jaychia/miniconda/lib/python3.10/threading.py", line 1567, in _shutdown
    lock.acquire()
KeyboardInterrupt: 

Looks like there is a thread attempting to acquire a lock somewhere and deadlocking, preventing clean shutdown of the program

To Reproduce

import daft

daft.context.set_runner_ray()

path = "s3://daft-public-data/test_fixtures/parquet-dev/mvp.parquet"
df = daft.read_parquet([path, path])
print("created df")
print(df.show())
@jaychia jaychia added the bug Something isn't working label Nov 10, 2023
@samster25
Copy link
Member

It doesn't look like a deadlock since Cntl-c is able to make it exit. I ran the following script:

import daft
import threading

daft.context.set_runner_ray()

path = "s3://daft-public-data/test_fixtures/parquet-dev/mvp.parquet"
df = daft.read_parquet([path, path])
print("created df")
print(df.show())
print("after show")

import sys, traceback, threading, time
thread_names = {t.ident: t.name for t in threading.enumerate()}

for thread_id, frame in sys._current_frames().items():
    print("Thread %s:" % thread_names.get(thread_id, thread_id))
    traceback.print_stack(frame)
    print()

and got this:

Thread fa2ef093-6f28-468c-b570-73ef99deaa16:
  File "/opt/homebrew/Cellar/[email protected]/3.11.4_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/threading.py", line 995, in _bootstrap
    self._bootstrap_inner()
  File "/opt/homebrew/Cellar/[email protected]/3.11.4_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/threading.py", line 1038, in _bootstrap_inner
    self.run()
  File "/opt/homebrew/Cellar/[email protected]/3.11.4_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/threading.py", line 975, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/sammy/code/Daft/daft/runners/ray_runner.py", line 530, in _run_plan
    readies, _ = ray.wait(
  File "/Users/sammy/code/Daft/venv/lib/python3.11/site-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/Users/sammy/code/Daft/venv/lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/Users/sammy/code/Daft/venv/lib/python3.11/site-packages/ray/_private/worker.py", line 2732, in wait
    ready_ids, remaining_ids = worker.core_worker.wait(

Thread ray_print_logs:
  File "/opt/homebrew/Cellar/[email protected]/3.11.4_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/threading.py", line 995, in _bootstrap
    self._bootstrap_inner()
  File "/opt/homebrew/Cellar/[email protected]/3.11.4_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/threading.py", line 1038, in _bootstrap_inner
    self.run()
  File "/opt/homebrew/Cellar/[email protected]/3.11.4_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/threading.py", line 975, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/sammy/code/Daft/venv/lib/python3.11/site-packages/ray/_private/worker.py", line 801, in print_logs
    data = subscriber.poll()

Thread ray_listen_error_messages:
  File "/opt/homebrew/Cellar/[email protected]/3.11.4_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/threading.py", line 995, in _bootstrap
    self._bootstrap_inner()
  File "/opt/homebrew/Cellar/[email protected]/3.11.4_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/threading.py", line 1038, in _bootstrap_inner
    self.run()
  File "/opt/homebrew/Cellar/[email protected]/3.11.4_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/threading.py", line 975, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/sammy/code/Daft/venv/lib/python3.11/site-packages/ray/_private/worker.py", line 2022, in listen_error_messages
    _, error_data = worker.gcs_error_subscriber.poll()

Thread MainThread:
  File "/Users/sammy/code/Daft/repro.py", line 19, in <module>
    traceback.print_stack(frame)

It looks like it's our ray runner python background thread that is created by df.show() that causes the hang.

samster25 added a commit that referenced this issue Nov 13, 2023
…1597)

* adds `stop_plan` method on the Ray Scheduler when is called when the
`RayRunner.run_iter` goes out of scope.
* This fixes the issues where we have a hanging thread at the end of
execution
* closes: #1591
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants