Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] [data] Pandas has no attribute 'core' likely due to race with import thread #32435

Closed
ericl opened this issue Feb 10, 2023 · 1 comment · Fixed by #32447 or #33103
Closed

[core] [data] Pandas has no attribute 'core' likely due to race with import thread #32435

ericl opened this issue Feb 10, 2023 · 1 comment · Fixed by #32447 or #33103
Assignees
Labels
bug Something that is supposed to be working; but isn't data Ray Data-related issues P0 Issues that should be fixed in short order

Comments

@ericl
Copy link
Contributor

ericl commented Feb 10, 2023

What happened + What you expected to happen

Running the below script on master, you'll quite quickly see:

(_map_task pid=465201) 2023-02-10 11:47:42,866	INFO worker.py:772 -- Task failed with retryable exception: TaskID(a4300300247f49faffffffffffffffffffffffff01000000).
(_map_task pid=465201) Traceback (most recent call last):
(_map_task pid=465201)   File "python/ray/_raylet.pyx", line 641, in ray._raylet.execute_dynamic_generator_and_store_task_outputs
(_map_task pid=465201)   File "python/ray/_raylet.pyx", line 2498, in ray._raylet.CoreWorker.store_task_outputs
(_map_task pid=465201)   File "/home/eric/Desktop/ray/python/ray/data/_internal/execution/operators/map_operator.py", line 353, in _map_task
(_map_task pid=465201)     m_out = BlockAccessor.for_block(b_out).get_metadata([], None)
(_map_task pid=465201)   File "/home/eric/Desktop/ray/python/ray/data/block.py", line 379, in for_block
(_map_task pid=465201)     import pandas
(_map_task pid=465201)   File "/home/eric/.local/lib/python3.8/site-packages/pandas/__init__.py", line 135, in <module>
(_map_task pid=465201)     from pandas import api, arrays, errors, io, plotting, testing, tseries
(_map_task pid=465201)   File "/home/eric/.local/lib/python3.8/site-packages/pandas/testing.py", line 6, in <module>
(_map_task pid=465201)     from pandas._testing import (
(_map_task pid=465201)   File "/home/eric/.local/lib/python3.8/site-packages/pandas/_testing/__init__.py", line 979, in <module>
(_map_task pid=465201)     cython_table = pd.core.common._cython_table.items()
(_map_task pid=465201) AttributeError: partially initialized module 'pandas' has no attribute 'core' (most likely due to a circular import)

There's a high chance this is import thread related, because the following patch fixes the problem:

diff --git a/python/ray/data/block.py b/python/ray/data/block.py
index 3de5992d85..9126d5c282 100644
--- a/python/ray/data/block.py
+++ b/python/ray/data/block.py
@@ -376,6 +376,9 @@ class BlockAccessor(Generic[T]):
     def for_block(block: Block) -> "BlockAccessor[T]":
         """Create a block accessor for the given block."""
         _check_pyarrow_version()
+        global pandas_ok
+        if not pandas_ok:
+            print("DELAYED PANDAS IMPORT")
+            time.sleep(1)
+            pandas_ok = True
         import pandas
         import pyarrow

Versions / Dependencies

master

Reproduction script

RAY_DATASET_USE_STREAMING_EXECUTOR=1 python test.py

import ray

for i in range(1000):
    ray.data.range(100).take(10)
    print(i)

Issue Severity

None

@ericl ericl added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) P0 Issues that should be fixed in short order data Ray Data-related issues and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Feb 10, 2023
@ericl ericl changed the title [core|data] [core] [data] Pandas has no attribute 'core' likely due to race with import thread Feb 10, 2023
@ericl ericl self-assigned this Feb 10, 2023
@ericl
Copy link
Contributor Author

ericl commented Feb 10, 2023

Update:

  • Surprisingly, the sleep workaround doesn't work reliably.
  • Neither does acquiring the import thread lock while importing pandas in Datasets.

What does work is moving pandas to a top-level import.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't data Ray Data-related issues P0 Issues that should be fixed in short order
Projects
None yet
2 participants