-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revert "Revert "[Dataset] [DataFrame 2/n] Add pandas block format implementation (partial) (#20988) (#21661)" #21894
Conversation
…lementation (partial) (ray-project#20988) (ray-project#21661)" This reverts commit fa5c167.
@jjyao Is it possible to run nightly tests with a PR build? |
@kfstorm Yes, PR build also generates the wheel that can be used in the nightly tests. Let me know if you don't have the instructions to do that. |
Could we also add a test that "pandas" isn't imported when you import ray? This can go into e.g., test_basic. |
//python/ray/data:tests/test_dataset TIMEOUT in 3 out of 3 in 900.0s |
|
python/ray/_private/utils.py
Outdated
def lazy_import(name): | ||
if name in sys.modules: | ||
return | ||
spec = importlib.util.find_spec(name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the package doesn't exist (i.e. Pandas isn't installed), this will return None
, which will cause the below spec.loader
to fail with an AttributeError
. We should check to see if this is None
here and raise the canonical ModuleNotFoundError
if so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @ericl pretty sure this is the source of the test error
I've fixed lazy import. Let's see the CI results again. |
|
Yeah, that's strange. I'm still debugging. |
Status update: This is a multi-threading issue. It can be reproduced with below script: import sys
import importlib
import time
def lazy_import(name):
try:
return sys.modules[name]
except KeyError:
spec = importlib.util.find_spec(name)
if not spec:
raise ModuleNotFoundError(f"No module named '{name}'", name=name)
module = importlib.util.module_from_spec(spec)
loader = importlib.util.LazyLoader(spec.loader)
loader.exec_module(module)
# It's possible that another thread has done `import <name>`.
# We only update sys.modules if the module is not in it already.
module2 = sys.modules.setdefault(name, module)
print(id(module), id(module2), "t1")
return module2
import threading
t1 = threading.Thread(target=lazy_import, args=("pandas",), name="t1")
def foo():
# time.sleep(0.5)
import pandas
print(id(pandas), "t2")
print(pandas.core.frame.DataFrame)
t2 = threading.Thread(target=foo, name="t2")
def bar():
time.sleep(0.5)
import pandas
print(id(pandas), "t3")
print(pandas.core.frame.DataFrame)
t3 = threading.Thread(target=bar, name="t3")
for _ in [t1, t2, t3]:
_.start()
for _ in [t1, t2, t3]:
_.join() Run My guess is that when there are multiple threads trying to access any attribute of pandas, the underlying module loader's This So I came up with some options:
@ericl @clarkzinzow Any comments or ideas are highly appreciated. |
@kfstorm what if we did something like this:
And the in |
@ericl It seems that your approach eagerly loads pandas. It will slow down 'import ray'. |
@kfstorm , not if you only call that inside PandasBlock methods. It doesn't need to be loaded at the file level. |
@ericl Then what's the difference with "import pandas" in every method? Will it be slightly faster? |
Yes, it will be faster.
…On Fri, Jan 28, 2022, 8:04 PM Kai Yang ***@***.***> wrote:
@ericl <https://github.com/ericl> Then what's the difference with "import
pandas" in every method? Will it be slightly faster?
—
Reply to this email directly, view it on GitHub
<#21894 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAADUSUWQRJOS5SIAWXLDJTUYNRNJANCNFSM5M23YIIA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
|
@ericl @jjyao CI and nightly test passed. |
This reverts commit fa5c167.
Why are these changes needed?
This PR adds pandas block format support by implementing
PandasRow
,PandasBlockBuilder
,PandasBlockAccessor
.Note that
sort_and_partition
,combine
,merge_sorted_blocks
,aggregate_combined_blocks
inPandasBlockAccessor
redirects to arrow block format implementation for now. They'll be implemented in a later PR.Related issue number
#20719
Checks
scripts/format.sh
to lint the changes in this PR.