-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Defer SIGINT interrupt during task argument deserialization. #30476
[Core] Defer SIGINT interrupt during task argument deserialization. #30476
Conversation
779a088
to
92ce39c
Compare
|
||
ray._private.worker.global_worker.run_function_on_all_workers( | ||
register_non_reentrant_import_and_delegate_reducer, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@iycheng Using run_function_on_all_workers
was the only way I could think of to make this test clean, lmk if you have any other ideas using runtime environment plugins and the like!
@stephanie-wang @rickyyx @rkooo567 Ping for review! |
3d8f8d8
to
53bf1aa
Compare
… support Python 3.6.
# See https://github.com/ray-project/ray/issues/30453. | ||
# NOTE (Clark): Signal handlers can only be registered on the | ||
# main thread. | ||
with DeferSigint.create_if_main_thread(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should hijack this in ray core, or only in the arrow serialization plugin?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a generic problem for Ray Core, so I don't think that we'd want to only fix it for the arrow serialization plugin. Other serialization add-ons and dependencies that we import on import ray
(e.g. NumPy) may suffer from non-reentrant imports either now or in the future, so if it's easy to solve this generically, I think that we should.
the fix looks simpler than I thought! |
# Monkey patch signal.signal to raise an error if a SIGINT handler is registered | ||
# within the context. | ||
self.orig_signal = signal.signal | ||
signal.signal = self._signal_monkey_patch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how reliable is this monkey patch?
think about it again, an alternative is to defer sending signals when serialization is in progress:
https://sourcegraph.com/github.com/ray-project/ray/-/blob/python/ray/_raylet.pyx?L1210
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This monkey patch should be pretty reliable unless signal
is manually reloaded with importlib
in the same process.
I should stress that this isn't technically needed for our current use of the context manager on task argument deserialization, this is just to:
- return a nice error to users that are doing something really weird, like registering a signal handler in the deserialization-side of a user-defined type pickle reducer,
- guard against future uses of this context manager elsewhere in Ray Core.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
think about it again, an alternative is to defer sending signals when serialization is in progress:
https://sourcegraph.com/github.com/ray-project/ray/-/blob/python/ray/_raylet.pyx?L1210
Yeah I thought of that, but I thought that deferring the signal to the end of the code block in question would be cleaner than adding another mutex-guarded is_deserializing_args
flag and relying on the task cancellation RPC retry.
In my testing, I think that I hit some hangs and silent failures when task cancellation was received before the task ID was set (i.e. with the RPC retry being hit), so I'm worried that there are some undiscovered bugs there. Going to dig a bit further into that to see if I can get a reproduction.
…o support Windows.
failures look related? |
@scv119 Hmm weird, I would have expected that flaky Datasets test to be fixed by this PR, but maybe task cancellation is still hitting a code section that isn't properly handling the interrupt... |
Datasets build failure determined to be an unrelated task cancellation bug that will be fixed in another PR, and other test failures are confirmed to be unrelated already-flaky tests, so I'm going to merge this. |
…ay-project#30476) Importing certain libraries (e.g. Arrow, Pandas, Torch) is not reentrant, and task cancellation is occasionally interrupting the Arrow import triggered via this deserialization add-on during task argument deserialization, which we are then trying to import again when serializing the error. See here for an example failure: https://buildkite.com/ray-project/oss-ci-build-branch/builds/1115#018485e1-df32-480f-9c36-cc898341f0a2 This PR prevents this import reentrancy from happening for the task cancellation case by deferring interrupts until after task argument deserialization finishes, so we can be sure that the serialization-related imports have finished before processing the interrupt.
…ay-project#30476) Importing certain libraries (e.g. Arrow, Pandas, Torch) is not reentrant, and task cancellation is occasionally interrupting the Arrow import triggered via this deserialization add-on during task argument deserialization, which we are then trying to import again when serializing the error. See here for an example failure: https://buildkite.com/ray-project/oss-ci-build-branch/builds/1115#018485e1-df32-480f-9c36-cc898341f0a2 This PR prevents this import reentrancy from happening for the task cancellation case by deferring interrupts until after task argument deserialization finishes, so we can be sure that the serialization-related imports have finished before processing the interrupt. Signed-off-by: Weichen Xu <[email protected]>
…ay-project#30476) Importing certain libraries (e.g. Arrow, Pandas, Torch) is not reentrant, and task cancellation is occasionally interrupting the Arrow import triggered via this deserialization add-on during task argument deserialization, which we are then trying to import again when serializing the error. See here for an example failure: https://buildkite.com/ray-project/oss-ci-build-branch/builds/1115#018485e1-df32-480f-9c36-cc898341f0a2 This PR prevents this import reentrancy from happening for the task cancellation case by deferring interrupts until after task argument deserialization finishes, so we can be sure that the serialization-related imports have finished before processing the interrupt. Signed-off-by: tmynn <[email protected]>
Importing certain libraries (e.g. Arrow, Pandas, Torch) is not reentrant, and task cancellation is occasionally interrupting the Arrow import triggered via this deserialization add-on during task argument deserialization, which we are then trying to import again when serializing the error. See here for an example failure: https://buildkite.com/ray-project/oss-ci-build-branch/builds/1115#018485e1-df32-480f-9c36-cc898341f0a2
This PR prevents this import reentrancy from happening for the task cancellation case by deferring interrupts until after task argument deserialization finishes, so we can be sure that the serialization-related imports have finished before processing the interrupt.
Related issue number
Closes #30453
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.