Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Defer SIGINT interrupt during task argument deserialization. #30476

Conversation

clarkzinzow
Copy link
Contributor

Importing certain libraries (e.g. Arrow, Pandas, Torch) is not reentrant, and task cancellation is occasionally interrupting the Arrow import triggered via this deserialization add-on during task argument deserialization, which we are then trying to import again when serializing the error. See here for an example failure: https://buildkite.com/ray-project/oss-ci-build-branch/builds/1115#018485e1-df32-480f-9c36-cc898341f0a2

This PR prevents this import reentrancy from happening for the task cancellation case by deferring interrupts until after task argument deserialization finishes, so we can be sure that the serialization-related imports have finished before processing the interrupt.

Related issue number

Closes #30453

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

python/ray/_raylet.pyx Outdated Show resolved Hide resolved

ray._private.worker.global_worker.run_function_on_all_workers(
register_non_reentrant_import_and_delegate_reducer,
)
Copy link
Contributor Author

@clarkzinzow clarkzinzow Nov 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iycheng Using run_function_on_all_workers was the only way I could think of to make this test clean, lmk if you have any other ideas using runtime environment plugins and the like!

@clarkzinzow
Copy link
Contributor Author

@stephanie-wang @rickyyx @rkooo567 Ping for review!

@clarkzinzow clarkzinzow force-pushed the core/fix/defer-sigint-task-arg-deser branch from 3d8f8d8 to 53bf1aa Compare November 22, 2022 17:04
# See https://github.com/ray-project/ray/issues/30453.
# NOTE (Clark): Signal handlers can only be registered on the
# main thread.
with DeferSigint.create_if_main_thread():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should hijack this in ray core, or only in the arrow serialization plugin?

Copy link
Contributor Author

@clarkzinzow clarkzinzow Nov 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a generic problem for Ray Core, so I don't think that we'd want to only fix it for the arrow serialization plugin. Other serialization add-ons and dependencies that we import on import ray (e.g. NumPy) may suffer from non-reentrant imports either now or in the future, so if it's easy to solve this generically, I think that we should.

@scv119
Copy link
Contributor

scv119 commented Nov 22, 2022

the fix looks simpler than I thought!

# Monkey patch signal.signal to raise an error if a SIGINT handler is registered
# within the context.
self.orig_signal = signal.signal
signal.signal = self._signal_monkey_patch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how reliable is this monkey patch?
think about it again, an alternative is to defer sending signals when serialization is in progress:
https://sourcegraph.com/github.com/ray-project/ray/-/blob/python/ray/_raylet.pyx?L1210

Copy link
Contributor Author

@clarkzinzow clarkzinzow Nov 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This monkey patch should be pretty reliable unless signal is manually reloaded with importlib in the same process.

I should stress that this isn't technically needed for our current use of the context manager on task argument deserialization, this is just to:

  1. return a nice error to users that are doing something really weird, like registering a signal handler in the deserialization-side of a user-defined type pickle reducer,
  2. guard against future uses of this context manager elsewhere in Ray Core.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

think about it again, an alternative is to defer sending signals when serialization is in progress:
https://sourcegraph.com/github.com/ray-project/ray/-/blob/python/ray/_raylet.pyx?L1210

Yeah I thought of that, but I thought that deferring the signal to the end of the code block in question would be cleaner than adding another mutex-guarded is_deserializing_args flag and relying on the task cancellation RPC retry.

In my testing, I think that I hit some hangs and silent failures when task cancellation was received before the task ID was set (i.e. with the RPC retry being hit), so I'm worried that there are some undiscovered bugs there. Going to dig a bit further into that to see if I can get a reproduction.

@scv119 scv119 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Nov 24, 2022
@scv119
Copy link
Contributor

scv119 commented Nov 24, 2022

failures look related?

@clarkzinzow
Copy link
Contributor Author

@scv119 Hmm weird, I would have expected that flaky Datasets test to be fixed by this PR, but maybe task cancellation is still hitting a code section that isn't properly handling the interrupt...

@xwjiang2010 xwjiang2010 added the release-blocker P0 Issue that blocks the release label Nov 29, 2022
@clarkzinzow
Copy link
Contributor Author

Datasets build failure determined to be an unrelated task cancellation bug that will be fixed in another PR, and other test failures are confirmed to be unrelated already-flaky tests, so I'm going to merge this.

@clarkzinzow clarkzinzow merged commit 4872416 into ray-project:master Nov 29, 2022
clarkzinzow added a commit to clarkzinzow/ray that referenced this pull request Nov 29, 2022
…ay-project#30476)

Importing certain libraries (e.g. Arrow, Pandas, Torch) is not reentrant, and task cancellation is occasionally interrupting the Arrow import triggered via this deserialization add-on during task argument deserialization, which we are then trying to import again when serializing the error. See here for an example failure: https://buildkite.com/ray-project/oss-ci-build-branch/builds/1115#018485e1-df32-480f-9c36-cc898341f0a2

This PR prevents this import reentrancy from happening for the task cancellation case by deferring interrupts until after task argument deserialization finishes, so we can be sure that the serialization-related imports have finished before processing the interrupt.
WeichenXu123 pushed a commit to WeichenXu123/ray that referenced this pull request Dec 19, 2022
…ay-project#30476)

Importing certain libraries (e.g. Arrow, Pandas, Torch) is not reentrant, and task cancellation is occasionally interrupting the Arrow import triggered via this deserialization add-on during task argument deserialization, which we are then trying to import again when serializing the error. See here for an example failure: https://buildkite.com/ray-project/oss-ci-build-branch/builds/1115#018485e1-df32-480f-9c36-cc898341f0a2

This PR prevents this import reentrancy from happening for the task cancellation case by deferring interrupts until after task argument deserialization finishes, so we can be sure that the serialization-related imports have finished before processing the interrupt.

Signed-off-by: Weichen Xu <[email protected]>
tamohannes pushed a commit to ju2ez/ray that referenced this pull request Jan 25, 2023
…ay-project#30476)

Importing certain libraries (e.g. Arrow, Pandas, Torch) is not reentrant, and task cancellation is occasionally interrupting the Arrow import triggered via this deserialization add-on during task argument deserialization, which we are then trying to import again when serializing the error. See here for an example failure: https://buildkite.com/ray-project/oss-ci-build-branch/builds/1115#018485e1-df32-480f-9c36-cc898341f0a2

This PR prevents this import reentrancy from happening for the task cancellation case by deferring interrupts until after task argument deserialization finishes, so we can be sure that the serialization-related imports have finished before processing the interrupt.

Signed-off-by: tmynn <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. release-blocker P0 Issue that blocks the release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[core] Test cancel is flakey
8 participants