Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix shared provider unload crash #5553

Merged
merged 7 commits into from
Oct 21, 2020
Merged

Conversation

RyanUnderhill
Copy link
Member

Fixed version of original change: #5523 - Fixed a build error with ORT_MINIMAL_BUILD and a build test failure in Python

Short Description: On Linux, when calling dlclose() a library won't always be unloaded immediately (and have its global variables be destroyed). This causes a problem with shared providers, since they hold a KernelRegistry object in a static and they need to call into the core OnnxRuntime code to destroy it.

The fix adds a Shutdown() method to the shared providers and calls this on OrtEnv destruction in the core onnxruntime code. This way these globals are destroyed in a deterministic way before library unloading gets involved.

Long Description:Here's what would happen before this change on Linux:

Customer code dynamically loads onnxruntime.so
Customer code then adds TensorRT provider to session
Onnxruntime dynamically loads onnxruntime_provider_tensorrt.so due to (2)
Customer code destroys all onnxruntime resources and unloads the onnxruntime.so library
onnxruntime global variables are destroyed, one of which unloads onnxruntime_provider_tensorrt.so
onnxruntime fully unloads
onnxruntime_provider_tensorrt.so global variables are destroyed, and try to call into onnxruntime.so code (to release a KernelRegistry object)
Crash!
On Windows, unloading onnxruntime_provider_tensorrt.so causes it to unload immediately, but on Linux it doesn't unload until after onnxruntime.so unloads.

Even if dlclose() unloaded and shutdown the onnxruntime_provider_tensorrt.so library immediately the shutdown order is still undefined and could result in bad behavior. My change makes the shutdown order deterministic so that the OS ordering of global destruction and when the library unloads will work in both cases.

Secondly.. unloading libraries during library unload is bad app behavior on windows (should never FreeLibrary during DllMain). So this is just the right thing to do in general.

My first fix was just changing the code so that we unload libraries on OrtEnv destruction. This isn’t enough to fix the issue as on linux dlclose() doesn’t guarantee when it unloads a library, it just marks it for unload at some later point when it feels like it. So the explicit Shutdown() method is needed in the provider for that.

#5384 - Original customer bug

RyanUnderhill and others added 3 commits October 20, 2020 11:36
* Change shared providers so that they are shutdown before shared library unload
* Move UnloadSharedProviders declaration into a shared header to avoid bugs.
@RyanUnderhill RyanUnderhill requested a review from a team as a code owner October 20, 2020 20:36
void Unload() {
if (handle_) {
if (provider_)
provider_->Shutdown();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So just to confirm - as long as the provider shutdown is invoked (which cleans up the kernel registry) prior to the ORT shared library getting unloaded, the actual order of libraries (EP shared libraries and ORT itself) getting unloaded doesn't matter given that some Linux platforms are not always deterministic about the order of unloading libraries marked for unloading ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the provider's Shutdown() method does anything that needs to be done before library unload (like calling back into onnxruntime). It's hard to prove we're doing everything at the right time, but this gives us a mechanism to do it.

dlclose() is documented as not guaranteeing even unloading the library at all. Just that it can at some point in the future after calling dlclose().

@RyanUnderhill RyanUnderhill reopened this Oct 21, 2020
@RyanUnderhill RyanUnderhill merged commit 82c7a97 into master Oct 21, 2020
@RyanUnderhill RyanUnderhill deleted the ryanunderhill/fix_dlclose branch October 21, 2020 20:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants