-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix shared provider unload crash #5553
Conversation
* Change shared providers so that they are shutdown before shared library unload * Move UnloadSharedProviders declaration into a shared header to avoid bugs.
void Unload() { | ||
if (handle_) { | ||
if (provider_) | ||
provider_->Shutdown(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So just to confirm - as long as the provider shutdown is invoked (which cleans up the kernel registry) prior to the ORT shared library getting unloaded, the actual order of libraries (EP shared libraries and ORT itself) getting unloaded doesn't matter given that some Linux platforms are not always deterministic about the order of unloading libraries marked for unloading ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the provider's Shutdown() method does anything that needs to be done before library unload (like calling back into onnxruntime). It's hard to prove we're doing everything at the right time, but this gives us a mechanism to do it.
dlclose() is documented as not guaranteeing even unloading the library at all. Just that it can at some point in the future after calling dlclose().
Fixed version of original change: #5523 - Fixed a build error with ORT_MINIMAL_BUILD and a build test failure in Python
Short Description: On Linux, when calling dlclose() a library won't always be unloaded immediately (and have its global variables be destroyed). This causes a problem with shared providers, since they hold a KernelRegistry object in a static and they need to call into the core OnnxRuntime code to destroy it.
The fix adds a Shutdown() method to the shared providers and calls this on OrtEnv destruction in the core onnxruntime code. This way these globals are destroyed in a deterministic way before library unloading gets involved.
Long Description:Here's what would happen before this change on Linux:
Customer code dynamically loads onnxruntime.so
Customer code then adds TensorRT provider to session
Onnxruntime dynamically loads onnxruntime_provider_tensorrt.so due to (2)
Customer code destroys all onnxruntime resources and unloads the onnxruntime.so library
onnxruntime global variables are destroyed, one of which unloads onnxruntime_provider_tensorrt.so
onnxruntime fully unloads
onnxruntime_provider_tensorrt.so global variables are destroyed, and try to call into onnxruntime.so code (to release a KernelRegistry object)
Crash!
On Windows, unloading onnxruntime_provider_tensorrt.so causes it to unload immediately, but on Linux it doesn't unload until after onnxruntime.so unloads.
Even if dlclose() unloaded and shutdown the onnxruntime_provider_tensorrt.so library immediately the shutdown order is still undefined and could result in bad behavior. My change makes the shutdown order deterministic so that the OS ordering of global destruction and when the library unloads will work in both cases.
Secondly.. unloading libraries during library unload is bad app behavior on windows (should never FreeLibrary during DllMain). So this is just the right thing to do in general.
My first fix was just changing the code so that we unload libraries on OrtEnv destruction. This isn’t enough to fix the issue as on linux dlclose() doesn’t guarantee when it unloads a library, it just marks it for unload at some later point when it feels like it. So the explicit Shutdown() method is needed in the provider for that.
#5384 - Original customer bug