-
Notifications
You must be signed in to change notification settings - Fork 215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework context handling #2346
Rework context handling #2346
Conversation
This makes it cheaper to look up the origin device, even when the context may have been destroyed.
Even after the context has been destroyed.
It doesn't safe much memory. Also test that resetting the context actually frees it.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #2346 +/- ##
==========================================
- Coverage 62.09% 60.33% -1.77%
==========================================
Files 155 155
Lines 14965 14926 -39
==========================================
- Hits 9293 9005 -288
- Misses 5672 5921 +249 ☔ View full report in Codecov by Sentry. |
@test @allocated(current_context()) == 0 | ||
@test @allocated(context()) == 0 | ||
@test @allocated(stream()) == 0 | ||
@test @allocated(device()) == 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry @KristofferC...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Straight to blacklist ;)
Alternatively, maybe I should just get rid of the ability to reset contexts; this functionality isn't even handled properly by NVIDIA's own libraries... |
Or, even better, we could just only support resetting contexts on 12+. That should make it possible to keep compatibility with older drivers, as long as people don't reset the device. |
Alright, CUDA 11.x support is back. Let's merge this once CI is green. |
Problem
CUDA contexts are annoying:
cuCtxCreate
,cuCtxGetCurrent
, etc)All this significantly complicates our ability to determine whether objects are safe to use and/or need to be finalized. Currently, we solve this by using some kind of factory method that is guaranteed to return a unique context object for every session of a context handle (i.e., after handle destruction and recreation, this method returns a different object despite the handle being identical). In combination with targeted invalidation of that object from all known APIs that destroy a context, that makes it possible to automatically determine context validity in all derived objects storing a reference. All this relies on multiple global dictionaries, which are slow and fragile (and have resulted in several issues with use with threads, and in finalizers where we can't take locks to safely access that global dict). It also isn't guaranteed to be correct, especially when cooperating with other software that may call context APIs.
Solution
CUDA 12.0 provides a new driver API,
cuCtxGetId
, which returns a monotonically incrementing identifier that does change when a context is destroyed and re-allocated. This greatly simplifies the design:This makes it possible to demote CuContext to a simple immutable type, and get rid of all context-related global state, improving thread- and finalizer-safety, while making it much cheaper to store context objects in derived resources.
The flip side: CUDA.jl will require a CUDA 12.x-compatible driver. This seems acceptable to me, given the improvements in this PR and the fact that CUDA 12 has been out for quite a while. People relying on CUDA 11.x can always keep using CUDA.jl 5.x. If needed, we can even make additional releases of CUDA.jl 5.x if backport PRs are suggested.
cc @vchuravy