-
Notifications
You must be signed in to change notification settings - Fork 599
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better error message when device not found #18749
Comments
Bindings/frontend layers could do this if they wanted but IREE's runtime library wouldn't because it's very expensive to enumerate drivers/devices. A frontend could make the choice if they accept the caveats - loading CUDA into a process when you are trying to use HIP to enumerate available CUDA devices, etc, is usually a bad move. If the issue is that the underlying driver can't be loaded at all (no CUDA/HIP impl found) then we can't enumerate devices and that needs to be reported separately, etc. |
I believe this is happening when trying to initialize a context that is missing executables for devices that are available to it. I'd be happy to do better error reporting at the frontend, even if expensive, but I don't know how to get the union of "you gave us this" and "I have this". Ideas? |
If HAL implementations include useful info in Background is that I'm not sure anything we could produce in the lower levels is going to be more useful than what we currently do (dump the executable target as built by the compiler). There's not really such thing as a "supported executable" as it's really the big feature matrix and each embedded executable can sparsely support anything in that space, and we may have multiple embedded executables. e.g. a hardware device/driver may support 3 subarchs but have different extensions on each and the compiled executables in the vmfb may not exactly match any of them. We can't have the runtime tell the user what compiler flags to use because the runtime shouldn't/can't know about them and any number of compile flags may map to viable executables at runtime. It's better than SIGILL (what you'd get from a normal native executable), a |
A dedicated status would at least give a fighting chance of saying something. As with most things like this, I'm thinking more of field supportability: we're going to get these error messages in bug reports, and being able to redirect/close when we see it is useful. Right now, people are like "is my GPU working?". I need to get this more into "the thing I'm running is incompatible" and bonus points if that has some extra details that will give us a fighting chance to respond to the issues with a "there's your problem (close issue)" |
I'll get that new status out in a sec. |
This allows error handling code to detect cases where the program is incompatible with the hosting environment. One day when status payloads are implemented we could attach exactly why in a programmatically accessible way but for now a hosting application can dump the whole supported environment/topology/etc. Progress on #18749.
This allows error handling code to detect cases where the program is incompatible with the hosting environment. One day when status payloads are implemented we could attach exactly why in a programmatically accessible way but for now a hosting application can dump the whole supported environment/topology/etc. Progress on #18749.
This allows error handling code to detect cases where the program is incompatible with the hosting environment. One day when status payloads are implemented we could attach exactly why in a programmatically accessible way but for now a hosting application can dump the whole supported environment/topology/etc. Progress on #18749.
Request description
In this issue,
nod-ai/SHARK-Platform#264
I encountered an error message that looked like
It would be very very nice if upon encountering an error like this, iree could enumerate the available devices and give something like "you request device x but we only have devices Y, Z, and W. Did you mean to call function f with argument device=y instead of device=x?"
What component(s) does this issue relate to?
Runtime
Additional context
No response
The text was updated successfully, but these errors were encountered: