Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better error message when device not found #18749

Open
renxida opened this issue Oct 10, 2024 · 5 comments
Open

Better error message when device not found #18749

renxida opened this issue Oct 10, 2024 · 5 comments
Labels
enhancement ➕ New feature or request

Comments

@renxida
Copy link
Contributor

renxida commented Oct 10, 2024

Request description

In this issue,

nod-ai/SHARK-Platform#264

I encountered an error message that looked like

ValueError: <vm>:0: NOT_FOUND; HAL device `__device_0` not found or unavailable: #hal.device.target<"hip", {legacy_sync}, [#hal.executable.target<"rocm", "rocm-hsaco-fb", {iree.gpu.target = #iree_gpu.target<arch = "gfx1100", features = "", wgp = <compute =  fp64|fp32|fp16|int64|int32|int16|int8, storage =  b64|b32|b16|b8, subgroup =  shuffle|arithmetic, dot =  dp4xi8toi32, mma = [<WMMA_F32_16x16x16_F16>, <WMMA_F16_16x16x16_F16>, <WMMA_I32_16x16x16_I8>], subgroup_size_choices = [32, 64], max_workgroup_sizes = [1024, 1024, 1024], max_thread_count_per_workgroup = 1024, max_workgroup_memory_bytes = 65536, max_workgroup_counts = [2147483647, 2147483647, 2147483647]>>, ukernels = "none"}>]>; 

It would be very very nice if upon encountering an error like this, iree could enumerate the available devices and give something like "you request device x but we only have devices Y, Z, and W. Did you mean to call function f with argument device=y instead of device=x?"

What component(s) does this issue relate to?

Runtime

Additional context

No response

@renxida renxida added the enhancement ➕ New feature or request label Oct 10, 2024
@benvanik
Copy link
Collaborator

Bindings/frontend layers could do this if they wanted but IREE's runtime library wouldn't because it's very expensive to enumerate drivers/devices. A frontend could make the choice if they accept the caveats - loading CUDA into a process when you are trying to use HIP to enumerate available CUDA devices, etc, is usually a bad move. If the issue is that the underlying driver can't be loaded at all (no CUDA/HIP impl found) then we can't enumerate devices and that needs to be reported separately, etc.

@stellaraccident
Copy link
Collaborator

I believe this is happening when trying to initialize a context that is missing executables for devices that are available to it. I'd be happy to do better error reporting at the frontend, even if expensive, but I don't know how to get the union of "you gave us this" and "I have this". Ideas?

@benvanik
Copy link
Collaborator

If HAL implementations include useful info in iree_hal_driver_dump_device_info then binding/hosting layer could (effectively) --dump_devices. We could have a special status for that (IREE_STATUS_INCOMPATIBLE) so that it could be emitted only when it's a device not found issue and it could dump the entire topology/etc.

Background is that I'm not sure anything we could produce in the lower levels is going to be more useful than what we currently do (dump the executable target as built by the compiler). There's not really such thing as a "supported executable" as it's really the big feature matrix and each embedded executable can sparsely support anything in that space, and we may have multiple embedded executables. e.g. a hardware device/driver may support 3 subarchs but have different extensions on each and the compiled executables in the vmfb may not exactly match any of them. We can't have the runtime tell the user what compiler flags to use because the runtime shouldn't/can't know about them and any number of compile flags may map to viable executables at runtime. It's better than SIGILL (what you'd get from a normal native executable), a CUDA_ERROR_NO_BINARY_FOR_GPU code, etc, at least :)

@stellaraccident
Copy link
Collaborator

A dedicated status would at least give a fighting chance of saying something.

As with most things like this, I'm thinking more of field supportability: we're going to get these error messages in bug reports, and being able to redirect/close when we see it is useful. Right now, people are like "is my GPU working?". I need to get this more into "the thing I'm running is incompatible" and bonus points if that has some extra details that will give us a fighting chance to respond to the issues with a "there's your problem (close issue)"

@benvanik
Copy link
Collaborator

I'll get that new status out in a sec.

benvanik added a commit that referenced this issue Oct 10, 2024
This allows error handling code to detect cases where the program is
incompatible with the hosting environment. One day when status payloads
are implemented we could attach exactly why in a programmatically
accessible way but for now a hosting application can dump the whole
supported environment/topology/etc.

Progress on #18749.
benvanik added a commit that referenced this issue Oct 10, 2024
This allows error handling code to detect cases where the program is
incompatible with the hosting environment. One day when status payloads
are implemented we could attach exactly why in a programmatically
accessible way but for now a hosting application can dump the whole
supported environment/topology/etc.

Progress on #18749.
benvanik added a commit that referenced this issue Oct 10, 2024
This allows error handling code to detect cases where the program is
incompatible with the hosting environment. One day when status payloads
are implemented we could attach exactly why in a programmatically
accessible way but for now a hosting application can dump the whole
supported environment/topology/etc.

Progress on #18749.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement ➕ New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants