Better error message when device not found #18749

renxida · 2024-10-10T19:03:40Z

Request description

In this issue,

I encountered an error message that looked like

ValueError: <vm>:0: NOT_FOUND; HAL device `__device_0` not found or unavailable: #hal.device.target<"hip", {legacy_sync}, [#hal.executable.target<"rocm", "rocm-hsaco-fb", {iree.gpu.target = #iree_gpu.target<arch = "gfx1100", features = "", wgp = <compute =  fp64|fp32|fp16|int64|int32|int16|int8, storage =  b64|b32|b16|b8, subgroup =  shuffle|arithmetic, dot =  dp4xi8toi32, mma = [<WMMA_F32_16x16x16_F16>, <WMMA_F16_16x16x16_F16>, <WMMA_I32_16x16x16_I8>], subgroup_size_choices = [32, 64], max_workgroup_sizes = [1024, 1024, 1024], max_thread_count_per_workgroup = 1024, max_workgroup_memory_bytes = 65536, max_workgroup_counts = [2147483647, 2147483647, 2147483647]>>, ukernels = "none"}>]>;

It would be very very nice if upon encountering an error like this, iree could enumerate the available devices and give something like "you request device x but we only have devices Y, Z, and W. Did you mean to call function f with argument device=y instead of device=x?"

What component(s) does this issue relate to?

Runtime

Additional context

No response

The text was updated successfully, but these errors were encountered:

benvanik · 2024-10-10T19:08:02Z

Bindings/frontend layers could do this if they wanted but IREE's runtime library wouldn't because it's very expensive to enumerate drivers/devices. A frontend could make the choice if they accept the caveats - loading CUDA into a process when you are trying to use HIP to enumerate available CUDA devices, etc, is usually a bad move. If the issue is that the underlying driver can't be loaded at all (no CUDA/HIP impl found) then we can't enumerate devices and that needs to be reported separately, etc.

stellaraccident · 2024-10-10T19:15:23Z

I believe this is happening when trying to initialize a context that is missing executables for devices that are available to it. I'd be happy to do better error reporting at the frontend, even if expensive, but I don't know how to get the union of "you gave us this" and "I have this". Ideas?

benvanik · 2024-10-10T19:29:31Z

If HAL implementations include useful info in iree_hal_driver_dump_device_info then binding/hosting layer could (effectively) --dump_devices. We could have a special status for that (IREE_STATUS_INCOMPATIBLE) so that it could be emitted only when it's a device not found issue and it could dump the entire topology/etc.

Background is that I'm not sure anything we could produce in the lower levels is going to be more useful than what we currently do (dump the executable target as built by the compiler). There's not really such thing as a "supported executable" as it's really the big feature matrix and each embedded executable can sparsely support anything in that space, and we may have multiple embedded executables. e.g. a hardware device/driver may support 3 subarchs but have different extensions on each and the compiled executables in the vmfb may not exactly match any of them. We can't have the runtime tell the user what compiler flags to use because the runtime shouldn't/can't know about them and any number of compile flags may map to viable executables at runtime. It's better than SIGILL (what you'd get from a normal native executable), a CUDA_ERROR_NO_BINARY_FOR_GPU code, etc, at least :)

stellaraccident · 2024-10-10T19:40:14Z

A dedicated status would at least give a fighting chance of saying something.

As with most things like this, I'm thinking more of field supportability: we're going to get these error messages in bug reports, and being able to redirect/close when we see it is useful. Right now, people are like "is my GPU working?". I need to get this more into "the thing I'm running is incompatible" and bonus points if that has some extra details that will give us a fighting chance to respond to the issues with a "there's your problem (close issue)"

benvanik · 2024-10-10T19:44:47Z

I'll get that new status out in a sec.

This allows error handling code to detect cases where the program is incompatible with the hosting environment. One day when status payloads are implemented we could attach exactly why in a programmatically accessible way but for now a hosting application can dump the whole supported environment/topology/etc. Progress on #18749.

renxida added the enhancement ➕ New feature or request label Oct 10, 2024

benvanik mentioned this issue Oct 10, 2024

Adding IREE_STATUS_INCOMPATIBLE. #18752

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better error message when device not found #18749

Better error message when device not found #18749

renxida commented Oct 10, 2024

benvanik commented Oct 10, 2024

stellaraccident commented Oct 10, 2024

benvanik commented Oct 10, 2024

stellaraccident commented Oct 10, 2024

benvanik commented Oct 10, 2024

Better error message when device not found #18749

Better error message when device not found #18749

Comments

renxida commented Oct 10, 2024

Request description

What component(s) does this issue relate to?

Additional context

benvanik commented Oct 10, 2024

stellaraccident commented Oct 10, 2024

benvanik commented Oct 10, 2024

stellaraccident commented Oct 10, 2024

benvanik commented Oct 10, 2024