Design Questions, Topics

These are probably some of the known weaker aspects of the current implementation:

PrepareKernel / PrepareMultiKernel assume that shader parameters can be summarized as: N input buffers, 1 output buffer, + optionally 1 custom param type. Is there a way to facilitate more complex param types without adding too much complexity to PrepareKernel? This doesn't come up as often in ML where everything are basically array buffers but may come up in other GPGPU use cases (simulation, offline rendering)
Currently LaunchMultiKernel dispatches kernels asynchronously, what's the best way to expose synchronization between kernels without introducing too much complexity?
Should we consolidate PrepareKernel / LaunchKernel as a special case (n=1 shader) of PrepareMultiKernel and LaunchMultiKernel? I'm leaning towards yes.
Builds: cmake + fetchcontent + dependence on Elie Michel's repo is 1) more fragile than I'd like and 2) probably not handling caching and locally builds as well as it could. Are there ways to improve the build to make it less brittle + faster + handle caching better?
known issue - matmul likely needs a lot more optimization to be competitive.
dawn vs. wgpu: wgpuInstanceProcessEvents seems to be handled differently as well as deviceLostCallbackInfo. Need to investigate further.
Any other suggestions welcome!

Provide feedback