(rocm/wmma) GFX1100 - SDXL CLIP compilation fails trying to use a wmma intrinsic that accumulates in f16 #17807

monorimet · 2024-07-03T18:09:56Z

What happened?

iree.compiler.tools.binaries.CompilerToolError: Error invoking IREE compiler tool iree-compile.exe
Error code: 3
Diagnostics:
LLVM ERROR: Cannot select: intrinsic %llvm.amdgcn.wmma.f16.16x16x16.f16
Please report issues to https://github.com/iree-org/iree/issues and include the crash backtrace.
Stack dump:
0.      Running pass 'CallGraph Pass Manager' on module 'encode_prompts$async_dispatch_8'.
1.      Running pass 'AMDGPU DAG->DAG Pattern Instruction Selection' on function '@"encode_prompts$async_dispatch_8_batch_matmul_transpose_b_12x64x64x64_f16"'
LLVM ERROR: Cannot select: intrinsic %llvm.amdgcn.wmma.f16.16x16x16.f16
Exception Code: 0x80000003
LLVM ERROR: Cannot select: intrinsic %llvm.amdgcn.wmma.f16.16x16x16.f16


Invoked with:
 iree-compile.exe C:\V\iree-build\compiler\bindings\python\iree\compiler\tools\..\_mlir_libs\iree-compile.exe - --iree-input-type=torch --iree-vm-bytecode-module-output-format=flatbuffer-binary --iree-hal-target-backends=rocm --mlir-print-debuginfo --mlir-print-op-on-diagnostic=false --mlir-pass-pipeline-crash-reproducer=./shark_tmp/core-reproducer.mlir --iree-hal-target-backends=rocm --iree-rocm-target-chip=gfx1100 --iree-vm-bytecode-module-output-format=flatbuffer-binary --iree-global-opt-propagate-transposes=true --iree-opt-outer-dim-concat=true --iree-vm-target-truncate-unsupported-floats --iree-llvmgpu-enable-prefetch=true --iree-opt-data-tiling=false --iree-opt-const-eval=false --iree-opt-aggressively-propagate-transposes=true --iree-flow-enable-aggressive-fusion --iree-global-opt-enable-fuse-horizontal-contractions=true --iree-codegen-gpu-native-math-precision=true --iree-codegen-llvmgpu-use-vector-distribution=true --iree-codegen-llvmgpu-enable-transform-dialect-jit=false --iree-preprocessing-pass-pipeline=builtin.module(iree-preprocessing-transpose-convolution-pipeline, iree-global-opt-raise-special-ops, util.func(iree-preprocessing-pad-to-intrinsics)) --iree-codegen-transform-dialect-library=./vmfbs\attention_and_matmul_spec_wmma.mlir

Need more information? Set IREE_SAVE_TEMPS=/some/dir in your environment to save all artifacts and reproducers.

attention_and_matmul_spec_wmma.mlir: https://sharkpublic.blob.core.windows.net/sharkpublic/specs/no_pad/attention_and_matmul_spec_wmma.mlir

clip MLIR:
https://sharkpublic.blob.core.windows.net/sharkpublic/ean/sdxl-turbine/debug/stable_diffusion_xl_base_1_0_bs1_64_fp16_prompt_encoder_rocm.mlir

Steps to reproduce your issue

Setup latest iree-compiler from IREE main branch (issue reproduced on version iree-compiler-20240703.943)

Download artifacts (mlir files) from above azure links.

run compile:

iree-compile --iree-input-type=torch --iree-vm-bytecode-module-output-format=flatbuffer-binary --iree-hal-target-backends=rocm --mlir-print-debuginfo --mlir-print-op-on-diagnostic=false --mlir-pass-pipeline-crash-reproducer=./shark_tmp/core-reproducer.mlir --iree-hal-target-backends=rocm --iree-rocm-target-chip=gfx1100 --iree-vm-bytecode-module-output-format=flatbuffer-binary --iree-global-opt-propagate-transposes=true --iree-opt-outer-dim-concat=true --iree-vm-target-truncate-unsupported-floats --iree-llvmgpu-enable-prefetch=true --iree-opt-data-tiling=false --iree-opt-const-eval=false --iree-opt-aggressively-propagate-transposes=true --iree-flow-enable-aggressive-fusion --iree-global-opt-enable-fuse-horizontal-contractions=true --iree-codegen-gpu-native-math-precision=true --iree-codegen-llvmgpu-use-vector-distribution=true --iree-codegen-llvmgpu-enable-transform-dialect-jit=false --iree-preprocessing-pass-pipeline='builtin.module(iree-preprocessing-transpose-convolution-pipeline, iree-global-opt-raise-special-ops, util.func(iree-preprocessing-pad-to-intrinsics))' --iree-codegen-transform-dialect-library=./vmfbs/attention_and_matmul_spec_wmma.mlir stable_diffusion_xl_base_1_0_bs1_64_fp16_prompt_encoder_rocm.mlir -o stable_diffusion_xl_base_1_0_bs1_64_fp16_prompt_encoder_rocm_gfx1100.vmfb

see error.

What component(s) does this issue relate to?

No response

Version information

First encountered on source build of sdxl_quantized branch (6cc8afe) but reproduced on latest published wheels (20240703.943).

Additional context

No response

The text was updated successfully, but these errors were encountered:

monorimet · 2024-07-03T18:12:03Z

assigned @raikonenfnu for now, let me know if this should be changed

EllisLambda · 2024-08-07T10:17:34Z

I have figured out the problem, accumulate to <8xf16> is only available for wave64 mode, it should be <16xf16> for wave32 mode. And the compile select wave32 mode by default for ROCm backend. But the accumulate matrix spec in MMAIntrinsic::WMMA_F16_16x16x16_F16 tie for vector.contract will only select the <8xf16> accumulate strategy.

The existing layout for the intrinsic was for subgroup=64 but we are using subgroup=32 so it lead to this error #18060 This PR fixes this to use the correct layout for subgroup=32 hence fixes #18060 and #17807

nirvedhmeshram · 2024-08-13T20:05:31Z

should be fixed with #18206

monorimet added the bug 🐞 Something isn't working label Jul 3, 2024

monorimet assigned raikonenfnu Jul 3, 2024

EllisLambda mentioned this issue Aug 13, 2024

[LLVMGPU][ROCm] Isel error for %llvm.amdgcn.wmma.f16.16x16x16.f16 for GFX1100. #18060

Closed

nirvedhmeshram mentioned this issue Aug 13, 2024

[ROCM] fix layout for WMMA_F16_16x16x16_F16 intrinsic #18206

Merged

nirvedhmeshram closed this as completed Aug 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(rocm/wmma) GFX1100 - SDXL CLIP compilation fails trying to use a wmma intrinsic that accumulates in f16 #17807

(rocm/wmma) GFX1100 - SDXL CLIP compilation fails trying to use a wmma intrinsic that accumulates in f16 #17807

monorimet commented Jul 3, 2024

monorimet commented Jul 3, 2024

EllisLambda commented Aug 7, 2024

nirvedhmeshram commented Aug 13, 2024

(rocm/wmma) GFX1100 - SDXL CLIP compilation fails trying to use a wmma intrinsic that accumulates in f16 #17807

(rocm/wmma) GFX1100 - SDXL CLIP compilation fails trying to use a wmma intrinsic that accumulates in f16 #17807

Comments

monorimet commented Jul 3, 2024

What happened?

Steps to reproduce your issue

What component(s) does this issue relate to?

Version information

Additional context

monorimet commented Jul 3, 2024

EllisLambda commented Aug 7, 2024

nirvedhmeshram commented Aug 13, 2024