Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(rocm/wmma) GFX1100 - SDXL CLIP compilation fails trying to use a wmma intrinsic that accumulates in f16 #17807

Closed
monorimet opened this issue Jul 3, 2024 · 3 comments
Assignees
Labels
bug 🐞 Something isn't working

Comments

@monorimet
Copy link
Collaborator

What happened?

iree.compiler.tools.binaries.CompilerToolError: Error invoking IREE compiler tool iree-compile.exe
Error code: 3
Diagnostics:
LLVM ERROR: Cannot select: intrinsic %llvm.amdgcn.wmma.f16.16x16x16.f16
Please report issues to https://github.com/iree-org/iree/issues and include the crash backtrace.
Stack dump:
0.      Running pass 'CallGraph Pass Manager' on module 'encode_prompts$async_dispatch_8'.
1.      Running pass 'AMDGPU DAG->DAG Pattern Instruction Selection' on function '@"encode_prompts$async_dispatch_8_batch_matmul_transpose_b_12x64x64x64_f16"'
LLVM ERROR: Cannot select: intrinsic %llvm.amdgcn.wmma.f16.16x16x16.f16
Exception Code: 0x80000003
LLVM ERROR: Cannot select: intrinsic %llvm.amdgcn.wmma.f16.16x16x16.f16


Invoked with:
 iree-compile.exe C:\V\iree-build\compiler\bindings\python\iree\compiler\tools\..\_mlir_libs\iree-compile.exe - --iree-input-type=torch --iree-vm-bytecode-module-output-format=flatbuffer-binary --iree-hal-target-backends=rocm --mlir-print-debuginfo --mlir-print-op-on-diagnostic=false --mlir-pass-pipeline-crash-reproducer=./shark_tmp/core-reproducer.mlir --iree-hal-target-backends=rocm --iree-rocm-target-chip=gfx1100 --iree-vm-bytecode-module-output-format=flatbuffer-binary --iree-global-opt-propagate-transposes=true --iree-opt-outer-dim-concat=true --iree-vm-target-truncate-unsupported-floats --iree-llvmgpu-enable-prefetch=true --iree-opt-data-tiling=false --iree-opt-const-eval=false --iree-opt-aggressively-propagate-transposes=true --iree-flow-enable-aggressive-fusion --iree-global-opt-enable-fuse-horizontal-contractions=true --iree-codegen-gpu-native-math-precision=true --iree-codegen-llvmgpu-use-vector-distribution=true --iree-codegen-llvmgpu-enable-transform-dialect-jit=false --iree-preprocessing-pass-pipeline=builtin.module(iree-preprocessing-transpose-convolution-pipeline, iree-global-opt-raise-special-ops, util.func(iree-preprocessing-pad-to-intrinsics)) --iree-codegen-transform-dialect-library=./vmfbs\attention_and_matmul_spec_wmma.mlir

Need more information? Set IREE_SAVE_TEMPS=/some/dir in your environment to save all artifacts and reproducers.

attention_and_matmul_spec_wmma.mlir: https://sharkpublic.blob.core.windows.net/sharkpublic/specs/no_pad/attention_and_matmul_spec_wmma.mlir

clip MLIR:
https://sharkpublic.blob.core.windows.net/sharkpublic/ean/sdxl-turbine/debug/stable_diffusion_xl_base_1_0_bs1_64_fp16_prompt_encoder_rocm.mlir

Steps to reproduce your issue

Setup latest iree-compiler from IREE main branch (issue reproduced on version iree-compiler-20240703.943)

Download artifacts (mlir files) from above azure links.

run compile:

iree-compile --iree-input-type=torch --iree-vm-bytecode-module-output-format=flatbuffer-binary --iree-hal-target-backends=rocm --mlir-print-debuginfo --mlir-print-op-on-diagnostic=false --mlir-pass-pipeline-crash-reproducer=./shark_tmp/core-reproducer.mlir --iree-hal-target-backends=rocm --iree-rocm-target-chip=gfx1100 --iree-vm-bytecode-module-output-format=flatbuffer-binary --iree-global-opt-propagate-transposes=true --iree-opt-outer-dim-concat=true --iree-vm-target-truncate-unsupported-floats --iree-llvmgpu-enable-prefetch=true --iree-opt-data-tiling=false --iree-opt-const-eval=false --iree-opt-aggressively-propagate-transposes=true --iree-flow-enable-aggressive-fusion --iree-global-opt-enable-fuse-horizontal-contractions=true --iree-codegen-gpu-native-math-precision=true --iree-codegen-llvmgpu-use-vector-distribution=true --iree-codegen-llvmgpu-enable-transform-dialect-jit=false --iree-preprocessing-pass-pipeline='builtin.module(iree-preprocessing-transpose-convolution-pipeline, iree-global-opt-raise-special-ops, util.func(iree-preprocessing-pad-to-intrinsics))' --iree-codegen-transform-dialect-library=./vmfbs/attention_and_matmul_spec_wmma.mlir stable_diffusion_xl_base_1_0_bs1_64_fp16_prompt_encoder_rocm.mlir -o stable_diffusion_xl_base_1_0_bs1_64_fp16_prompt_encoder_rocm_gfx1100.vmfb

see error.

What component(s) does this issue relate to?

No response

Version information

First encountered on source build of sdxl_quantized branch (6cc8afe) but reproduced on latest published wheels (20240703.943).

Additional context

No response

@monorimet monorimet added the bug 🐞 Something isn't working label Jul 3, 2024
@monorimet
Copy link
Collaborator Author

assigned @raikonenfnu for now, let me know if this should be changed

@EllisLambda
Copy link

I have figured out the problem, accumulate to <8xf16> is only available for wave64 mode, it should be <16xf16> for wave32 mode. And the compile select wave32 mode by default for ROCm backend. But the accumulate matrix spec in MMAIntrinsic::WMMA_F16_16x16x16_F16 tie for vector.contract will only select the <8xf16> accumulate strategy.

nirvedhmeshram added a commit that referenced this issue Aug 13, 2024
The existing layout for the intrinsic was for subgroup=64 but we are
using subgroup=32 so it lead to this error
#18060
This PR fixes this to use the correct layout for subgroup=32 hence fixes
#18060 and
#17807
@nirvedhmeshram
Copy link
Contributor

should be fixed with #18206

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐞 Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants