Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LLVMGPU][ROCm] Isel error for %llvm.amdgcn.wmma.f16.16x16x16.f16 for GFX1100. #18060

Closed
EllisLambda opened this issue Jul 31, 2024 · 4 comments · Fixed by #18206
Closed

[LLVMGPU][ROCm] Isel error for %llvm.amdgcn.wmma.f16.16x16x16.f16 for GFX1100. #18060

EllisLambda opened this issue Jul 31, 2024 · 4 comments · Fixed by #18206
Assignees
Labels
bug 🐞 Something isn't working codegen/hip ROCm code generation compiler backend

Comments

@EllisLambda
Copy link

What happened?

Using sdxl-script compile for sdxl-scripts/fp16-model/base_ir/stable_diffusion_xl_base_1_0_64_fp16_prompt_encoder.mlir. Configure the iree-compile with

--iree-hal-target-backends=rocm
--iree-input-type=torch
--iree-rocm-target-chip=gfx1100
--iree-rocm-bc-dir=/opt/rocm-6.1.2/lib/llvm/lib/clang/17/lib/amdgcn/bitcode
--iree-global-opt-propagate-transposes=true
--iree-opt-outer-dim-concat=true 
--iree-opt-const-eval=false
--iree-llvmgpu-enable-prefetch 
--iree-flow-enable-aggressive-fusion 
--iree-global-opt-enable-fuse-horizontal-contractions=true
--iree-opt-aggressively-propagate-transposes=true 
--iree-codegen-llvmgpu-use-vector-distribution=true
--iree-execution-model=async-external 
--iree-hal-dump-executable-configurations-to=configurations/clip
--iree-hal-dump-executable-sources-to=sources/clip 
--iree-hal-dump-executable-binaries-to=binaries/clip 
--iree-hal-dump-executable-benchmarks-to=benchmarks/clip

The rocdl dialect operation can be observed in --mlir-print-ir-before="iree-util-fuse-globals" dump outputs

%303 = rocdl.wmma.f16.16x16x16.f16 %270, %302, %20, %0 : (vector<16xf16>, vector<16xf16>, vector<8xf16>, i1) -> vector<8xf16>

While the HAL codegen process, the LLVM pass crash

LLVM ERROR: Cannot select: intrinsic %llvm.amdgcn.wmma.f16.16x16x16.f16
Please report issues to https://github.com/iree-org/iree/issues and include the crash backtrace.
Stack dump:
0.      Running pass 'CallGraph Pass Manager' on module 'encode_prompts$async_dispatch_8'.
1.      Running pass 'AMDGPU DAG->DAG Pattern Instruction Selection' on function '@"encode_prompts$async_dispatch_8_batch_matmul_transpose_b_12x64x64x64_f16"'
Stack dump without symbol names (ensure you have llvm-symbolizer in your PATH or set the environment var `LLVM_SYMBOLIZER_PATH` to point to it):
LLVM ERROR: Cannot select: intrinsic %llvm.amdgcn.wmma.f16.16x16x16.f16`

There seems llvm.amdgcn.wmma.f16.16x16x16.f16.v8f16.v16f16 instruction in --print-isel-input output dump
There's no issue when the param is --iree-rocm-target-chip=gfx940

Steps to reproduce your issue

Running compile-clip.sh with the params above.

What component(s) does this issue relate to?

MLIR, Compiler

Version information

IREE (https://iree.dev):
IREE compiler version (#17985)
LLVM version 19.0.0git
Optimized build

Additional context

No response

@EllisLambda EllisLambda added the bug 🐞 Something isn't working label Jul 31, 2024
@kuhar kuhar added the codegen/hip ROCm code generation compiler backend label Jul 31, 2024
@nirvedhmeshram nirvedhmeshram self-assigned this Aug 12, 2024
@nirvedhmeshram
Copy link
Contributor

nirvedhmeshram commented Aug 12, 2024

Here is a reduced repro for anyone interested with the following minimal_repro.mlir

func.func @gemm_failure(%9 : tensor<12x64x64xf16>, %10 : tensor<12x64x64xf16>) -> tensor<12x64x64xf16>  {   
    %cst = arith.constant 0.000000e+00 : f16    
    %11 = tensor.empty() : tensor<12x64x64xf16>
    %12 = linalg.fill ins(%cst : f16) outs(%11 : tensor<12x64x64xf16>) -> tensor<12x64x64xf16>
    %13 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d0, d1, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d2, d3)>, 
        affine_map<(d0, d1, d2, d3) -> (d0, d1, d2)>], iterator_types = ["parallel", "parallel", "parallel", "reduction"]} 
        ins(%9, %10 : tensor<12x64x64xf16>, tensor<12x64x64xf16>) outs(%12 : tensor<12x64x64xf16>) {
          ^bb0(%in: f16, %in_0: f16, %out: f16):
            %14 = arith.mulf %in, %in_0 : f16
            %15 = arith.addf %out, %14 : f16
            linalg.yield %15 : f16
          } -> tensor<12x64x64xf16>
    return %13 : tensor<12x64x64xf16>
}

compilation fails with

iree-compile minimal_repro.mlir \
    --iree-hal-target-backends=rocm \
    --iree-rocm-target-chip=gfx1100 \
    -o output.vmfb

@nirvedhmeshram
Copy link
Contributor

nirvedhmeshram commented Aug 12, 2024

The IR we are making seems reasonable to me, I have opened an upstream llvm issue
llvm/llvm-project#102981

@nirvedhmeshram
Copy link
Contributor

I have a lead in that this intrinsic is only allowed when we have -mattr=+wavefrontsize64 but I believe we are using wavefrontsize32 or not passing it correctly. I should be able to solve this tomorrow.

@EllisLambda
Copy link
Author

EllisLambda commented Aug 13, 2024

I have a lead in that this intrinsic is only allowed when we have -mattr=+wavefrontsize64 but I believe we are using wavefrontsize32 or not passing it correctly. I should be able to solve this tomorrow.

Yeah, I mentioned it in #17807. but at the same time I have tried changing the preference to wavefront64. But there's sharedmemory out of bound error occur.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐞 Something isn't working codegen/hip ROCm code generation compiler backend
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants