Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce MLIR transform dialect to BladeDISC #787

Open
16 of 18 tasks
wyzero opened this issue Nov 24, 2022 · 2 comments
Open
16 of 18 tasks

Introduce MLIR transform dialect to BladeDISC #787

wyzero opened this issue Nov 24, 2022 · 2 comments

Comments

@wyzero
Copy link
Collaborator

wyzero commented Nov 24, 2022

We'll start to explore using MLIR transform dialect to do codegen for (fused) compute-intensive pattern. The initial target is to support gemm codegen on ARM platform to address the dynamic shape problem of Arm Compute Library.

The initial plan is:

  • Step 1, enhance the fusion decision pass. We’ll add a new fusion kind kTransform for the transform-based fusion pattern.
  • Step 2, lower the lmhlo fusion op to linalg on tensor.
  • Step 3, transform the linalg computation to loops using transform dialect.
  • Step 4, refined the transformed loop to make it suitable for BladeDISC runtime.
  • Step 5, add a new pass to the disc pass pipeline to drive the above process.
  • Step 6, weight pre-packing support
    • add disc_linalg.multi_level_pack op, used for doing packing.
    • add transform.disc.cache_read transform op, relying on disc_linalg.multi_level_pack op.
    • add folding support for disc_linalg.multi_level_pack.
    • lower disc_linalg.multi_level_pack to loop if it can not be folded.
    • fuse const weight op into the kTransform fusion pattern, lower it to linalg and then schedule it.
  • Step 7, assign a default schedule for each kTransform pattern.
  • Step 8, schedule selection logic injection
  • Step 9, initial model level testing: bert (albert).
  • Step 10, support nt, tn, tt format GEMM.
  • Step 11, support batch matmul
  • Step 12, support GEMM epilogue fusion.
  • Step 13, performance optimization
wyzero added a commit that referenced this issue Dec 5, 2022
Following is some preliminary data:

test on g6r; single thread; A, B and C are fully dynamic (pre-packing is not possible in such case)

| m, n, k | DISC + transform (ms) | DISC + ACL (ms) |
| ------------- | ------------- | ------------- |
| 304, 256, 256 | 1.02 | 1.00 |
| 304, 512, 256 | 2.00 | 2.02 |
| 304, 1024, 256 | 4.10 | 4.00 |
| 304, 1024, 512 | 8.56 | 7.99 |
| 1024, 1024, 1024 | 60.0 | 52.8 |
| 34, 512, 256 | 0.301 | 0.293 |
| 74, 512, 256 | 0.561 | 0.544 |
| 174, 512, 256 | 1.19 | 1.207 |
| 34, 256, 256 | 0.135 | 0.158 |
| 74, 256, 256 | 0.272 | 0.281 |
| 174, 256, 256 | 0.592 | 0.589 |

to #787
@wyzero
Copy link
Collaborator Author

wyzero commented Dec 29, 2022

e2e model test on: Bert Base (TF) and Albert (PyTorch), on g6r, using single thread. Note that we only have one default schedule for all shape and the schedule is known to be less performant when n or k is large, thus the initial performance is supposed to be improved when we support schedule selection logic.

Bert Base (TF)

input TF 2.8(s) DISC-ACL(s) DISC-Transform(s) speedup (DISC-transform / DISC-ACL)
(1, 128) 0.742 0.638 0.656 97.3%
(2, 128) 1.41 1.24 1.27 97.6%
(4, 128) 2.85 2.36 2.55 92.5%
(8, 128) 5.84 4.68 5.07 92.3%
(16, 128) 11.9 9.55 10.2 93.6%

Albert (PyTorch)

input TorchScript OnnxRuntime DISC-ACL DISC-Transform
(2, 12) 0.197 0.140 0.117 0.139

@wyzero
Copy link
Collaborator Author

wyzero commented Mar 23, 2023

some sharing doc:

https://bladedisc.oss-cn-hangzhou.aliyuncs.com/docs/transform-dialect-based-codegen-in-bladedisc.pdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant