-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Code Generation for Matrix Multiplication #653
Conversation
be2faf3
to
01ba2e7
Compare
449c373
to
2f89b1a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good work @resting-dove ! A few small things to iron out but nothing major. You did a really good and thorough job improving our matrix multiplication codegen.
mlir::LowerVectorToLLVMOptions lowerVectorToLLVMOptions; | ||
lowerVectorToLLVMOptions.enableX86Vector(true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if this run on an ARM system? You may want to check and use lowerVectorToLLVMOptions.enableArmSVE();
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I could tell, it didn't actually produce any x86 specific instructions so I removed it.
bool vectorize{false}; | ||
bool tile{false}; | ||
bool useFixedTileSizes{false}; | ||
int register_size{4 * 4 * 64}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which register size is this referring to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It took me a while to really understand what Uday Bondhugula was referring to with this in the blog post. It seems to mean the SIMD vector size x how many vectors there are. I changed the variable name to something that should make this more clear.
Type elementType, int64_t vec_size) { | ||
auto vec_Type = mlir::VectorType::get({vec_size}, elementType); | ||
|
||
// TODO: We need an option to enable smaller vector sizes for the ends of each |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO, in case this is out of scope for this PR, maybe it is wort it to create an issue with more explanation and reasoning for the suggested change and reference the issue here as TODO(#xxx)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will create a PR for this.
// TODO: Test why not | ||
// Cannot tile non square matmul. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be nice to have, especially for your benchmarks. In case you run into too much troubles fixing this, handle it in the same way as the TODO
above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems to work currently also for non square matrices. I will do some more testing.
// Cannot lower integer matmul, because we cannot create Memrefs with | ||
// integer type at the moment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Integer memrefs should generally work, see src/compiler/lowering/EwOpsLowering.cpp
. Handle this as already discussed in our meeting, so similarly to the TODO
above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original problem was three fold. 1. Filling the output memref with correctly signed 0 integer values, when we so far only used arith::ConstantOps which are MLIR unsigned. 2. Similarly arith::AddIOp only operates on signless types. 3. SumAllOp could not handle integer types for the same reasons.
In EwOps this is handled by casting from signfull to signless where necessary. I took this approach and applied it to MatMulLowering and SumAllOpLowering.
// TODO: This assert only fails in debug mode?! | ||
// assert(isValidLoopInterchangePermutation(twiceTiledNest, {2, 0, 1, 4, 7, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as the TODO
above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this check simply cannot deal with for loops that have more complex step lengths etc. Thus, I removed this check. The other checks should ensure we know where our loops are and therefore that this is a valid interchange.
let dependentDialects = ["vector::VectorDialect", "mlir::LLVM::LLVMDialect", "mlir::AffineDialect", | ||
"mlir::memref::MemRefDialect"]; | ||
let constructor = "mlir::daphne::createMatMulOpLoweringPass()"; | ||
let description = [{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good and thorough description!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
…rnels instead of CodeGen, except for MatMul
52bcd64
to
487745b
Compare
Thanks again @resting-dove , took a while but the changes are now merged in. Great work improving and testing optimized code generation for matrix multiplication! |
Work for the project in "Large Scale Data Engineering" at TU Berlin WS 23/ 24. The project is supervised by @philipportner. The issue is #627.
The work mainly concerns an extension to the MatMulLoweringPass of Daphne.
Several command line options are introduced:
--mlir-codegen
lowers to the naive structure with three loops.--matmul-tile
attempts to find tile sizes such that Register, L1, L2 and L3 caches are reused more efficiently.--matmul-fixed-tile-sizes=4,4
can be used to specify up to 5 tile sizes to be used in a tiling scheme adapted from https://github.com/bondhugula/llvm-project/blob/hop/mlir/docs/HighPerfCodeGen.md.matmul-unroll-factor=2
additionally unrolls the inner most loop in the tiling scheme by up to the specified factor.matmul-unroll-jam-factor=4
additionally unroll jams the original inner most two loops of the tiled nest, if the tile sizes divide the loop size.--matmul-vec-size-bits=64
attempts to use vector instructions with the specified bitwidth, but at least one element. Is only executed, if the resulting vector size divides the matrix size.--matmul-num-vec-registers=16
gives the number of vector registers available for automatic tile size calculation.Automated test are added under
test/api/cli/codegen
.Evaluations of the options are reported in https://github.com/resting-dove/lde_project.