-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a compiler pass for iterate(::CartesianIndices) #35074
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few line comments for context
@@ -400,17 +406,29 @@ function renumber_ir_elements!(body::Vector{Any}, changemap::Vector{Int}) | |||
end | |||
|
|||
function renumber_ir_elements!(body::Vector{Any}, ssachangemap::Vector{Int}, labelchangemap::Vector{Int}) | |||
delta = abs(labelchangemap[1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of the changes here and below are cosmetic (like the i->j
renaming I made to make it easier to interpret @show
while debugging), but I was a bit worried about the early-exit criterion in this function. Previously it assumed that the sum of all changes wasn't 0; that's valid if you only ever grow the number of lines, but since this might someday be used for deletions I thought it would be better to be a bit more careful.
println(src) | ||
iterator, iteratortype, itervar, bodystart, bodystop, loopexit = extract_iterate(src, id) | ||
@assert iteratortype <: Main.Base.CartesianIndices | ||
# slotname = isa(iterator, Slot) ? Main.Base.string("#", Main.Base.String(src.slotnames[iterator.id])) : Main.Base.String(gensym()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice to be able to give things nice names but currently this causes bootstrap problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general Main.Base
might not be available during bootstrap?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This only gets called when it is, see the check in expand_cartesian_loops!(src::CodeInfo)
base/compiler/loops.jl
Outdated
src.codelocs = whole.codelocs | ||
src.ssavaluetypes = whole.ssavaluetypes | ||
return src | ||
# newsrc = CodeInfo(whole.stmts, whole.codelocs, whole.ssavaluetypes, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't sure whether it would be better to do in-place modification, but currently it doesn't seem possible to construct a CodeInfo from Julia. Obviously fixable via ccall
s, but I took the lazy way out so far.
base/compiler/utilities.jl
Outdated
@@ -83,6 +83,20 @@ end | |||
# MethodInstance/CodeInfo # | |||
########################### | |||
|
|||
function CodeInfo(stmts::Vector{Any}, codelocs::Vector{Int32}, ssavaluetypes::Vector{Any}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not yet useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Some raw thoughts while reading through this.
- My first reaction was that this should operate upon the latter IRCode/SSAIR since that is nicer to work with, but this dependents on
iterate
not being inlined yet. - This effectively elevates
CartesianIndices
to a compiler construct, as well as implementing the iteration protocol for them in the compiler. I feel like we should be honest about that and moveCartesianIndices
into the compiler, or even better take a good look at MLIR's affine maps which can be used to express cartesian iteration and blocked iteration.
For manipulating loops (and especially nested loops), it makes sense
to transiently go back to a nested block structure to avoid lots of
insert!
anddeleteat!
shuffling.
I am strongly in favour of that, I think there are lessons to be learned from MLIR and we are throwing information away to early. Having a structured IR to work with makes working with loops much easier.
My biggest concern here is that one of the biggest draws of Julia for me was always that things are implemented in Julia itself and that there are very few special constructs. A lot of stuff can just be implemented in "userspace", whereas here we introduce a compiler construct and state: No users can't implement it :/ This is why I have been searching for solutions that are agnostic to the user-level input description and recover these patterns after the fact (admittedly a rather hard issue).
I'm with you much of the way. I'd rather #9080 got fixed as a by-product of some more general transformation. But that hasn't happened, and given that it's been 6 years perhaps it's time to bring out the big guns. Were it not for the emergence of LoopVectorization I would not have submitted this, as just using the Another data point: wrapping I wasn't familiar with MLIR, perhaps it would fare better since the affine stuff seems useful in principle. With regards to doing this transformation on IR code, currently one transformation made by |
Yes I believe much of our infrastructure around
One challenge with working with MLIR is that if you throw away information you have to recover it and while it is much easier to write language and domain specific optimisations there, recovering the structure is the issue that makes Polly's effectiveness limited.
I personally would rather see one IR format instead of two ;), but the tools to work with either need more work. |
Could the hypothetical MLIR passes be written in Julia, or would they require C++? |
If we use MLIR presumably the IR generation would have to be done in C++. LoopVectorization seems to play in the same space as MLIR (@vchuravy sent me a link to this paper), and these two are not the only ones (the above-mentioned One difference is that MLIR was written by a big Google team, whereas LoopVectorization was written in spare time (AFAIK) by Chris Elrod using Julia 😄. One can react to that in two ways: one rational response is "of course we want to leverage the huge investment made by all those genuinely talented people," and another rational response is "wow, if one amazingly talented person can do the work of a small army using Julia, perhaps we should do more of this in Julia." I think it's too early to tell which would be the more productive path, and I applaud efforts to try both. But I don't think we should be shy about blowing past MLIR if we can really do better writing this stuff in Julia. |
LLVM.jl has shown that you can write Julia passes for a C++ infrastructure, but I am not advocating for a whole-sale import of MLIR into Juli,
MLIR is building a large infrastructure to solve a bigger problem in compiler design and is exploring interesting domains like affine loops and representation of parallelism as they go along. I think we haven't seen the full potential yet. Julia is a more productive language than C++, but for me LoopVectorization.jl is an attempt at solving a small subset of the problems the MLIR infrastructure does. One the other hand https://arxiv.org/pdf/2003.00532.pdf is a demonstration on how that infrastructure can be used by one talented engineer to write a flexible and generic loop scheduling infrastructure for matrix-multiply. Anyway this has digressed far enough. The question I would like to see answered for the PR are:
|
Does this patch work (or can be made to work) with other "nested loops" like
Aren't they in direct tension? While I do hope
@vchuravy This is yet another digression (sorry, @timholy!), but I'm interested in what do you think about |
It's not a whole-scale move into the compiler in the sense that we need to leave the current code there for the interpreter, but agreed with the sentiment of your question.
I don't know if that's needed; letting it lower allows us to catch
Can be made to work. Does not currently. It would not be a big change.
It's a fine approach. I am uncertain about lowering to But as you say, this is(?) a bit orthogonal because the real underlying point here is a tentative step towards a world where we can run a full cost-based analysis on different loop orderings, unrollings, SIMDification, etc. Maybe one could write that with |
I know almost nothing about the compiler internal but that's my guess too. I think the discussion on |
This now works, and all tests pass locally. Two important changes were (1) when inferring the return type of one-dimensional Lines 391 to 402 in a6a2d26
where R1 and R2 are CartesianIndices .
I think this has gotten a certain amount of high-level discussion, but I'd be particularly grateful for a line-by-line code review. It might also be good to discuss this in the context of #35015, so I'm marking this for triage. (I'd be happy to join in if someone reminds me how.) |
This replaces iteration over CartesianIndices{N} with a set of N nested one-dimensional loops. Fixes #9080. It also paves the way for loop-reordering and other transformations.
It does SIMD now without julia> function sumcart_manual(A::AbstractMatrix)
s = zero(eltype(A))
ax1, ax2 = axes(A)
for j in ax2, i in ax1
@fastmath @inbounds s += A[i, j]
end
s
end
sumcart_manual (generic function with 1 method)
julia> function sumcart_iter(A)
s = zero(eltype(A))
for I in CartesianIndices(A)
@fastmath @inbounds s += A[I]
end
s
end
sumcart_iter (generic function with 1 method)
julia> A = rand(4,4);
julia> @btime sumcart_iter($A)
13.953 ns (0 allocations: 0 bytes)
8.557935619365121
julia> @btime sumcart_manual($A)
13.040 ns (0 allocations: 0 bytes)
8.557935619365121
julia> A = rand(400,4);
julia> @btime sumcart_iter($A)
128.912 ns (0 allocations: 0 bytes)
794.9728539317515
julia> @btime sumcart_manual($A)
127.191 ns (0 allocations: 0 bytes)
794.9728539317515 # julia> @cn sumcart_iter(A)
.text
push r15
push r14
push r12
push rbx
mov r11, qword ptr [rdi + 24]
vpxor xmm0, xmm0, xmm0
test r11, r11
je L338
mov r9, qword ptr [rdi + 32]
test r9, r9
je L338
movabs r10, 9223372036854775806
mov rsi, qword ptr [rdi]
lea rax, [r11 - 1]
lea r8, [r9 - 1]
lea r14, [r10 + 1]
cmp rax, r10
cmovb r14, r11
lea rcx, [rsi + 192]
shl r11, 3
mov rdi, r14
and rdi, -32
lea rax, [r14 + 1]
add rsi, -8
vpxor xmm0, xmm0, xmm0
mov r12d, 1
jmp L146
nop
L112:
lea rdx, [r12 + 1]
lea rbx, [r10 + 1]
cmp r8, r10
cmovb rbx, r9
add rcx, r11
add rsi, r11
cmp r12, rbx
mov r12, rdx
je L338
L146:
cmp r14, 32
jae L176
mov edx, 1
jmp L320
nop word ptr cs:[rax + rax]
L176:
lea r15, [r10 - 30]
and r15, r14
mov rdx, r15
or rdx, 1
vmovq xmm0, xmm0 # xmm0 = xmm0[0],zero
vxorpd xmm1, xmm1, xmm1
xor ebx, ebx
vxorpd xmm2, xmm2, xmm2
vxorpd xmm3, xmm3, xmm3
L208:
vaddpd zmm0, zmm0, zmmword ptr [rcx + 8*rbx - 192]
vaddpd zmm1, zmm1, zmmword ptr [rcx + 8*rbx - 128]
vaddpd zmm2, zmm2, zmmword ptr [rcx + 8*rbx - 64]
vaddpd zmm3, zmm3, zmmword ptr [rcx + 8*rbx]
add rbx, 32
cmp rdi, rbx
jne L208
vaddpd zmm0, zmm1, zmm0
vaddpd zmm0, zmm2, zmm0
vaddpd zmm0, zmm3, zmm0
vextractf64x4 ymm1, zmm0, 1
vaddpd zmm0, zmm0, zmm1
vextractf128 xmm1, ymm0, 1
vaddpd xmm0, xmm0, xmm1
vpermilpd xmm1, xmm0, 1 # xmm1 = xmm0[1,0]
vaddsd xmm0, xmm0, xmm1
cmp r14, r15
je L112
nop word ptr cs:[rax + rax]
L320:
vaddsd xmm0, xmm0, qword ptr [rsi + 8*rdx]
inc rdx
cmp rax, rdx
jne L320
jmp L112
L338:
pop rbx
pop r12
pop r14
pop r15
vzeroupper
ret
nop dword ptr [rax] Note that the SIMD version is slower for the 4x4 case. |
This replaces iteration over
CartesianIndices{N}
with a set ofN
nested one-dimensional loops. Proximally, this fixes #9080;bafflingly, it is now faster than the manual iteration:(EDIT: in final version they are the same)yields
This makes little sense to me since the point of the PR is to turnsumcart_iter
intosumcart_manual
, but it's been very consistent across runs. Mind you, I'm not exactly complaining.The more long-term goal is to pave the way for loop-reordering and other transformations. @chriselrod has posted some absolutely mind-bending performance improvements with LoopVectorization, including some I really want in order to make ImageFiltering.jl competitive with OpenCV for operations like blurring. (They've manually written a bunch of hand-SIMDed, loop-unrolled functions. I'd really rather not make our code that ugly, but currently they are just smoking us performance-wise and that makes me unhappy.) LoopVectorization has devdocs now, and if you're interested you should read them, but briefly it uses a two-stage process: (1) a DSL that operates during macroexpansion of
@avx
to encode the operations of the loop into a format that gets enough info into the type system so that (2) a generated function can then reconstruct enough about the loop using type info alone to set up a cost model for performing the operations and then choosing the loop ordering and vectorization semantics that give the lowest cost. It's very sophisticated, but I've wondered whether it might, in the long run, be easier to do as a compiler pass where you wouldn't need to go through this two-stage process. So a secondary motivation here was to pave the way and see what it would feel like.Which comes to the immediate priority: this doesn't yet work. It's awesome when I implement it as an external pass using #35015, that's where those benchmark numbers above come from.@expandci
is in this gist, and it calls methods defined in this PR. It was also pretty straightforward to write this, the only tricky part was getting the block-jumps (the various:goto
expressions) pointing to the correct places. However, moving it into Core.Compiler (where one no longer needs@expandci
) gives me trouble getting through bootstrap; while compiling this method I get an error which starts like this:The problematic line is here. It doesn't have trouble when I use this as an external pass, even when I use that pass on the same function that's giving trouble. Any ideas?