Generated LLVM code causes register spills #52933

aravindh-krishnamoorthy · 2024-01-16T20:21:25Z

This issue stems from PR #52438 and concerns the following function on my PC with Windows 11/WSL on Intel hardware (see below). But I suspect that the problem may also apply to other hardware types.

julia> versioninfo()
Julia Version 1.11.0-DEV.1038
Commit 290cfcee73* (2023-12-16 08:49 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 8 × 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
  WORD_SIZE: 64
  LLVM: libLLVM-15.0.7 (ORCJIT, tigerlake)
  Threads: 1 on 8 virtual cores

julia> function f(x::Tuple, shift::Integer)
           @inline
           j = mod1(shift, length(x))
           ntuple(k -> Base.__safe_getindex(x, k-j+ifelse(k>j,0,length(x))), Val(length(x)))
       end
f (generic function with 1 method)

Consider the LLVM code generated for the following case where the parameter shift is a variable, i.e., shift without constant propagation:

julia> x = Tuple(collect(1:50))
(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50)

julia> @code_llvm f(x, 1)

The generated LLVM code has the following structure:

; ifelse + ...
; index 1
%8 = icmp sgt i64 %7, 0
%9 = select i1 %8, i64 10, i64 0
%10 = sub i64 %9, %7
%memcpy_refined_src = getelementptr inbounds [10 x i64], [10 x i64]* %"x::Tuple", i64 0, i64 %10
;
; indices 2 through 49
;
; index 50
%.inv = icmp slt i64 %7, 50
%203 = select i1 %.inv, i64 0, i64 50
%204 = sub i64 49, %7
%205 = add i64 %204, %203
%memcpy_refined_src399 = getelementptr inbounds [50 x i64], [50 x i64]* %"x::Tuple", i64 0, i64 %205

; load
; index 1
%206 = load i64, i64* %memcpy_refined_src, align 8
;
; indices 2 through 49
;
; index 50
%255 = load i64, i64* %memcpy_refined_src399, align 8

; new tuple element / store
; index 1
%"new::Tuple.sroa.0.0..sroa_idx" = getelementptr inbounds [50 x i64], [50 x i64]* %sret_return, i64 0, i64 0
store i64 %206, i64* %"new::Tuple.sroa.0.0..sroa_idx", align 8
;
; indices 2 through 49
;
; index 50
%"new::Tuple.sroa.50.0..sroa_idx449" = getelementptr inbounds [50 x i64], [50 x i64]* %sret_return, i64 0, i64 49
store i64 %255, i64* %"new::Tuple.sroa.50.0..sroa_idx449", align 8

Note that in the ifelse, load, and new/store portions, the instructions for all indices are bunched together. Next, consider the generated native code:

julia> @code_native f(x, 1)

The same structure is also carried forward into assembly (not shown here) despite there being no memory aliases! This significantly increases the register pressure and leads to a large number of register spills and reloads (8 bytes each). The spills and reloads degrade the performance of the function. Furthermore, the effects get worse as the size of the tuple increases.

One can readily see that the register spills and reloads can be completely avoided by having the ifelse, load, and store portions index by index.

This topic was brought up on Julia Discourse where it was suggested that looking into new aliasing annotations for LLVM might be useful, see link for details.

Note 1: The case with constant propagation does not have this issue. The resulting code is excellent.

The LLVM and native code for this case can be obtained as follows:

julia> g(x) = f(x, 1)
g (generic function with 1 method)

julia> @code_llvm g(x)
[...]

julia> @code_native g(x)
[...]

Note 2 ¹: Special solutions are sometimes found for some cases (integer tuples) but not for others.

The LLVM and native code for these case can be obtained as follows:

julia> x = Tuple(collect(1:50))
(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50)

julia> g(x,i) = f(x,i)
g (generic function with 1 method)

# LLVM code is Ok but deterministic.
julia> @code_llvm g(x,1)
[...]

julia> x = Tuple(repeat(['a', -1, 1.0], 20))
('a', -1, 1.0, 'a', -1, 1.0, 'a', -1, 1.0, 'a', -1, 1.0, 'a', -1, 1.0, 'a', -1, 1.0, 'a', -1, 1.0, 'a', -1, 1.0, 'a', -1, 1.0, 'a', -1, 1.0, 'a', -1, 1.0, 'a', -1, 1.0, 'a', -1, 1.0, 'a', -1, 1.0, 'a', -1, 1.0, 'a', -1, 1.0, 'a', -1, 1.0, 'a', -1, 1.0, 'a', -1, 1.0, 'a', -1, 1.0)

# LLVM code again has the issue mentioned above.
julia> @code_llvm g(x,1)
[...]

Note: unicode superscripts denote edits.

The text was updated successfully, but these errors were encountered:

gbaraldi · 2024-01-17T19:16:36Z

I'm taking a look at this, but this has a C reproducer https://godbolt.org/z/bGPT4WxPe and GCC does a lot better

aravindh-krishnamoorthy · 2024-01-17T22:01:48Z

I'm taking a look at this, but this has a C reproducer https://godbolt.org/z/bGPT4WxPe and GCC does a lot better

Thank you @gbaraldi for taking this up.

For the C code, I'd just make the single change below. With the change, gcc is still good, but not as good as with the original UB. I feel it only packs the results in XMM registers instead of the register spills and reloads... and might also not be scalable?

struct wtf {
-    int a[20];
+    int a[30];
};

Perhaps there's some hidden dependency that I'm not seeing? :(

gbaraldi · 2024-01-17T22:09:33Z

Yeah, that was a typo, could you try incfreasing it to 50 or 100? Though i'm not surprised if at that size it gets so bad. I opened llvm/llvm-project#78506 upstream

aravindh-krishnamoorthy · 2024-01-19T11:30:19Z

Yeah, that was a typo, could you try incfreasing it to 50 or 100? Though i'm not surprised if at that size it gets so bad. I opened llvm/llvm-project#78506 upstream

Thank for for raising the upstream issue, @gbaraldi. Actually, seems like it takes a lot to break GCC! This is the general process to check the assembly for any N:

Copy the awk script below to rs.awk
Generate the C file for 50, $ awk -v N=50 -f rs.awk rs.awk > rs50.c. For other sizes, please just change -v N=50
Generate the assembly, $ gcc -S -masm=intel -O3 -fverbose-asm rs50.c

rs.awk

$0 ~ /^##.*/ {gsub("^## ",""); print}
$0 ~ /^#~.*/ {gsub("^#~ ",""); gsub("N",N); if($0 ~ /?/) {for (i=0; i<N; i++) {a = $0; gsub("?",i,a); print a}} else {print}}

## #include <stddef.h>
## struct wtf {
#~     int a[N];
## };
## struct wtf __attribute__ ((noinline))  foo(struct wtf *b, int i)
## {
##     struct wtf new;
#~     int idx? = (? + i) % N ;
#~     int val? = b->a[idx?] ;
#~     new.a[?] = val? ;
##     return new;
## }
##

gbaraldi · 2024-01-19T12:49:43Z

I get a

awk: rs.awk: line 2: regular expression compile failed (missing operand)
?
awk: rs.awk: line 2: regular expression compile failed (missing operand)
?

error

aravindh-krishnamoorthy · 2024-01-19T13:02:07Z

I get a

awk: rs.awk: line 2: regular expression compile failed (missing operand)
?
awk: rs.awk: line 2: regular expression compile failed (missing operand)
?

error

Sorry, I think you have a POSIX compatible awk but this one uses extensions from GNU awk. My version is as follows:

$ awk --version
GNU Awk 5.2.1, API 3.2, PMA Avon 8-g1, (GNU MPFR 4.2.0, GNU MP 6.2.1)
Copyright (C) 1989, 1991-2022 Free Software Foundation.

And the POSIX version fails here too:

$ awk -P -v N=10 -f rs.awk rs.awk
awk: rs.awk:2: error: Invalid preceding regular expression: /?/

gbaraldi · 2024-01-19T13:03:32Z

my awk doesn't even have a -- version option 😆 . But gawk worked

aravindh-krishnamoorthy · 2024-01-19T14:47:58Z

my awk doesn't even have a -- version option 😆 . But gawk worked

Sorry, seems like I got the for loop wrong. The correct version is for (i=0; i<N; i++), which starts from $i=0.$

$0 ~ /^##.*/ {gsub("^## ",""); print}
- $0 ~ /^#~.*/ {gsub("^#~ ",""); gsub("N",N); if($0 ~ /?/) {for (i=1; i<N; i++) {a = $0; gsub("?",i,a); print a}} else {print}}
+ $0 ~ /^#~.*/ {gsub("^#~ ",""); gsub("N",N); if($0 ~ /?/) {for (i=0; i<N; i++) {a = $0; gsub("?",i,a); print a}} else {print}}

## #include <stddef.h>
## struct wtf {
#~     int a[N];
## };
## struct wtf __attribute__ ((noinline))  foo(struct wtf *b, int i)
## {
##     struct wtf new;
#~     int idx? = (? + i) % N ;
#~     int val? = b->a[idx?] ;
#~     new.a[?] = val? ;
##     return new;
## }
##

I've also corrected it above.

aravindh-krishnamoorthy mentioned this issue Jan 16, 2024

Implement circshift(::Tuple, ::Integer) #52438

Merged

10 tasks

nsajko added performance Must go faster compiler:llvm For issues that relate to LLVM labels Aug 12, 2024

nsajko mentioned this issue Aug 12, 2024

5x slowdown in 1.11.0-rc1 compared to 1.10.4 #55227

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generated LLVM code causes register spills #52933

Generated LLVM code causes register spills #52933

aravindh-krishnamoorthy commented Jan 16, 2024 •

edited

Loading

gbaraldi commented Jan 17, 2024 •

edited

Loading

aravindh-krishnamoorthy commented Jan 17, 2024 •

edited

Loading

gbaraldi commented Jan 17, 2024

aravindh-krishnamoorthy commented Jan 19, 2024 •

edited

Loading

gbaraldi commented Jan 19, 2024

aravindh-krishnamoorthy commented Jan 19, 2024

gbaraldi commented Jan 19, 2024 •

edited

Loading

aravindh-krishnamoorthy commented Jan 19, 2024

Generated LLVM code causes register spills #52933

Generated LLVM code causes register spills #52933

Comments

aravindh-krishnamoorthy commented Jan 16, 2024 • edited Loading

gbaraldi commented Jan 17, 2024 • edited Loading

aravindh-krishnamoorthy commented Jan 17, 2024 • edited Loading

gbaraldi commented Jan 17, 2024

aravindh-krishnamoorthy commented Jan 19, 2024 • edited Loading

gbaraldi commented Jan 19, 2024

aravindh-krishnamoorthy commented Jan 19, 2024

gbaraldi commented Jan 19, 2024 • edited Loading

aravindh-krishnamoorthy commented Jan 19, 2024

aravindh-krishnamoorthy commented Jan 16, 2024 •

edited

Loading

gbaraldi commented Jan 17, 2024 •

edited

Loading

aravindh-krishnamoorthy commented Jan 17, 2024 •

edited

Loading

aravindh-krishnamoorthy commented Jan 19, 2024 •

edited

Loading

gbaraldi commented Jan 19, 2024 •

edited

Loading