Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generated LLVM code causes register spills #52933

Open
aravindh-krishnamoorthy opened this issue Jan 16, 2024 · 8 comments
Open

Generated LLVM code causes register spills #52933

aravindh-krishnamoorthy opened this issue Jan 16, 2024 · 8 comments
Labels
compiler:llvm For issues that relate to LLVM performance Must go faster

Comments

@aravindh-krishnamoorthy
Copy link
Contributor

aravindh-krishnamoorthy commented Jan 16, 2024

This issue stems from PR #52438 and concerns the following function on my PC with Windows 11/WSL on Intel hardware (see below). But I suspect that the problem may also apply to other hardware types.

julia> versioninfo()
Julia Version 1.11.0-DEV.1038
Commit 290cfcee73* (2023-12-16 08:49 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 8 × 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
  WORD_SIZE: 64
  LLVM: libLLVM-15.0.7 (ORCJIT, tigerlake)
  Threads: 1 on 8 virtual cores
julia> function f(x::Tuple, shift::Integer)
           @inline
           j = mod1(shift, length(x))
           ntuple(k -> Base.__safe_getindex(x, k-j+ifelse(k>j,0,length(x))), Val(length(x)))
       end
f (generic function with 1 method)

Consider the LLVM code generated for the following case where the parameter shift is a variable, i.e., shift without constant propagation:

julia> x = Tuple(collect(1:50))
(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50)

julia> @code_llvm f(x, 1)

The generated LLVM code has the following structure:

; ifelse + ...
; index 1
%8 = icmp sgt i64 %7, 0
%9 = select i1 %8, i64 10, i64 0
%10 = sub i64 %9, %7
%memcpy_refined_src = getelementptr inbounds [10 x i64], [10 x i64]* %"x::Tuple", i64 0, i64 %10
;
; indices 2 through 49
;
; index 50
%.inv = icmp slt i64 %7, 50
%203 = select i1 %.inv, i64 0, i64 50
%204 = sub i64 49, %7
%205 = add i64 %204, %203
%memcpy_refined_src399 = getelementptr inbounds [50 x i64], [50 x i64]* %"x::Tuple", i64 0, i64 %205

; load
; index 1
%206 = load i64, i64* %memcpy_refined_src, align 8
;
; indices 2 through 49
;
; index 50
%255 = load i64, i64* %memcpy_refined_src399, align 8

; new tuple element / store
; index 1
%"new::Tuple.sroa.0.0..sroa_idx" = getelementptr inbounds [50 x i64], [50 x i64]* %sret_return, i64 0, i64 0
store i64 %206, i64* %"new::Tuple.sroa.0.0..sroa_idx", align 8
;
; indices 2 through 49
;
; index 50
%"new::Tuple.sroa.50.0..sroa_idx449" = getelementptr inbounds [50 x i64], [50 x i64]* %sret_return, i64 0, i64 49
store i64 %255, i64* %"new::Tuple.sroa.50.0..sroa_idx449", align 8

Note that in the ifelse, load, and new/store portions, the instructions for all indices are bunched together. Next, consider the generated native code:

julia> @code_native f(x, 1)

The same structure is also carried forward into assembly (not shown here) despite there being no memory aliases! This significantly increases the register pressure and leads to a large number of register spills and reloads (8 bytes each). The spills and reloads degrade the performance of the function. Furthermore, the effects get worse as the size of the tuple increases.

  • One can readily see that the register spills and reloads can be completely avoided by having the ifelse, load, and store portions index by index.

This topic was brought up on Julia Discourse where it was suggested that looking into new aliasing annotations for LLVM might be useful, see link for details.

Note 1: The case with constant propagation does not have this issue. The resulting code is excellent.

The LLVM and native code for this case can be obtained as follows:

julia> g(x) = f(x, 1)
g (generic function with 1 method)

julia> @code_llvm g(x)
[...]

julia> @code_native g(x)
[...]

Note 2 ¹: Special solutions are sometimes found for some cases (integer tuples) but not for others.

The LLVM and native code for these case can be obtained as follows:

julia> x = Tuple(collect(1:50))
(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50)

julia> g(x,i) = f(x,i)
g (generic function with 1 method)

# LLVM code is Ok but deterministic.
julia> @code_llvm g(x,1)
[...]

julia> x = Tuple(repeat(['a', -1, 1.0], 20))
('a', -1, 1.0, 'a', -1, 1.0, 'a', -1, 1.0, 'a', -1, 1.0, 'a', -1, 1.0, 'a', -1, 1.0, 'a', -1, 1.0, 'a', -1, 1.0, 'a', -1, 1.0, 'a', -1, 1.0, 'a', -1, 1.0, 'a', -1, 1.0, 'a', -1, 1.0, 'a', -1, 1.0, 'a', -1, 1.0, 'a', -1, 1.0, 'a', -1, 1.0, 'a', -1, 1.0, 'a', -1, 1.0, 'a', -1, 1.0)

# LLVM code again has the issue mentioned above.
julia> @code_llvm g(x,1)
[...]

Note: unicode superscripts denote edits.

@gbaraldi
Copy link
Member

gbaraldi commented Jan 17, 2024

I'm taking a look at this, but this has a C reproducer https://godbolt.org/z/bGPT4WxPe and GCC does a lot better

@aravindh-krishnamoorthy
Copy link
Contributor Author

aravindh-krishnamoorthy commented Jan 17, 2024

I'm taking a look at this, but this has a C reproducer https://godbolt.org/z/bGPT4WxPe and GCC does a lot better

Thank you @gbaraldi for taking this up.

For the C code, I'd just make the single change below. With the change, gcc is still good, but not as good as with the original UB. I feel it only packs the results in XMM registers instead of the register spills and reloads... and might also not be scalable?

struct wtf {
-    int a[20];
+    int a[30];
};

Perhaps there's some hidden dependency that I'm not seeing? :(

@gbaraldi
Copy link
Member

Yeah, that was a typo, could you try incfreasing it to 50 or 100? Though i'm not surprised if at that size it gets so bad. I opened llvm/llvm-project#78506 upstream

@aravindh-krishnamoorthy
Copy link
Contributor Author

aravindh-krishnamoorthy commented Jan 19, 2024

Yeah, that was a typo, could you try incfreasing it to 50 or 100? Though i'm not surprised if at that size it gets so bad. I opened llvm/llvm-project#78506 upstream

Thank for for raising the upstream issue, @gbaraldi. Actually, seems like it takes a lot to break GCC! This is the general process to check the assembly for any N:

  • Copy the awk script below to rs.awk
  • Generate the C file for 50, $ awk -v N=50 -f rs.awk rs.awk > rs50.c. For other sizes, please just change -v N=50
  • Generate the assembly, $ gcc -S -masm=intel -O3 -fverbose-asm rs50.c
rs.awk
$0 ~ /^##.*/ {gsub("^## ",""); print}
$0 ~ /^#~.*/ {gsub("^#~ ",""); gsub("N",N); if($0 ~ /?/) {for (i=0; i<N; i++) {a = $0; gsub("?",i,a); print a}} else {print}}

## #include <stddef.h>
## struct wtf {
#~     int a[N];
## };
## struct wtf __attribute__ ((noinline))  foo(struct wtf *b, int i)
## {
##     struct wtf new;
#~     int idx? = (? + i) % N ;
#~     int val? = b->a[idx?] ;
#~     new.a[?] = val? ;
##     return new;
## }
## 

@gbaraldi
Copy link
Member

I get a

awk: rs.awk: line 2: regular expression compile failed (missing operand)
?
awk: rs.awk: line 2: regular expression compile failed (missing operand)
?

error

@aravindh-krishnamoorthy
Copy link
Contributor Author

I get a

awk: rs.awk: line 2: regular expression compile failed (missing operand)
?
awk: rs.awk: line 2: regular expression compile failed (missing operand)
?

error

Sorry, I think you have a POSIX compatible awk but this one uses extensions from GNU awk. My version is as follows:

$ awk --version
GNU Awk 5.2.1, API 3.2, PMA Avon 8-g1, (GNU MPFR 4.2.0, GNU MP 6.2.1)
Copyright (C) 1989, 1991-2022 Free Software Foundation.

And the POSIX version fails here too:

$ awk -P -v N=10 -f rs.awk rs.awk
awk: rs.awk:2: error: Invalid preceding regular expression: /?/

@gbaraldi
Copy link
Member

gbaraldi commented Jan 19, 2024

my awk doesn't even have a -- version option 😆 . But gawk worked

@aravindh-krishnamoorthy
Copy link
Contributor Author

my awk doesn't even have a -- version option 😆 . But gawk worked

Sorry, seems like I got the for loop wrong. The correct version is for (i=0; i<N; i++), which starts from $i=0.$

$0 ~ /^##.*/ {gsub("^## ",""); print}
- $0 ~ /^#~.*/ {gsub("^#~ ",""); gsub("N",N); if($0 ~ /?/) {for (i=1; i<N; i++) {a = $0; gsub("?",i,a); print a}} else {print}}
+ $0 ~ /^#~.*/ {gsub("^#~ ",""); gsub("N",N); if($0 ~ /?/) {for (i=0; i<N; i++) {a = $0; gsub("?",i,a); print a}} else {print}}

## #include <stddef.h>
## struct wtf {
#~     int a[N];
## };
## struct wtf __attribute__ ((noinline))  foo(struct wtf *b, int i)
## {
##     struct wtf new;
#~     int idx? = (? + i) % N ;
#~     int val? = b->a[idx?] ;
#~     new.a[?] = val? ;
##     return new;
## }
## 

I've also corrected it above.

@nsajko nsajko added performance Must go faster compiler:llvm For issues that relate to LLVM labels Aug 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler:llvm For issues that relate to LLVM performance Must go faster
Projects
None yet
Development

No branches or pull requests

3 participants