Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

5x slowdown in 1.11.0-rc1 compared to 1.10.4 #55227

Open
matthias314 opened this issue Jul 24, 2024 · 11 comments
Open

5x slowdown in 1.11.0-rc1 compared to 1.10.4 #55227

matthias314 opened this issue Jul 24, 2024 · 11 comments
Labels
compiler:llvm For issues that relate to LLVM performance Must go faster regression 1.11 Regression in the 1.11 release

Comments

@matthias314
Copy link
Contributor

matthias314 commented Jul 24, 2024

While working on my package SmallCollections.jl, I've noticed a significant slowdown in 1.11.0-rc1:

using SmallCollections, Chairmarks

sh = shuffles(2, 2, 2, 2, 2, 2, 2)
# this is an iterator over all ordered partitions of 1:14
# into 7 2-element sets, plus a permutation sign

function f(sh)
    s = 0
    for (t, _) in sh
        # "t" is the ordered partition, "_" would be the permutation sign
        s += sum(length, t)   # the right-hand side is just 7*2 = 14
    end
    s
end

EDIT: sh = shuffles(2, 2, 2, 2, 2, 2) (6 arguments) is enough for a 5x slowdown.

julia> @b f($sh)
1.718 s (without a warmup)   # 1.10.4
9.277 s (without a warmup)   # 1.11.0-rc1

The functionality is not yet in the published version, nor in master.
You can use the latest published version (v0.3.0) of SmallCollections.jl.

Julia Version 1.11.0-rc1
Commit 3a35aec36d1 (2024-06-25 10:23 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 4 × Intel(R) Core(TM) i3-10110U CPU @ 2.10GHz
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, skylake)
Threads: 1 default, 0 interactive, 1 GC (on 4 virtual cores)
@oscardssmith oscardssmith added regression 1.11 Regression in the 1.11 release performance Must go faster labels Jul 24, 2024
@oscardssmith
Copy link
Member

I'm seeing ~8 seconds on 1.10, so I'm not fully sure what's going on here.

@matthias314
Copy link
Contributor Author

What processor do you have? I should have pointed out that on x86-64 and i686 with BMI2 the code uses the PDEP instruction (via llvm.x86.bmi.pdep.64). Without BMI2 and for other architectures it is emulated in software. This instruction is used heavily, so results could be quite different.

(I've just checked: PDEP does get used with 1.11.0-rc1.)

@oscardssmith
Copy link
Member

I have a Zen 2 (3600) which iirc also has BMI2.

@oscardssmith
Copy link
Member

oh, Zen2 only has micro-coded PDEP that LLVM is likely avoiding since it is slow. So the issue seems to be whether or not LLVM is emitting PDEP instructions where it should.

@oscardssmith
Copy link
Member

Actually looking at the code shows that I'm getting pdeps emitted in both 1.10 and 1.11. The difference seems to be that 1.11 is spilling and reloading a ton of variables. e.g:

.LBB0_15:                               # %guard_pass1430
                                        #   in Loop: Header=BB0_2 Depth=1
	mov	r13, qword ptr [rbp - 792]      # 8-byte Reload
	mov	rax, qword ptr [rbp - 120]      # 8-byte Reload
	mov	rcx, qword ptr [rbp - 568]      # 8-byte Reload
	mov	rsi, qword ptr [rbp - 536]      # 8-byte Reload
	mov	r8, qword ptr [rbp - 560]       # 8-byte Reload
	mov	r11, qword ptr [rbp - 408]      # 8-byte Reload
	pdep	rdi, r13, rax
	mov	qword ptr [rbp - 656], rcx      # 8-byte Spill
	mov	rcx, qword ptr [rbp - 544]      # 8-byte Reload
	mov	qword ptr [rbp - 176], rsi      # 8-byte Spill
	mov	rsi, qword ptr [rbp - 472]      # 8-byte Reload
	mov	qword ptr [rbp - 648], r8       # 8-byte Spill
	mov	r8, qword ptr [rbp - 528]       # 8-byte Reload
	mov	qword ptr [rbp - 160], rsi      # 8-byte Spill
	mov	rsi, qword ptr [rbp - 552]      # 8-byte Reload
	mov	qword ptr [rbp - 616], rcx      # 8-byte Spill
	mov	qword ptr [rbp - 88], rcx       # 8-byte Spill
	mov	qword ptr [rbp - 632], r8       # 8-byte Spill
	mov	qword ptr [rbp - 640], rsi      # 8-byte Spill
	mov	rsi, qword ptr [rbp - 520]      # 8-byte Reload
	mov	qword ptr [rbp - 624], rsi      # 8-byte Spill
	mov	rsi, qword ptr [rbp - 504]      # 8-byte Reload
	mov	qword ptr [rbp - 600], rsi      # 8-byte Spill
	mov	qword ptr [rbp - 184], rsi      # 8-byte Spill
	mov	r12, rdi
	xor	r12, rax
	mov	al, 1
	mov	dword ptr [rbp - 48], eax       # 4-byte Spill
	mov	rax, rcx
	mov	rcx, qword ptr [rbp - 496]      # 8-byte Reload
	mov	rax, qword ptr [rbp - 576]      # 8-byte Reload
	mov	qword ptr [rbp - 152], rcx      # 8-byte Spill
	mov	rcx, qword ptr [rbp - 488]      # 8-byte Reload
	mov	qword ptr [rbp - 208], rax      # 8-byte Spill
	mov	rax, rcx
	mov	qword ptr [rbp - 608], rcx      # 8-byte Spill
	mov	qword ptr [rbp - 128], rcx      # 8-byte Spill
	mov	rcx, qword ptr [rbp - 424]      # 8-byte Reload
	mov	rax, qword ptr [rbp - 480]      # 8-byte Reload
	mov	qword ptr [rbp - 376], rcx      # 8-byte Spill
	mov	rcx, rsi
	mov	rcx, qword ptr [rbp - 416]      # 8-byte Reload
	mov	rsi, qword ptr [rbp - 456]      # 8-byte Reload
	mov	qword ptr [rbp - 168], rax      # 8-byte Spill
	mov	rax, qword ptr [rbp - 400]      # 8-byte Reload
	mov	qword ptr [rbp - 592], rcx      # 8-byte Spill
	mov	rcx, qword ptr [rbp - 512]      # 8-byte Reload
	mov	qword ptr [rbp - 200], rsi      # 8-byte Spill
	mov	rsi, qword ptr [rbp - 440]      # 8-byte Reload
	mov	qword ptr [rbp - 584], rcx      # 8-byte Spill
	mov	rcx, qword ptr [rbp - 448]      # 8-byte Reload
	mov	qword ptr [rbp - 192], rsi      # 8-byte Spill
	mov	qword ptr [rbp - 672], rcx      # 8-byte Spill
	mov	rcx, qword ptr [rbp - 432]      # 8-byte Reload
	mov	qword ptr [rbp - 680], rcx      # 8-byte Spill
	mov	rcx, qword ptr [rbp - 392]      # 8-byte Reload
	mov	qword ptr [rbp - 112], rcx      # 8-byte Spill
	mov	rcx, qword ptr [rbp - 384]      # 8-byte Reload
	mov	qword ptr [rbp - 136], rcx      # 8-byte Spill
	mov	rcx, qword ptr [rbp - 464]      # 8-byte Reload
	mov	qword ptr [rbp - 664], rcx      # 8-byte Spill

@topolarity or @gbaraldi any idea why the codegen would be this bad?

@matthias314
Copy link
Contributor Author

I don't know if this adds to the mystery or clears it up, but here is another function that I would expect to be equivalent to the f from above:

function g(sh)
    s = Ref(0)
    @inline foreach(sh) do (t, _)
        s[] += sum(length, t)
    end
    s[]
end

With this g, however, the difference disappears:

julia> @b g($sh)
1.709 s (without a warmup)   # 1.10.4
1.718 s (without a warmup)   # 1.11.0-rc1

@gbaraldi
Copy link
Member

gbaraldi commented Jul 24, 2024

Potentially llvm/llvm-project#78506 ? Or excessive unrolling?

@KristofferC
Copy link
Sponsor Member

KristofferC commented Jul 24, 2024

8e4221f is the first bad commit
Date: Wed Jan 10 03:47:02 2024 +0100
Bump Julia to LLVM 16 (#51720)

@matthias314
Copy link
Contributor Author

@gbaraldi In my example, iterate(sh) is implemented recursively over shuffles of smaller length. With 7 parameters, one therefore has many state variables:

julia> sizeof(iterate(sh)[2])
240

This sounds indeed like the "lots of temporary variables" mentioned in llvm/llvm-project#78506.

@JeffBezanson JeffBezanson added the compiler:llvm For issues that relate to LLVM label Jul 24, 2024
@matthias314
Copy link
Contributor Author

The problem is also present in master (LLVM 18), but to a smaller extent: There I see a 2.4x slowdown instead of 5.4x for 1.11.0-rc2.

Julia Version 1.12.0-DEV.989
Commit 40ecf69019 (2024-08-05 12:54 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 4 × Intel(R) Core(TM) i3-10110U CPU @ 2.10GHz
  WORD_SIZE: 64
  LLVM: libLLVM-18.1.7 (ORCJIT, skylake)
Threads: 1 default, 0 interactive, 1 GC (on 4 virtual cores)

As mentioned in the updated OP, one can now use the latest published version (v0.3.0) of SmallCollections.jl to reproduce the issue.

@nsajko
Copy link
Contributor

nsajko commented Aug 12, 2024

possible dup: #52933

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler:llvm For issues that relate to LLVM performance Must go faster regression 1.11 Regression in the 1.11 release
Projects
None yet
Development

No branches or pull requests

6 participants