5x slowdown in 1.11.0-rc1 compared to 1.10.4 #55227

matthias314 · 2024-07-24T00:09:09Z

While working on my package SmallCollections.jl, I've noticed a significant slowdown in 1.11.0-rc1:

using SmallCollections, Chairmarks

sh = shuffles(2, 2, 2, 2, 2, 2, 2)
# this is an iterator over all ordered partitions of 1:14
# into 7 2-element sets, plus a permutation sign

function f(sh)
    s = 0
    for (t, _) in sh
        # "t" is the ordered partition, "_" would be the permutation sign
        s += sum(length, t)   # the right-hand side is just 7*2 = 14
    end
    s
end

EDIT: sh = shuffles(2, 2, 2, 2, 2, 2) (6 arguments) is enough for a 5x slowdown.

julia> @b f($sh)
1.718 s (without a warmup)   # 1.10.4
9.277 s (without a warmup)   # 1.11.0-rc1

~~The functionality is not yet in the published version, nor in master.~~
You can use the latest published version (v0.3.0) of SmallCollections.jl.

Julia Version 1.11.0-rc1
Commit 3a35aec36d1 (2024-06-25 10:23 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 4 × Intel(R) Core(TM) i3-10110U CPU @ 2.10GHz
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, skylake)
Threads: 1 default, 0 interactive, 1 GC (on 4 virtual cores)

The text was updated successfully, but these errors were encountered:

oscardssmith · 2024-07-24T00:58:38Z

I'm seeing ~8 seconds on 1.10, so I'm not fully sure what's going on here.

matthias314 · 2024-07-24T01:30:21Z

What processor do you have? I should have pointed out that on x86-64 and i686 with BMI2 the code uses the PDEP instruction (via llvm.x86.bmi.pdep.64). Without BMI2 and for other architectures it is emulated in software. This instruction is used heavily, so results could be quite different.

(I've just checked: PDEP does get used with 1.11.0-rc1.)

oscardssmith · 2024-07-24T01:35:34Z

I have a Zen 2 (3600) which iirc also has BMI2.

oscardssmith · 2024-07-24T01:41:59Z

oh, Zen2 only has micro-coded PDEP that LLVM is likely avoiding since it is slow. So the issue seems to be whether or not LLVM is emitting PDEP instructions where it should.

oscardssmith · 2024-07-24T01:44:47Z

Actually looking at the code shows that I'm getting pdeps emitted in both 1.10 and 1.11. The difference seems to be that 1.11 is spilling and reloading a ton of variables. e.g:

.LBB0_15:                               # %guard_pass1430
                                        #   in Loop: Header=BB0_2 Depth=1
	mov	r13, qword ptr [rbp - 792]      # 8-byte Reload
	mov	rax, qword ptr [rbp - 120]      # 8-byte Reload
	mov	rcx, qword ptr [rbp - 568]      # 8-byte Reload
	mov	rsi, qword ptr [rbp - 536]      # 8-byte Reload
	mov	r8, qword ptr [rbp - 560]       # 8-byte Reload
	mov	r11, qword ptr [rbp - 408]      # 8-byte Reload
	pdep	rdi, r13, rax
	mov	qword ptr [rbp - 656], rcx      # 8-byte Spill
	mov	rcx, qword ptr [rbp - 544]      # 8-byte Reload
	mov	qword ptr [rbp - 176], rsi      # 8-byte Spill
	mov	rsi, qword ptr [rbp - 472]      # 8-byte Reload
	mov	qword ptr [rbp - 648], r8       # 8-byte Spill
	mov	r8, qword ptr [rbp - 528]       # 8-byte Reload
	mov	qword ptr [rbp - 160], rsi      # 8-byte Spill
	mov	rsi, qword ptr [rbp - 552]      # 8-byte Reload
	mov	qword ptr [rbp - 616], rcx      # 8-byte Spill
	mov	qword ptr [rbp - 88], rcx       # 8-byte Spill
	mov	qword ptr [rbp - 632], r8       # 8-byte Spill
	mov	qword ptr [rbp - 640], rsi      # 8-byte Spill
	mov	rsi, qword ptr [rbp - 520]      # 8-byte Reload
	mov	qword ptr [rbp - 624], rsi      # 8-byte Spill
	mov	rsi, qword ptr [rbp - 504]      # 8-byte Reload
	mov	qword ptr [rbp - 600], rsi      # 8-byte Spill
	mov	qword ptr [rbp - 184], rsi      # 8-byte Spill
	mov	r12, rdi
	xor	r12, rax
	mov	al, 1
	mov	dword ptr [rbp - 48], eax       # 4-byte Spill
	mov	rax, rcx
	mov	rcx, qword ptr [rbp - 496]      # 8-byte Reload
	mov	rax, qword ptr [rbp - 576]      # 8-byte Reload
	mov	qword ptr [rbp - 152], rcx      # 8-byte Spill
	mov	rcx, qword ptr [rbp - 488]      # 8-byte Reload
	mov	qword ptr [rbp - 208], rax      # 8-byte Spill
	mov	rax, rcx
	mov	qword ptr [rbp - 608], rcx      # 8-byte Spill
	mov	qword ptr [rbp - 128], rcx      # 8-byte Spill
	mov	rcx, qword ptr [rbp - 424]      # 8-byte Reload
	mov	rax, qword ptr [rbp - 480]      # 8-byte Reload
	mov	qword ptr [rbp - 376], rcx      # 8-byte Spill
	mov	rcx, rsi
	mov	rcx, qword ptr [rbp - 416]      # 8-byte Reload
	mov	rsi, qword ptr [rbp - 456]      # 8-byte Reload
	mov	qword ptr [rbp - 168], rax      # 8-byte Spill
	mov	rax, qword ptr [rbp - 400]      # 8-byte Reload
	mov	qword ptr [rbp - 592], rcx      # 8-byte Spill
	mov	rcx, qword ptr [rbp - 512]      # 8-byte Reload
	mov	qword ptr [rbp - 200], rsi      # 8-byte Spill
	mov	rsi, qword ptr [rbp - 440]      # 8-byte Reload
	mov	qword ptr [rbp - 584], rcx      # 8-byte Spill
	mov	rcx, qword ptr [rbp - 448]      # 8-byte Reload
	mov	qword ptr [rbp - 192], rsi      # 8-byte Spill
	mov	qword ptr [rbp - 672], rcx      # 8-byte Spill
	mov	rcx, qword ptr [rbp - 432]      # 8-byte Reload
	mov	qword ptr [rbp - 680], rcx      # 8-byte Spill
	mov	rcx, qword ptr [rbp - 392]      # 8-byte Reload
	mov	qword ptr [rbp - 112], rcx      # 8-byte Spill
	mov	rcx, qword ptr [rbp - 384]      # 8-byte Reload
	mov	qword ptr [rbp - 136], rcx      # 8-byte Spill
	mov	rcx, qword ptr [rbp - 464]      # 8-byte Reload
	mov	qword ptr [rbp - 664], rcx      # 8-byte Spill

@topolarity or @gbaraldi any idea why the codegen would be this bad?

matthias314 · 2024-07-24T01:50:34Z

I don't know if this adds to the mystery or clears it up, but here is another function that I would expect to be equivalent to the f from above:

function g(sh)
    s = Ref(0)
    @inline foreach(sh) do (t, _)
        s[] += sum(length, t)
    end
    s[]
end

With this g, however, the difference disappears:

julia> @b g($sh)
1.709 s (without a warmup)   # 1.10.4
1.718 s (without a warmup)   # 1.11.0-rc1

gbaraldi · 2024-07-24T02:17:39Z

Potentially llvm/llvm-project#78506 ? Or excessive unrolling?

KristofferC · 2024-07-24T09:48:49Z

8e4221f is the first bad commit
Date: Wed Jan 10 03:47:02 2024 +0100
Bump Julia to LLVM 16 (#51720)

matthias314 · 2024-07-24T11:06:50Z

@gbaraldi In my example, iterate(sh) is implemented recursively over shuffles of smaller length. With 7 parameters, one therefore has many state variables:

julia> sizeof(iterate(sh)[2])
240

This sounds indeed like the "lots of temporary variables" mentioned in llvm/llvm-project#78506.

matthias314 · 2024-08-10T19:30:21Z

The problem is also present in master (LLVM 18), but to a smaller extent: There I see a 2.4x slowdown instead of 5.4x for 1.11.0-rc2.

Julia Version 1.12.0-DEV.989
Commit 40ecf69019 (2024-08-05 12:54 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 4 × Intel(R) Core(TM) i3-10110U CPU @ 2.10GHz
  WORD_SIZE: 64
  LLVM: libLLVM-18.1.7 (ORCJIT, skylake)
Threads: 1 default, 0 interactive, 1 GC (on 4 virtual cores)

As mentioned in the updated OP, one can now use the latest published version (v0.3.0) of SmallCollections.jl to reproduce the issue.

nsajko · 2024-08-12T15:36:50Z

possible dup: #52933

oscardssmith added regression 1.11 Regression in the 1.11 release performance Must go faster labels Jul 24, 2024

KristofferC added the bisect wanted label Jul 24, 2024

KristofferC removed the bisect wanted label Jul 24, 2024

JeffBezanson added the compiler:llvm For issues that relate to LLVM label Jul 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

5x slowdown in 1.11.0-rc1 compared to 1.10.4 #55227

5x slowdown in 1.11.0-rc1 compared to 1.10.4 #55227

matthias314 commented Jul 24, 2024 •

edited

Loading

oscardssmith commented Jul 24, 2024

matthias314 commented Jul 24, 2024

oscardssmith commented Jul 24, 2024

oscardssmith commented Jul 24, 2024

oscardssmith commented Jul 24, 2024

matthias314 commented Jul 24, 2024

gbaraldi commented Jul 24, 2024 •

edited

Loading

KristofferC commented Jul 24, 2024 •

edited

Loading

matthias314 commented Jul 24, 2024

matthias314 commented Aug 10, 2024

nsajko commented Aug 12, 2024

5x slowdown in 1.11.0-rc1 compared to 1.10.4 #55227

5x slowdown in 1.11.0-rc1 compared to 1.10.4 #55227

Comments

matthias314 commented Jul 24, 2024 • edited Loading

oscardssmith commented Jul 24, 2024

matthias314 commented Jul 24, 2024

oscardssmith commented Jul 24, 2024

oscardssmith commented Jul 24, 2024

oscardssmith commented Jul 24, 2024

matthias314 commented Jul 24, 2024

gbaraldi commented Jul 24, 2024 • edited Loading

KristofferC commented Jul 24, 2024 • edited Loading

matthias314 commented Jul 24, 2024

matthias314 commented Aug 10, 2024

nsajko commented Aug 12, 2024

matthias314 commented Jul 24, 2024 •

edited

Loading

gbaraldi commented Jul 24, 2024 •

edited

Loading

KristofferC commented Jul 24, 2024 •

edited

Loading