Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression on the threading branch compared to the threads branch #10527

Closed
kpamnany opened this issue Mar 16, 2015 · 12 comments
Closed
Labels
multithreading Base.Threads and related functionality performance Must go faster regression Regression in behavior compared to a previous version

Comments

@kpamnany
Copy link
Contributor

The reason the old threads branch is still around and hasn't been replaced by the newer threading branch is that there is a performance regression. There are three forms for the @threads macro:

  1. @threads all foo()
  2. @threads all begin ... end
  3. @threads all for ... end

They all use the same threading call--the latter two generate an inner function to pass in--and should have the same performance. In the threads branch they do; in the threading branch, the latter two are ~3.5X slower than the first form.

Any of the three forms can be used in test/perf/threads/laplace3d.jl; choose the one to try at line 95.

@JeffBezanson: I think you had fixed this or something very similar to this in the threads branch back in November--something to do with type checks in inner functions?

@kpamnany kpamnany added the multithreading Base.Threads and related functionality label Mar 16, 2015
@ViralBShah ViralBShah changed the title Performance regression Performance regression on the threading branch compared to the threads branch Mar 16, 2015
@JeffBezanson JeffBezanson added performance Must go faster regression Regression in behavior compared to a previous version labels Mar 16, 2015
@JeffBezanson
Copy link
Sponsor Member

I have this one by the tail. It comes down to how the array assignment gets inlined. In the fast version:

        (top(arrayset))(u3::Array{Float32,3},#s36::Float32,#56#k_1::Int64,k_2::Int64,k_3::Int64)::Array{Float32,3}

and in the slow version:

        _var13 = u3::Array{Float32,3}
        _var12 = #s36::Float32
        _var11 = #56#k_1::Int64
        _var10 = k_2::Int64
        _var9 = k_3::Int64
        (top(arrayset))(_var13::Array{Float32,3},_var12::Float32,_var11::Int64,_var10::Int64,_var9::Int64)::Array{Float32,3}

Just extra (though unnecessary) temp vars. Surprisingly, this causes a cascade of horrible effects in code gen. u3 comes from the closure environment. The extra temp var moves it to a GC root, which I believe throws off alias analysis (the closed vars are tbaa_const). Then we/llvm fail to hoist array metadata access out of the loop, and consequently also fail to vectorize the loop.

cc @vtjnash @Keno

@vtjnash
Copy link
Sponsor Member

vtjnash commented Mar 16, 2015

It looks like inlining may be concerned about the evaluation order. I've seen that be an issue with some intrinsics which want to swap their argument order.

Since u3 is not local, it's usage is not affect-free. (I don't think that's related here though)

@JeffBezanson
Copy link
Sponsor Member

But u3 is never assigned, so it should be affect-free.

@vtjnash
Copy link
Sponsor Member

vtjnash commented Mar 16, 2015

True. Plus it wouldn't explain why the Other vars are being extracted. More codegen passes now generate GenSym objects. Which still generate a symbol?

@JeffBezanson
Copy link
Sponsor Member

Somewhat related: #6713

@JeffBezanson
Copy link
Sponsor Member

Inlining still uses symbols instead of GenSyms, based on the islocal variable in inlineable. I don't think the logic around this variable is optimal. Whether the variable is assigned to seems to be what matters --- do I have that right?

@JeffBezanson
Copy link
Sponsor Member

I can fix this by making effect_free return true for more symbols. Surely the logic there now is way too conservative! Can a non-captured local variable ever change value while evaluating an argument list?

@vtjnash
Copy link
Sponsor Member

vtjnash commented Mar 17, 2015

Inlining still uses symbols instead of GenSyms, based on the islocal variable in inlineable. I don't think the logic around this variable is optimal. Whether the variable is assigned to seems to be what matters --- do I have that right?

yes. if the symbol is local, that means the argument is also assigned locally. it's only valid to use a GenSym when the variable is assigned exactly once.

Can a non-captured local variable ever change value while evaluating an argument list?

can you give an example of what you mean? i thought a local variable would already always return true

@vtjnash
Copy link
Sponsor Member

vtjnash commented Mar 17, 2015

insanity case?

function f() 
  x = 1
  g() = (x+=1)
  h(a,b,c,d) = (a,b,c,d)
  return h(x, g(), g(), x)
end

but the inliner would bail on that just due to the presence of the captured variable (

julia/base/inference.jl

Lines 2374 to 2377 in b201312

if (vi[3]&1)!=0
# captures variables (TODO)
return NF
end
)

@JeffBezanson
Copy link
Sponsor Member

Yeah, that's why I said "non-captured".
effect_free says:

    if isa(e,SymbolNode)
        allow_volatile && return true
        if is_global(sv, (e::SymbolNode).name)
            return false
        end
    end

In other words, it only returns true for a local when allow_volatile is true. Quite sub-optimal.

@vtjnash
Copy link
Sponsor Member

vtjnash commented Mar 17, 2015

Ah, right. That's not falling through correctly and is being wayyyy too conservative

JeffBezanson added a commit that referenced this issue Mar 18, 2015
this avoids inserting temporary variable copies of local variables
unless they might be assigned by an inner function.
@JeffBezanson
Copy link
Sponsor Member

Ok I fixed this on master, and it will fix this issue when picked onto the threading branch.

JeffBezanson added a commit that referenced this issue Mar 18, 2015
this avoids inserting temporary variable copies of local variables
unless they might be assigned by an inner function.

Conflicts:
	base/inference.jl
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
multithreading Base.Threads and related functionality performance Must go faster regression Regression in behavior compared to a previous version
Projects
None yet
Development

No branches or pull requests

3 participants