Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance of Union{T1, T2} #22097

Closed
quinnj opened this issue May 27, 2017 · 7 comments
Closed

Performance of Union{T1, T2} #22097

quinnj opened this issue May 27, 2017 · 7 comments
Labels
performance Must go faster

Comments

@quinnj
Copy link
Member

quinnj commented May 27, 2017

Discussion has been scattered across various repos, mostly around the best and/or most performant representation of missingness (null values), but I wanted to open up the discussion here specifically to talk about efforts to improve both code generation & memory layout when Union{T1, T2} types are involved, and more specifically when T1 and T2 are both isbits (I'm not sure if that's a hard requirement, but it at least would seem to be important for the memory layout piece).

I've put together a few examples that are (hopefully) representative of some core operations that we'd like to improve.
(*Note: I've run these examples against Julia 0.5, 0.6, and @vtjnash's jn/union-bits-layout branch w/o significant difference)

using BenchmarkTools

g(x) = x < 5 ? x : -1
A = [i for i = 1:10]
function get_then_set(A)
    @simd for i = 1:10
        @inbounds A[i] = g(i)
    end
    return A
end
@code_warntype g(1)
@code_warntype get_then_set(A)
@benchmark get_then_set(A) # 20ns

g2(x) = x < 5 ? x : nothing

A = Union{Int, Void}[i for i = 1:10]
function get_then_set2(A)
    @simd for i = 1:10
        @inbounds A[i] = g2(i)
    end
    return A
end
@code_warntype g2(1)
@code_warntype get_then_set2(A)
@code_llvm get_then_set2(A)
@benchmark get_then_set2(A) # 155ns


g4(x::Int) = 1
g4(x::Void) = 0

A = [i for i = 1:10]
function get_sum(A)
    s = 0
    for a in A
        s += g4(a)
    end
    return s
end
@code_warntype get_sum(A)
@code_llvm get_sum(A)
@benchmark get_sum(A) # 24ns

A = Union{Int, Void}[i for i = 1:10]
A[[3, 5, 7]] = nothing
function get_sum2(A)
    s = 0
    for a in A
        s += g4(a)
    end
    return s
end
@code_warntype get_sum2(A)
@code_llvm get_sum(A)
@benchmark get_sum2(A) # 100ns

I've aimed for these examples to be somewhat representative of actual code uses I anticipate (e.g. in CSV.jl, we're interested in calling a parse(...)::Union{T, Null} and subsequently setting the return value in a Vector{Union{T, Null}} index).

Inspecting the generated code in both @code_warntype and @code_llvm, it seems to me the main issues the Union case runs into are:

  • Insertion of branch statements like unless (Core.isa)(a::Union{Int64,Void},Void)::Any goto 14 at every call point for a Union
  • Inability to inline calls within these so-called "Union branches"

So my main questions revolve around how we can address these performance issues and, assuming we succeed in doing so, how I can improve my mental model as a user/dev in terms of how to handle code that may involve Unions.

Also happy to help any way I can; providing or refining examples, diving into /src, though I'm afraid I may be a bit less efficient than others, or helping write tests. Cheers!

@quinnj
Copy link
Member Author

quinnj commented May 28, 2017

Someone correct me if I'm wrong, but I believe I'm actually reporting the same thing @johnmyleswhite did in #11699, albeit that issue was a tad more "nullable" focused whereas I believe the actual problem just comes down to plain unions, nullable or not.

In my studies, I also came across #10805 and wondered if that would potentially help the Union codegen situation. I'll leave @yuyichao to comment, but it seems like the core idea addressed the issue here.

@quinnj
Copy link
Member Author

quinnj commented May 30, 2017

Another update here, so I've found a work-around for the inlining issue, it turns out that putting my own explicit isa check allows inlining to happen, i.e. changing this

function get_then_set2(A)
    @simd for i = 1:10
        @inbounds A[i] = g2(i)
    end
    return A
end

to this

function get_then_set2(A)
    @simd for i = 1:10
        val = g2(i)
        if val isa Void
            @inbounds A[i] = g2(i)
        else
            @inbounds A[i] = g2(i)
        end
    end
    return A
end

I do, however, seem to be running into a tricky boxed value issue; I can't reproduce using the examples I've given here, but I'll try to get a way to easily reproduce. Basically it involves inlining a function 2 levels deep that returns a Union{T, Null}, but in the outermost function call, it has to allocate a box for the return type, even though I have the if val isa Null check directly after all the inlining. At the moment I don't know what else to do/try to avoid the boxing.

@quinnj
Copy link
Member Author

quinnj commented May 30, 2017

If anyone feels brave enough, you should be able to reproduce the allocation I'm seeing by doing the following (probably works on 0.6 or current master, I'm personally on the jn/union-bits-layout branch)

Pkg.checkout("CSV", "jq/gangy")
Pkg.checkout("DataStreams", "jq/gangy")
Pkg.checkout("DataFrames", "jq/gangy")
Pkg.checkout("WeakRefStrings", "jq/gangy")
Pkg.checkout("Nulls")

Then I usually run the following code to generate a dummy file

open("randoms_Int64.csv", "w") do f
        for j = 1:1_000_000
            write(f, string(1001))
            write(f, "\n")
        end
    end

Then I run this to see the roughly one allocation per line

    @time CSV.read("/Users/jacobquinn/Downloads/randoms_$(T).csv"; rows=999999)
    # can run the below to see no allocations if we treat the column as `Vector{Int}` instead of `Vector{Union{Int, Null}}`
    # @time CSV.read("/Users/jacobquinn/Downloads/randoms_$(T).csv"; nullable=false, rows=999999)

And finally the code I'm using to inspect the generated code

@time source = CSV.Source("/Users/jacobquinn/Downloads/randoms_Int64.csv"; )
sink = Si = DataFrames.DataFrame
transforms = Dict{Int,Function}()
append = false
args = kwargs = ()
source_schema = DataStreams.Data.schema(source)
sink_schema, transforms2 = DataStreams.Data.transform(source_schema, transforms, true)
sinkstreamtype = DataStreams.Data.Field
sink = Si(sink_schema, sinkstreamtype, append, args...; kwargs...)
columns = []
filter = x->true
@code_warntype DataStreams.Data.stream!(source, sinkstreamtype, sink, source_schema, sink_schema, transforms2, filter, columns)

@JeffBezanson JeffBezanson added the performance Must go faster label May 31, 2017
@quinnj
Copy link
Member Author

quinnj commented Jun 1, 2017

An additional note on performance that I believe is the specific purpose of jn/union-bits-layout: currently there is a huge hit when trying to allocate Union arrays

julia> @benchmark Vector{Int}(1000000)
BenchmarkTools.Trial:
  memory estimate:  7.63 MiB
  allocs estimate:  2
  --------------
  minimum time:     66.183 μs (88.69% GC)
  median time:      200.744 μs (59.07% GC)
  mean time:        2.601 ms (96.94% GC)
  maximum time:     17.754 ms (99.03% GC)
  --------------
  samples:          322
  evals/sample:     6

julia>

julia> @benchmark Vector{Union{Nulls.Null,Int}}(1000000)
BenchmarkTools.Trial:
  memory estimate:  7.63 MiB
  allocs estimate:  2
  --------------
  minimum time:     979.422 μs (0.00% GC)
  median time:      1.208 ms (0.00% GC)
  mean time:        2.051 ms (42.53% GC)
  maximum time:     5.859 ms (68.86% GC)
  --------------
  samples:          2433
  evals/sample:     1

quinnj referenced this issue Jun 4, 2017
unlike codegen, only bitstypes (!isptr) fields are permitted in the union
and the offset count starts from 0 instead of 1
but otherwise the tindex counter is compatible
@quinnj
Copy link
Member Author

quinnj commented Jun 4, 2017

Alrighty, an update from everyone's favorite Union pest: I took a journey down the dark & scary path of inspecting walls of LLVM code & and our own codegen.cpp and I think I've got a better handle on the specific performance problems I'm seeing.

To summarize:

  • The original inlining issue has a fairly easy work-around for now, and I'm optimistic that patches like this will resolve that fairly easily
  • Currently, fieldtypes and array element types of Union{A, B} will introduce boxing; for the specific type field or for each element of the array, respectively. I just realized tonight that @vtjnash's work in the jn/union-bits-layout branch is targeted specifically at field layout optimization rather than array layout.
  • The Vector{Union{Nulls.Null,Int}}(1000000) array constructor performance problem is then directly related to the fact that it's allocating a bunch of empty boxes rather than the unboxed array buffer expected from Vector{Int}(10000).
  • Similarly, the "extra allocation" I've been seeing in the DataStreams code has to do w/ the jl_builtin_arrayset builtin function we emit/generate for setting the element of an array. I didn't quite dig in far enough to notice whether the issue is w/ the array itself being a boxed-element array (probably) or element trying to be set, but whatever the case, an extra box gets introduced causing this spurious allocation.

@timholy
Copy link
Sponsor Member

timholy commented Jun 21, 2017

Really interesting stuff. FYI I get a slight performance improvement from

function get_then_set2(A)
           @simd for i = 1:10
               val = g2(i)
               if val isa Void
                   val = val::Void
                   @inbounds A[i] = val
               else
                   val = val::Int
                   @inbounds A[i] = val
               end
           end
           return A
       end

I suspect that getting vectorization will be quite a challenge, but improving the efficiency even without it would be a great thing.

@quinnj
Copy link
Member Author

quinnj commented Aug 18, 2017

Closing in favor of more specific issues #23336 and #23338

@quinnj quinnj closed this as completed Aug 18, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
None yet
Development

No branches or pull requests

3 participants