Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Preserve concrete Union types in collect() with generators #23940

Closed
wants to merge 1 commit into from

Conversation

nalimilan
Copy link
Member

When return_type() gives a Union of concrete types, use it as the element
type of the array instead of the type of the first element and progressively
making it broader as needed (often ending with element type Any).

This means that no reallocation will be needed anymore if/when encoutering
an element with a new type. Note that the inferred element type may still be
broader than the actual contents of the array, for example with
(x for x in Union{Int, Void}[1]).

Using a Union element type also has the advantage of using the more efficient
layout for isbitsunion types and of allowing for more efficient generated code
when using the resulting array. This is particularly noticeable with array
comprehensions, which inherit the behavior of collect().


Cf. JuliaData/Missings.jl#6 (comment) and following comments.

This change is surprisingly not disruptive at all, probably because code is not supposed to rely on inference. I'm still uncertain about at least one issue: whether it makes sense to apply this strategy only to Unions of concrete types. The clearest performance gain should be for isbitsunion types (efficient memory layout for the resulting array), but it should also be noticeable for other Unions of concrete types (at least for code reusing the resulting array, since branches on the type will be used). Unions of abstract types shouldn't get a better performance, but maybe for simplicity/consistency we could treat them the same?

Simple benchmark on realistic data with missing values:

julia> using Nulls

julia> x = Union{Float64,Null}[rand() > .2 ? rand() : null for _ in 1:10_000];

# On master, second run
julia> @time map(s -> s == 0, x);
  0.125965 seconds (29.25 k allocations: 1.572 MiB)

# On this PR, second run
julia> @time map(s -> s == 0, x);
  0.048889 seconds (14.63 k allocations: 789.467 KiB)

Note that s == 0 returns null for null inputs (with Nulls master), so with this PR the output of map is a Vector{Union{Bool, Null}}, which means that its narrow typing will help performance down the line too. The remaining allocations will probably be reduced by general optimizations for Union.

Cc: @davidanthoff @quinnj

When return_type() gives a Union of concrete types, use it as the element
type of the array instead of the type of the first element and progressively
making it broader as needed (often ending with element type Any).

This means that no reallocation will be needed anymore if/when encoutering
an element with a new type. Note that the inferred element type may still be
broader than the actual contents of the array, for example with
(x for x in Union{Int, Void}[1]).

Using a Union element type also has the advantage of using the more efficient
layout for isbitsunion types and of allowing for more efficient generated code
when using the resulting array. This is particularly noticeable with array
comprehensions, which inherit the behavior of collect().
@nalimilan nalimilan added the potential benchmark Could make a good benchmark in BaseBenchmarks label Sep 30, 2017
@JeffBezanson
Copy link
Sponsor Member

I really don't think we can handle this exactly this way --- there are too many cases where inference might return an unexpected Union type, or might decide to widen the Union instead. I would rather do something like looking at generator.itr and seeing if its element type has Null. Literally anything, any crazy hack at all, is better than using the inferred type.

isconcrete(T)
end
end

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have Base.isbitsunion for this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is slightly different, it also includes things like Union{String, Void}.

@quinnj
Copy link
Member

quinnj commented Oct 1, 2017

It might be enough to find some syntax to allow declaring the eltype of a Generator upon construction; I'm not sure if we could somehow use/detect something like (f(x)::Union{Int, Null} for x in itr) and set a new eltype type parameter for Generator during lowering, defaulting to Any if no eltype was given, which would provide the same eltype as it does now if no type assert/declaration is given. The reasoning here is:

  • Generators are already syntactically special, so this would just be one extra special-case in their lowering
  • It is true that you can already do collect(T, generator), but that doesn't cover cases where you're passing a Generator to some other function that accepts iterables, e.g. filter, reduce, etc.

Not sure if this is too disruptive for Generators in general, but it would certainly help solve the "eltype" problem they have. Happy to spec some code out and do a PR if people think it's viable; would need some help on parser changes though (if we indeed think catching a type assert is viable or if some other syntax is possible).

@nalimilan
Copy link
Member Author

@JeffBezanson The problem is, Null isn't going to be defined in Base (in the current state of our discussions)... :-/ But I don't understand what's the problem with an "unexpected Union": either it's a Union of concrete types, so inference isn't completely unreasonable and better use it, or it's a less specific type and we still fall back to Any. Anyway, it can't be worse than the current Any, can it?

Or maybe we could restrict this to Unions of two concrete types? Or to Unions of one concrete type and one singleton type?

@quinnj Why wouldn't you want to preserve an inferred Union{T, Null} element type in a generator? It would be too bad pessimize them as if their element type was Any when we can generate a very efficient code and memory layout by default instead.

@nalimilan
Copy link
Member Author

nalimilan commented Oct 1, 2017

Oh, and maybe my misunderstanding is related to the fact that currently we observe this:

# OK: element type is the most specific given array contents
julia> map(identity, Union{Int,Void}[1])
1-element Array{Int64,1}:
 1

# Not OK: element type is always Any
julia> map(identity, Union{Int,Void}[1, nothing])
2-element Array{Any,1}:
 1       
  nothing

That is, when the return type is not based on what the array actually contains, but is always Any (== typejoin(Int, Void), but that's not really the question here). This means that any attempt to use a more specific type when inference finds it exposes inference to the user -- which I agree is bad.

I guess the solution would be, first to always return an array with the most specific type (Union or not) able to hold the values that are actually present. And only then apply optimizations like this one which would not be visible to the user and which could be tweaked as we want, even after 1.0. Is that what you have in mind?

@JeffBezanson
Copy link
Sponsor Member

One thing we can easily do is make typejoin(Int,Void) return Union{Int,Void}.

@quinnj
Copy link
Member

quinnj commented Oct 2, 2017

One thing we can easily do is make typejoin(Int,Void) return Union{Int,Void}.

I imagine there will be certain restrictions on this though, like only leaftypes? Would this just require changing the last line of typejoin from return Any to return Union{a, b}? Happy to take a stab at this.

@nalimilan
Copy link
Member Author

Closing in favor of the more general #24332.

@nalimilan nalimilan closed this Dec 9, 2017
@nalimilan nalimilan deleted the nl/generator branch December 9, 2017 21:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
potential benchmark Could make a good benchmark in BaseBenchmarks
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants