Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Union{T,Null} inference in structs #6

Open
davidanthoff opened this issue May 14, 2017 · 54 comments
Open

Union{T,Null} inference in structs #6

davidanthoff opened this issue May 14, 2017 · 54 comments

Comments

@davidanthoff
Copy link

Say we have a struct like this:

struct Foo{T}
    v::T
end

And a function like this:

function bar(x)
    x = rand() > 0.5 : 0. : null
    return Foo(x)
end

Then the return type of this is currently inferred as Union{Foo{Float64}, Foo{Nulls.Null}}. This won't work for the whole Query.jl design, I would need the inferred return type here to be Foo{Union{Float64,Null}}.

But here is the twist: I'm not convinced that this would be the right thing to do in general, i.e. that changing the current julia behavior across the board to something that squares with the needs of Query.jl would be a good idea. Note that this is not an issue with the current approach taken in Query.jl where I use the DataValue type from the DataValues.jl package to represent missing values.

@nalimilan
Copy link
Member

Can you give a concrete example where this behavior is a problem?

@quinnj
Copy link
Member

quinnj commented May 15, 2017

So I've been experimenting w/ porting DataStreams/CSV/DataFrames over to use Nulls.jl over the weekend and actually ran into this a little. My issue was passing the parsed value from CSV to be stored in the DataFrame; at first, I was running into issues because I was allocating the DataFrame vectors as Vector{Int}, Vector{Float64}, Vector{Date}, etc. but then trying to pass a null and there were type errors ("can't assign ::Null to ::Date", etc.). The key here is to declare the field types to be Union{T, Null}, so in your case

using Nulls

struct Foo3{T}
    v::T
end

function bar3(x)
    x = rand() > 0.5 ? 0. : null
    return Foo3{?(Float64)}(x)
end

In this case now, @code_warntype bar3(1) correctly has Foo3{Union{Float64, Nulls.Null}} as the return type, and both floats and nulls can be passed as the inner value w/o issue.

So all that to say that I recognize the issue you're talking about here and ran into it myself, but the work-around isn't actually a problem once you stop to think about it. It just takes a little more pre-declaring container types instead of purely relying on inferring the type parameters.

@davidanthoff
Copy link
Author

For a query like this:

@from i in df begin
    @select {i.a, i.b}
end

we will end up with an anonymous function like this somewhere deep inside Query.jl: i -> Foo(i.a, i.b). The type of Foo is

struct Foo{T1,T2}
    a::T1
    b::T2
end

This anonymous function will be called with instances of another type, say Bar, that represents one row from the source:

struct Bar{T1,T2}
    a::T1
    b::T2
end

The specific concrete type that would be passed into this anonymous function might be Foo{Union{Int,Null},Union{String,Null}}.

@quinnj suggestion to explicitly declare the fields to be Union{T,Null} would work from a technical point of view, but it would essentially destroy usability, because folks now would have to write this query as:

@from i in df begin
    @select @NT(a::?(Int) = i.a, b::?(String) = i.b)
end

@nalimilan
Copy link
Member

@quinnj suggestion to explicitly declare the fields to be Union{T,Null} would work from a technical point of view, but it would essentially destroy usability, because folks now would have to write this query as:

Couldn't the query macros do that work automatically?

@davidanthoff
Copy link
Author

@quinnj suggestion to explicitly declare the fields to be Union{T,Null} would work from a technical point of view, but it would essentially destroy usability, because folks now would have to write this query as:

Couldn't the query macros do that work automatically?

No, the macro doesn't even know that the is will be Bars. You can for example be in a situation where the is will be Foo{Int,String,}, and in that case the anonymous function should return Bar{Int,String} instances, so if the macro rewrote the anonymous function it would break that case.

@davidanthoff
Copy link
Author

I think this problem is actually just an instance of two issues that were raised and discussed in the original julep (but never resolved, as far as I can tell).

The core of the problem is this: if you have a higher-order function and pass a procedural parameter that returns a composite type and it operates over one or more collections with element type Union{T,Null}, things get very messy. This affects a large number of use cases beyond Query.jl: dictionary generators, list comprehensions, map and broadcast (and potentially more, I haven't gone through this in a systematic way).

The two issues from the julep (and the discussion in the PR) that are relevant here are 1) John's rational for preferring Union{Some{T},Null} over Union{T,Null} and 2) @andyferris's arguments for using Union{T,Null{T}}. Neither approach would solve the problem, but each points to one of the two problems I'm discussing in this issue.

John's reasoning for preferring Union{Some{T},Null} over Union{T,Null} was that otherwise an f is bound to always return exactly the same thing whether it is called with a T or in a Union{T,Null} setting. But when f returns a composite type, we might actually want to create a composite type that has a Union field if it is called in a context where a missing value can be passed. That is fundamentally not possible with the Union{T,Null} approach, but it bites us with these higher-order functions: we would like them to typically create say an array or a Dict that has elements of the composite type that have Union{T,Null} fields, but instead it will create arrays where the element types are unions of different composite types. Note that this really starts to break once you look at composite types with many fields, you would essentially end up with an array who's element type is a union of 2^number_of_fields different composite types.

@andyferris's reason for preferring Union{T,Null{T}} was that there are cases where you need to return something from a function that is different depending on the type of the missing value. This problem also shows up in the problem of this issue: we need to be able to have an f that returns a composite type that has a Union{T,Null} field. But if we are just called with a Null, without any info about T, we can't do that, we can only return a composite type where the field is of type Null. So higher order functions like map will again return a stream of elements that all have different types, and we would need an array that holds unions of these composite types, instead of an array that holds composite types with union fields. Note also that the main problem with parameterizing the type of a null is the counterfactual return type problem. But if we go with a white listing approach, I think that problem essentially goes away.

The only way to solve both problems with the Union approach that I can see right now would be to use Union{Some{T}, Null{T}}. It would still require the kind of rewrite of a say @select statement in a query that @nalimilan suggested above, but I think with that type this might work (but mainly I haven't thought this one through, so who knows). It won't work easily for any of the other cases I mentioned above (dictionary generators, list comprehensions etc.).

Here is a simple examples that illustrate this problem:

using Nulls

A = Union{Int,Null}[1,2,null,4,null]
B = Union{Int,Null}[1,null,3,4,null]

C = map(Pair, A, B)

Note how this array will have elements with four different concrete types: Pair{Int,Int}, Pair{Int,Null}, Pair{Null,Int} and Pair{Null,Null}. What we don't get is Pair{Union{T,Null},Union{T,Null}}. For types with more fields this can go from 2^2 cases to 2^number_of_fields.

So I think this issue is essentially about two problems with Union{T,Null} that were identified in the julep and its discussion, and at least one of them seemed to be the reason why John actually recommended to not use Union{T,Null}. Was there any other discussion somewhere about these two problems where this was resolved?

@quinnj
Copy link
Member

quinnj commented May 17, 2017

Yes, this is still the same original problem here, but I'm also not sure why the same solution doesn't apply

julia> C = map((x,y)->Pair{?(Int),?(Int)}(x, y), A, B)
5-element Array{Pair{Union{Int64, Nulls.Null},Union{Int64, Nulls.Null}},1}:
 Pair{Union{Int64, Nulls.Null},Union{Int64, Nulls.Null}}(1, 1)
 Pair{Union{Int64, Nulls.Null},Union{Int64, Nulls.Null}}(2, null)
 Pair{Union{Int64, Nulls.Null},Union{Int64, Nulls.Null}}(null, 3)
 Pair{Union{Int64, Nulls.Null},Union{Int64, Nulls.Null}}(4, 4)
 Pair{Union{Int64, Nulls.Null},Union{Int64, Nulls.Null}}(null, null)

julia>  Pair{?Int, ?Int}[Pair(x, y) for (x, y) in zip(A, B)]
5-element Array{Pair{Union{Int64, Nulls.Null},Union{Int64, Nulls.Null}},1}:
 Pair{Union{Int64, Nulls.Null},Union{Int64, Nulls.Null}}(1, 1)
 Pair{Union{Int64, Nulls.Null},Union{Int64, Nulls.Null}}(2, null)
 Pair{Union{Int64, Nulls.Null},Union{Int64, Nulls.Null}}(null, 3)
 Pair{Union{Int64, Nulls.Null},Union{Int64, Nulls.Null}}(4, 4)
 Pair{Union{Int64, Nulls.Null},Union{Int64, Nulls.Null}}(null, null)

i.e. the container just has to be declared/tagged w/ the right ?(T) type parameters upon construction. In my experience porting DataStreams over so far, this was a trivial update and results in no other issues; when streaming, we just ensure the Sink allocates its internal storage as Vector{?T} (if it even needs to allocate at all).

Going back over the John/Andy discussions, I'm inclined to say that it's probably a more theoretical problem than practical; especially since I believe most users won't even run into this issue in their own code. They'll more than likely be calling higher-level libraries like DataStreams/Query which can handle the typing under the hood. Even if they're extracting DataTable/Frame columns to do their own operations, they're already set since the columns themselves will already be correctly typed with Vector{?Int}

@andyferris
Copy link
Member

andyferris commented May 18, 2017

As a prelude: Like all great works, I feel that reading John's julep was a process of gradually transcending many layers of deeper conceptual understanding :)

That said, after much reflection, I actually think that this application (nullables, nulls, etc) may have been the first to expose a bit of an inconsistency (*) in Julia's type system. For example, "covariant" objects like direct bindings follow the semantics of a dynamic language, while the elements of structs, refs, arrays, etc behave a lot like the types in a static type system. Unfortunately, once you reach any level of complexity, you end up taking covariant objects and throwing them into containers, passing that through a function, which maps these elementwise through another function, etc... and you end up with the problems discussed in this issue.

So, I think that none of

  • Union{T, Null}
  • Union{Some{T}, Null}
  • Union{T, Null{T}}
  • Union{Some{T}, Null{T}}
  • Nullable{T}

are perfectly suitable for all situations - more precisely, I think you can construct objectionable situations for any of the five types above. I think what happened in the Julep is that different people found different objectionable situations for several of the above.

Given that, my suggestion is to choose whatever is easiest for the typical user, and then do your best to patch up the resulting holes that appear - case in point being the OP. At this stage, I'd say that Union{T, Null} seems to take the prize for the "easiest to use" for simple applications. OTOH, I do hope that whatever happens can be made compatible with all the wonderful work done by @davidanthoff (and others), and if a consensus is reached I'm sure we'll all pitch in to help each other.

(*) - inconsistency is definitely too strong of a word (apologies to Jeff B, if you're reading this), but I couldn't think of another one. (Warning, going off-topic). I'm not saying that it's fundamentally broken or anything, but we definitely have to deal with covariant types and their invariant type parameters. In a static language, you would just use Nullable{T}, but hopefully with better syntax (like Swift). In most dynamic languages, you would probably choose use a Null object (like R). If Julia were going to take the dynamic type thing to it's limit, I'm guessing structs with covariant elements would be possible, and in the extreme case even refs and arrays could have semantically covariant eltypes with inference's job to allow them to be unboxed when provably safe... (the advantage of all that being that we wouldn't have to ever invoke inference to figure out the element types of a container (I think this was one of John's main goals with the Julep?), or play the fun game done by untyped comprehensions which I worry might propagate everywhere else in array-land, although that partly has to do with promote not just inference).

@davidanthoff
Copy link
Author

Yes, declaring the return types in all these cases works. But I think that would prevent us from enabling a large class of very elegant data manipulation solutions. Here is one I just enabled on master in IterableTables.jl, and then I'll try to explain the difficulty in solving this in the Query.jl context.

With the latest master version of IterableTables.jl, you can write simple queries that project and filter using standard julia generator syntax:

using DataFrames, IterableTables, NamedTuples

df = DataFrame(a=[1,2,3], b=[4,5,6])

df2 = DataFrame(@NT(c=i.a, b=i.b) for i in getiterator(df) if get(i.a)>2)

This will generate a new DataFrame with columns c and d and only the third row from the original (and of course you could also use any of the other table types supported in IterableTables.jl as the source and sink in this, and mix and match).

This is still somewhat verbose, but I think there is a very good chance that we could get this down to the following in maybe even the julia 1.0 time frame:

df2 = DataFrame({c=i.a, i.b} for i in df if i.a>2)

This would require a few things: First, a named tuple syntax in base. Doesn't have to be {} as I assumed here, but something like that. My sense is that this is likely for julia 1.0 and almost certain in the next update after julia 1.0. Second, the simpler filtering clause just requires that I'll switch IterableTables.jl over to DataValue, which I'm going to do in the next few days. The simpler filter would also work with Union{T,Null}. Third, getting rid of the getiterator call. This probably can easily be achieved for types like TypedTable or IndexedTable that have all their column name and type information encoded in their type, and less easy for something like DataFrame. I have a couple of other ideas for this, but that is probably the most difficult part. In any case, in the worst case, we would just still have the getiterator call there.

In my mind this is syntax that would potentially be used a lot by end users for very simple data manipulation of table data. Now, if we wanted this to work with the Union{T,Null} type, we would have to type annotate the fields of the named tuple that is generated, similar to what @quinnj recommended above, so the same query now would look like this:

df2 = DataFrame({c::?(Int)=i.a, b::?(Int)=i.b} for i in df if i.a>2)

I think that would make this whole idea probably moot, folks don't want to specify every column type in this kind of simple query.

Alright, running out of time, I'll write up why this is not trivially handled in something like Query.jl, or at least why I don't see a way to do this right now. Would obviously be great if one of you could think of a solution to that, once I've described the problem.

Oh, one last thing: I'm not surprised that this all works great in DataStreams.jl. I'm 99% sure I would not run into the slightest problem with Union{T,Null} if I was just looking at IterableTables.jl. But Query.jl is a whole other beast...

@nalimilan
Copy link
Member

nalimilan commented May 20, 2017

Thanks for the details, that helps understanding the issue. So the concrete problem is very similar to this:

julia> using NamedTuples, Nulls
julia> itr = (@NT(c=i.a, b=i.b) for i in (@NT(a=1, b=2), @NT(a=2, b=null)));
julia> collect(itr)
2-element Array{NamedTuples._NT_c_b{Int64,T2} where T2,1}:
 (c = 1, b = 2)   
 (c = 2, b = null)

Or even this:

julia> itr2 = (((i[1], i[2]) for i in ((1, 2), (2, null))));
julia> collect(itr2)
2-element Array{Tuple{Int64,Any},1}:
 (1, 2)   
 (2, null)

The element type is chosen by collect via Base._default_eltype, which in turns relies on inference:

julia> Core.Inference.return_type(first, Tuple{typeof(itr)})
Union{NamedTuples._NT_c_b{Int64,Int64}, NamedTuples._NT_c_b{Int64,Nulls.Null}}

julia> Core.Inference.return_type(first, Tuple{typeof(itr2)})
Tuple{Int64,Any}

Fortunately, even though collect chooses to simplify the element type of the returned array to Array{NamedTuples._NT_c_b{Int64,T2} where T2,1} (via typeintersect), no information is lost about the element type of itr. By adjusting the definition of getiterator to:

function getiterator(source::Base.Generator)
    T = Core.Inference.return_type(first, Tuple{typeof(source)})
    S = typeof(source)
    return IterableTables.GeneratorIterator{T,S}(source)
end

I get this:

julia> eltype(getiterator(itr))
Union{NamedTuples._NT_c_b{Int64,Int64}, NamedTuples._NT_c_b{Int64,Nulls.Null}}

This means you should be able to perform whatever type computation you need inside _DataFrame. When a Union is encountered, you just need to iterate over its parts, extract the NamedTuple types inside, and Union the types corresponding to the same field in each NamedTuple. Not terribly pretty, but it should work.

BTW, I haven't been able to replicate the example from your last post on Julia 0.6, even with IterableTables from git master. The df2 = ... step gives ERROR: type UnionAll has no field types.

@nalimilan
Copy link
Member

Also, regarding the Pair examples given above, I've noticed this:

julia> [a=>b for (a, b) in zip([1, 2], [2, null])]
2-element Array{Pair{Int64,B} where B,1}:
 1=>2   
 2=>null

Notice that the return type is not Array{Pair{Int, Int}, Pair{Int, Null}}, so we're already applying some transformations to Union types. So I don't see why B where B above couldn't be kept as Union{Int, Null}. This would make a lot of sense if unions are made more efficient.

@ararslan
Copy link
Member

ararslan commented May 20, 2017

I don't see why B where B above couldn't be kept as Union{Int, Null}. This would make a lot of sense if unions are made more efficient.

Perhaps, though it seems to me that the advantage of going as general as possible off the bat for the element type of a mutable object is that the element type doesn't have to be recomputed if a new element is added. I imagine that could be detrimental to performance. (Edit: This makes no sense, disregard)

@nalimilan
Copy link
Member

the element type doesn't have to be recomputed if a new element is added. I imagine that could be detrimental to performance.

What do you mean? The type of an array won't change when adding an element, so you can't even add an entry of a different type. This is indeed one of the drawbacks of precise typing, but we accept this for concrete types because it is essential for performance (and safety/predictability BTW).

@ararslan
Copy link
Member

What I mean is that if you have Array{Pair{Int64, B} where B, 1}, B <: Any. If you constrain B to be e.g. Union{Int64, Null} then try to push!(x, 1=>1.2), you won't be able to. I (completely naively) assume that Julia defensively makes the non-constant type as wide as possible, since it doesn't know whether you want to add more randomly typed garbage in there. So constraining it to be a Union seems perhaps too narrow. I dunno though. (Also my performance argument above doesn't make any sense.)

@nalimilan
Copy link
Member

Yes, but how is that different from e.g. Array{Pair{Int, Int}}? We never complain when our arrays are concretely typed, so why would we care when they are "semi-concretely" (i.e. Union) typed?

@ararslan
Copy link
Member

That's a good point. Don't mind me, then...

@davidanthoff
Copy link
Author

@nalimilan There are really three questions, I believe: 1) can one recover enough type information to construct the right data structure to materialize a query, 2) will that data structure be fast and efficient and 3) can one write code that materializes things that is fast.

In your examples we end up with the following types for elements: a) Tuple{Int64,Any}, b) Union{_NT_c_b{Int64,Int64}, _NT_c_b{Int64,Nulls.Null}} and c) _NT_c_b{Int64,T2} where T2.

I believe b) is something we can only get if there are very few fields in the target struct that can take a missing value. Otherwise you end up with 2^n elements in the union. I think type inference will actually infer things as UnionAll types (as in c)) once the number of fields increases. So, for now I'll just limit the discussion to a) and c).

In the a) case we lost the required type information, i.e. we no longer can tell that the second tuple element can only be an Int or a missing value. So that looks like another example where things go wrong.

c) is interesting. In this case we have to deal with UnionAll types. In your example there is only one type variable, but if there were more fields that can take nulls, this might look like _NT_bla_bla{T1,T2,T3} where T1 where T2 where T3 etc.

Now, I believe you are right that with the type information in c) we could construct say a DataFrame that has the right column types etc, i.e. the type information we need is there and not lost. But if we think about materializing this say into an array, the question becomes whether the changes to julia base will also make arrays of UnionAll types as fast and efficient as arrays of small Union types. As far as I can tell that is not the plan, right? It also seems really almost impossible to me. An array of say type Array{_NT_a_b{T1,T2} where T1 where T2,1} would require a memory layout that can store e.g. _NT_a_b{Int, AnyType} as well as _NT_a_b{Int,Null}. I don't see how this could be achieved in any compact sense without resorting to pointers in the array that point towards the elements stored in the array, which would clearly destroy any performance.

Now, just for the case of materializing something into a DataFrame, the storage problem is not so bad, after all the named tuple would not be stored in any case, but disassembled again into a struct of array storage format. But that still leaves the question whether one can write fast code that materializes things. This is essentially the question what would happen to this loop (from _filldf)

for i in source
    # Store tuple elements into DataFrame columns
end

Type inference would infer the type of i as a UnionAll type. So if there is a plan to make UnionAll types fast in the same way that small Union types are to be made fast, things would work. But again, that is not the plan, right? And again, as long as these UnionAll types are inferred with a generic where T1 clause without any restrictions on T, I don't see how this could be done.

Now, if type inference was able to generally infer these UnionAll types as say _NT_{T1,T2} where T1<:Union{Int,Null} where T2<:Union{Int,Null} and then those kinds of UnionAll types were to get the same performance treatment that is planned for small Union types, things might work. But that would require two (major?) additional changes to julia base, namely for inference to be able to do this and then for code generation to make this case fast.

@shashi
Copy link

shashi commented May 30, 2017

Note that this is not an issue with the current approach taken in Query.jl where I use the DataValue type from the DataValues.jl package to represent missing values.

What does it do differently?

@davidanthoff
Copy link
Author

f(a::Nullable{Int}) = Pair(a,a)

always returns a Pair{Nullable{Int},Nullable{Int}.

f(a::Union{Int,Null} = Pair(a,a)

will return either a Pair{Int,Int} or a Pair{Null,Null}, but not Pair{Union{Int,Null},Union{Int,Null}}.

DataValue behaves like Nullable in this example, and that kind of behavior is what I need in Query.jl.

@quinnj
Copy link
Member

quinnj commented May 30, 2017

Unless you define f to be

f(a::Union{Int, Null}) = Pair{Union{Int, Null}, Union{Int, Null}}(a, a)

which has already been pointed out.

@davidanthoff
Copy link
Author

As I wrote before, that doesn't work with the design for Query. I can't ask users to make these type annotations in every query, and the macro doesn't know anything about types, so it can't automatically rewrite this either. There is no other stage where one could rewrite this, so I don't see how this suggestion can be made to work.

@shashi
Copy link

shashi commented Jun 1, 2017

that kind of behavior is what I need in Query.jl.

Why is this? I have some guesses but I can't tell for sure.

@nalimilan
Copy link
Member

Will NamedTuple in Base (JuliaLang/julia#22194) help with this issue?

@davidanthoff
Copy link
Author

@shashi Query.jl is essentially a Julia implementation of LINQ, so for a general overview of the design any LINQ description should work.

Specifically, in many cases a query will often end up being translated into a chain of iterations, and a bunch of those chained iterators are map-like, i.e.they apply a user provided anonymous function to each element in the iteration (this is true for @select statements, but also for e.g. @group statements). When you collect a query, there is going to be a loop that iterates over these chained iterators. So it is really important that the calls to next of these chained iterators are fast. But if next returns a UnionAll type I don't see how the planned optimizations for small Unions would apply. Given that tables are represented as iterators of named tuples in Query, that would essentially be the most common situation.

Are you coming to juliacon this year? I'll dive into some detail of this in my talk.

@davidanthoff
Copy link
Author

@nalimilan I don't think that will change anything. I'm really looking forward to NamedTuple in base, it will enable a whole bunch of other cool things in Query.

@shashi
Copy link

shashi commented Jun 4, 2017

But if next returns a UnionAll type I don't see how the planned optimizations for small Unions would apply.

Do you mean Union{@NT(x){T}, @NT(x){Null}) when you say UnionAll (I'm not sure if that Union is a UnionAll)?

maybe @vtjnash knows if that would work. I think this sort of thing should be taken care of the compiler either way...

I'll dive into some detail of this in my talk.

I'll make sure to attend!

@vtjnash
Copy link

vtjnash commented Jun 5, 2017

always returns a Pair{Nullable{Int},Nullable{Int}

The basic problem here is that this example seems to be too naive: it explicitly asserted that the inputs had to be Nullable{Int}, even though nothing in the program input actually implied (or could supply) that. Other valid expected inputs should include Nullable{Any} and Nullable{Union{}} (non-exhaustive).

Thus to be pedantic and accurate about what the language / compiler can do, the result of this function should really be represented as Union{Pair{Nullable{Int}, Nullable{Int}}, Pair{Nullable{Any}, Nullable{Any}}, Pair{Nullable{Union{}}, Nullable{Union{}}}, ...}. Which after simplifying is the abstract type: Pair{Nullable{T}, Nullable{T}} where T (aka, UnionAll(T, Nullable{T}), which is where the name comes from 🙂)

With the alternate design, the possible outputs are simply Union{Pair{Int, Int}, Pair{Null, Null}} and the set is bounded to these two options. That makes it much easier for the compiler to automatically optimize this code based upon it's actual usage, rather than the library authors type-assertions.

Aside: I'm not sure the latter is necessarily meaningful, so I would expect the intended result is instead Union{Pair{Int, Int}, Null} (e.g. isnull(x) ? x : f(x))

Making the Union representation faster and more ergonomic than the existing Nullable doesn't preclude one from using it in a Nullable-like fashion – this in fact is essentially what I would expect. The goal is just that implementation can become much simpler as the compiler can do the optimization work:

struct Maybe{T} # née Nullable{T}
  value::Union{T, Null}
end

@davidanthoff
Copy link
Author

Do you mean Union{@NT(x){T}, @NT(x){Null}) when you say UnionAll (I'm not sure if that Union is a UnionAll)?

@shashi I was a bit sloppy there. If the function creates a named tuple with just one field things get inferred as Union{@NT(a::T),@NT(a::Null)}. With two fields as Union{@NT(a::T,b::T),@NT(a::T,b::Null),@NT(a::Null,b::T),@NT(a::Null,b::Null)}. So these unions grow exponentially in the number of fields. To still deal with this, type inference seems to switch over to presenting them as @NT(a::T1,b::T2,c::T3) where T1,T2,T3, i.e. a UnionAll.

The basic problem here is that this example seems to be too naive: it explicitly asserted that the inputs had to be Nullable{Int}, even though nothing in the program input actually implied (or could supply) that. Other valid expected inputs should include Nullable{Any} and Nullable{Union{}} (non-exhaustive).

@vtjnash Well, for Query.jl I can guarantee that the inputs are always properly typed as say Nullable{Int}. Here is a less contrived example. Say we have a source DataFrame with columns a, b and c. a is a DataVector{Int}, b is a DataVector{String} and c is a DataVector{Float64}. Then assume we have the following query:

@from i in df begin
@select {x=i.a, y=i.b, z=i.c}
@collect DataFrame
end

This will construct a chain of two iterators. The first iterator returns a named tuple for each row in the source DataFrame. The eltype of that iterator is @NT(a::Nullable{Int}, b::Nullable{String}, c::Nullable{Float64}). The second iterator is essentially a map like this map(i->@NT(x=i.a,y=i.b,z=i.c),first_iterator). After the @NT macro did its magic, this looks like this: map(i->_NT_x_y_z(i.a, i.b, i.c), first_iterator). _NT_x_y_z is a struct with three fields x, y and z.

The next function for this second iterator (the one created by map) is guaranteed to always return an element of type _NT_x_y_z{Nullable{Int}, Nullable{String}, Nullable{Float64}}, which is great because the loop that collects things in the end and iterates over these chained iterators is type stable because the call to next of the second iterator is type stable.

So how would this look with Union{T,Null}? The eltype of the first iterator could well be @NT(a::Union{Int,Null}, b::Union{String,Null}, c::Union{Float64,Null}). But the next function of the second iterator, i.e. the map, would be inferred as returning something like _NT_x_y_z{T1, T2, T3} where T1,T2,T3. A collect function that called this next function in a loop would not be fast because UnionAlls won't get the optimizations that you are planning for Union types, right?

With the alternate design, the possible outputs are simply Union{Pair{Int, Int}, Pair{Null, Null}} and the set is bounded to these two options. That makes it much easier for the compiler to automatically optimize this code based upon it's actual usage, rather than the library authors type-assertions.

Aside: I'm not sure the latter is necessarily meaningful, so I would expect the intended result is instead Union{Pair{Int, Int}, Null} (e.g. isnull(x) ? x : f(x))

I don't understand that, but maybe my exposition of the issue was just not very clear. Hopefully the explanation in this post is more comprehensive and closer to the problem I have in Query.jl with Union{T,Null}.

@davidanthoff
Copy link
Author

I just reread @vtjnash's comment and somehow I had missed/not really noticed the last paragraph:

Making the Union representation faster and more ergonomic than the existing Nullable doesn't preclude one from using it in a Nullable-like fashion – this in fact is essentially what I would expect. The goal is just that implementation can become much simpler as the compiler can do the optimization work:

struct Maybe{T} # née Nullable{T}
    value::Union{T, Null}
end

That kind of design of course would solve the issue for Query.jl, essentially this amounts to sticking with a container based approach, but swapping out the internals of how that container is structured. If there are performance benefits of that over the old Nullable approach it would of course be excellent. From Query.jl's point of view this would probably be a very small change, i.e. I would just have to change the DataValue implementation over to this.

In general, what do folks think about the problem in this issue at this point? I've played with various other attempts to get around this in Query.jl over the last weeks, and I just don't see a solution that works. I did see that various packages are going full steam ahead in switching over to Nulls.jl (DataStreams, CategoricalArrays, DataTables?), but unless we find a solution to this I fear this will just lead to the next fragmentation of the data ecosystem, i.e. Query.jl and friends won't be able to follow suite and will use DataValues.jl, and then the other half of the package world will use Nulls.jl... Is that the plan right now?

@nalimilan
Copy link
Member

@vtjnash This seems relevant for JuliaLang/julia#22682.

@JeffBezanson
Copy link

I'm a bit stuck here on not understanding what Foo and Bar are in e.g. #6 (comment) . Are they user-defined types, or internal types in Query.jl? What do they mean?

One possibility is to use something like type promotion, where Foo{T} and Foo{Null} can be promoted to Foo{Union{T,Null}}.

Another possibility is to use some interface that tells you the declared type of a field. For example

i -> Foo{fieldtype(i, :a), fieldtype(i, :b)}(i.a, i.b)

fieldtype happens to already exist, but this could be a new overloadable function instead, with default definition

fieldtype(x, f) = typeof(getfield(x, f))

@davidanthoff
Copy link
Author

Here is a simpler setup that demonstrates the issue, using the named tuple syntax from the PR in base:

using Nulls

A = [1,2,null,4]
B = [1.,2.,3.,null]

result = map((i,j)->(a=i,b=j),A,B)

The question is a) can result have an efficient and compact memory layout and b) how is map implemented?

If I understand co-variance properly, then the idea that came up during juliacon (make named tuples co-variant) would actually imply "no" for question a), right? In that case the element type of the result array would be NamedTuple{(:a,:b),Tuple{Union{Int,Null},Union{Float64,Null}}}, which is not a concrete type, which would imply that the storage for results would not be nice and compact. Plus, I take it that covariant named tuples couldn't participate in the union-splitting optimization, so that whole co-variance idea probably is not a good one, right?

If named tuples are not co-variant, then NamedTuple{(:a,:b),Tuple{Union{Int,Null},Union{Float64,Null}}} would actually be a concrete type, and I assume arrays of it could have a nice compact memory layout, right? So in that case the issue is how does map decide what the element type of the result should be.

One option is @JeffBezanson's suggestion to change the anonymous function so that it always returns instances of NamedTuple{(:a,:b),Tuple{Union{Int,Null},Union{Float64,Null}}}. This is similar to the suggestions that @quinnj made previously above. The problem with that is that you need to add a lot of syntax in the anonymous function, and that users of Query.jl would have to add that syntax, which in my mind is a non-starter. This would be equivalent to SQL requiring you to specify the type of every column in the result set in every SELECT statement.

The promotion idea is interesting, but I can't think of a way of using that which wouldn't tie things even more to inference, whereas I'm trying to actually move away from the heavy reliance on inference. But on the flip-side, I haven't thought about this option long (well, since yesterday), so maybe there is some way to make this work. Here is the issue: if I was trying to implement a map that does not rely on inference, I would essentially look at the type of the first result element, allocate an array of that type and then add elements, until I encounter an element of a different type. At that point I could use some kind of promotion mechanism to find a more generic element type, copy all the results I have accumulated so far into a new array of that new type, and continue. But in any typical tabular data scenario, this would imply a lot of reallocation, as far as I can tell. My example here has two columns, but as soon as we get more columns, this reallocation can happen many more times. I'm also not sure how this would work out with things like nested named tuples, e.g. say (i.a, c=(i.b, i.d)).

It is also just worth pointing out again how much easier everything is for the Query.jl situation with DataValue{T}. All of these problems essentially just go away, the information of whether one can or can not have missingness just flows smoothly through the whole query pipeline via the element types. And there is no real downside: the whole DataValue lifting story really seems to work well in Query.jl. This design has been in place for almost a year now, and I think over the whole period I got three issues/questions about lifting/missing data etc., and all three of them just required adding one more whitelisted method for something, and then the issue was essentially solved.

@JeffBezanson
Copy link

Thanks, that's a useful example.

look at the type of the first result element, allocate an array of that type and then add elements, until I encounter an element of a different type. At that point I could use some kind of promotion mechanism to find a more generic element type, copy all the results I have accumulated so far into a new array of that new type, and continue

As you probably know, that's exactly how map works. We might be able to improve the situation with nulls by adding a bit more to this logic. For instance, we could use the fact that the input array is nullable:

function map(f, a)
    first = f(a[1])
    T = typeof(first)
    if Null <: eltype(a)
        T = Union{T, Null}
    end
    output = Vector{T}(...)
    ...

That would help reduce the number of re-allocations, though it wouldn't eliminate them. However, if we (improperly) exploit inference (the way Nullable and DataValue do) we could find out the (nullable) result type up front and avoid all re-allocations.

@quinnj
Copy link
Member

quinnj commented Sep 28, 2017

@JeffBezanson, what exactly are you referring to when you say "if we (improperly) exploit inference"; are you talking about the liberal use of Base._return_type to determine the eltype for DataValue?

@vtjnash
Copy link

vtjnash commented Sep 28, 2017

I assume arrays of it could have a nice compact memory layout

No, Arrays of NamedTuples will not have a compact memory layout (they're rotated wrong in memory for that, although by contrast that means they would be laid out in cache order). For that, you would need a RowView type (a la DataFrame's current design).

@nalimilan
Copy link
Member

Indeed, the fact that map can reallocate in some cases to avoid relying on inference does not sound like a problem. What matter is that inference allows avoiding the slow path in standard cases, i.e. when the function is type stable. It seems reasonable to have map to start assuming that the eltype of the result is Union{T, Null} (rather than just T or Null) so that no reallocation is needed. Due to the memory layout of Array{Union{T, Null}}, it is very efficient to call convert(Array{T}, x::Array{Union{T, Null}}) at the end if it turns out there were no null values in the actual data.

The problem is that currently eltype(i for i in 1:3) === Any, so that AFAICT map (via collect(::Generator)) always starts allocating an array with an eltype matching the first element, and reallocates if an element with a different type is found. Would it be possible to improve eltype on generators?

An issue with the proposal in @JeffBezanson's last comment is that if Null does not live in Base, no special-casing would be possible. Anyway a more general solution would be much better.

@andyferris
Copy link
Member

It seems reasonable to have map to start assuming that the eltype of the result is Union{T, Null}

Instinctively, I think this might be a reasonable choice for data packages, but I'm wondering if the default versions methods in Base should give much consideration to Null.

Would it be possible to improve eltype on generators?

This seems like a good idea to me, For collect(::Generator), a hybrid approach that picks up on Union coming from inference might be optimal (it would observably behave as now but make use of the layout of Array{<:Union} and partial information from inference, to be fast in "most" cases).

@JeffBezanson
Copy link

Would it be possible to improve eltype on generators?

eltype on generators is exactly the same problem as getting the return type of any function. But to the extent we use return_type, yes, we could use that as an additional hint that the result might need to be nullable.

@nalimilan
Copy link
Member

I've tried implementing this strategy at JuliaLang/julia#23940. Comments welcome.

@davidanthoff
Copy link
Author

I assume arrays of it could have a nice compact memory layout

No, Arrays of NamedTuples will not have a compact memory layout (they're rotated wrong in memory for that, although by contrast that means they would be laid out in cache order). For that, you would need a RowView type (a la DataFrame's current design).

I have three memory layouts in mind: a) you can end up with a situation where you have an array of pointers to the actual values (I guess this is called boxed values?), b) you can have a dense memory layout of structs, i.e. no pointers, just the content of one element after another in a consecutive memory region, and c) you can store an array of structs as a struct of vectors. My understanding was that an Vector{NamedTuple} by default would end up with b) if all the fields are bitstypes? Main thing to me seems that it doesn't end up as a).

@davidanthoff
Copy link
Author

@JeffBezanson I guess I also left out two complications in my simple example... Always tricky to find the right balance of a simple example and one that does justice to the complexity of the real problem. Here they are:

The loops that materialize queries are largely not part of Query itself, they really should be part of the containers that can materialize a query. So that can essentially be any container that accepts an iterator. Things like query |> Set or just any query |> MyCustomCollectionType just work as long as these types have a constructor that accepts an iterator. So all of these collections now would need to have the kind of sophisticated code that can handle these streams of slightly changing types, not just one central implementation, like the implementation in map. Essentially we would go from a situation where every custom collection type would need to go through extra hoops to be able to consume a query, whereas right now they largely should just work without even knowing about Query.

The second complication is that my example was misleading in that it started out with a map over two vectors. That actually never happens in Query. Rather, every query operator is actually operating on just one iterator, and with tables that iterator will return named tuple elements. So a better example would have been this (using a slightly modified named tuple syntax from base):

using Nulls

source = [
    (a::Union{Int,Null}=1, b::Union{Float64,Null}=4.),
    (a::Union{Int,Null}=2, b::Union{Float64,Null}=null),
    (a::Union{Int,Null}=null, b::Union{Float64,Null}=3.),
    (a::Union{Int,Null}=null, b::Union{Float64,Null}=null)
]

result = map(i->(a=i.a,b=i.b),source)

With this, the anonymous function that you pass to map is type instable, no matter what clever trick we would be using in the map implementation: the anonymous function always gets an argument of type NamedTuple{a::Union{Int,Null},b::Union{Float64,Null}} (using my suggestion for a named tuple type syntax), but it would return a different type for each element.

@davidanthoff
Copy link
Author

I think at the end of the day there are a bunch of assumptions that drove folks towards the Union{T,Null} idea that simply don't hold for more complicated use-cases like Query:

  1. I think a lot of people thought that it would be desirable that if you have a situation where a value could be missing, but it isn't missing, to just use the type T for that. But in a situation like Query things are simply much, much easier if the type of that value indicates that the value could have been missing. In general you probably want the type to behave as if it was a T, but just using T straight out makes things more complicated because you lose all indication that the value could have been missing.
  2. I think the second assumption was that one would never need a typed missing value. I completely buy that this makes a lot of things simpler, but for Query that assumption doesn't hold: things work much more smoothly if one can also have a typed null (and with DataValue one has both, a typed and a untyped missing value).
  3. I think the final assumption was that one couldn't provide a good user experience with container types like Nullable, and that was probably mostly driven by the pretty vocal user feedback on how unusable DataTable was. I think that one really hasn't happened with Query and DataValue. Now, it is tricky to get reliable data on this, so take the following with a grain of salt: Query at this point has a little more than twice the number of stars than DataTables on github, so I take that as some indication that its user base is larger than the DataTable user base was. The DataTables user feedback was pretty clear and widespread: folks didn't think that approach worked. I have literally not received a single piece of feedback on Query that the DataValue approach is too cumbersome over the last year (I did get feedback that folks don't like the comparison behavior, but that is a separate issue). In practice it seems that DataValue{T} behaves closely enough to a T that most folks don't ever have to unpack things, certainly that has been my experience in using it.

The following is not a rigorous argument, but my gut feeling on this still is the following: LINQ was designed by a group of folks that had been thinking for a long time about composable software systems (Erik Meijer, Brian Beckman etc.). They essentially based everything around ideas from functional programming languages, and in particular monads. What they ended up with is a design that is truly a marvel at composability: you have all these really small, self-contained things (the IEnumerable monad, the Nullable monad etc.), that know nothing about each other and they all just happen to magically work wonderfully together to form larger, more complicated systems. I think I managed to bring most of that over to Query: the design right now for example is super, super close to one where the query side of things doesn't even know that things like missing values can exist, but it just happens to compose really well with DataValue, which also doesn't know anything about Query. That whole model seems to not carry over to Union{T,Null}. I'm completely convinced that Union{T,Null} is a great choice for a large number of simple use-cases, but it just doesn't seem a good fit for something like Query.

I'd be really interested to hear @andyferris thoughts on this, in particular as it relates to the https://github.com/JuliaData/SplitApplyCombine.jl package. I would expect that you'll run into exactly the same issues, given how similar the designs of Query.jl and your package are?

@andyferris
Copy link
Member

andyferris commented Oct 25, 2017

For SplitApplyCombine I'm flat out asserting that the null data problem is an orthogonal design issue (*). The package deals with iterable/indexable containers and transformations thereof, and the element/scalar-level functions are provided by the user. I agree 100% with the premise that these things must compose seamlessly, and I think several others have already expressed a similar opinion.

I can appreciate that having typed null values (indicating e.g. a value of type Bool is missing here, etc) is useful, and also having a simple null that encompasses any missingness is simple and convenient (and also that using a container like Nullable is "safe"). Julia has a natural tension between dynamic and static typing (basically, because of covariance and invariance) and @davidanthoff makes a valid point that it's not always clear that the "let's just pretend the output eltype is covariant" implementations of things like map are exactly what's needed in all cases. In a sense, such methods are just sweeping the fact that these type parameters are actually invariant under the carpet, and there will always be a corner case or some user-defined container which exposes the invariance, and when that happens the experience may be jarring. Continuing that tangent - I feel the key is not to dispatch or make decisions based on such a type parameter, and let the elements themselves do most of the talking (so that whether the result of eltype is precise or not is just an optimization, everywhere). Please don't take away that I don't like covariance - I'd just rather it were implemented as a language-level semantic (tricky to get right for refs and arrays, but I believe possible). But for now, I tend to go with the rule that "it doesn't matter whether user-defined containers choose to invoke inference for map, etc or to use a covariant-based approach like Base, so long as functions of containers only ever use the type parameters and eltype as a more-or-less-unobservable optimization".

I think my points above do add some constraints on "valid" ways of implementing missing values, but there is still some design space left to explore. Addressing the OP directly, we primarily circumvent the tension between covariance and invariance for containers because they are instantiated with similar or whatever. One yucky approach is to use a factory (I think that's the word) to build a Foo{T} (e.g. speculatively something like similar_type(foo, promote_op(f, x))(f(x))). And while Union{Foo{T}, Foo{Null}} and Foo{Union{T, Null}} may be more-or-less interchangeable, this isn't true when you have a large number of type parameters, a long tuple, or whatever. My understanding is that mapping an array via a function that returns Union{T,Null} could end up with inference considering Union{Array{Null}, Array{T}, Array{Union{T,Null}}}, while Array{Union{T, Null}} is probably preferable in 99.9% of cases. Do we have a non-hacky plan for dealing with that?

Regarding LINQ, I wonder how much of the design is influenced by statically typed languages that use vtables to do a bit of dynamic stuff, and what parts should be different in a fully dynamic language. (That's an honest question).

TLDR; Sorry, I'm not sure I provided any actual answers here! I think with some effort we could make any one of several approaches work OK (but like I hinted doing it really well might have wider implications for things as far ranging as how map is done, how to make constructing structs null-friendly... or even whether to make the entire language covariant...).

(*) I did play with what the SplitApplyCombine functions do to Nullable as a container with 0 or 1 elements, and I note that perhaps some nice functions and syntax could make the Nullable situation more palatable (like Swift) but it's my suspicion that many "data science" type people would be happy to think of a missing value as a scalar not a container.

@JeffBezanson
Copy link

I do think Nullable or DataValue is a viable option, but it seems to have the opposite set of problems that Union{T,Null} has.

I agree the annoyance of unwrapping might not be a showstopper. At one point I conceded that some small set of operations (+, -, etc.) could be implemented on Nullable, which we might as well allow since the same thing would be done for Null.

But then there are problems:

  • The type parameter in Nullable and DataValue leads to much more reliance on type inference than the Union approach.
  • DataValueArray and NullableArray are unfortunate. Using a different element type shouldn't require using a different container type.

So all of these collections now would need to have the kind of sophisticated code that can handle these streams of slightly changing types, not just one central implementation

I agree this is an issue. We should find a way to abstract this pattern if possible.

@zenna
Copy link

zenna commented Nov 4, 2017

Having not worked much with Null or missing values in Julia per se, nor read the julep, perhaps the following should be taken with a grain of salt. That said:

I think the issue here is a mismatch of terms. A type-error, or kind-error if you will.

This problem pops up in various forms: in Interval packages, what should
[0,1] == [0,1] return? In Symbolic DSLs such as TensorFlow.jl, where should Tensor{Float64} fit in the type hierarchy? I have seen various similar issues in packages I've worked on, such as AbstractDomains.jl https://github.com/zenna/AbstractDomains.jl

Each package makes its choices, but one way or another it's trying to fit a square peg into a round hole.

A missing value and a value are not the sme kind of thing. A concrete value represents an element of a set. A missing value, due to some form of uncertainty, represents a set of elements. A missing value is more like a type than a value. Sticking them together in a Union will inevitably lead to tradeoffs and contradictions.

Similarly, an Interval{Float64} is not a Float64, it is an abstraction of, i.e. a set of Float64s. It is more like the type Float64. With a more expressive type system it could literally be a type:

x :: {ν::Float | a < ν < b}

I have no doubt these discussions will eventually lead to a solution or number of solutions with best tradeoffs. In that sense I am echoing @andyferris original comment, and also suggesting what I think may be the root cause. I think a deeper resolution to this problem would essentially require Julia to make computing with concrete representations of sets of values (abstract domains) a first class principle. Probably this isn't a common enough problem to warrant that. But it'd be neat.

@zenna
Copy link

zenna commented Nov 4, 2017

One more thing:

What Nulls.jl is doing now with 1+null = null is an abstract interpretation with a very coarse abstract domain. With Bool you are able to retain a little more precision in some cases.

@davidanthoff
Copy link
Author

@andyferris

For SplitApplyCombine I'm flat out asserting that the null data problem is an orthogonal design issue.

Agreed a 100%, that is the philosophy I followed with Query from the beginning. There was one place where I had messed that up, but I was just able to remove that today, so now QueryOperators.jl doesn't depend on DataValues.jl anymore, it just happens to work well together with it.

I'd just be interested whether you share my view that you will run into the same problems that I have in Query.jl when you try to use a named tuple with fields of type Union{T, Missing} to represent a row in your design of SplitApplyCombine.jl.

@JeffBezanson

The type parameter in Nullable and DataValue leads to much more reliance on type inference than the Union approach.

Could you elaborate a bit more, or have an example in mind, what you mean by that? I think you aren't referring to the reliance of Query.jl on type inference, but something else, right? I have a pretty solid plan at this point to get rid of the reliance of Query.jl on type inference (or at least only use it as a performance optimization) that would work well with DataValue but not with Union{T,Missing}.

DataValueArray and NullableArray are unfortunate. Using a different element type shouldn't require using a different container type.

Yeah, that is still tricky. I should say though that I think that DataValueArray is much more usable than NullableArray ever was. I think the only issue really is construction, but once you have a DataValueArray it is really very usable at this point and I'm not really aware of any issues. I think the construction issue could be solved relatively easy. Heck, if this line was used instead of the following one, I could just add a getindex(::Type{T}, vals...) where {T<:DataValue} method and call it a day. And yes, I am aware that I'm probably missing some other complications, but I haven't seen anything yet that would convince me that we couldn't sort these things out.

So all of these collections now would need to have the kind of sophisticated code that can handle these streams of slightly changing types, not just one central implementation

I agree this is an issue. We should find a way to abstract this pattern if possible.

In my mind this is an annoying problem, but I agree, it is probably not impossible to find some solution for it. At least it is doable, all these collections could after all just implement that complicated logic. Not elegant, and I'm not sure how realistic it is in practice, but it doesn't look entirely impossible to me.

The much bigger concern in my mind is the type instability of the anonymous function in something like map from a vector of named tuples that has Union{T,Missing} fields, when that anonymous function creates some new named tuple, i.e. just a normal projection. I just don't see any general workaround for that. @quinnj wrote somewhere that he has another idea, so I'm very much looking forward to that, but at this point I've thought about this for six months and I'm pretty convinced that there is just a serious mismatch between the Union{T,Missing} idea and the design in Query.jl (and possibly SplitApplyCombine.jl). And let me just stress that I still really hope that I'm wrong on this one because the last thing we need is a data ecosystem that uses two different ways of representing missing values. But as far as I can tell that is where we are heading right now...

@JeffBezanson
Copy link

once you have a DataValueArray it is really very usable at this point and I'm not really aware of any issues

It works pretty well, but I ran into the issue recently that similar(a::DataValueArray, Any) does not return an array that can hold any value, which prevents Base map and collect from working.

@JeffBezanson
Copy link

On the type inference issue, I'm referring to the need to determine the parameter in DataValue{T} when values are missing. How do you plan to do it?

@JeffBezanson
Copy link

@zenna I agree about the connections to domain theory. It would indeed be interesting to explore e.g. making Int + 1 do an abstract interpretation. An unknown value would just be represented by a type, Any if nothing is known about it. I assume this is what you're thinking? Do you have any thoughts on how David's example:

using Nulls

A = [1,2,null,4]
B = [1.,2.,3.,null]

result = map((i,j)->(a=i,b=j),A,B)

would play out in that world?

@davidanthoff I would seriously consider making the syntax (a = i.a,) use i's element type for a in the new named tuple.

Also, if you're consuming an iterator don't you have to worry about varying types anyway? For instance the user could give you DataValue{Int} and DataValue{Union{}} on different iterations, or just Int and String.

@davidanthoff
Copy link
Author

It works pretty well, but I ran into the issue recently that similar(a::DataValueArray, Any) does not return an array that can hold any value, which prevents Base map and collect from working.

Just for those that aren't following things over at DataValues.jl: this has been fixed now, that was just a bug.

On the type inference issue, I'm referring to the need to determine the parameter in DataValue{T} when values are missing. How do you plan to do it?

We don't have any automatic lifting in any case, so it is just something one needs to manually take care of when defining the lifted version of a function. Some of the manual definitions right now rely on type inference (that code was mostly just copied from the Nullable implementations in base), and that seems to work fine because those are all very simple functions. I'm pretty sure one could actually get rid of that, it would just be a bit more work (famous last words, I hope not). In my mind this counterfactual return type problem is really serious for any attempt to automatically lift, but if the lifted methods are coded by hand in any case, it seems to mostly work.

DataValues also has const NA=DataValue{Union{}}, so one can always also just return that. I haven't fully explored how that interacts with typed missing values. In some cases one might just end up with a return type of Union{T,DataValue{Union{}}, which seems somewhat similar to the current Union{T,Missing} story. But again, I haven't really thought this through. It would of course be very nice if one could actually have something that supports both typed and untyped missing values (a la Union{T,Missing}) simultaneously. But that might be wishful thinking...

I would seriously consider making the syntax (a = i.a,) use i's element type for a in the new named tuple.

Would that also work for say ( a = log(i.a),)? And for pairs a la i.a => log(i.b)? And essentially for any parameterized constructor?

Also, if you're consuming an iterator don't you have to worry about varying types anyway? For instance the user could give you DataValue{Int} and DataValue{Union{}} on different iterations, or just Int` and String``

Yes, I think Query.jl needs to not error in these situations. But I think if someone ends up in that situation, it is ok if things are slow. But the more common use cases really need to be fast, and a standard projection call that selects a couple of columns from a table is certainly a super common case, so that needs to be very performant. In some sense the usual rules apply here: if you write a type-stable function, you'll get speed. But it would be nice if the most typical use case actually is just type stable.

@JeffBezanson
Copy link

But it would be nice if the most typical use case actually is just type stable.

Well, we have been adding code to the compiler to make Union types perform well despite the lack of type stability. There may be more to do, but getting these cases reasonably fast should be possible. That plus hacks like the one I mentioned in #6 (comment) might be enough. I know implementing the map pattern is a bit annoying, but it seems better to me than dealing with counterfactual types.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants