Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add spreadmissings #122

Open
wants to merge 27 commits into
base: master
Choose a base branch
from
Open

Conversation

pdeffebach
Copy link
Contributor

Adds behavior described in #121

Implementation is easy enough to follow

function (f::SpreadMissings{F})(xs...) where {F}
    @assert all(x -> x isa AbstractVector, xs)

    x1 = first(xs)
    @assert all(x -> length(x) == length(x1), xs)

    if any(x -> x isa AbstractVector{>:Missing}, xs)
        nonmissinginds = collect(eachindex(skipmissings(xs...)))
        # TODO: narrow the type of the `view` to exclude `Missing`
        res = f.f((view(x, nonmissinginds) for x in xs)...)

        out = Vector{Union{Missing, eltype(res)}}(missing, length(x1))
        out[nonmissinginds] .= res

        return out
    else
        return f.f(xs...)
    end
end

@pdeffebach
Copy link
Contributor Author

I just did some re-factoring. New implementation is

function (f::SpreadMissings{F})(xs...) where {F}
    if any(x -> x isa AbstractVector{>:Missing}, xs)
        vecs = Base.filter(x -> x isa AbstractVector, xs)
        s = skipmissings(vecs...)
        vecs_counter = 1
        newargs = ntuple(length(xs)) do i
            if xs[i] isa AbstractVector
                t = s[vecs_counter]
                vecs_counter += 1
            else
                t = xs[i]
            end
            t
        end

        res = f.f(newargs...)

        if res isa AbstractVector
            out = Vector{Union{Missing, eltype(res)}}(missing, length(first(vecs)))
            out[collect(eachindex(first(s)))] .= res
        else
            out = Vector{Union{Missing, typeof(res)}}(missing, length(first(vecs)))
            out[collect(eachindex(first(s)))] .= Ref(res)
        end

        return out
    else
        return f.f(xs...)
    end
end

Miraculously, this infers function infers.

@pdeffebach
Copy link
Contributor Author

Okay after playing around with this here's what I think an API should be. Consider the call

f(v1::Vector, v2::Vector, s::Scalar)

Recall that skipmissings(x, y) returns a tuple of iterators sx, sy that iterate through x and y where both indices are not missing. We need

  1. skipfun(f) such that skipfun(f)(v1, v2, s) is
sv1, sv2 = skipmissings(v1, v2)
f(sv1, sv2, x)

It automatically finds the vector values and calls skipmissings on those. Maybe we can dig into the broadcasting infrastructure to make this work more intelligently?

  1. spreadmissings, defined in this PR. spreadmisings(f)(v1, v2, s) is equivalent to
sv1, sv2 = skipmissings(v1, v2)
res = f(sv1, sv2, x)
if res isa AbstractVector
    out = Vector{Union{Missing, eltype(res)}}(missing, length(v1))
    out[collect(eachindex(sv1))] .= res
else
    out = Vector{Union{Missing, typeof(res)}}(missing, length(v1))
    out[collect(eachindex(sv1))] .= Ref(res)
end
out

The problem with spreadmissings is that it is restricted to Vectors and it's hard to figure out all the types that really need to be "spread". Consider, for example

function maketable(v1, v2, s)
    # stuff which errors on missing values
    (n1 = v1 .+ v2 .* s, n2 = v1 .- v2 .* s)
end

If we use spreadmissings logic, we know that this should be a named tuple of vectors each of length v1. Unfortunately we can't just special case a named tuple of vectors to be the kind of thing that "spreads". I mean, maybe we can, but it's easy to see how that kind of logic can get more and more complicated.

@pdeffebach
Copy link
Contributor Author

We also need

  1. passmissingfalse which returns false instead of missing, for use in filtering and && etc.
  2. spreadmissingfalse which fills in false instead of missing in the missing indices.

@matthieugomez
Copy link

matthieugomez commented Nov 7, 2020

Maybe it would be more elegant to restrict it to functions for which all arguments are AbstractVector, i.e. @assert all(x -> x isa AbstractVector, xs).
AFAIU, this does not hurt usability in DataFrames since users need to create anonymous function with columns anyway. Same thing for DataFramesMeta.

@pdeffebach
Copy link
Contributor Author

Thats what I had initially. Then I thought about broadcasting and how it automatically figures out what should be broadcasted. So I updated this PR to emulate that behavior a little.

@matthieugomez
Copy link

matthieugomez commented Nov 7, 2020

Another thing is that spreadmissing always returns a Vector, but some functions may return AbstractVectors that are not Vectors (e.g. . cut returns a CategoricalVector)

So it may be better to use in spreadmissing a function like extend_missings which extends an AbstractVector to accept missings, i.e.

function extend_missings(v::AbstractVector{T}, esample::BitVector)
   out = Vector{Union{Missing, T}(missing, length(esample))
   out[esample] = v
   return out
end

This would allow external packages to define specialized methods of extend_missings for their particular type.

@pdeffebach
Copy link
Contributor Author

That's a good point. Maybe I will try to formalize all of this behavior into a separate package first and see how it goes.

@matthieugomez
Copy link

matthieugomez commented Nov 7, 2020

It'd be great to have such a function! It's just hard to find the right abstraction, so it's good to experiment.

@nalimilan
Copy link
Member

Requiring packages to extend extend_missings won't really work as they would all need to add a dependency on Missings, which most won't want to do. Why not just call allowmissing(res), which uses similar and should therefore return the right array type? for the case where a scalar is returned, similar(vecs[1], Union{Missing, typeof(res)}) shouldn't be too bad (at least it will return a CategroicalArray if res isa CategoricalValue).

src/Missings.jl Outdated Show resolved Hide resolved
src/Missings.jl Outdated Show resolved Hide resolved
src/Missings.jl Outdated Show resolved Hide resolved
src/Missings.jl Outdated Show resolved Hide resolved
@pdeffebach
Copy link
Contributor Author

which uses similar and should therefore return the right array type? for the case where a scalar is returned, similar(vecs[1], Union{Missing, typeof(res)}) shouldn't be too bad (at least it will return a CategroicalArray if res isa CategoricalValue).

I changed it, but I don't think the implementation is quite correct. In particular, I'm worried about the scalar case. I'm not sure what the importance of the first argument in similar(x, eltype, length) is. What's the purpose of vecs[1] in similar(vec[1], Union{typeof(res), Missing})?

@pdeffebach
Copy link
Contributor Author

I think it would be totally awesome if @nalimilan or @quinnj took over this PR. It's over my head to deal with all the performance considerations and necessary benchmarking.

If possible maybe we can merge it as an undocumented feature and see if it works?

src/Missings.jl Outdated Show resolved Hide resolved
src/Missings.jl Outdated
Comment on lines 244 to 246
else
return f.f(xs...; kwargs...)
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about throwing an error in this case for now? It doesn't seem very useful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No I disagree. I want people to be able to code defensively with this tool. throwing an error would defeat the purpose a bit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Something that occurred to me though: should we do something special if one of the non-vector arguments is missing (like returning missing)? Should we reserve this behavior for the future in case we want spreadmissings to be a more general version of passmissing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. We definitely can't return missing, we would have to return a vector of missings, which seems very restrictive.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what "restrictive" means here. That would be similar to e.g. broadcast(+, missing, [1, 2, 3]).

Whether that makes sense depends on what f does with its inputs, but I can't find examples of functions to which one would pass one or more vectors plus a missing argument and would not expect the resulting vector to be full of missings. So reserving the situation where one argument is missing could be a good idea in case we want to handle it later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think? At least throwing an error if any of the inputs is missing for now doesn't seem too problematic?

src/Missings.jl Outdated Show resolved Hide resolved
src/Missings.jl Outdated Show resolved Hide resolved
src/Missings.jl Outdated Show resolved Hide resolved
src/Missings.jl Outdated Show resolved Hide resolved
@bkamins
Copy link
Member

bkamins commented Jun 28, 2021

@nalimilan - please ping me when the first round of comments is resolved. Then I will review. Thank you!

src/Missings.jl Outdated Show resolved Hide resolved
src/Missings.jl Outdated Show resolved Hide resolved
@pdeffebach
Copy link
Contributor Author

Okay after the merging of @rtransform etc. in DataFramesMeta, this will be my next focus.

just to confirm, do we want this new feature to live in Missings.jl for now? Or do we want me to move all this stuff to DataFramesMeta.

@bkamins
Copy link
Member

bkamins commented Jul 25, 2021

Can you please summarize briefly the consensus how things should work? Thank you!

@CameronBieganek
Copy link

CameronBieganek commented Dec 1, 2023

(Quoting myself.)

There might be a natural way to design the API so that users can collect if they need to.

Of course, skipmissings is already fairly natural in this regard. For example, if you want to use cor, you can do this:

cor(collect.(skipmissings(x, y))...)

But I don't think you've responded to the scenario with DataFrames? A key benefit of the current approach is that you don't lose length information, the way you would with skipmissings.

Well, to be honest, I'm more interested in reduction functions. I feel like the need for a uniform missing-skipping syntax for reduction functions is more pressing than the need to handle functions like center and zscore.

But I did address the "DataFrames scenario" with non-reduction functions. I proposed either writing them by hand, like this,

transform(df, :x => (x -> x .- mean(skipmissing(x))) => :x_c)

or adding keyword arguments to zscore, etc. I also said "But perhaps there's a more elegant way to do this. I will continue to think about it."

I think trying to combine the behavior for reduction functions and non-reduction functions into one function produces a complicated, hard to understand function.

@jariji
Copy link

jariji commented Dec 1, 2023

Making cor accept SkipMissings would make

cor(collect.(skipmissings(x, y))...)

look a little nicer without the collect.

@pdeffebach
Copy link
Contributor Author

Making cor accept SkipMissings would make

cor(collect.(skipmissings(x, y))...)

Base devs rejected that proposal. see here. Looking back, I agree with the logic a bit more. Vectors are a natural way to enforce indices matching, whereas iterators will never be as clear.

@CameronBieganek
Copy link

CameronBieganek commented Dec 1, 2023

Yeah, it's debatable. But it seems like they might be open to a new method like cor(itr), where itr is an iterator of iterators. (Where the inner iterators are observations.)

@CameronBieganek
Copy link

CameronBieganek commented Dec 1, 2023

Copying over from Discourse... It's a little goofy, but we could signal to skipm that an argument needs to be collected. Something like this:

Setup:

struct NeedsArray{T}
    x::T
end

const  = NeedsArray

What it would look like:

skipm(cor)(x, y)

@CameronBieganek
Copy link

Wow, the cor(x::AbstractVector) = 1 method blows up any chance of having a cor(itr) method, where itr iterates observations. 😭

In view of that unfortunate situation, perhaps the Base devs could be convinced to allow cor(itr1, itr2).

@nalimilan
Copy link
Member

@CameronBieganek I think you should develop your use cases and/or make a detailed proposal somewhere else. You have hijacked a pull request about a well-defined function which was in the final review phase. This is not a design issue intended at experimenting with various ideas. The result is a confusing discussion which hasn't allowed me to grasp what are your main requirements.

@CameronBieganek
Copy link

Apologies for hijacking a pull request. I have concerns about the design and usability of spreadmissings, which led me to suggest alternatives.

@pdeffebach
Copy link
Contributor Author

pdeffebach commented Dec 9, 2023

@nalimilan I have fixed all the issues you have highlighted. Thank you for your feedback!

Three things to keep in mind

  1. spreadmissings currently hurts type stability.
julia> using Statistics

julia> using Statistics, Missings

julia> function foo(x)
           spreadmissings(mean)(x)
       end;

julia> @code_warntype foo([1, missing])
MethodInstance for foo(::Vector{Union{Missing, Int64}})
  from foo(x) @ Main REPL[27]:1
Arguments
  #self#::Core.Const(foo)
  x::Vector{Union{Missing, Int64}}
Body::Any
1 ─ %1 = Main.spreadmissings(Main.mean)::Core.Const(Missings.SpreadMissings{typeof(mean)}(Statistics.mean, :default))
│   %2 = (%1)(x)::Any
└──      return %2

I've determined this is not due to the fact that spread is a Symbol (as opposed to specialized structs) and not due to the fact that sometimes we return a vector without missing, in the case of no missing values in the inputs.

How much should I worry about this? Is there anything we can do? Maybe I could re-write the function so it dispatches on SpreadMIssings{F, spread}?

EDIT:

I' played around with this and it looks like creating specialized structs does fix the type stability issues. However I haven't been able to have both the Symbol keyword argument for the "spread" behavior and type stability. I have pushed an implementation that I thought might work but doesn't.

SpreadMissings(f, spread::Val{:default}) = SpreadMissings(f, SpreadDefault())
SpreadMissings(f, spread::Val{:nonmissing}) = SpreadMissings(f, SpreadNonMissing())
SpreadMissings(f, spread::Val{:none}) = SpreadMissings(f, SpreadNone())

function spreadmissings(f; spread::Symbol = :default)
    SpreadMissings(f, Val(spread))
#   throw(ArgumentError("`spread` must be one of `:default`, `:nonmissing`, or `:none`"))
end

Is there something we can do to make this work? I don't really want to make users write SpreadDefault or even define a variable spreaddefault. Those both seem kind of cumbersome.

EDIT 2: Maybe it's impossible to get type stability under any conditions. I'm not 100% sure.

  1. I have added an error if missing is a positional or keyword argument. We can relax this in the future and propagate missing but I'm a little nervous about adding that behavior right now.
  2. Should we have a spread = :all option? So that if there is a scalar output, it gets spread across the whole vector? That way it emulates skipmissing more. See my example above about @transform :m = mean(skipmissing(:x)) vs @transform @spreadmissings :m = mean(:x)

@nalimilan
Copy link
Member

Have you tried adding @inlined to spreadmissings(f; spread::Symbol = :default) (and maybe @noinline to others)? At least that approach works for levels in DataAPI. You could also try using if spread === :default; SpreadMissings(f, SpreadDefault()) and so on inside that function.

spread=:all could be useful indeed. I just hope things don't get too complex to grasp.

@jariji
Copy link

jariji commented Dec 10, 2023

I just hope things don't get too complex to grasp.

I'm well past that point. It would be easier for me to follow if each of these methods had its own function name rather than dispatching on keyword arguments.

@pdeffebach
Copy link
Contributor Author

@nalimilan this is ready for a review!

The inference failures were due to a lack of function barriers. Adding spread to the type parameter of SpreadMissings and using dispatch to control behavior rather than if...else works well. Type hints in the smaller functions also appear helpful.

I've filed an issue upstream for a lack of type inference with mean. See here. The SubArray hack does not go as far as I would like, and still produces a lot of Union{Missing, Float64} inference. But that's okay since it's a small union.

I also added spread = :all

  • For scalar outputs, spreads the output along the full length of the input vector
  • For vector outputs, throws an error. This is because the inner function knows nothing about the full length of the input vectors to the outer function. Any scenario where the output vector matches the length of the input vectors is coincidence and the indices wouldn't correspond.

It's worth emphasizing again that if none of the inputs are vectors, then behavior is entirely unaffected. So no error is thrown if the inputs are not vectors and the output is a vector.

This behavior is indeed complicated, but it's all in the docstring.

@pdeffebach
Copy link
Contributor Author

@nalimilan Bumping this. Remember that we ultimately want to try this out in DataFramesMeta.jl first. I will port the PR over there eventually but am editing it here.

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry. Here are more comments.

src/Missings.jl Outdated Show resolved Hide resolved
Comment on lines +255 to +259
function spread_missing(
res::AbstractVector{T},
vecs::Tuple,
nonmissinginds::AbstractVector{<:Integer},
nonmissingmask::AbstractVector{<:Bool})::AbstractVector{Union{Missing, T}} where {T}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
function spread_missing(
res::AbstractVector{T},
vecs::Tuple,
nonmissinginds::AbstractVector{<:Integer},
nonmissingmask::AbstractVector{<:Bool})::AbstractVector{Union{Missing, T}} where {T}
function spread_missing(res::AbstractVector{T},
vecs::Tuple,
nonmissinginds::AbstractVector{<:Integer},
nonmissingmask::AbstractVector{<:Bool})::AbstractVector{Union{Missing, T}} where {T}

Same below (for consistency with current style of the package).

nonmissingmask::AbstractVector{<:Bool})::AbstractVector{Union{Missing, T}} where {T}

if length(res) != length(nonmissinginds)
s = "When spreading a vector result with `spread=$(S)`, " *
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

S doesn't exist here. Apparently also needs testing.

maybespread_missing(f, newargs, new_kwargs, vecs, nonmissinginds, nonmissingmask)
# There is at least one vector, but none of the vectors can contain missing
elseif any(x -> x isa AbstractVector, xs)
vecs = Base.filter(x -> x isa AbstractVector, xs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still applies.

Comment on lines +516 to +518
julia> xmiss = [10, 20, 30, missing];

julia> ymiss = [missing, 500, 400, 300];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not reuse the previous objects?

Comment on lines +221 to +223
if !(spread isa AbstractSpread)
throw(ArgumentError("spread must be either SpreadDefault(), SpreadNonMissing(), or SpreadNone()"))
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cannot happen.

Suggested change
if !(spread isa AbstractSpread)
throw(ArgumentError("spread must be either SpreadDefault(), SpreadNonMissing(), or SpreadNone()"))
end

2. `fillmean_skip` returns a vector which does not allow for `missing`, while
`spreadmissings(fillmean)` does.

Use the keyword `spread = :all` to emulate the `skipmissing` behavior.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show an example?

@@ -0,0 +1,227 @@
using Test, SparseArrays, Missings
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this file?

function right_vec(args...; kwargs...)
kwargs_vals = values(values(kwargs))
xs = tuple(args..., kwargs_vals...)
vecs = Base.filter(x -> x isa AbstractVector, xs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
vecs = Base.filter(x -> x isa AbstractVector, xs)
vecs = filter(x -> x isa AbstractVector, xs)

t = spreadmissings(categorical)(x)
@test t ≈ categorical([1, 2, 3, missing])
@test typeof(t) == typeof(categorical([1, 2, 3, missing]))
end

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add tests that inference works?

Co-authored-by: Milan Bouchet-Valat <[email protected]>
P = typeof(a) # Type of parent array
I = Tuple{typeof(nonmissinginds)} # Type of the non-missing indices
L = Base.IndexStyle(a) === IndexLinear # If the type supports fast linear indexing
SubArray{T, N, P, I, L}(a, (nonmissinginds,), 0, 1)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring for SubArray makes no mention of contructors, and in fact says "Construct SubArrays using the view function." So this direct usage of a SubArray constructor is usage of undocumented Base internals that could change in any minor or patch release.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's probably true.

I think it will be worthwhile to make our own array type that does this better... hopefully it's not too hard.

Copy link

@aplavin aplavin Feb 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be relevant – my SentinelViews.jl package does kinda the opposite. It defines a view type that propagates the sentinel value from indices to the results, sentinelview([10, 20, 30], [1, nothing, 3]) == [10, nothing, 30].
It was reasonably straightforward to implement, as should be the "typesubtract view" needed here. With the obvious downside that ::SubArray function dispatches won't be triggered.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants