-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add spreadmissings
#122
base: master
Are you sure you want to change the base?
Add spreadmissings
#122
Conversation
I just did some re-factoring. New implementation is
Miraculously, this infers function infers. |
Okay after playing around with this here's what I think an API should be. Consider the call
Recall that
It automatically finds the vector values and calls
The problem with
If we use |
We also need
|
Maybe it would be more elegant to restrict it to functions for which all arguments are AbstractVector, i.e. |
Thats what I had initially. Then I thought about broadcasting and how it automatically figures out what should be broadcasted. So I updated this PR to emulate that behavior a little. |
Another thing is that So it may be better to use in function extend_missings(v::AbstractVector{T}, esample::BitVector)
out = Vector{Union{Missing, T}(missing, length(esample))
out[esample] = v
return out
end This would allow external packages to define specialized methods of |
That's a good point. Maybe I will try to formalize all of this behavior into a separate package first and see how it goes. |
It'd be great to have such a function! It's just hard to find the right abstraction, so it's good to experiment. |
Requiring packages to extend |
I changed it, but I don't think the implementation is quite correct. In particular, I'm worried about the scalar case. I'm not sure what the importance of the first argument in |
I think it would be totally awesome if @nalimilan or @quinnj took over this PR. It's over my head to deal with all the performance considerations and necessary benchmarking. If possible maybe we can merge it as an undocumented feature and see if it works? |
src/Missings.jl
Outdated
else | ||
return f.f(xs...; kwargs...) | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about throwing an error in this case for now? It doesn't seem very useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No I disagree. I want people to be able to code defensively with this tool. throwing an error would defeat the purpose a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Something that occurred to me though: should we do something special if one of the non-vector arguments is missing
(like returning missing
)? Should we reserve this behavior for the future in case we want spreadmissings
to be a more general version of passmissing
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. We definitely can't return missing
, we would have to return a vector of missing
s, which seems very restrictive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what "restrictive" means here. That would be similar to e.g. broadcast(+, missing, [1, 2, 3])
.
Whether that makes sense depends on what f
does with its inputs, but I can't find examples of functions to which one would pass one or more vectors plus a missing
argument and would not expect the resulting vector to be full of missing
s. So reserving the situation where one argument is missing
could be a good idea in case we want to handle it later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think? At least throwing an error if any of the inputs is missing
for now doesn't seem too problematic?
@nalimilan - please ping me when the first round of comments is resolved. Then I will review. Thank you! |
Co-authored-by: Milan Bouchet-Valat <[email protected]>
Okay after the merging of just to confirm, do we want this new feature to live in Missings.jl for now? Or do we want me to move all this stuff to DataFramesMeta. |
Can you please summarize briefly the consensus how things should work? Thank you! |
(Quoting myself.)
Of course, cor(collect.(skipmissings(x, y))...)
Well, to be honest, I'm more interested in reduction functions. I feel like the need for a uniform missing-skipping syntax for reduction functions is more pressing than the need to handle functions like But I did address the "DataFrames scenario" with non-reduction functions. I proposed either writing them by hand, like this, transform(df, :x => (x -> x .- mean(skipmissing(x))) => :x_c) or adding keyword arguments to I think trying to combine the behavior for reduction functions and non-reduction functions into one function produces a complicated, hard to understand function. |
Making cor(collect.(skipmissings(x, y))...) look a little nicer without the |
Base devs rejected that proposal. see here. Looking back, I agree with the logic a bit more. Vectors are a natural way to enforce indices matching, whereas iterators will never be as clear. |
Yeah, it's debatable. But it seems like they might be open to a new method like |
Copying over from Discourse... It's a little goofy, but we could signal to Setup: struct NeedsArray{T}
x::T
end
const ⋆ = NeedsArray What it would look like: skipm(cor)(⋆x, ⋆y) |
Wow, the In view of that unfortunate situation, perhaps the Base devs could be convinced to allow |
@CameronBieganek I think you should develop your use cases and/or make a detailed proposal somewhere else. You have hijacked a pull request about a well-defined function which was in the final review phase. This is not a design issue intended at experimenting with various ideas. The result is a confusing discussion which hasn't allowed me to grasp what are your main requirements. |
Apologies for hijacking a pull request. I have concerns about the design and usability of |
Co-authored-by: Milan Bouchet-Valat <[email protected]>
Co-authored-by: Milan Bouchet-Valat <[email protected]>
Co-authored-by: Milan Bouchet-Valat <[email protected]>
@nalimilan I have fixed all the issues you have highlighted. Thank you for your feedback! Three things to keep in mind
I've determined this is not due to the fact that How much should I worry about this? Is there anything we can do? Maybe I could re-write the function so it dispatches on EDIT: I' played around with this and it looks like creating specialized structs does fix the type stability issues. However I haven't been able to have both the
Is there something we can do to make this work? I don't really want to make users write EDIT 2: Maybe it's impossible to get type stability under any conditions. I'm not 100% sure.
|
Have you tried adding
|
I'm well past that point. It would be easier for me to follow if each of these methods had its own function name rather than dispatching on keyword arguments. |
@nalimilan this is ready for a review! The inference failures were due to a lack of function barriers. Adding I've filed an issue upstream for a lack of type inference with I also added
It's worth emphasizing again that if none of the inputs are vectors, then behavior is entirely unaffected. So no error is thrown if the inputs are not vectors and the output is a vector. This behavior is indeed complicated, but it's all in the docstring. |
@nalimilan Bumping this. Remember that we ultimately want to try this out in DataFramesMeta.jl first. I will port the PR over there eventually but am editing it here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry. Here are more comments.
function spread_missing( | ||
res::AbstractVector{T}, | ||
vecs::Tuple, | ||
nonmissinginds::AbstractVector{<:Integer}, | ||
nonmissingmask::AbstractVector{<:Bool})::AbstractVector{Union{Missing, T}} where {T} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
function spread_missing( | |
res::AbstractVector{T}, | |
vecs::Tuple, | |
nonmissinginds::AbstractVector{<:Integer}, | |
nonmissingmask::AbstractVector{<:Bool})::AbstractVector{Union{Missing, T}} where {T} | |
function spread_missing(res::AbstractVector{T}, | |
vecs::Tuple, | |
nonmissinginds::AbstractVector{<:Integer}, | |
nonmissingmask::AbstractVector{<:Bool})::AbstractVector{Union{Missing, T}} where {T} |
Same below (for consistency with current style of the package).
nonmissingmask::AbstractVector{<:Bool})::AbstractVector{Union{Missing, T}} where {T} | ||
|
||
if length(res) != length(nonmissinginds) | ||
s = "When spreading a vector result with `spread=$(S)`, " * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
S
doesn't exist here. Apparently also needs testing.
maybespread_missing(f, newargs, new_kwargs, vecs, nonmissinginds, nonmissingmask) | ||
# There is at least one vector, but none of the vectors can contain missing | ||
elseif any(x -> x isa AbstractVector, xs) | ||
vecs = Base.filter(x -> x isa AbstractVector, xs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still applies.
julia> xmiss = [10, 20, 30, missing]; | ||
|
||
julia> ymiss = [missing, 500, 400, 300]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not reuse the previous objects?
if !(spread isa AbstractSpread) | ||
throw(ArgumentError("spread must be either SpreadDefault(), SpreadNonMissing(), or SpreadNone()")) | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This cannot happen.
if !(spread isa AbstractSpread) | |
throw(ArgumentError("spread must be either SpreadDefault(), SpreadNonMissing(), or SpreadNone()")) | |
end |
2. `fillmean_skip` returns a vector which does not allow for `missing`, while | ||
`spreadmissings(fillmean)` does. | ||
|
||
Use the keyword `spread = :all` to emulate the `skipmissing` behavior. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Show an example?
@@ -0,0 +1,227 @@ | |||
using Test, SparseArrays, Missings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's this file?
function right_vec(args...; kwargs...) | ||
kwargs_vals = values(values(kwargs)) | ||
xs = tuple(args..., kwargs_vals...) | ||
vecs = Base.filter(x -> x isa AbstractVector, xs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vecs = Base.filter(x -> x isa AbstractVector, xs) | |
vecs = filter(x -> x isa AbstractVector, xs) |
t = spreadmissings(categorical)(x) | ||
@test t ≈ categorical([1, 2, 3, missing]) | ||
@test typeof(t) == typeof(categorical([1, 2, 3, missing])) | ||
end | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add tests that inference works?
Co-authored-by: Milan Bouchet-Valat <[email protected]>
P = typeof(a) # Type of parent array | ||
I = Tuple{typeof(nonmissinginds)} # Type of the non-missing indices | ||
L = Base.IndexStyle(a) === IndexLinear # If the type supports fast linear indexing | ||
SubArray{T, N, P, I, L}(a, (nonmissinginds,), 0, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring for SubArray
makes no mention of contructors, and in fact says "Construct SubArray
s using the view
function." So this direct usage of a SubArray
constructor is usage of undocumented Base internals that could change in any minor or patch release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's probably true.
I think it will be worthwhile to make our own array type that does this better... hopefully it's not too hard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be relevant – my SentinelViews.jl package does kinda the opposite. It defines a view type that propagates the sentinel value from indices to the results, sentinelview([10, 20, 30], [1, nothing, 3]) == [10, nothing, 30]
.
It was reasonably straightforward to implement, as should be the "typesubtract view" needed here. With the obvious downside that ::SubArray
function dispatches won't be triggered.
Adds behavior described in #121
Implementation is easy enough to follow