-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add spreadmissings
, the backend for column-wise @passmissing
#276
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -283,6 +283,101 @@ macro byrow(args...) | |
end | ||
|
||
|
||
struct SpreadMissings{F} <: Function | ||
f::F | ||
end | ||
|
||
function (f::SpreadMissings{F})(vecs::AbstractVector...) where {F} | ||
if any(x -> x isa AbstractVector{>:Missing}, vecs) | ||
|
||
firstindices = eachindex(first(vecs)) | ||
if !all(x -> eachindex(x) == firstindices, vecs) | ||
err_str = "Indices of arguments do not match. Indices are " * | ||
join(string.eachindex.(vecs), ", ", " and ") * "." | ||
|
||
throw(ArgumentError(err_str)) | ||
end | ||
Comment on lines
+293
to
+299
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Assuming this is an internal function I think we do not need it. DataFrames.jl enforces 1-based indexing. |
||
|
||
nonmissingmask = fill(true, length(vecs[1])) | ||
for v in vecs | ||
nonmissingmask .&= .!ismissing.(v) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is inefficient. See https://github.com/JuliaData/DataFrames.jl/blob/main/src/abstractdataframe/abstractdataframe.jl#L758 for an efficient implementation. |
||
end | ||
|
||
nonmissinginds = findall(nonmissingmask) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
|
||
vecs_counter = 1 | ||
newargs = ntuple(i -> view(vecs[i], nonmissinginds), length(vecs)) | ||
|
||
res = f.f(newargs...) | ||
|
||
if res isa AbstractVector && length(res) == length(nonmissinginds) | ||
out = similar(res, Union{eltype(res), Missing}, length(vecs[1])) | ||
fill!(out, missing) | ||
out[nonmissinginds] .= res | ||
|
||
return out | ||
else | ||
return res | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am not sure about this design. The "appropriate length" part seems fragile. I understand the intention, but it seems incorrect somehow. I feel that the best design would be to have two separate functions:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, another complication is how the spreading works. DataFrames.jl has some fairly complicated rules regarding spreading scalar values across columns. Let's say you do
this would currently spread the mean across all rows. But maybe it's more logical to spread it across only non-missing rows. Unfortunately this would require re-implementing the DataFrames.jl spreading logic. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My intention is to resolve such things with JuliaData/DataFrames.jl#2794. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Indeed if the result is an
Well it would also be quite common for users to wish to fill all rows with the mean, including those with missing values. That happens for example if you want to create a group-level variable in a multi-level model. |
||
end | ||
else | ||
return f.f(vecs...) | ||
end | ||
end | ||
|
||
""" | ||
spreadmissings(f) | ||
|
||
Given a function `f`, wraps `f` so that a `view` selecting non-missing | ||
values of `AbstractVector` argument is applied before calling `f`. After `f` | ||
is called, if the result is a `AbstractVector` of the appropriate length, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. define "appropriate length" There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. also comment exactly what happens otherwise. |
||
is is "spread" across the length of the original inputs, imputing `missing` where | ||
indices of inputs contain `missing` values. | ||
|
||
Consider `spreadmissing(f)(x1, x2)`, where `x1` and `x2` are defined as | ||
|
||
``` | ||
x = [1, missing, 3, 4] | ||
y = [5, 6, 7, missing] | ||
``` | ||
|
||
and the function | ||
|
||
``` | ||
no_missing(x::Int, y::Int) = x + y | ||
f(x::AbstractVector, y::AbstractVector) = no_missing.(last(x), y) | ||
``` | ||
|
||
Then the indices for which both `x` and `y` are non-missing are `[1, 2]`. | ||
`spreadmissings(f)` then calls `f(view(x, [1, 3]), view(y, [1, 3]))`. This will return | ||
`[8, 10]`. | ||
|
||
Finally, `spreadmissings(f)` will "spread" the vector `[7, 8]` across the non-missing | ||
indices of the inputs, filling in `missing`, otherwise. As a result we have | ||
|
||
``` | ||
julia> DataFramesMeta.spreadmissings(f)(x, y) | ||
4-element Vector{Union{Missing, Int64}}: | ||
8 | ||
missing | ||
10 | ||
missing | ||
``` | ||
|
||
If the result of `f(view()...)` is not a vector, the it is returned as-is. | ||
|
||
### Examples | ||
|
||
``` | ||
julia> x = [1, 2, 3, 4, missing, 6]; y = [11, 5, missing, 8, 15, 12]; | ||
|
||
julia> DataFramesMeta.spreadmissings(cor)(x, y) | ||
0.3803061508332227 | ||
``` | ||
|
||
|
||
""" | ||
spreadmissings(f) = SpreadMissings{Core.Typeof(f)}(f) | ||
|
||
############################################################################## | ||
## | ||
## @with | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the rationale behind
SpreadMissings
name? It would be more natural for me to call it something like "complete cases" (or justSkipMissings
).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was originally named for the "spreading" that happens when the shorter vector with the non-missing indices is spread out to match the length of the inputs. But I agree that "complete cases" is better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem with the name
SkipMissings
is that this type is quite different fromSkipMissing
due to the "spreading" behavior. Actually, this function can be defined as a vectorizedpassmissing
when the wrapped function returns a vector; but when the function returns a scalar it leaves it as a scalar. This behavior is what seems to make the most sense for users: an alternative would be to "spread" scalars so thatmissing
is used in positions which aremissing
in the input, but that doesn't sound super useful.So it's hard to find a good name due to this hybrid behavior. I'd almost call it
ArrayPassMissing
. Anyway for now it's internal.But what matters is the API that DataFramesMeta (and probably DataFrames at some point) exposes to users. The approach adopted here is to make
@passmissing
the go-to solution to handle missing values, which would work formean(:x)
,cor(:x, :y)
,scale(:x)
, etc. It's appealing as most users wouldn't have to care about the subtle differences betweenpassmissing
andskipmissing
and their potential variants (mean(skipmissing(:x))
would be an alternative to@passmissing mean(:x)
which makes a single pass over the data, but most users probably don't care a lot).Another approach would be to require users to choose what they want to do, but it would be much more annoying to use:
@skipmissing
would useskipmissing∘f
for single arguments, or a new variant ofskipmissing
for multiple arguments which passes a view of their non-missing entries; both would leave the returned value as-is@passmissing
would usepassmissing(f)
forAbstractArray
inputs, orspreadmissing(f)
forAbstractArray
inputs, and would require the output to be anAbstractArray
with its length equal to the number of non-missing entries in the input (or possibly broadcast a scalar result to non-missing positions)