Standardizing working with multiple columns #2016

pdeffebach · 2019-11-15T19:55:38Z

I propose adding utility functions for more flexibility in working with multiple column arguments in DataFrames.

The motivation for this stems from the Stata functionality for keep. In Stata, one can write

keep patient_* id* household_1-household_50 this that

What's more, we can write

local vars_to_keep `patient_* id* household_1-household_50 this that `
vars_to_keep `vars_to_keep` one_more_thing
keep `vars_to_keep`

We are close to being able to do this in DataFrames, but not there yet. Currently we can write

vars_to_keep = [r"patient_*", r"id*", Between(:household_1, :household_50), :this, :that]
vars_to_keep = push!(vars_to_keep, :one_more_thing)
select(df, vars_to_keep...)

This is pretty good for the user. However it makes it hard to write code. For instance, what if I want to drop those variables instead of keep them? I think that would be hard to do with the current system while being very easy to do in Stata.

Our current system is also inconsistent with splatting arguments. Take the function dropmissing for example.

function dropmissing(df::AbstractDataFrame,
                     cols::Union{ColumnIndex, AbstractVector, Regex, Not, Between, All, Colon}=:;
                     disallowmissing::Bool=true)
    newdf = df[completecases(df, cols), :]
    disallowmissing && disallowmissing!(newdf, cols)
    newdf
end

This function only allows cols to be one of Not, Between, etc. rather than a vector. And it doesn't splat, meaning you couldn't do dropmissing(df, Between(:a, :d), Between(:f, :m).

It would require a lot of work to support these kinds of function calls on a case-by case basis. Rather, I propose we standardize this kind of argument with a function of the form

"""
    clean_inputs(df, input, inputvec = Symbol[])

Always returns a `Vector` of `Symbol`s from an array of `Symbol`, `Integer`, 
`Regex`, `Between`, and `:`. 
"""
function clean_inputs_vec(df, inputs::AbstractVector, inputvec = Symbol[])
    for input in inputs
        clean_inputs!(df, input, inputvec)
    end

    return unique(inputvec)
end

function clean_inputs!(df, input::Symbol, inputvec::Vector{Symbol})
    push!(inputvec, input)
end

function clean_inputs!(df, input::Integer, inputvec::Vector{Symbol})
    push!(inputvec, names(df)[input])
end

function clean_inputs!(df, input::Regex, inputvec::Vector{Symbol})
    matches = filter(name -> occursin(input, String(name)), names(df))
    append!(inputvec, matches)
end

function clean_inputs!(df, input::Between, inputvec::Vector{Symbol})
    append!(inputvec, names(df)[input.first:input.last])
end

function clean_inputs!(df, ::Colon, inputvec::Vector{Symbol})
    append!(inputvec, names(df))
end

Additionally, given the growing consensus about the Pairs interface design, I propse a similar form of sanitation for inputs of the form

foo(df, [:a => f1, :b => f2]; c = d => f3)

Working with these inputs was a hard thing to do in #1620, though I regret abandoning the PR. Now that we allow regex in combine, this makes working with Vectors of inputs harder. It would be nice to have something which

Expanded all Vector-like arguments (Between, Regex, etc.) to a single named tuple and
Automatically merged Pair arguments and the keyword arguments into a single named tuple

Given the likely addition of a transform fun as well as the already-existing insertcols! function, I think this would be a valuable tool-set to build and make it easier to standardize the interface.

The text was updated successfully, but these errors were encountered:

bkamins · 2019-11-15T22:48:35Z

Thank you for writing this down. I have one comment though before we move forward.

This is pretty good for the user. However it makes it hard to write code. For instance, what if I want to drop those variables instead of keep them?

Just write select(df, Not(All(vars_to_keep...))).

And in general All is meant to to provide combining functionality.

There is one small difference between what you propose and All. Your code will error on duplicates while All deduplicates.

In summary - the question is - could you please clarify what does All miss that you need?

pdeffebach · 2019-11-15T23:12:58Z

Thank you for your response. I think a splatted All() does provide the functionality I am looking for.

Nonetheless, I wonder if for any function that currently takes in a Vector of Symbols, we should allow it to accept any iterable of Union{Symbol, Integer, All, Between, Not, Regex}.

Whatever choice we make, we should clarify this as a contract to the user and comb through the code base to add All(...) functions where for ease of maintenance. We could even have an AllSymbol(...) for functions where we need the symbols or AllInteger(...) for functions where we need only the locations of the columns.

nalimilan · 2019-11-18T21:13:08Z

Yes, all functions that allow specifying columns should probably accept any of the selector types.

I wonder whether we really need people to write All(cols...) BTW. Maybe Not([cols...]) should also work. Any reason it shouldn't?

pdeffebach · 2019-11-19T17:52:29Z

My proposition is that we should allow the user to input a Vector of all selector types then immediately call All(input...) internally.

bkamins · 2019-11-19T22:24:40Z

We could make Not(cols...) to be equal to Not(All(cols...)) - this should not be a problem and is not breaking.

The only reservation against there proposals is that [:x, :x] is an error, while All(:x, :x) is OK, as All deduplicates. So this change would be breaking.

Let us decide what we want and I can implement the changes.

nalimilan · 2019-11-20T08:20:41Z

That raises an interesting question: what should e.g. a [:x5, r"x"] do? Throw an error? Or put :x5 first and other :x.* columns after it? The latter would certainly be more useful. Maybe we can make a difference between symbols and integers, which can only represent a single column, and other selectors: for the former it makes sense to throw an error when there are duplicates since we cannot select a column twice; for the latter, dropping duplicates and keep the first appearance of a column sounds better.

pdeffebach · 2019-11-20T13:04:29Z

Stata de-duplicates in things like these, so I support us de-duplicating and maintaining first appearance.

bkamins · 2019-11-20T18:58:11Z

I am OK with the general idea of de-duplicating.

Avoid de-duplicating things like [:x1, :x1] is also intuitively OK, but it goes against All API which allows All(:x1, :x1). We could change this, but this change would be breaking and inconsistent with JuliaDB.jl.

Apart from this small glitch that we should settle down there are two things to do:

first writing down the precise specification of target functionality
making a PR implementing and documenting it

Here the question is do you want me to do both (I am OK to do this) or someone is willing to make a PR with this change.

nalimilan · 2019-11-22T21:18:30Z

My suggestion was to keep All and indexing with vectors of symbols/integers as they are, but allow vectors mixing different kinds of indices and deduplicate them. IOW, only throw an error if symbols or integers appear several times.

bkamins · 2019-11-22T22:23:27Z

It is crucial that we are precise here (as what you propose means that we cannot simply call All(v...) on vector v).

So I understand the rule you propose is (this is not something we have currently and it is actually much more flexible - I am not sure if we really want to allow it):

allow any AbstractVector as column selector (currently we are strict and allow mono-Symbol, mono-integer and mono-bool vectors only)
in this AbstractVector we allow mixing of any styles (Symbol, Integer, All, Between, Not, Colon) - I guess except bool indexing which should be reserved
we process top-level of the vector and find all Symbol or integer entries and make sure they are not duplicated; if they are duplicated throw an error other wise call All(v...) on this vector;
in particular we have to remember that these rules would get applied recursively (by design) if we get such vectors nested in vectors;

Now the question is - how often such mixing would be needed and is it worth to add this complexity as currently we have All that handles exactly such cases.

Just to be clear - I am not strongly against it. The only thing is that I feel that this is not a super common pattern and it duplicates the functionality we already have that All provides with additional complexity of checking for duplicates added. But I think I am biased here (as I spend hours working on it so I probably have a skew 😄) and people maybe would do it often and strongly prefer a vector form instead of All. Therefore for me it is important to hear the voice of the user.

Of course - we still should allow Not(v...) to be parsed as Not(All(v...))) - this is something that I think is natural and not problematic to add.

bkamins · 2019-11-22T22:24:41Z

Just to give an example:

[[[[1]]]]

would be a valid form and would be equivalent to selecting the first column just with [1].

pdeffebach · 2019-11-22T22:47:45Z

Thank you for your detailed comment. Your summary is exactly the behavior I was imagining. You are correct that this logic does result in some odd behavior, like [[[[1]]]]. Perhaps recursiveness should be re-thought.

This week is tough for me (Thanksgiving in US and classes ramping up). But I would like to find time to, as you say, make this as precise as possible into a PR that just explores documentation.

I agree that it would be nice to have more feedback from users. My expectation is that there are many users who miss Stata's seamless handling of multiple types of column selection.

However in addition, this Issue is partly motivated by ease of development as well. If we had a more binding (and consistent) contract for what inputs could be considered, I believe it would make argument handling more streamlined (similar for :x => fun type functions.). So even in the absence of an overwhelming user need, I still think this might be a valuable thing to do.

Perhaps the week after next I will be able to write a PR.

bkamins · 2019-11-22T23:14:22Z

I think it is less effort to just to write down the specification you propose here before making a PR when you have time for this.

Just as an additional comment Not(x...) is not possible as:

julia> Not(1,1,1)
InvertedIndex{InvertedIndices.TupleVector{Tuple{Int64,Int64,Int64}}}([1, 1, 1])

So it has to be Not([1,2,3]) (if we have a specification) or Not(All(1,2,3)) (this is already working).

So in summary the discussion is how heterogenous AbstractVector should be handled as we cannot change Not.

nalimilan · 2019-11-23T10:03:26Z

I wouldn't say it's high priority, but that behavior would make sense. Processing inputs recursively may not be necessary though.

bkamins · 2019-11-23T10:59:10Z

Fortunately whole this functionality is non-breaking so it can be added at any time.

Processing inputs recursively may not be necessary though.

It would be a side effect. So I just clarify that this is what will happen when we implement this.

pdeffebach · 2019-12-19T16:04:43Z

@bkamins Perhaps when discussion is concluded on #1975 about how the function should work, we should revive this proposal to have a fully generic way of handling arguments that can be used with by, combine, select, describe etc.

bkamins · 2019-12-19T20:20:58Z

This is what I assume. I want to have one set of rules (like we did for indexing) and then make sure we consistently implement it.

pdeffebach · 2020-02-29T15:00:48Z

Curious about this, to what extent is something transform(df, [:x1, r"^a"] .=> normalize) in #2080 ? Is there anything I can do to help implement this kind of feature?

bkamins · 2020-02-29T19:03:56Z

This feature is almost there (just need to finish select PR which is 95% ready and then make a small transform PR and a small PR that will allow broadcasting for All), what you have written will be:

transform(df, (All(:x1, r"^a") .=> normalize)...)

(there is some small amount of syntax noise in comparison to what you have written, but at the benefit that we can "strictly" handle dispatch)

Then "for free" we will get broadcasting like this in split-apply-combine when select/transform are implemented.

pdeffebach · 2020-02-29T19:19:50Z

Wonderful! This is great to hear.

nalimilan · 2020-03-03T17:08:26Z

We should probably allow transform(df, All(:x1, r"^a") .=> normalize) without splatting at some point. That will require a few dispatch tricks, as in #1620.

bkamins · 2020-03-03T21:44:58Z

We can allow for this in the future. I try to make the design of normalize_columns so that it would allow for such things. But it can be added later as #2080 PR already adds 1.5k lines of code.

bkamins · 2020-04-04T07:17:35Z

@pdeffebach - all what you wanted seems to be done in master. Do you need anything else to be added or we can close this?

bkamins · 2020-05-05T15:13:10Z

@pdeffebach - can this be closed?

pdeffebach · 2020-05-05T18:30:48Z

Let's keep this up until Between(:a, :x) .=> + works.

bkamins · 2020-05-05T18:38:34Z

OK. We have tracked it in #2171 so I leave up to you to keep it open or close it.

pdeffebach mentioned this issue Dec 1, 2019

Add more splatting and appending of arguments in by #1620

Closed

Drvi mentioned this issue Dec 22, 2019

Selections.jl + DataFrames.jl #1936

Closed

pdeffebach closed this as completed May 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Standardizing working with multiple columns #2016

Standardizing working with multiple columns #2016

pdeffebach commented Nov 15, 2019

bkamins commented Nov 15, 2019

pdeffebach commented Nov 15, 2019

nalimilan commented Nov 18, 2019

pdeffebach commented Nov 19, 2019

bkamins commented Nov 19, 2019

nalimilan commented Nov 20, 2019

pdeffebach commented Nov 20, 2019

bkamins commented Nov 20, 2019

nalimilan commented Nov 22, 2019

bkamins commented Nov 22, 2019

bkamins commented Nov 22, 2019

pdeffebach commented Nov 22, 2019

bkamins commented Nov 22, 2019

nalimilan commented Nov 23, 2019

bkamins commented Nov 23, 2019

pdeffebach commented Dec 19, 2019

bkamins commented Dec 19, 2019

pdeffebach commented Feb 29, 2020

bkamins commented Feb 29, 2020 •

edited

Loading

pdeffebach commented Feb 29, 2020

nalimilan commented Mar 3, 2020

bkamins commented Mar 3, 2020

bkamins commented Apr 4, 2020

bkamins commented May 5, 2020

pdeffebach commented May 5, 2020

bkamins commented May 5, 2020

Standardizing working with multiple columns #2016

Standardizing working with multiple columns #2016

Comments

pdeffebach commented Nov 15, 2019

bkamins commented Nov 15, 2019

pdeffebach commented Nov 15, 2019

nalimilan commented Nov 18, 2019

pdeffebach commented Nov 19, 2019

bkamins commented Nov 19, 2019

nalimilan commented Nov 20, 2019

pdeffebach commented Nov 20, 2019

bkamins commented Nov 20, 2019

nalimilan commented Nov 22, 2019

bkamins commented Nov 22, 2019

bkamins commented Nov 22, 2019

pdeffebach commented Nov 22, 2019

bkamins commented Nov 22, 2019

nalimilan commented Nov 23, 2019

bkamins commented Nov 23, 2019

pdeffebach commented Dec 19, 2019

bkamins commented Dec 19, 2019

pdeffebach commented Feb 29, 2020

bkamins commented Feb 29, 2020 • edited Loading

pdeffebach commented Feb 29, 2020

nalimilan commented Mar 3, 2020

bkamins commented Mar 3, 2020

bkamins commented Apr 4, 2020

bkamins commented May 5, 2020

pdeffebach commented May 5, 2020

bkamins commented May 5, 2020

bkamins commented Feb 29, 2020 •

edited

Loading