Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standardizing working with multiple columns #2016

Closed
pdeffebach opened this issue Nov 15, 2019 · 26 comments
Closed

Standardizing working with multiple columns #2016

pdeffebach opened this issue Nov 15, 2019 · 26 comments

Comments

@pdeffebach
Copy link
Contributor

I propose adding utility functions for more flexibility in working with multiple column arguments in DataFrames.

The motivation for this stems from the Stata functionality for keep. In Stata, one can write

keep patient_* id* household_1-household_50 this that 

What's more, we can write

local vars_to_keep `patient_* id* household_1-household_50 this that `
vars_to_keep `vars_to_keep` one_more_thing
keep `vars_to_keep`

We are close to being able to do this in DataFrames, but not there yet. Currently we can write

vars_to_keep = [r"patient_*", r"id*", Between(:household_1, :household_50), :this, :that]
vars_to_keep = push!(vars_to_keep, :one_more_thing)
select(df, vars_to_keep...)

This is pretty good for the user. However it makes it hard to write code. For instance, what if I want to drop those variables instead of keep them? I think that would be hard to do with the current system while being very easy to do in Stata.

Our current system is also inconsistent with splatting arguments. Take the function dropmissing for example.

function dropmissing(df::AbstractDataFrame,
                     cols::Union{ColumnIndex, AbstractVector, Regex, Not, Between, All, Colon}=:;
                     disallowmissing::Bool=true)
    newdf = df[completecases(df, cols), :]
    disallowmissing && disallowmissing!(newdf, cols)
    newdf
end

This function only allows cols to be one of Not, Between, etc. rather than a vector. And it doesn't splat, meaning you couldn't do dropmissing(df, Between(:a, :d), Between(:f, :m).

It would require a lot of work to support these kinds of function calls on a case-by case basis. Rather, I propose we standardize this kind of argument with a function of the form

"""
    clean_inputs(df, input, inputvec = Symbol[])

Always returns a `Vector` of `Symbol`s from an array of `Symbol`, `Integer`, 
`Regex`, `Between`, and `:`. 
"""
function clean_inputs_vec(df, inputs::AbstractVector, inputvec = Symbol[])
    for input in inputs
        clean_inputs!(df, input, inputvec)
    end

    return unique(inputvec)
end

function clean_inputs!(df, input::Symbol, inputvec::Vector{Symbol})
    push!(inputvec, input)
end

function clean_inputs!(df, input::Integer, inputvec::Vector{Symbol})
    push!(inputvec, names(df)[input])
end

function clean_inputs!(df, input::Regex, inputvec::Vector{Symbol})
    matches = filter(name -> occursin(input, String(name)), names(df))
    append!(inputvec, matches)
end

function clean_inputs!(df, input::Between, inputvec::Vector{Symbol})
    append!(inputvec, names(df)[input.first:input.last])
end

function clean_inputs!(df, ::Colon, inputvec::Vector{Symbol})
    append!(inputvec, names(df))
end

Additionally, given the growing consensus about the Pairs interface design, I propse a similar form of sanitation for inputs of the form

foo(df, [:a => f1, :b => f2]; c = d => f3)

Working with these inputs was a hard thing to do in #1620, though I regret abandoning the PR. Now that we allow regex in combine, this makes working with Vectors of inputs harder. It would be nice to have something which

  1. Expanded all Vector-like arguments (Between, Regex, etc.) to a single named tuple and
  2. Automatically merged Pair arguments and the keyword arguments into a single named tuple

Given the likely addition of a transform fun as well as the already-existing insertcols! function, I think this would be a valuable tool-set to build and make it easier to standardize the interface.

@bkamins
Copy link
Member

bkamins commented Nov 15, 2019

Thank you for writing this down. I have one comment though before we move forward.

This is pretty good for the user. However it makes it hard to write code. For instance, what if I want to drop those variables instead of keep them?

Just write select(df, Not(All(vars_to_keep...))).

And in general All is meant to to provide combining functionality.

There is one small difference between what you propose and All. Your code will error on duplicates while All deduplicates.

In summary - the question is - could you please clarify what does All miss that you need?

@pdeffebach
Copy link
Contributor Author

Thank you for your response. I think a splatted All() does provide the functionality I am looking for.

Nonetheless, I wonder if for any function that currently takes in a Vector of Symbols, we should allow it to accept any iterable of Union{Symbol, Integer, All, Between, Not, Regex}.

Whatever choice we make, we should clarify this as a contract to the user and comb through the code base to add All(...) functions where for ease of maintenance. We could even have an AllSymbol(...) for functions where we need the symbols or AllInteger(...) for functions where we need only the locations of the columns.

@nalimilan
Copy link
Member

Yes, all functions that allow specifying columns should probably accept any of the selector types.

I wonder whether we really need people to write All(cols...) BTW. Maybe Not([cols...]) should also work. Any reason it shouldn't?

@pdeffebach
Copy link
Contributor Author

My proposition is that we should allow the user to input a Vector of all selector types then immediately call All(input...) internally.

@bkamins
Copy link
Member

bkamins commented Nov 19, 2019

We could make Not(cols...) to be equal to Not(All(cols...)) - this should not be a problem and is not breaking.

The only reservation against there proposals is that [:x, :x] is an error, while All(:x, :x) is OK, as All deduplicates. So this change would be breaking.

Let us decide what we want and I can implement the changes.

@nalimilan
Copy link
Member

That raises an interesting question: what should e.g. a [:x5, r"x"] do? Throw an error? Or put :x5 first and other :x.* columns after it? The latter would certainly be more useful. Maybe we can make a difference between symbols and integers, which can only represent a single column, and other selectors: for the former it makes sense to throw an error when there are duplicates since we cannot select a column twice; for the latter, dropping duplicates and keep the first appearance of a column sounds better.

@pdeffebach
Copy link
Contributor Author

Stata de-duplicates in things like these, so I support us de-duplicating and maintaining first appearance.

@bkamins
Copy link
Member

bkamins commented Nov 20, 2019

I am OK with the general idea of de-duplicating.

Avoid de-duplicating things like [:x1, :x1] is also intuitively OK, but it goes against All API which allows All(:x1, :x1). We could change this, but this change would be breaking and inconsistent with JuliaDB.jl.

Apart from this small glitch that we should settle down there are two things to do:

  • first writing down the precise specification of target functionality
  • making a PR implementing and documenting it

Here the question is do you want me to do both (I am OK to do this) or someone is willing to make a PR with this change.

@nalimilan
Copy link
Member

My suggestion was to keep All and indexing with vectors of symbols/integers as they are, but allow vectors mixing different kinds of indices and deduplicate them. IOW, only throw an error if symbols or integers appear several times.

@bkamins
Copy link
Member

bkamins commented Nov 22, 2019

It is crucial that we are precise here (as what you propose means that we cannot simply call All(v...) on vector v).

So I understand the rule you propose is (this is not something we have currently and it is actually much more flexible - I am not sure if we really want to allow it):

  • allow any AbstractVector as column selector (currently we are strict and allow mono-Symbol, mono-integer and mono-bool vectors only)
  • in this AbstractVector we allow mixing of any styles (Symbol, Integer, All, Between, Not, Colon) - I guess except bool indexing which should be reserved
  • we process top-level of the vector and find all Symbol or integer entries and make sure they are not duplicated; if they are duplicated throw an error other wise call All(v...) on this vector;
  • in particular we have to remember that these rules would get applied recursively (by design) if we get such vectors nested in vectors;

Now the question is - how often such mixing would be needed and is it worth to add this complexity as currently we have All that handles exactly such cases.

Just to be clear - I am not strongly against it. The only thing is that I feel that this is not a super common pattern and it duplicates the functionality we already have that All provides with additional complexity of checking for duplicates added. But I think I am biased here (as I spend hours working on it so I probably have a skew 😄) and people maybe would do it often and strongly prefer a vector form instead of All. Therefore for me it is important to hear the voice of the user.

Of course - we still should allow Not(v...) to be parsed as Not(All(v...))) - this is something that I think is natural and not problematic to add.

@bkamins
Copy link
Member

bkamins commented Nov 22, 2019

Just to give an example:

[[[[1]]]]

would be a valid form and would be equivalent to selecting the first column just with [1].

@pdeffebach
Copy link
Contributor Author

Thank you for your detailed comment. Your summary is exactly the behavior I was imagining. You are correct that this logic does result in some odd behavior, like [[[[1]]]]. Perhaps recursiveness should be re-thought.

This week is tough for me (Thanksgiving in US and classes ramping up). But I would like to find time to, as you say, make this as precise as possible into a PR that just explores documentation.

I agree that it would be nice to have more feedback from users. My expectation is that there are many users who miss Stata's seamless handling of multiple types of column selection.

However in addition, this Issue is partly motivated by ease of development as well. If we had a more binding (and consistent) contract for what inputs could be considered, I believe it would make argument handling more streamlined (similar for :x => fun type functions.). So even in the absence of an overwhelming user need, I still think this might be a valuable thing to do.

Perhaps the week after next I will be able to write a PR.

@bkamins
Copy link
Member

bkamins commented Nov 22, 2019

I think it is less effort to just to write down the specification you propose here before making a PR when you have time for this.

Just as an additional comment Not(x...) is not possible as:

julia> Not(1,1,1)
InvertedIndex{InvertedIndices.TupleVector{Tuple{Int64,Int64,Int64}}}([1, 1, 1])

So it has to be Not([1,2,3]) (if we have a specification) or Not(All(1,2,3)) (this is already working).

So in summary the discussion is how heterogenous AbstractVector should be handled as we cannot change Not.

@nalimilan
Copy link
Member

I wouldn't say it's high priority, but that behavior would make sense. Processing inputs recursively may not be necessary though.

@bkamins
Copy link
Member

bkamins commented Nov 23, 2019

Fortunately whole this functionality is non-breaking so it can be added at any time.

Processing inputs recursively may not be necessary though.

It would be a side effect. So I just clarify that this is what will happen when we implement this.

@pdeffebach
Copy link
Contributor Author

@bkamins Perhaps when discussion is concluded on #1975 about how the function should work, we should revive this proposal to have a fully generic way of handling arguments that can be used with by, combine, select, describe etc.

@bkamins
Copy link
Member

bkamins commented Dec 19, 2019

This is what I assume. I want to have one set of rules (like we did for indexing) and then make sure we consistently implement it.

@pdeffebach
Copy link
Contributor Author

Curious about this, to what extent is something transform(df, [:x1, r"^a"] .=> normalize) in #2080 ? Is there anything I can do to help implement this kind of feature?

@bkamins
Copy link
Member

bkamins commented Feb 29, 2020

This feature is almost there (just need to finish select PR which is 95% ready and then make a small transform PR and a small PR that will allow broadcasting for All), what you have written will be:

transform(df, (All(:x1, r"^a") .=> normalize)...)

(there is some small amount of syntax noise in comparison to what you have written, but at the benefit that we can "strictly" handle dispatch)

Then "for free" we will get broadcasting like this in split-apply-combine when select/transform are implemented.

@pdeffebach
Copy link
Contributor Author

Wonderful! This is great to hear.

@nalimilan
Copy link
Member

We should probably allow transform(df, All(:x1, r"^a") .=> normalize) without splatting at some point. That will require a few dispatch tricks, as in #1620.

@bkamins
Copy link
Member

bkamins commented Mar 3, 2020

We can allow for this in the future. I try to make the design of normalize_columns so that it would allow for such things. But it can be added later as #2080 PR already adds 1.5k lines of code.

@bkamins
Copy link
Member

bkamins commented Apr 4, 2020

@pdeffebach - all what you wanted seems to be done in master. Do you need anything else to be added or we can close this?

@bkamins
Copy link
Member

bkamins commented May 5, 2020

@pdeffebach - can this be closed?

@pdeffebach
Copy link
Contributor Author

Let's keep this up until Between(:a, :x) .=> + works.

@bkamins
Copy link
Member

bkamins commented May 5, 2020

OK. We have tracked it in #2171 so I leave up to you to keep it open or close it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants