Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cols([:x1, :x2]) should be a vector vectors, not a DataFrame #158

Closed
pdeffebach opened this issue Jul 23, 2020 · 10 comments
Closed

cols([:x1, :x2]) should be a vector vectors, not a DataFrame #158

pdeffebach opened this issue Jul 23, 2020 · 10 comments
Milestone

Comments

@pdeffebach
Copy link
Collaborator

The calls

julia> @transform(df, t = sum([:x1, :x2]));

julia> @transform(df, t = sum(cols([:x1, :x2])));

Should return the same thing. However the second one throws an error because things inside cols become df[!, [:x1, :x2]].

I think this makes the mental model for cols more complicated than it needs to be, and isn't documented behavior. This makes me think I should add it to the "limits of" sections of tests in #155. The question is, should this be defined as undefined behavior?

cc @mkborregaard and @tk3369 and other regular users, I want to get feedback from people using DataFramesMeta on this.

@bkamins
Copy link
Member

bkamins commented Jul 24, 2020

The documentation says:

the column is referenced by the variable expr rather than a symbol

which means that cols(expr) should add a run-time check that expr evaluates to a DataFrames.ColumnIndex before passing it to getindex. So I would say that the fact that @transform(df, t = sum(cols([:x1, :x2]))) currently works is a semi-bug.

There is another question what should cols([:x1, :x2]) evaluate to if we would decide to allow it. I think the current behavior is the only consistent one and the documentation could be changed to:

cols(expr) gets transformed to a getindex call where expr is used as a column selector untouched

and your second example should rather be @transform(df, t = sum([cols(:x1), cols(:x2)]));.

@pdeffebach
Copy link
Collaborator Author

pdeffebach commented Jul 24, 2020

I think the current behavior is the only consistent one and the documentation could be changed to:

Here is the scenario.

You are writing code in global scope and where you have to take the sum across columns.

t = @pipe df |>
    select(:id, [:v1, :v2, :v3]) |>
    @transform(_, newvar = sum([:v1, :v2, :v3])

Now you realize that you aren't always working with :v1, :v2, :v3. You need to put your current pipe into a function.

With the current behavior, you can't just do a one-to-one translation but just adding cols where you need it to go.

function clean(df, idvar, vars)
    t = @pipe df |>
        select(idvar, vars)
        @transform(_, newvar = sum(cols(vars))
end

This is frustrating. It means to get a one-to-one translation from the global scope code to the function code, I need to add an eachcol.

I think that whatever container of Symbols what is inside cols evaluates to, it should be the same as if that literal were placed in the DataFramesMeta call.

This is roughly how stata works. Stata is very lisp-like.

If you define the local

local t v1 v2 v3

wherever you see that local

`t'

in the code, you know that it will always behave exactly as though the literal v1 v2 v3 were placed there instead.

cc @matthieugomez for the stata analogy

@bkamins
Copy link
Member

bkamins commented Jul 24, 2020

I understand the issue with convenience. The problem is with consistency. Consider the code:

julia> df = DataFrame(rand(2,3))
2×3 DataFrame
│ Row │ x1       │ x2       │ x3       │
│     │ Float64  │ Float64  │ Float64  │
├─────┼──────────┼──────────┼──────────┤
│ 1   │ 0.970156 │ 0.92646  │ 0.598718 │
│ 2   │ 0.835014 │ 0.360059 │ 0.569086 │

julia> t = @pipe df |>
           @transform(_, newvar = axes(cols([:x1, :x2, :x3]), 1))
2×4 DataFrame
│ Row │ x1       │ x2       │ x3       │ newvar │
│     │ Float64  │ Float64  │ Float64  │ Int64  │
├─────┼──────────┼──────────┼──────────┼────────┤
│ 1   │ 0.970156 │ 0.92646  │ 0.598718 │ 1      │
│ 2   │ 0.835014 │ 0.360059 │ 0.569086 │ 2      │

(this is how it works now)

How would you want to express this under your proposal? Or such things should be not supported?

@pdeffebach
Copy link
Collaborator Author

It would not be supported. Since you can't do that with Symbol literals.

I think the main problem is that it would be very very hard to specialize behavior based off of what's inside the cols command. For instance, say for whatever reason the user wanted a Tuple of vectors.

cols((:x1, :x2, :x3))

I don't really see a way of getting this to work without a whole extra replace_syms! step. Perhaps this would be easy since we aren't doing things like adding Symbols inside cols.

If we had to force a certain behavior on the user, I personally would rather have collect(eachcol(df[:, vars])) be the output rather than df[:, vars]. But the first best choice is something like the above.

@pdeffebach
Copy link
Collaborator Author

For example, here's something that might be feasible

v = [:x1, :x2]

replace_syms!(e::Expr, membernames) =
    if onearg(e, :^)
        e.args[2]
    elseif onearg(e, :_I_)
        @warn "_I_() for escaping variables is deprecated, use cols() instead"
        addkey!(membernames, :($(e.args[2])))
    elseif onearg(e, :cols)
        t = @eval Main v # TODO: figure out how to evaluate this using `e.args[2]`
        # addkey!(membernames, :($(e.args[2])))
        mapexpr(x -> replace_syms!(x, membernames), Meta.parse(repr(t)))
    elseif e.head == :quote
        addkey!(membernames, Meta.quot(e.args[1]) )
    elseif e.head == :.
        replace_dotted!(e, membernames)
    else
        mapexpr(x -> replace_syms!(x, membernames), e)
    end

julia> v = [:x1, :x2]
2-element Array{Symbol,1}:
 :x1
 :x2

julia> @transform(df, t = fill(typeof(cols(v)), nrow(df)))
2×3 DataFrame
│ Row │ x1        │ x2       │ t                         │
│     │ Float64   │ Float64  │ DataType                  │
├─────┼───────────┼──────────┼───────────────────────────┤
│ 1   │ 0.155328  │ 0.581641 │ Array{Array{Float64,1},1} │
│ 2   │ 0.0800355 │ 0.88714  │ Array{Array{Float64,1},1} │

julia> v = (:x1, :x2)
(:x1, :x2)

julia> @transform(df, t = fill(typeof(cols(v)), nrow(df)))
2×3 DataFrame
│ Row │ x1        │ x2       │ t                                        │
│     │ Float64   │ Float64  │ DataType                                 │
├─────┼───────────┼──────────┼──────────────────────────────────────────┤
│ 1   │ 0.155328  │ 0.581641 │ Tuple{Array{Float64,1},Array{Float64,1}} │
│ 2   │ 0.0800355 │ 0.88714  │ Tuple{Array{Float64,1},Array{Float64,1}} │

@bkamins
Copy link
Member

bkamins commented Jul 24, 2020

In this case I would list a list of values accepted in cols and specify the behavior for each of these types and throw an error if a value of non-supported type is passed.

@pdeffebach
Copy link
Collaborator Author

In this case I would list a list of values accepted in cols and specify the behavior for each of these types and throw an error if a value of non-supported type is passed.

Yes, i think this is easiest. I think my idea of evaluating cols and then replacing the symbols again is going overboard.

Actually, I just tried it out with my new DataFrames.transform-based macro @transform2.

It "just works" exactly as I described above, with the exception that

DataFrames.transform(df, [[:x1, :x2]] => sum)

isn't supported. I will continue to thing on this.

@pdeffebach
Copy link
Collaborator Author

Marking as 1.X since as fas as I understand, AsVector is something that might be added to DataFrames.

Allowing for AsTable is a higher priority, since thats currently allowed in DataFrames.

@bkamins
Copy link
Member

bkamins commented Mar 7, 2021

AsVector will be added, but I guess after 1.0 release (@nalimilan prefers to keep the number of changes in 1.0 release minimal and I understand the reasoning as otherwise we would postpone it too much)

@pdeffebach
Copy link
Collaborator Author

Closed in favor of #229

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants