Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

expression functors (in particular: over) #3377

Open
adienes opened this issue Sep 6, 2023 · 14 comments
Open

expression functors (in particular: over) #3377

adienes opened this issue Sep 6, 2023 · 14 comments
Labels
Milestone

Comments

@adienes
Copy link
Contributor

adienes commented Sep 6, 2023

pasted from slack:

consider the following type of expr

select(groupby(df, :x4), :x1 => first => :first_in_group)

this is fine in isolation, but is a little unwieldy when embedded in a select list of multiple transformations at once. technically, the below works but is obviously not idiomatic...

select(df,
    :x1 => ByRow(sqrt) => :sqrt1,
    :x2 => ByRow(log) => :log1,
    AsTable([:x4, :x1]) => (
        j -> select(groupby(DataFrame(j), :x4),
            :x1 => first => :first_in_group
        ).first_in_group
    ) => :first_in_group
)

polars has .over which is essentially an expr functor (expr, col) --> expr , so something like this could be expressed as

df.select(
    x1=np.sqrt(col("x1")),
    x2=np.log(col("x2")),
    first_in_group=col("x1").first().over("x4")
)

I'm wondering if such a feature / expr transformation like over (which I use frequently) could be implemented in DataFrames ? (or if it exists and I haven't found it yet)

reply by bkamins:

The key question is if we could add it consistently (i.e. supporting both data frame and grouped data frame as an input, and supporting both select and combine kind of operation)

@bkamins bkamins added the feature label Sep 6, 2023
@bkamins bkamins added this to the 1.7 milestone Sep 6, 2023
@bkamins
Copy link
Member

bkamins commented Sep 6, 2023

Yes - my point was about what is your proposal for these cases. In particular:

  • how do you expect the feature to work with GroupedDataFrame input;
  • how do you expect the feature to work with combine (where operations could return varying number of rows, e.g. you do over that returns 10 rows, but other operations passed to combine return e.g. 7 rows that cannot be unambiguously matched to the 10 rows retuned by over).

(maybe you are not clear what to do - it is OK then, but we must clearly see how to combine the proposal with the whole ecosystem)

@adienes
Copy link
Contributor Author

adienes commented Sep 6, 2023

I have not thought out all the edge cases yet, but I will try to get a list of examples for these scenarios (and then bikeshed the actual api)

at least in the basic case of GroupedDataFrame I think this can be straightforwardly handled by applying over the subgroups within each group, that is

* select(df, Over(expr, [:A, :B]))
* select(groupby(df, :A), Over(expr, :B))
* select(groupby(df, [:A, :B]), expr)
* select(df, Over(Over(expr, :B), :A)) # maybe? not sure about this one

should all be equal

@bkamins
Copy link
Member

bkamins commented Sep 6, 2023

Note that groupby allows for passing sorting order of groups and even allows the groups to be reordered (which is a hard edge case to think about). But I think it can be "worked out". The combin case is a harder issue.

@adienes
Copy link
Contributor Author

adienes commented Sep 6, 2023

actually, sorry, could you help me create such an edge case that may be ambiguous? maybe this is a naive answer, but I am thinking that in the same way for select, then combine(groupby(df, [:A, :B]), expr) == combine(groupby(df, :A), Over(expr, :B)) == combine(df, Over(expr, [:A, :B]))

there is already an error when the column lengths from combine don't match, like

julia> combine(df, :x1 => j -> [1,2], :x2 => j -> [1,2,3])
ERROR: ArgumentError: New columns must have the same length as old columns

as an alternate API showerthought, since |> has higher precedence than =>, maybe this could apply to the column selectors and be written like

select(df, :x1 |> Over([:A, :B]) => first => :first_in_group)

etc

@bkamins
Copy link
Member

bkamins commented Sep 6, 2023

Here is a problematic example:

julia> df = DataFrame(id=[1,2,1,2,3], x=1:5)
5×2 DataFrame
 Row │ id     x
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     2      2
   3 │     1      3
   4 │     2      4
   5 │     3      5

julia> combine(df, :x => collect∘extrema => :free_col, Over(:x => sum, :id))

and now it is unclear what should be done.
In separation the expressions produce:

julia> combine(df, :x => collect∘extrema => :free_col)
2×1 DataFrame
 Row │ free_col
     │ Int64
─────┼──────────
   1 │        1
   2 │        5

julia> combine(groupby(df, :id), :x => sum)
3×2 DataFrame
 Row │ id     x_sum
     │ Int64  Int64
─────┼──────────────
   1 │     1      4
   2 │     2      6
   3 │     3      5

but how to link them?

Note that if :id had 2 groups we would have the same number of rows in both results, but it would still not be clear how the rows should be combined.

Note that things would yet be more complicated if the combine(groupby(df, :id), :x => sum) returned multiple rows per group (and potentially different number of rows per group).

Your examples work and are clear what to do because you work with:

  • select which keeps all rows and keeps their order always;
  • from Over you return 1 row per group (so it can be broadcasted without a problem);

And when we combine these two conditions indeed things are clear how they should work.

@adienes
Copy link
Contributor Author

adienes commented Sep 6, 2023

combine(df, :x => collect∘extrema => :free_col, Over(:x => sum, :id))
and now it is unclear what should be done.

so, I would say this should error for the same reason that this does.

combine(df, :x => collect∘extrema => :free_col, :x => (j -> [sum(j), std(j), mean(j)]))

if :id has 2 groups then I would hcat the resulting x_sum with free_col; that is the group id should be mostly ignored except in determining how to apply the expression, other than that it is "just" a function returning a column that can be applied like any other. if it returns multiple rows per group and/or different number of rows per group, that is ok, as long as the total number of rows returned is equal to the number in the other combine operations.

one consequence, if I understand myself correctly lol, would be

combine(groupby(df, :x), Over(expr, :x))

is a no-op, equivalent to combine(groupby(df, :x), expr))

@adienes
Copy link
Contributor Author

adienes commented Sep 7, 2023

I think in terms of behavior, this does more or less everything I'm imagining (although obviously the implementation is inefficient and the API is pretty kludgy)

function Over(expr, col)
    return df -> select(select(groupby(df, col), expr), Not(col))
end

@bkamins
Copy link
Member

bkamins commented Sep 7, 2023

You opened the discussion with the polars over refernce, so I think it is good to have a look there: https://pola-rs.github.io/polars/py-polars/html/reference/expressions/api/polars.Expr.over.html

Note that the design of over is more complex and it provides three mapping strategies (that are polars specific)

I need to think about it more. The mental problem I have is that while I see the need for Over the challenge is to design it in a way that would be composable with the rest of the ecosystem.

@nalimilan
Copy link
Member

Maybe this shouldn't be allowed with combine? That operation is more tricky than select because the meaning of rows isn't defined as clearly. And anyway it seems less useful than with select.

In terms of syntax, the most consistent extension of the current API I can think of would be select(df, :x1 => GroupBy(:x4, first) => :first_in_group).

@adienes
Copy link
Contributor Author

adienes commented Sep 10, 2023

[Over with] combine is more tricky than select and seems less useful

totally agreed. in fact, part of the whole reason to want Over is explicitly to map the values back to the original rows --- thinking through semantics with combine was only an attempt to make it consistent across the "whole" ecosystem as suggested by @bkamins , but if it were instead decided to support only select and transform I think that would be reasonable. In fact there is already kind of precedent, in that groupby(::GroupedDataFrame, ::Symbol) already errors

.

Another thing to consider with the API is composition of potentially multiple expr functors (hopefully not too ambitious), with another prominent example being Filter. going back to the polars comparison, they allow expressions like this:

df = pl.DataFrame()
df = df.with_columns(
    value=pl.Series(np.random.rand(10)),
    ready=pl.Series([bool(x) for x in [1, 0, 0, 0, 1, 1, 1, 0, 1, 1]]),
    group=pl.Series([1, 1, 1, 2, 2, 2, 3, 3, 3, 4]),
)
df.with_columns(
    first_ready_in_group=col("value").filter(col("ready")).first().over(col("group"))
)

and btw, the reason I am calling these "functors" is because in fact they really can take arbitrary expressions, like so

filter_expr = col("ready").xor((col("value") * 1000).round().mod(7).cast(bool))
over_expr = pl.when(col("group") == 1).then(1).otherwise(2)

df.with_columns(
    first_ready_in_group=col("value").filter(filter_expr).first().over(over_expr)
)

@bkamins
Copy link
Member

bkamins commented Sep 10, 2023

As for functors - I think it will be too hard to add them. Our API is too different. Such things were designed to be chained.


As for the syntax:

select(df, :x1 => GroupBy(:x4, first) => :first_in_group)

would be re-written as:

select(df, df -> select(groupby(df, :x), :x1 => first => :first_in_group, keepkeys=false))

so maybe it would be clearer to use the following syntax:

select(df, GroupBy(:x4, :x1 => first  => :first_in_group))

(as then it would be fully clear that GroupBy is just a replacement for inner groupby with keepkeys=false followed by select)

@adienes
Copy link
Contributor Author

adienes commented Sep 11, 2023

I still think my preferred syntax, if possible, would be with a pipe, both to avoid an extra layer of function nesting and to avoid the classic problem of having to jump the cursor back to add the Over/GroupBy (as I usually start by writing down the input columns). it could look something like

select(df, :x1 |> Over(:x4) => first => :first_in_group

possibly as a curried version of

select(df, Over(:x1, :x4) => first => :first_in_group

read as "transform x1 over x4" and this could take column selectors

select(df, :x1 |> Over([:id_A, :id_B, id_C]) => first => :first_in_product_group)

I also prefer this over @nalimilan 's proposed syntax since I feel like it is somewhat of a rule that all the columns in the transformation must appear in the first of the Pair, so seeing :x1 => Groupby(:x4, expr) is at first surprising where :x4 comes from since I did not pass [:x1, :x4] =>

an attempt to enumerate the questions/decisions to be made:

  1. how much broadcasting magic is acceptable? I think when either (each group returns a scalar) or (each group returns a vector with length of the group) it is quite clear what to do. but what if in some edge case, some groups return a scalar and others return a vector --- should they be allowed to broadcast back independently?
  2. GroupBy vs Over - I am mostly ambivalent but do feel like Over reads nicely and is consistent with polars. My reasoning for the argument order is to follow the convention that foo(x, y) often means x foo y like the function occursin. also even in regular groupby, the grouping column is the second argument
  3. Where should the call go / what argument order? on an expr A => B => C, we have three proposals for GroupBy(:id, A => B => C), A => GroupBy(:id, B) => C, and GroupBy(A, :id) => B => C
  4. should combine or GroupedDataFrame be supported, and to what extent

.
regarding the polars mapping_strategy, I think we can try to emulate only group_to_rows and forget the other two. mapping_strategy="join" could be obtained already with this Over proposal simply via a wrapping the resultant vectors in a Ref, and mapping_strategy="explode" is just strange...

.

As for functors - I think it will be too hard to add them. Our API is too different. Such things were designed to be chained.

ok, fair enough (especially for the general case). but wouldn't this be nice for some more "simple" compositions of Over with Filter ! (or maybe named When to keep the adjective theme going)

select(df,
    :value |> Filter(:valid, :recent; operator = ∩) |> Over(:id)
    => timeof ∘ first
    => :group_ready_time,
)

@bkamins
Copy link
Member

bkamins commented Sep 11, 2023

Let us focus on the syntax. I will answer the rest of the issues once this is settled (as this will be a consequence).

Using :x1 |> Over(:x4) is just equivalent to Over(:x4)(:x1), but maybe it is OK for you?

So I understand the proposal is to have:

Over(GROUPING_COLS)(A) => B => C

which is equivalent to:

A |> Over(GROUPING_COLS) => B => C

to be rewritten as:

df -> select(groupby(df, GROUPING_COLS), A => B => C, keepkeys=false)

@adienes
Copy link
Contributor Author

adienes commented Sep 11, 2023

yes, that looks great IMO. Over(A, GROUPING_COLS) => B => C. also seems reasonable if you want to avoid the currying

@bkamins bkamins modified the milestones: 1.7, 1.x Sep 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants