Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Group Indices function #1704

Closed
matthieugomez opened this issue Jan 30, 2019 · 21 comments
Closed

Group Indices function #1704

matthieugomez opened this issue Jan 30, 2019 · 21 comments
Labels
breaking The proposed change is breaking.
Milestone

Comments

@matthieugomez
Copy link
Contributor

matthieugomez commented Jan 30, 2019

It would be great to have a function that returns a vector of unique identifier for each group of a GroupedDataframe (as a PooledDataArray or as a CategoricalVector). Similar functions exist in other packages: group_indices in dplyr, .GRP in data.table, group in Stata. In particular, it would make it easier for external packages to benefit from the all recent work done in group_rows.

@bkamins
Copy link
Member

bkamins commented Jan 30, 2019

You can get it like this:

getindex.(gd, 1, (gd.cols,))

or like this

copy.(getindex.(gd, 1, (gd.cols,)))

But I agree that it would be good to define a function that does this in the package (recently we have discussed this with @nalimilan).

What name for such a function would you see and should it return DataFrameRows or rather NamedTuples?

@nalimilan
Copy link
Member

@matthieugomez Can you detail what kind of output you would expect? A single vector giving groups? Then it's present as integer indices in gd.groups. Or do you need labels too?

@bkamins
Copy link
Member

bkamins commented Jan 30, 2019

Yes - now I see that dplyr function group_indices is something we have as groups field.
Maybe it would be good to expose it also (probably wrapping in ReadOnlyArray).

@matthieugomez
Copy link
Contributor Author

matthieugomez commented Jan 30, 2019

Oh yes you're right, an integer vector would be enough. The only thing is that groups with missing values should return a certain integer (0 or missing or some other value, as long as it is documented).

@nalimilan
Copy link
Member

OK. We already use 0 for missing values when skipmissing=true. I guess we could add a group_indices or groupindices function (unless we can override an already existing and more standard function for that).

@matthieugomez
Copy link
Contributor Author

Thinking about it, maybe it would be safer to return a missing value for groups where one variable is missing when skipmissing = true.

@nalimilan
Copy link
Member

We could do that, but that would require allocating a new vector. What do other implementations do?

@matthieugomez
Copy link
Contributor Author

matthieugomez commented Jan 30, 2019

By default, Stata returns missing values in case one variable has a missing value. With the option missing, missing values are considered like distinct values (so, for the case of two variables missing, 1 is one integer, 1, missing is another integer).
It seems like dplyr and datatable always consider missing values as distinct values.

@matthieugomez
Copy link
Contributor Author

matthieugomez commented Jan 30, 2019

By the way, I actually think it is a bit weird that groups with a missing value are included by default in groupby by etc no? Would not it be less error-prone to remove them by default? I think Stata and Panda remove them, but R dplyr and datatable keep them.

@bkamins
Copy link
Member

bkamins commented Jan 31, 2019

It think it is very good that you raise these issues before DataFrames.jl 1.0, as now we have time to change the API if we want.

From my side I am happy with the current functionality, that is:

  • by default we include missing (actually I fint it better, because you do not omit something by accident)
  • we treat something as missing when skipmissing=true if at least one value is missing, in this case all "missing" observations have the same code - 0 - as opposed to the case above when they would get different codes i.e. in case 1, missing and missing, 1
  • we treat a missing as a distinct value when group-by is by multiple columns, e.g.
julia> df = DataFrame(a=[1, missing, 3, missing], b=[missing, 2, 3, missing], c = [1, 2, 3, 4])
4×3 DataFrame
│ Row │ a       │ b       │ c     │
│     │ Int64⍰  │ Int64⍰  │ Int64 │
├─────┼─────────┼─────────┼───────┤
│ 1   │ 1       │ missing │ 1     │
│ 2   │ missing │ 2       │ 2     │
│ 3   │ 3       │ 3       │ 3     │
│ 4   │ missing │ missing │ 4     │

julia> gd = collect(groupby(df, [1,2]))
4-element Array{Any,1}:
 1×3 SubDataFrame
│ Row │ a      │ b       │ c     │
│     │ Int64⍰ │ Int64⍰  │ Int64 │
├─────┼────────┼─────────┼───────┤
│ 1   │ 1      │ missing │ 1     │
 1×3 SubDataFrame
│ Row │ a       │ b      │ c     │
│     │ Int64⍰  │ Int64⍰ │ Int64 │
├─────┼─────────┼────────┼───────┤
│ 1   │ missing │ 2      │ 2     │
 1×3 SubDataFrame
│ Row │ a      │ b      │ c     │
│     │ Int64⍰ │ Int64⍰ │ Int64 │
├─────┼────────┼────────┼───────┤
│ 1   │ 3      │ 3      │ 3     │
 1×3 SubDataFrame
│ Row │ a       │ b       │ c     │
│     │ Int64⍰  │ Int64⍰  │ Int64 │
├─────┼─────────┼─────────┼───────┤
│ 1   │ missing │ missing │ 4     │

All these options are "safe" IMHO. What would you change here?

However, what I would add is (here a feedback from you would be valuable if you also find them useful):

  • a function exposing groups variable (as accessing internal undocumented field is not a proper way to go)
  • a function allowing to get the values of grouping variables in each group (the thing I initially thought you wanted)

@nalimilan
Copy link
Member

Yes, in Julia missing values are never skipped by default, so doing that with groupby is consistent (and safer).

a function exposing groups variable (as accessing internal undocumented field is not a proper way to go)
+1

a function allowing to get the values of grouping variables in each group (the thing I initially thought you wanted)

Is there prior art for this? I'm not sure how to call it nor what it should return (a vector of tuples? a data frame?).

@bkamins
Copy link
Member

bkamins commented Jan 31, 2019

Regarding the first function - can we call it groups? If we did not want to wrap it we could have:

groups(gd) = gd.groups

Then in dplyr you can get information about grouping variables, we could have something like:

groupvars(gd) = gd.cols

Regarding the second - initially I thought about returning a vector of DataFrameRows or NamedTuples as proposed above:

getindex.(gd, 1, (gd.cols,))

or

copy.(getindex.(gd, 1, (gd.cols,)))

I could not find a similar function in other packages, but now when I am thinking about it actually combine(gd) could return a DataFrame holding only grouping variables - one row per group. This would be something like combine(gd, x->nothing) but without creating this extra column with nothing values.

@matthieugomez
Copy link
Contributor Author

matthieugomez commented Jan 31, 2019

yes, exactly, I would like a function that exposes the groups variable.
About the skipmissing stuff, I just checked and all other softwares (except panda I think) do like you do by default, so I guess this is the right default.

The only remaining question is whether the function should return missing values when skipmissing = true. I think it would be better to do so but I'm fine either way.

@nalimilan
Copy link
Member

groups and groupvars sounds fine, sticking to the dplyr terminology is a good idea when we don't have a reason to diverge (though for some obscure reason groups also returns the names of variables). groups should probably replace 0 with missing when skipmissing=true (otherwise 0 isn't used).

Using combine(df) like you suggest is an interesting idea. IIUC that would be equivalent to combine(df -> df[1, gd.cols], gd), right? However it's hard to decide what's the most logical behavior for combine(df). Having it equivalent to DataFrame(combine) (like ungroup in dplyr) would also be appealing: the sequence df |> groupby(:key) |> map(:x => mean) |> combine() is natural ("split-apply-combine"), even if using map is less efficient than combine(df, :x => mean).

@bkamins
Copy link
Member

bkamins commented Jan 31, 2019

The issue is that combine(df -> df[1, gd.cols], gd) would not be right as it duplicates the columns. The ways I have proposed above do what is intended. Using combine or map it is not possible as they always expect to add some columns.

groups should probably replace 0 with missing when skipmissing=true (otherwise 0 isn't used).

I agree that having missing is more consistent logically, but having 0 will be faster.

it's hard to decide what's the most logical behavior for combine(df)

For me the logical is what I propose (i.e. we computed no columns so only grouping columns are retained as we always retain them - note that e.g. this is the same what you do if you have zero rows of data). But I agree that this is not necessarily most useful feature, so I will not stick to it very hard if there are other opinions.

@matthieugomez
Copy link
Contributor Author

just to clarify, group in dplyr returns the name of the groups,group_indices is the function returning an integer vector.

@bkamins
Copy link
Member

bkamins commented Jan 31, 2019

Yes, but do we want to keep these names. group_indices is intuitive and we could keep it (probably groupindices), but groups is unclear, so I have proposed groupvars.

@nalimilan
Copy link
Member

OK, groupindices and groupvars sounds good (actually in dply groups and group_vars do almost the same thing, but have a different return type).

Regarding the treatment of missing values, we can still have an argument to use 0, but by default it would be more consistent to use missing. The performance impact should be negligible compared to the cost of grouping anyway.

The issue is that combine(df -> df[1, gd.cols], gd) would not be right as it duplicates the columns. The ways I have proposed above do what is intended. Using combine or map it is not possible as they always expect to add some columns.

Ah, right. We really need to fix this.

For me the logical is what I propose (i.e. we computed no columns so only grouping columns are retained as we always retain them - note that e.g. this is the same what you do if you have zero rows of data). But I agree that this is not necessarily most useful feature, so I will not stick to it very hard if there are other opinions.

I agree about the columns, but the question is what rows to return. Returning one row per group assumes a summary function is the most natural operation, but combine supports any kind of operation, so we could also return as many rows as in the input data frame (or 2 rows per group, or anything).

@pdeffebach
Copy link
Contributor

relevant discussion also at #1693.

@bkamins
Copy link
Member

bkamins commented Feb 1, 2019

groupindices and groupvars

OK - I will open a PR adding them. For groupindices I will transform 0 to missing internally and create a new vector.

We really need to fix this.

I will not touch it (and leave you to decide how to implement it when you do other improvements to grouping code)

but the question is what rows to return

good point. So for now we can simply keep it deprecated and decide later.

relevant discussion also at #1693.

Agreed - that is why I propose to leave it for later to decide by @nalimilan. What is simple to decide to add groupindices and groupvars functions and that is why I propose to push a PR adding them (I shall open this PR today).

@bkamins bkamins added the breaking The proposed change is breaking. label Feb 12, 2020
@bkamins bkamins added this to the 1.0 milestone Feb 12, 2020
@bkamins
Copy link
Member

bkamins commented Feb 12, 2020

It needs to be sorted out if all issues raised here are resolved (some of them are, I am not sure if all), so I mark it for 1.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking The proposed change is breaking.
Projects
None yet
Development

No branches or pull requests

4 participants