Group Indices function #1704

matthieugomez · 2019-01-30T14:50:04Z

It would be great to have a function that returns a vector of unique identifier for each group of a GroupedDataframe (as a PooledDataArray or as a CategoricalVector). Similar functions exist in other packages: group_indices in dplyr, .GRP in data.table, group in Stata. In particular, it would make it easier for external packages to benefit from the all recent work done in group_rows.

The text was updated successfully, but these errors were encountered:

bkamins · 2019-01-30T15:14:01Z

You can get it like this:

getindex.(gd, 1, (gd.cols,))

or like this

copy.(getindex.(gd, 1, (gd.cols,)))

But I agree that it would be good to define a function that does this in the package (recently we have discussed this with @nalimilan).

What name for such a function would you see and should it return DataFrameRows or rather NamedTuples?

nalimilan · 2019-01-30T15:18:51Z

@matthieugomez Can you detail what kind of output you would expect? A single vector giving groups? Then it's present as integer indices in gd.groups. Or do you need labels too?

bkamins · 2019-01-30T15:21:26Z

Yes - now I see that dplyr function group_indices is something we have as groups field.
Maybe it would be good to expose it also (probably wrapping in ReadOnlyArray).

matthieugomez · 2019-01-30T15:25:02Z

Oh yes you're right, an integer vector would be enough. The only thing is that groups with missing values should return a certain integer (0 or missing or some other value, as long as it is documented).

nalimilan · 2019-01-30T15:29:48Z

OK. We already use 0 for missing values when skipmissing=true. I guess we could add a group_indices or groupindices function (unless we can override an already existing and more standard function for that).

matthieugomez · 2019-01-30T15:33:03Z

Thinking about it, maybe it would be safer to return a missing value for groups where one variable is missing when skipmissing = true.

nalimilan · 2019-01-30T15:56:14Z

We could do that, but that would require allocating a new vector. What do other implementations do?

matthieugomez · 2019-01-30T21:50:03Z

By default, Stata returns missing values in case one variable has a missing value. With the option missing, missing values are considered like distinct values (so, for the case of two variables missing, 1 is one integer, 1, missing is another integer).
It seems like dplyr and datatable always consider missing values as distinct values.

matthieugomez · 2019-01-30T22:22:54Z

By the way, I actually think it is a bit weird that groups with a missing value are included by default in groupby by etc no? Would not it be less error-prone to remove them by default? I think Stata and Panda remove them, but R dplyr and datatable keep them.

bkamins · 2019-01-31T08:02:30Z

It think it is very good that you raise these issues before DataFrames.jl 1.0, as now we have time to change the API if we want.

From my side I am happy with the current functionality, that is:

by default we include missing (actually I fint it better, because you do not omit something by accident)
we treat something as missing when skipmissing=true if at least one value is missing, in this case all "missing" observations have the same code - 0 - as opposed to the case above when they would get different codes i.e. in case 1, missing and missing, 1
we treat a missing as a distinct value when group-by is by multiple columns, e.g.

julia> df = DataFrame(a=[1, missing, 3, missing], b=[missing, 2, 3, missing], c = [1, 2, 3, 4])
4×3 DataFrame
│ Row │ a       │ b       │ c     │
│     │ Int64⍰  │ Int64⍰  │ Int64 │
├─────┼─────────┼─────────┼───────┤
│ 1   │ 1       │ missing │ 1     │
│ 2   │ missing │ 2       │ 2     │
│ 3   │ 3       │ 3       │ 3     │
│ 4   │ missing │ missing │ 4     │

julia> gd = collect(groupby(df, [1,2]))
4-element Array{Any,1}:
 1×3 SubDataFrame
│ Row │ a      │ b       │ c     │
│     │ Int64⍰ │ Int64⍰  │ Int64 │
├─────┼────────┼─────────┼───────┤
│ 1   │ 1      │ missing │ 1     │
 1×3 SubDataFrame
│ Row │ a       │ b      │ c     │
│     │ Int64⍰  │ Int64⍰ │ Int64 │
├─────┼─────────┼────────┼───────┤
│ 1   │ missing │ 2      │ 2     │
 1×3 SubDataFrame
│ Row │ a      │ b      │ c     │
│     │ Int64⍰ │ Int64⍰ │ Int64 │
├─────┼────────┼────────┼───────┤
│ 1   │ 3      │ 3      │ 3     │
 1×3 SubDataFrame
│ Row │ a       │ b       │ c     │
│     │ Int64⍰  │ Int64⍰  │ Int64 │
├─────┼─────────┼─────────┼───────┤
│ 1   │ missing │ missing │ 4     │

All these options are "safe" IMHO. What would you change here?

However, what I would add is (here a feedback from you would be valuable if you also find them useful):

a function exposing groups variable (as accessing internal undocumented field is not a proper way to go)
a function allowing to get the values of grouping variables in each group (the thing I initially thought you wanted)

nalimilan · 2019-01-31T08:15:32Z

Yes, in Julia missing values are never skipped by default, so doing that with groupby is consistent (and safer).

a function exposing groups variable (as accessing internal undocumented field is not a proper way to go)
+1

a function allowing to get the values of grouping variables in each group (the thing I initially thought you wanted)

Is there prior art for this? I'm not sure how to call it nor what it should return (a vector of tuples? a data frame?).

bkamins · 2019-01-31T09:51:23Z

Regarding the first function - can we call it groups? If we did not want to wrap it we could have:

groups(gd) = gd.groups

Then in dplyr you can get information about grouping variables, we could have something like:

groupvars(gd) = gd.cols

Regarding the second - initially I thought about returning a vector of DataFrameRows or NamedTuples as proposed above:

getindex.(gd, 1, (gd.cols,))

or

copy.(getindex.(gd, 1, (gd.cols,)))

I could not find a similar function in other packages, but now when I am thinking about it actually combine(gd) could return a DataFrame holding only grouping variables - one row per group. This would be something like combine(gd, x->nothing) but without creating this extra column with nothing values.

matthieugomez · 2019-01-31T17:11:51Z

yes, exactly, I would like a function that exposes the groups variable.
About the skipmissing stuff, I just checked and all other softwares (except panda I think) do like you do by default, so I guess this is the right default.

The only remaining question is whether the function should return missing values when skipmissing = true. I think it would be better to do so but I'm fine either way.

nalimilan · 2019-01-31T17:29:46Z

groups and groupvars sounds fine, sticking to the dplyr terminology is a good idea when we don't have a reason to diverge (though for some obscure reason groups also returns the names of variables). groups should probably replace 0 with missing when skipmissing=true (otherwise 0 isn't used).

Using combine(df) like you suggest is an interesting idea. IIUC that would be equivalent to combine(df -> df[1, gd.cols], gd), right? However it's hard to decide what's the most logical behavior for combine(df). Having it equivalent to DataFrame(combine) (like ungroup in dplyr) would also be appealing: the sequence df |> groupby(:key) |> map(:x => mean) |> combine() is natural ("split-apply-combine"), even if using map is less efficient than combine(df, :x => mean).

bkamins · 2019-01-31T17:52:45Z

The issue is that combine(df -> df[1, gd.cols], gd) would not be right as it duplicates the columns. The ways I have proposed above do what is intended. Using combine or map it is not possible as they always expect to add some columns.

groups should probably replace 0 with missing when skipmissing=true (otherwise 0 isn't used).

I agree that having missing is more consistent logically, but having 0 will be faster.

it's hard to decide what's the most logical behavior for combine(df)

For me the logical is what I propose (i.e. we computed no columns so only grouping columns are retained as we always retain them - note that e.g. this is the same what you do if you have zero rows of data). But I agree that this is not necessarily most useful feature, so I will not stick to it very hard if there are other opinions.

matthieugomez · 2019-01-31T18:22:39Z

just to clarify, group in dplyr returns the name of the groups,group_indices is the function returning an integer vector.

bkamins · 2019-01-31T18:23:52Z

Yes, but do we want to keep these names. group_indices is intuitive and we could keep it (probably groupindices), but groups is unclear, so I have proposed groupvars.

nalimilan · 2019-02-01T10:48:14Z

OK, groupindices and groupvars sounds good (actually in dply groups and group_vars do almost the same thing, but have a different return type).

Regarding the treatment of missing values, we can still have an argument to use 0, but by default it would be more consistent to use missing. The performance impact should be negligible compared to the cost of grouping anyway.

The issue is that combine(df -> df[1, gd.cols], gd) would not be right as it duplicates the columns. The ways I have proposed above do what is intended. Using combine or map it is not possible as they always expect to add some columns.

Ah, right. We really need to fix this.

For me the logical is what I propose (i.e. we computed no columns so only grouping columns are retained as we always retain them - note that e.g. this is the same what you do if you have zero rows of data). But I agree that this is not necessarily most useful feature, so I will not stick to it very hard if there are other opinions.

I agree about the columns, but the question is what rows to return. Returning one row per group assumes a summary function is the most natural operation, but combine supports any kind of operation, so we could also return as many rows as in the input data frame (or 2 rows per group, or anything).

pdeffebach · 2019-02-01T13:06:21Z

relevant discussion also at #1693.

bkamins · 2019-02-01T13:08:29Z

groupindices and groupvars

OK - I will open a PR adding them. For groupindices I will transform 0 to missing internally and create a new vector.

We really need to fix this.

I will not touch it (and leave you to decide how to implement it when you do other improvements to grouping code)

but the question is what rows to return

good point. So for now we can simply keep it deprecated and decide later.

relevant discussion also at #1693.

Agreed - that is why I propose to leave it for later to decide by @nalimilan. What is simple to decide to add groupindices and groupvars functions and that is why I propose to push a PR adding them (I shall open this PR today).

bkamins · 2020-02-12T12:22:33Z

It needs to be sorted out if all issues raised here are resolved (some of them are, I am not sure if all), so I mark it for 1.0.

bkamins mentioned this issue Feb 1, 2019

Fix getindex and add groupvars and groupindices #1709

Merged

bkamins added the breaking The proposed change is breaking. label Feb 12, 2020

bkamins added this to the 1.0 milestone Feb 12, 2020

bkamins closed this as completed Feb 21, 2020

nalimilan mentioned this issue Mar 26, 2020

[BREAKING] add groupcols and valuecols functions #2167

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Group Indices function #1704

Group Indices function #1704

matthieugomez commented Jan 30, 2019 •

edited

Loading

bkamins commented Jan 30, 2019

nalimilan commented Jan 30, 2019

bkamins commented Jan 30, 2019

matthieugomez commented Jan 30, 2019 •

edited

Loading

nalimilan commented Jan 30, 2019

matthieugomez commented Jan 30, 2019

nalimilan commented Jan 30, 2019

matthieugomez commented Jan 30, 2019 •

edited

Loading

matthieugomez commented Jan 30, 2019 •

edited

Loading

bkamins commented Jan 31, 2019

nalimilan commented Jan 31, 2019

bkamins commented Jan 31, 2019

matthieugomez commented Jan 31, 2019 •

edited

Loading

nalimilan commented Jan 31, 2019

bkamins commented Jan 31, 2019

matthieugomez commented Jan 31, 2019

bkamins commented Jan 31, 2019

nalimilan commented Feb 1, 2019

pdeffebach commented Feb 1, 2019

bkamins commented Feb 1, 2019

bkamins commented Feb 12, 2020

Group Indices function #1704

Group Indices function #1704

Comments

matthieugomez commented Jan 30, 2019 • edited Loading

bkamins commented Jan 30, 2019

nalimilan commented Jan 30, 2019

bkamins commented Jan 30, 2019

matthieugomez commented Jan 30, 2019 • edited Loading

nalimilan commented Jan 30, 2019

matthieugomez commented Jan 30, 2019

nalimilan commented Jan 30, 2019

matthieugomez commented Jan 30, 2019 • edited Loading

matthieugomez commented Jan 30, 2019 • edited Loading

bkamins commented Jan 31, 2019

nalimilan commented Jan 31, 2019

bkamins commented Jan 31, 2019

matthieugomez commented Jan 31, 2019 • edited Loading

nalimilan commented Jan 31, 2019

bkamins commented Jan 31, 2019

matthieugomez commented Jan 31, 2019

bkamins commented Jan 31, 2019

nalimilan commented Feb 1, 2019

pdeffebach commented Feb 1, 2019

bkamins commented Feb 1, 2019

bkamins commented Feb 12, 2020

matthieugomez commented Jan 30, 2019 •

edited

Loading

matthieugomez commented Jan 30, 2019 •

edited

Loading

matthieugomez commented Jan 30, 2019 •

edited

Loading

matthieugomez commented Jan 30, 2019 •

edited

Loading

matthieugomez commented Jan 31, 2019 •

edited

Loading