-
Notifications
You must be signed in to change notification settings - Fork 367
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Group Indices function #1704
Comments
You can get it like this:
or like this
But I agree that it would be good to define a function that does this in the package (recently we have discussed this with @nalimilan). What name for such a function would you see and should it return |
@matthieugomez Can you detail what kind of output you would expect? A single vector giving groups? Then it's present as integer indices in |
Yes - now I see that |
Oh yes you're right, an integer vector would be enough. The only thing is that groups with missing values should return a certain integer (0 or missing or some other value, as long as it is documented). |
OK. We already use |
Thinking about it, maybe it would be safer to return a missing value for groups where one variable is missing when |
We could do that, but that would require allocating a new vector. What do other implementations do? |
By default, Stata returns missing values in case one variable has a missing value. With the option missing, missing values are considered like distinct values (so, for the case of two variables |
By the way, I actually think it is a bit weird that groups with a missing value are included by default in |
It think it is very good that you raise these issues before DataFrames.jl 1.0, as now we have time to change the API if we want. From my side I am happy with the current functionality, that is:
All these options are "safe" IMHO. What would you change here? However, what I would add is (here a feedback from you would be valuable if you also find them useful):
|
Yes, in Julia missing values are never skipped by default, so doing that with
Is there prior art for this? I'm not sure how to call it nor what it should return (a vector of tuples? a data frame?). |
Regarding the first function - can we call it
Then in dplyr you can get information about grouping variables, we could have something like:
Regarding the second - initially I thought about returning a vector of
or
I could not find a similar function in other packages, but now when I am thinking about it actually |
yes, exactly, I would like a function that exposes the groups variable. The only remaining question is whether the function should return missing values when |
Using |
The issue is that
I agree that having
For me the logical is what I propose (i.e. we computed no columns so only grouping columns are retained as we always retain them - note that e.g. this is the same what you do if you have zero rows of data). But I agree that this is not necessarily most useful feature, so I will not stick to it very hard if there are other opinions. |
just to clarify, |
Yes, but do we want to keep these names. |
OK, Regarding the treatment of missing values, we can still have an argument to use
Ah, right. We really need to fix this.
I agree about the columns, but the question is what rows to return. Returning one row per group assumes a summary function is the most natural operation, but |
relevant discussion also at #1693. |
OK - I will open a PR adding them. For
I will not touch it (and leave you to decide how to implement it when you do other improvements to grouping code)
good point. So for now we can simply keep it deprecated and decide later.
Agreed - that is why I propose to leave it for later to decide by @nalimilan. What is simple to decide to add |
It needs to be sorted out if all issues raised here are resolved (some of them are, I am not sure if all), so I mark it for 1.0. |
It would be great to have a function that returns a vector of unique identifier for each group of a GroupedDataframe (as a
PooledDataArray
or as aCategoricalVector
). Similar functions exist in other packages:group_indices
indplyr
,.GRP
indata.table
,group
in Stata. In particular, it would make it easier for external packages to benefit from the all recent work done ingroup_rows
.The text was updated successfully, but these errors were encountered: