proper nest/unnest functions #3672

randomgambit · 2019-06-30T18:11:15Z

Hello,

I am trying to get the equivalent of nest() / unnest() in data.table and I wonder if a new function would make more sense. Consider this little example


tibble(group = c(1,1,1,2,2,2),
      val = c('hello', 'world','hello', 'world','hello', 'world'),
      col = c(1,2,3,4,5,6)) %>% 
  group_by(group) %>% nest() %>% 
  mutate(newval = group + 1)

# A tibble: 2 x 3
  group data             newval
  <dbl> <list>            <dbl>
1     1 <tibble [3 x 2]>      2
2     2 <tibble [3 x 2]>      3

I am not sure the same can be done with data.table. Look at the mysterious output I get after this. Am I missing something?


mydf[, list(listcol=list(data.table(val, col))), by=group][
  , newval := group + 1]
   group    val newval
1:     1 <list>      3
2:     1 <list>      3
3:     1 <list>      3
4:     2 <list>      3
5:     2 <list>      3
6:     2 <list>      3

The text was updated successfully, but these errors were encountered:

franknarf1 · 2019-06-30T18:36:45Z

Re the mysterious output, yeah that does look odd (having newval := group + 1 and yet that output, and also listcol not being a column name...). Could you provide a reproducible example, preferably not using packages that are not required? Some more guidance is here: https://github.com/Rdatatable/data.table/wiki/Support

As far as I know, nest and unnest are feasible in data.table already, though there is a proposal to add a fast unnest: #2146

(Wrapping in list as you have done is nesting. For unnest, there are several options depending on what specifically you want to do. Rep and unlist work for many cases: https://stackoverflow.com/a/38796822)

randomgambit · 2019-06-30T18:42:43Z

sure you can simply use

mydf <- data.table(group = c(1,1,1,2,2,2),
           val = c('hello', 'world','hello', 'world','hello', 'world'),
           col = c(1,2,3,4,5,6))

and this seems to work


> mydf[, list(listcol=list(data.table(val, col))), by=group]
   group      listcol
1:     1 <data.table>
2:     2 <data.table>

but the rest seems to break down the list-column format. Ideally, I would like to use purrr::map() on the nest data.frame. Yes, using the best for both worlds :)

randomgambit · 2019-06-30T18:43:37Z

wow but the nest() ticket has been open for more than 2 years!

moodymudskipper · 2019-07-05T12:55:51Z

@randomgambit your problem is that you used assignment by reference on top of an aggregating call, it seems like data.table doesn't like it too much, this works :

library(data.table)
mydf <- data.table(group = c(1,1,1,2,2,2),
                   val = c('hello', 'world','hello', 'world','hello', 'world'),
                   col = c(1,2,3,4,5,6)) 
  
mydf <- mydf[, .(listcol=list(data.table(val, col))), by=group]
mydf [, newval := group + 1]
mydf
#>    group      listcol newval
#> 1:     1 <data.table>      2
#> 2:     2 <data.table>      3

This also works, with the strange effect that mydf needs to called twice to be printed (bug ?) :

mydf <- data.table(group = c(1,1,1,2,2,2),
                   val = c('hello', 'world','hello', 'world','hello', 'world'),
                   col = c(1,2,3,4,5,6)) 

mydf <- mydf[, .(listcol=list(data.table(val, col))), by=group][, newval := group + 1]
mydf
mydf
#>    group      listcol newval
#> 1:     1 <data.table>      2
#> 2:     2 <data.table>      3

We could create a function .nest as a shortcut :

.nest <- function(...) list(data.table(...))

mydf <- mydf[, .(listcol= .nest(val, col)), by=group]
mydf [, newval := group + 1]
mydf
#>    group      listcol newval
#> 1:     1 <data.table>      2
#> 2:     2 <data.table>      3

As this function doesn't make sense outside of [.data.table it could be defined in its scope only.

Could you be more precise about what you would like to do with purrr::map but feel you can't ?

As for unnest it seems that we can get help from tidyr without any conversion to tibble. It might be possible to get the best of both worlds after all!

library(tidyr)
unnest(mydf)
#>    group newval   val col
#> 1:     1      2 hello   1
#> 2:     1      2 world   2
#> 3:     1      2 hello   3
#> 4:     2      3 world   4
#> 5:     2      3 hello   5
#> 6:     2      3 world   6

# or
mydf[,unnest(.SD)]
#>    group newval   val col
#> 1:     1      2 hello   1
#> 2:     1      2 world   2
#> 3:     1      2 hello   3
#> 4:     2      3 world   4
#> 5:     2      3 hello   5
#> 6:     2      3 world   6

jaapwalhout · 2019-07-05T13:43:10Z

This also works, with the strange effect that mydf needs to called twice to be printed (bug ?)

This is expected behavior since at least version 1.9.6; see here for reference.

randomgambit · 2019-07-05T16:15:50Z

Ah that is pretty great. As for

Could you be more precise about what you would like to do with purrr::map but feel you can't ?

Here is in my opinion what makes tibble superior to DT (for now :)) for data analysis. Look at this wonderful output:

mydf <- tibble(group = c(1,1,1,2,2,2),
                   val = c('hello', 'world','hello', 'world','hello', 'world'),
                   col = c(1,2,3,4,5,6)) 

myreg <- function(df){
  lm(col ~ I(val), data = df)
}

mycoef <- function(obj){
  obj %>% broom::tidy()
}

mydf %>% 
  group_by(group) %>% 
  nest() %>% 
  mutate(myreg = map(data, ~myreg(.x)),
         myoutput = map(myreg, ~mycoef(.x)))

# A tibble: 2 x 4
  group data             myreg    myoutput        
  <dbl> <list>           <list>   <list>          
1     1 <tibble [3 x 2]> <S3: lm> <tibble [2 x 5]>
2     2 <tibble [3 x 2]> <S3: lm> <tibble [2 x 5]>

Here, by just looking at the R console I can see that myreg is a lm object, and that the regression by group went fine because myoutput contains non-empty tibbles.

I would love being able to encapsulate DTs or tibbles in a DT and get the same functionality. But this requires proper unnest()/nest() functions (and proper printing functions ... see my other github issue #3671).

Here, it is important to be able to unnest myoutput while preserving the other variables (and possibly the other list-column myreg). I am not sure this works right now in DT.

Thanks!

danielarantes · 2019-07-05T17:20:28Z

The only real difference I see in the data.table approach (below what I would do...) is the print doesn't show the size of the objects in the list column.
To unnest you can use tidyr as pointed out earlier.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)
library(purrr)
library(data.table)
#> 
#> Attaching package: 'data.table'
#> The following object is masked from 'package:purrr':
#> 
#>     transpose
#> The following objects are masked from 'package:dplyr':
#> 
#>     between, coalesce, first, last

mydf <- tibble(group = c(1,1,1,2,2,2),
               val = c('hello', 'world','hello', 'world','hello', 'world'),
               col = c(1,2,3,4,5,6)) 

myreg <- function(df){
  lm(col ~ I(val), data = df)
}

mycoef <- function(obj){
  obj %>% broom::tidy()
}

mydt <- setDT(mydf)

mydt[, .(data = list(.SD)), group]  %>% 
  .[, {
    myreg = lapply(data, FUN = function(x) myreg(x))
    myoutput = lapply(myreg, FUN = function(x) mycoef(x))
    list(data = data, myreg = myreg, myoutput = myoutput)
  }, group]
#>    group         data myreg myoutput
#> 1:     1 <data.table>  <lm> <tbl_df>
#> 2:     2 <data.table>  <lm> <tbl_df>

# OR this other way with more intermediary steps and unnesting at the end

mydt[, .(data = list(.SD)), group] %>% 
  .[, .(data, myreg = lapply(data, FUN = function(x) myreg(x))), group] %>% 
  .[, .(data, myreg, myoutput = lapply(myreg, FUN = function(x) mycoef(x))), group] %>% 
  unnest(myoutput)
#>    group        term     estimate std.error     statistic   p.value
#> 1:     1 (Intercept)  2.00000e+00  1.000000  2.000000e+00 0.2951672
#> 2:     1 I(val)world -5.43896e-16  1.732051 -3.140185e-16 1.0000000
#> 3:     2 (Intercept)  5.00000e+00  1.414214  3.535534e+00 0.1754797
#> 4:     2 I(val)world  8.15844e-16  1.732051  4.710277e-16 1.0000000

^{Created on 2019-07-05 by the reprex package (v0.3.0)}

moodymudskipper · 2019-07-05T17:25:16Z

This is what I thought, we're so used at considering data.table and tidyverse as competing paradigms that we don't see how much they can work hand in hand, and would probably even more if more effort was invested in that direction :

.nest <- function(...) list(data.table(...))

mydt <- as.data.table(mydf)

mydt <- mydt[, .(data = .nest(.SD)),  by = group]

# the following works with lapply but I wanted to show that map works just fine,
# including the formula notation and the potential quasiquotation stuff.

mydt[, myreg := map(data, myreg)][
  , myoutput := map(myreg, mycoef)]

mydt
# group         data myreg myoutput
# 1:     1 <data.table>  <lm> <tbl_df>
# 2:     2 <data.table>  <lm> <tbl_df>

I agree that it prints less nice. And I find the need to switch between assignment by copy and reference annoying, but to be fair it's quite nice and readable.

I tried to use tibble:::print.tbl() on mydt and it failed, but I think it's something the tidyverse team could sort out, it might even be a tidyverse bug in the sense that data.tables are data.frames and it's supposed to support data.frames.

Now to unnest a specific column :

# this removes other list columns but it does the same on your mydf tibble
mydt[,unnest(.SD, myoutput)]
# group        term     estimate std.error     statistic   p.value
# 1:     1 (Intercept)  2.00000e+00  1.000000  2.000000e+00 0.2951672
# 2:     1 I(val)world -5.43896e-16  1.732051 -3.140185e-16 1.0000000
# 3:     2 (Intercept)  5.00000e+00  1.414214  3.535534e+00 0.1754797
# 4:     2 I(val)world  8.15844e-16  1.732051  4.710277e-16 1.0000000

# to keep all we can use a join
mydt[,unnest(.SD), .SDcols = c("group","myoutput")][mydt, on="group"]
#    group        term     estimate std.error     statistic   p.value         data myreg myoutput
# 1:     1 (Intercept)  2.00000e+00  1.000000  2.000000e+00 0.2951672 <data.table>  <lm> <tbl_df>
# 2:     1 I(val)world -5.43896e-16  1.732051 -3.140185e-16 1.0000000 <data.table>  <lm> <tbl_df>
# 3:     2 (Intercept)  5.00000e+00  1.414214  3.535534e+00 0.1754797 <data.table>  <lm> <tbl_df>
# 4:     2 I(val)world  8.15844e-16  1.732051  4.710277e-16 1.0000000 <data.table>  <lm> <tbl_df>

moodymudskipper · 2019-07-05T17:30:14Z

ah Daniel I hadn't seen your answer, well that gives more options!

randomgambit · 2019-07-05T17:34:23Z

Very interesting. I am all if for a better integration between DT and tidyverse.
@moodymudskipper when you say:

And I find the need to switch between assignment by copy and reference annoying

Is this related to #3682?

Also, perhaps the best solution is to make tibble:::print.tbl() accept DTs. The more I think about it, the more this looks like a genius option. I can file an issue on the tibble repo if you like.

moodymudskipper · 2019-07-05T17:37:44Z

Actually I was wrong to describe it as a bug, it's an unexported method so it's reasonable that it works on tibbles only, and actually the following works so no need to file anything (except it could support data.table as a special class, as it doesn't now) :

as_tibble(mydt)
# A tibble: 2 x 4
  group data             myreg  myoutput        
  <dbl> <list>           <list> <list>          
1     1 <df[,2] [3 x 2]> <lm>   <tibble [2 x 5]>
2     2 <df[,2] [3 x 2]> <lm>   <tibble [2 x 5]>

moodymudskipper · 2019-07-05T17:49:02Z

The trick to use mydt[TRUE] to make sure we're copying is really nice, and for quick exploration mydt[T] is just 3 more keys.

randomgambit · 2019-07-05T17:58:56Z

got it.

Also this syntax is quite interesting
mydt[,unnest(.SD), .SDcols = c("group","myoutput")][mydt, on="group"]

My understanding is that this only works because mydt is copied on the first place. So if I were to use := inside the first call then I should necessarily use the mydf[TRUE] insane hack :)

jangorecki · 2019-07-05T19:03:33Z

Calling it hack is not appropriate, each data table query that doesn't update by reference will copy. The hack about TRUE is that it does shallow copy, not really relevant here. Making copies on [ is a regular behaviour.

MichaelChirico · 2019-07-06T06:34:25Z

Thanks Jan, I was thinking 1:.N and NULL were the same behavior... so actually DT[TRUE] and DT[1:.N]/DT[NULL] have different effects (the latter being full copies)

jangorecki · 2019-07-30T20:13:29Z

I think we can close this issue as duplicate of #2146. We don't need fast nest function because AFAIU it is a matter of wrapping in a list. Any comments?

moodymudskipper mentioned this issue Jul 5, 2019

Two features for data.table one liners #3681

Closed

DavidArenburg added the feature request label Jul 8, 2019

jangorecki changed the title ~~feature request: proper nest/unnest functions~~ proper nest/unnest functions Jul 30, 2019

mgirlich mentioned this issue Jul 2, 2021

Add unnest() tidyverse/dtplyr#266

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

proper nest/unnest functions #3672

proper nest/unnest functions #3672

randomgambit commented Jun 30, 2019

franknarf1 commented Jun 30, 2019 •

edited

Loading

randomgambit commented Jun 30, 2019

randomgambit commented Jun 30, 2019

moodymudskipper commented Jul 5, 2019 •

edited

Loading

jaapwalhout commented Jul 5, 2019 •

edited

Loading

randomgambit commented Jul 5, 2019 •

edited

Loading

danielarantes commented Jul 5, 2019

moodymudskipper commented Jul 5, 2019

moodymudskipper commented Jul 5, 2019

randomgambit commented Jul 5, 2019

moodymudskipper commented Jul 5, 2019 •

edited

Loading

moodymudskipper commented Jul 5, 2019

randomgambit commented Jul 5, 2019

jangorecki commented Jul 5, 2019

MichaelChirico commented Jul 6, 2019

jangorecki commented Jul 30, 2019

proper nest/unnest functions #3672

proper nest/unnest functions #3672

Comments

randomgambit commented Jun 30, 2019

franknarf1 commented Jun 30, 2019 • edited Loading

randomgambit commented Jun 30, 2019

randomgambit commented Jun 30, 2019

moodymudskipper commented Jul 5, 2019 • edited Loading

jaapwalhout commented Jul 5, 2019 • edited Loading

randomgambit commented Jul 5, 2019 • edited Loading

danielarantes commented Jul 5, 2019

moodymudskipper commented Jul 5, 2019

moodymudskipper commented Jul 5, 2019

randomgambit commented Jul 5, 2019

moodymudskipper commented Jul 5, 2019 • edited Loading

moodymudskipper commented Jul 5, 2019

randomgambit commented Jul 5, 2019

jangorecki commented Jul 5, 2019

MichaelChirico commented Jul 6, 2019

jangorecki commented Jul 30, 2019

franknarf1 commented Jun 30, 2019 •

edited

Loading

moodymudskipper commented Jul 5, 2019 •

edited

Loading

jaapwalhout commented Jul 5, 2019 •

edited

Loading

randomgambit commented Jul 5, 2019 •

edited

Loading

moodymudskipper commented Jul 5, 2019 •

edited

Loading