All non-atomic types are now converted to lists in rbindlist #4196

sritchie73 · 2020-01-24T12:11:47Z

Closes #546 and fixes #3811

Example:

> A = data.table(c1 = 1, c2 = 'asd', c3 = expression(as.character(Sys.time())))
> B = data.table(c1 = 3, c2 = 'qwe', c3 = expression(as.character(Sys.time()+5)))
> rbind(A,B)
      c1     c2              c3
   <num> <char>          <list>
1:     1    asd <expression[1]>
2:     3    qwe <expression[1]>
> rbind(A,B)$c3
[[1]]
expression(as.character(Sys.time()))

[[2]]
expression(as.character(Sys.time() + 5))

A consequence of this PR is that where a data.table has > 1 row and an expression column, each expression is separated out into its own list element:

> A = data.table(c1 = 1, c2 = 'asd', c3 = expression(as.character(Sys.time())))
> B = data.table(c1 = 3:5, c2 = c('qwe', "foo", "bar"), 
+                c3 = expression(1+1, print("test"), as.character(Sys.time()+5)))
> rbind(A,B)
      c1     c2              c3
   <num> <char>          <list>
1:     1    asd <expression[1]>
2:     3    qwe <expression[1]>
3:     4    foo <expression[1]>
4:     5    bar <expression[1]>
> rbind(A,B)$c3
[[1]]
expression(as.character(Sys.time()))

[[2]]
expression(1 + 1)

[[3]]
expression(print("test"))

[[4]]
expression(as.character(Sys.time() + 5))

I don't know if this is desired behaviour, or whether we should default back to base R's coercion rules (i.e. returning an expression vector) - presumably if the user has constructed an expression vector column of length > 1 they would desire an expression vector back? E.g. #4040 is requesting expression columns to be preserved.

Any non-allowed column not listed in the typeorder in src/init.c are now appropriately detected in src/rbindlist.c. In these cases, the column type becomes a list, with each element wrapped in a list.

src/rbindlist.c

codecov · 2020-01-24T12:21:03Z

Codecov Report

Attention: 1 lines in your changes are missing coverage. Please review.

Comparison is base (c005296) 99.61% compared to head (00c6d85) 99.59%.
Report is 714 commits behind head on master.

Files	Patch %	Lines
src/utils.c	95.23%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #4196      +/-   ##
==========================================
- Coverage   99.61%   99.59%   -0.02%     
==========================================
  Files          72       72              
  Lines       13871    13938      +67     
==========================================
+ Hits        13817    13882      +65     
- Misses         54       56       +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

src/rbindlist.c

sritchie73 · 2020-01-24T17:39:46Z

src/rbindlist.c

@@ -514,9 +514,33 @@ SEXP rbindlist(SEXP l, SEXP usenamesArg, SEXP fillArg, SEXP idcolArg)
        if (w==-1 || !length(thisCol=VECTOR_ELT(li, w))) {  // !length for zeroCol warning above; #1871
          writeNA(target, ansloc, thisnrow);  // writeNA is integer64 aware and writes INT64_MIN
        } else {
-          if ((TYPEOF(target)==VECSXP || TYPEOF(target)==EXPRSXP) && TYPEOF(thisCol)!=TYPEOF(target)) {
+          if ((TYPEOF(target)==VECSXP) && (isVectorAtomic(thisCol) || TYPEOF(thisCol)==LISTSXP)) {


Assuming the desired behaviour here is for a pairlist to be converted to a list, rather than being converted to a list where each element is a 1-element pairlist.

example of both would help

I think this makes sense - my understanding is a pairlist is a linked-list, while a regular list is essentially a vector of pointers to the list elements. If we treat a LISTSXP the same way as an EXPRSXP, then you end up with a list, where each element is a pointer to each individual node in the linked list. You end up lose any benefits of having the linked list, because each list element is just a pointer to a 1-node linked list.

E.g the result is something like:

# Linked list, A -> B -> C ll = pairlist(A=1, B=pairlist(1, 2:3), C='foo') # Current rbindlist behaviour turns top level pairlist into list (losing the names) cb = list(1, pairlist(1, 2:3), 'foo') # Alternate rbindlist behaviour if we treat a pairlist in the same way as expression: ab = list(pairlist(A=1), pairlist(B=pairlist(1, 2:3)), pairlist(C='foo'))

sritchie73 · 2020-01-24T17:42:50Z

src/rbindlist.c

+            SEXP thisColList = PROTECT(allocVector(VECSXP, length(thisCol))); nprotect++;
+            for(int r=0; r<length(thisCol); ++r) {
+              SET_VECTOR_ELT(thisColList, r, thisCol);
+            }


Demonstration of the recycling issue and why we need this for loop:

> A = data.table(a=0L, B=quote(1+1)) > A a B <int> <call> 1: 0 1 + 1 2: 0 1 + 1 3: 0 1 + 1

The for loop means we get the right result with rbind:

> rbind(A) a B <int> <list> 1: 0 <call[3]> 2: 0 <call[3]> 3: 0 <call[3]>

Otherwise we would have

> rbind(A) a B <int> <list> 1: 0 <call[3]> 2: 0 3: 0

It would be good to fix this in data.table()/as.data.table() so that these non-vector types are wrapped in list at the time of data.table construction

agree, very nice demo

sritchie73 · 2020-01-24T17:46:42Z

I've also come across this unexpected behaviour:

> A = data.table(c1=0, c2=list(1:3))
> B = data.table(c1=1)
> rbind(A,B,fill=TRUE)
      c1     c2
   <num> <list>
1:     0  1,2,3
2:     1

When fill=TRUE and the missing column is list type, then the corresponding entries are filled with NULL elements. Should these be filled with NA_logical_ instead?.

jangorecki · 2020-01-25T02:32:33Z

src/rbindlist.c

-            // Exotic non-atomic types need each element to be wrapped in a list, e.g. expression vectors #546
+          if ((TYPEOF(target)==VECSXP) && (isVectorAtomic(thisCol) || TYPEOF(thisCol)==LISTSXP)) {
+            // do an as.list() on the atomic column; #3528
+            // pairlists (LISTSXP) can also be coerced to lists using coerceVector


does it coerce recursively unwrapping nested parilists into single list of many elements?

Only the top level pairlist is coerced to a list:

> A = data.table(c1 = 1:3, c2 = pairlist(A=1, B=list(1, 2:3), C=pairlist(C1=1, C2=3:4))) > B = data.table(c1 = 2, c2 = 'asd') > rbind(A,B) c1 c2 <num> <list> 1: 1 1 2: 2 <list[2]> 3: 3 <pairlist[2]> 4: 2 asd

jangorecki · 2020-01-25T02:37:23Z

Filling with NULL in such a case looks seems fine for me.

sritchie73 · 2020-01-25T11:31:04Z

It departs from the expected behaviour of fill=TRUE for all other column types though.

sritchie73 · 2020-01-27T12:04:54Z

TODO: list wrapper of atomic vectors is not class aware.

I.e. plan to use in Casmatrix in #4144

sritchie73 · 2020-01-27T17:18:27Z

Wrapping atomic vectors in list now preserves class. The integer64 case errors because memrecycle copies the integer64 class to the list

TODO: fix or handle

sritchie73 · 2020-01-27T23:24:19Z

Wrapping atomic vectors in list now correctly preserves class.

src/rbindlist.c

jangorecki · 2020-04-05T19:25:55Z

does this PR needs update after #546 (comment) ? @sritchie73

sritchie73 · 2020-05-18T10:22:51Z

@jangorecki I'm not sure: I think the issue would be that if you do not wrap each expression in a list, then the length of the expression column will not match the length of the other non-expression columns.

jangorecki · 2020-05-18T10:26:09Z

unless length matches, that would happen when each row is meant to store a language object, rather than expression.

jangorecki · 2020-05-18T10:31:48Z

I was going to provide simple example but I noticed that expression looks to be recycled, making the results not what I really wanted

l=list(x=1:2, y=2:3, expr=expression(1+2,2+3))
sapply(l, length)
#   x    y expr 
#   2    2    2 
as.data.table(l)
#       x     y                     expr
#   <int> <int>                   <expr>
#1:     1     2 expression(1 + 2, 2 + 3)
#2:     2     3 expression(1 + 2, 2 + 3)

but after closer look, it seems that the issue is in print data.table, because expr column is not recycled

as.data.table(l)$expr
expression(1 + 2, 2 + 3)

@MichaelChirico any idea if there is an open issue for printing expression column behavior?

jangorecki · 2020-05-18T10:37:21Z

If expression would always store the language objects then using it a column would probably make more sense.
But the following behavior makes me think it is just easier to convert that to list

lapply(expression(1+2,3), class)
#[[1]]
#[1] "call"
#
#[[2]]
#[1] "numeric"

sritchie73 · 2020-05-18T11:37:40Z

That's essentially the conclusion I came to for any object that's not considered internally to be a vector by R (i.e. !isVector() == true in the R's C interface) because they have a length property that does not correspond to the number of elements. This causes problems when trying to print, and also errors when these types are in the first column of the data.table (because the first column is used to determine .N)

jangorecki · 2020-05-18T12:07:50Z

Agree, could you please ensure PR is ready to review? otherwise better to mark it as WIP.

sritchie73 · 2020-05-18T13:06:55Z

I've just reviewed the unit tests and run them; the current behaviour of this PR is to wrap each expression element in a list, even if the input expression column is a vector of length > 1:

A = data.table(c1 = 1:2, c2 = c('asd', 'xzy'), c3 = expression(as.character(Sys.time()), 1+2))
B = data.table(c1 = 3:5, c2 = c('qwe', "foo", "bar"), 
               c3 = expression(1+1, print("test"), as.character(Sys.time()+5)))
dt = rbind(A,B)
dt[, c3]
[[1]]
expression(as.character(Sys.time()))

[[2]]
expression(1 + 2)

[[3]]
expression(1 + 1)

[[4]]
expression(print("test"))

[[5]]
expression(as.character(Sys.time() + 5))

I've added a unit test to cover this case.

MichaelChirico · 2020-05-19T10:16:26Z

any idea if there is an open issue for printing expression column behavior?

Sounds somewhat familiar (it's related to #3011) but I didn't see anything in debugonce.

What's happening is format(l$expr) returns length-one, then cbind recycles this.

This will be better addressed by #3414 -- in fact I already see a format_col.expression method there.

jangorecki · 2020-05-19T10:55:28Z

@sritchie73 one line seems to be not covered yet: https://codecov.io/gh/Rdatatable/data.table/compare/c0052964694a4c618ab182aa474f924d40576d94...00c6d859b158737f130d3f5a1da416693358e14d/diff

sritchie73 · 2020-05-19T11:08:35Z

I'll need to do some digging later. I don't see how that line isn't covered given all the other lines in that for loop are covered...

sritchie73 · 2020-05-23T16:56:08Z

I injected a few Rprintf statements and can confirm that the tests are running code within that loop before and after the line that's supposedly not covered, so I'm not sure what to make of this:

Running this test:

# Test non-atomic columns in rbindlist, e.g. expressions, #546
A = data.table(c1 = 1, c2 = 'asd', c3 = expression(as.character(Sys.time())))
B = data.table(c1 = 3, c2 = 'qwe', c3 = expression(as.character(Sys.time()+5)))
DT = data.table(c1 = c(1, 3), c2 = c("asd", "qwe"), 
                c3 = list(expression(as.character(Sys.time())), expression(as.character(Sys.time() + 5))))
test(2129.1, rbind(A,B), DT)

I get both messages below output to the R console:

      Rprintf("Inside loop\n");
      // Allocate length-1 vector of appropriate type and set it in the right place in the VECSXP container
      SEXP thisElement = PROTECT(allocVector(TYPEOF(x), 1)); (*nprotect)++;
      SET_VECTOR_ELT(coerced, i, thisElement); UNPROTECT(1); (*nprotect)--; // line supposedly not covered 
      Rprintf("After SET_VECTOR_ELT\n");

MichaelChirico · 2024-01-06T13:43:10Z

@sritchie73 just checking if this is still needed? If so we can merge to master and resume the review process (nearly 4 years later 😅).

Also if you'd like someone else to take over the PR, do feel free.

cc @ben-schwen since I know you've been working a lot on rbindlist lately.

jangorecki · 2024-01-06T14:25:05Z

It may be better to merge other rbindlist PR(s) (Ben had something recently from where we took only regression fix), and once the other one is merged master, then merge master to this one.

MichaelChirico · 2024-09-05T04:20:44Z

#5453 is the other pending rbindlist() PR. Let's revisit this after that's addressed.

sritchie73 added 3 commits January 24, 2020 11:55

Non-atomic types listified in rbindlist

0e67524

Any non-allowed column not listed in the typeorder in src/init.c are now appropriately detected in src/rbindlist.c. In these cases, the column type becomes a list, with each element wrapped in a list.

Fixed test, and added more extensive tests

5268888

Updated NEWS item

79e6ed2

sritchie73 commented Jan 24, 2020

View reviewed changes

src/rbindlist.c Outdated Show resolved Hide resolved

sritchie73 commented Jan 24, 2020

View reviewed changes

src/rbindlist.c Outdated Show resolved Hide resolved

sritchie73 added 3 commits January 24, 2020 12:22

Added comment to future proof typeorder

b647ebe

Missing PROTECT added.

d44129f

All exotic types sould now be listified

1908ecc

sritchie73 added the WIP label Jan 24, 2020

sritchie73 commented Jan 24, 2020

View reviewed changes

sritchie73 changed the title ~~Expression columns now supported by rbindlist~~ All exotic types are now converted to lists in rbindlist Jan 24, 2020

Added missing comment to UNPROTECT

3e34323

sritchie73 changed the title ~~All exotic types are now converted to lists in rbindlist~~ All non-atomic types are now converted to lists in rbindlist Jan 25, 2020

jangorecki reviewed Jan 25, 2020

View reviewed changes

Added tests

a7ae6db

sritchie73 removed the WIP label Jan 25, 2020

This was referenced Jan 25, 2020

Should rbindlist(..., fill=TRUE) return NA_logical_ in list columns? #4198

Closed

Fast data.table to matrix conversion in C [Depends: #4196] #4144

Draft

sritchie73 added the WIP label Jan 27, 2020

sritchie73 added 5 commits January 27, 2020 12:56

Abstracted list coercion into separate function

2bae80c

I.e. plan to use in Casmatrix in #4144

bugfix abstracted function

7166509

Add test and allow integer64 when maxType = VECSXP

7063d5c

Class propogates to list elements

5f6543a

No error on class mismatch when maxType==VECSXP

070508c

sritchie73 removed the WIP label Jan 27, 2020

sritchie73 added 3 commits January 27, 2020 23:44

& should be &&

10baf70

Reverted no longer relevant comment change

e2f3b65

Generalised vector case, uses memrecycle

e0d2f46

sritchie73 commented Jan 28, 2020

View reviewed changes

src/rbindlist.c Show resolved Hide resolved

sritchie73 mentioned this pull request Jan 28, 2020

Fixed typeorder coercion rules for RAWSXP and LGLSXP to match base R #4195

Closed

Better PROTECT stack management

60dddd1

tlapak mentioned this pull request Mar 6, 2020

Triggered internal error with an expression in j while grouping #3708

Open

Unit test for expression column with length > 1

9e94df6

bugfix test number increment

00c6d85

MichaelChirico mentioned this pull request May 19, 2020

format_col/format_list_item printing generics for customization #3414

Merged

MichaelChirico mentioned this pull request Jan 6, 2024

Improved handling of list columns with NULL entries #4250

Merged

jangorecki added the rbindlist label Jan 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

All non-atomic types are now converted to lists in rbindlist #4196

All non-atomic types are now converted to lists in rbindlist #4196

sritchie73 commented Jan 24, 2020 •

edited

Loading

codecov bot commented Jan 24, 2020 •

edited

Loading

sritchie73 Jan 24, 2020

jangorecki Jan 25, 2020

sritchie73 Jan 25, 2020

sritchie73 Jan 24, 2020

jangorecki Jan 25, 2020

sritchie73 commented Jan 24, 2020

jangorecki Jan 25, 2020

sritchie73 Jan 25, 2020

jangorecki commented Jan 25, 2020

sritchie73 commented Jan 25, 2020

sritchie73 commented Jan 27, 2020

sritchie73 commented Jan 27, 2020

sritchie73 commented Jan 27, 2020

jangorecki commented Apr 5, 2020 •

edited

Loading

sritchie73 commented May 18, 2020

jangorecki commented May 18, 2020 •

edited

Loading

jangorecki commented May 18, 2020 •

edited

Loading

jangorecki commented May 18, 2020

sritchie73 commented May 18, 2020

jangorecki commented May 18, 2020

sritchie73 commented May 18, 2020

MichaelChirico commented May 19, 2020 •

edited

Loading

jangorecki commented May 19, 2020

sritchie73 commented May 19, 2020

sritchie73 commented May 23, 2020

MichaelChirico commented Jan 6, 2024

jangorecki commented Jan 6, 2024

MichaelChirico commented Sep 5, 2024

All non-atomic types are now converted to lists in rbindlist #4196

Are you sure you want to change the base?

All non-atomic types are now converted to lists in rbindlist #4196

Conversation

sritchie73 commented Jan 24, 2020 • edited Loading

codecov bot commented Jan 24, 2020 • edited Loading

Codecov Report

sritchie73 Jan 24, 2020

Choose a reason for hiding this comment

jangorecki Jan 25, 2020

Choose a reason for hiding this comment

sritchie73 Jan 25, 2020

Choose a reason for hiding this comment

sritchie73 Jan 24, 2020

Choose a reason for hiding this comment

jangorecki Jan 25, 2020

Choose a reason for hiding this comment

sritchie73 commented Jan 24, 2020

jangorecki Jan 25, 2020

Choose a reason for hiding this comment

sritchie73 Jan 25, 2020

Choose a reason for hiding this comment

jangorecki commented Jan 25, 2020

sritchie73 commented Jan 25, 2020

sritchie73 commented Jan 27, 2020

sritchie73 commented Jan 27, 2020

sritchie73 commented Jan 27, 2020

jangorecki commented Apr 5, 2020 • edited Loading

sritchie73 commented May 18, 2020

jangorecki commented May 18, 2020 • edited Loading

jangorecki commented May 18, 2020 • edited Loading

jangorecki commented May 18, 2020

sritchie73 commented May 18, 2020

jangorecki commented May 18, 2020

sritchie73 commented May 18, 2020

MichaelChirico commented May 19, 2020 • edited Loading

jangorecki commented May 19, 2020

sritchie73 commented May 19, 2020

sritchie73 commented May 23, 2020

MichaelChirico commented Jan 6, 2024

jangorecki commented Jan 6, 2024

MichaelChirico commented Sep 5, 2024

sritchie73 commented Jan 24, 2020 •

edited

Loading

codecov bot commented Jan 24, 2020 •

edited

Loading

jangorecki commented Apr 5, 2020 •

edited

Loading

jangorecki commented May 18, 2020 •

edited

Loading

jangorecki commented May 18, 2020 •

edited

Loading

MichaelChirico commented May 19, 2020 •

edited

Loading