You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am writing a simple function to return the first row of a dataset using data.table. I don't want to assume all inputs will be a data.table so was conducting some testing with tibbles and came across the following bug.
When using as_tibble() (from ‘3.2.1’) to define a tibble I assume it must keep some reference to the original data.frame. The reprex below shows some strange reordering that happens when the original data.frame is modified and then a function is called on the tibble created from as_tibble().
I am trying to understand why this might occur. I would have thought that tbl <- as_tibble(df) would have assigned an item to a new memory location and no longer referenced df. If we change the order of execution to run `select_first(tbl, keyby = "c") before any others it returns the desired output.
library(data.table)
library(tibble)
select_first<-function(dt, by, keyby) {
setDT(dt)
if (!missing(keyby)) {
data.table::setkeyv(dt, keyby)
}
if (missing(by)) {
return(dt[1])
}
dt[dt[, .I[1], by]$V1]
}
df<-data.frame(
a=1:5,
b=LETTERS[1:5],
c=5:1,
d= c("a", "a", "b", "b", "b")
)
df# a b c d# 1 1 A 5 a# 2 2 B 4 a# 3 3 C 3 b# 4 4 D 2 b# 5 5 E 1 btbl<-tibble::as_tibble(df)
tbl# # A tibble: 5 × 4# a b c d# <int> <chr> <int> <chr># 1 1 A 5 a# 2 2 B 4 a# 3 3 C 3 b# 4 4 D 2 b# 5 5 E 1 bdt<-data.table::as.data.table(df)
dt# a b c d# <int> <char> <int> <char># 1: 1 A 5 a# 2: 2 B 4 a# 3: 3 C 3 b# 4: 4 D 2 b# 5: 5 E 1 bdf# format remains the same as when first created# a b c d# 1 1 A 5 a# 2 2 B 4 a# 3 3 C 3 b# 4 4 D 2 b# 5 5 E 1 b
select_first(df, keyby="c") # result as expected# Key: <c># a b c d# <int> <char> <int> <char># 1: 5 E 1 btbl# order of columns `b` and `d` have changed# A tibble: 5 × 4# a b c d# <int> <chr> <int> <chr># 1 1 E 5 b# 2 2 D 4 b# 3 3 C 3 b# 4 4 B 2 a# 5 5 A 1 a
select_first(tbl, keyby="c") # results are not as expected# Key: <c># a b c d# <int> <char> <int> <char># 1: 5 A 1 adt# format remains the same as when first created# a b c d# <int> <char> <int> <char># 1: 1 A 5 a# 2: 2 B 4 a# 3: 3 C 3 b# 4: 4 D 2 b# 5: 5 E 1 b
select_first(dt, keyby="c") # result as expected# Key: <c># a b c d# <int> <char> <int> <char># 1: 5 E 1 b
The text was updated successfully, but these errors were encountered:
@danwwilson - this is not to do with tibble but with the fact that you should avoid using setDT programmatically (i.e use as.data.table() instead) or exercise great care in non-interactive settings. To illustrate:
as_bob<-function(x) {
x<- as.data.frame(x)
class(x) <- c("bob", class(x))
x
}
df<-data.frame(
a=1:5,
b=LETTERS[1:5],
c=5:1,
d= c("a", "a", "b", "b", "b")
)
(bob<- as_bob(df))
#> a b c d#> 1 1 A 5 a#> 2 2 B 4 a#> 3 3 C 3 b#> 4 4 D 2 b#> 5 5 E 1 b
select_first(df, keyby="c")
#> Key: <c>#> a b c d#> <int> <char> <int> <char>#> 1: 5 E 1 bbob#> a b c d#> 1 1 E 5 b#> 2 2 D 4 b#> 3 3 C 3 b#> 4 4 B 2 a#> 5 5 A 1 a
The onus would be on data.table to change the behaviour of setDT() (should they wish to) but that would likely remove it's efficient nature. You could make your function force a copy() but in that case you might as well call as.data.table() (at this point the only real difference in overhead would be the method dispatch).
I am writing a simple function to return the first row of a dataset using data.table. I don't want to assume all inputs will be a data.table so was conducting some testing with tibbles and came across the following bug.
When using
as_tibble()
(from ‘3.2.1’) to define a tibble I assume it must keep some reference to the original data.frame. The reprex below shows some strange reordering that happens when the original data.frame is modified and then a function is called on the tibble created fromas_tibble()
.I am trying to understand why this might occur. I would have thought that
tbl <- as_tibble(df)
would have assigned an item to a new memory location and no longer referenceddf
. If we change the order of execution to run `select_first(tbl, keyby = "c") before any others it returns the desired output.The text was updated successfully, but these errors were encountered: