Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with memory allocations when using as_tibble() #1605

Open
danwwilson opened this issue Oct 31, 2024 · 1 comment
Open

Issue with memory allocations when using as_tibble() #1605

danwwilson opened this issue Oct 31, 2024 · 1 comment

Comments

@danwwilson
Copy link

danwwilson commented Oct 31, 2024

I am writing a simple function to return the first row of a dataset using data.table. I don't want to assume all inputs will be a data.table so was conducting some testing with tibbles and came across the following bug.

When using as_tibble() (from ‘3.2.1’) to define a tibble I assume it must keep some reference to the original data.frame. The reprex below shows some strange reordering that happens when the original data.frame is modified and then a function is called on the tibble created from as_tibble().

I am trying to understand why this might occur. I would have thought that tbl <- as_tibble(df) would have assigned an item to a new memory location and no longer referenced df. If we change the order of execution to run `select_first(tbl, keyby = "c") before any others it returns the desired output.

library(data.table)
library(tibble)

select_first <- function(dt, by, keyby) {
  setDT(dt)

  if (!missing(keyby)) {
    data.table::setkeyv(dt, keyby)
  }

  if (missing(by)) {
    return(dt[1])
  }

  dt[dt[, .I[1], by]$V1]
}

df <- data.frame(
  a = 1:5,
  b = LETTERS[1:5],
  c = 5:1,
  d = c("a", "a", "b", "b", "b")
)
df
#   a b c d
# 1 1 A 5 a
# 2 2 B 4 a
# 3 3 C 3 b
# 4 4 D 2 b
# 5 5 E 1 b

tbl <- tibble::as_tibble(df)
tbl
# # A tibble: 5 × 4
#       a b         c d
#   <int> <chr> <int> <chr>
# 1     1 A         5 a
# 2     2 B         4 a
# 3     3 C         3 b
# 4     4 D         2 b
# 5     5 E         1 b

dt <- data.table::as.data.table(df)
dt
#        a      b     c      d
#    <int> <char> <int> <char>
# 1:     1      A     5      a
# 2:     2      B     4      a
# 3:     3      C     3      b
# 4:     4      D     2      b
# 5:     5      E     1      b

df # format remains the same as when first created
#   a b c d
# 1 1 A 5 a
# 2 2 B 4 a
# 3 3 C 3 b
# 4 4 D 2 b
# 5 5 E 1 b

select_first(df, keyby = "c") # result as expected
# Key: <c>
#        a      b     c      d
#    <int> <char> <int> <char>
# 1:     5      E     1      b

tbl # order of columns `b` and `d` have changed
# A tibble: 5 × 4
#       a b         c d
#   <int> <chr> <int> <chr>
# 1     1 E         5 b
# 2     2 D         4 b
# 3     3 C         3 b
# 4     4 B         2 a
# 5     5 A         1 a

select_first(tbl, keyby = "c") # results are not as expected
# Key: <c>
#        a      b     c      d
#    <int> <char> <int> <char>
# 1:     5      A     1      a

dt # format remains the same as when first created
#        a      b     c      d
#    <int> <char> <int> <char>
# 1:     1      A     5      a
# 2:     2      B     4      a
# 3:     3      C     3      b
# 4:     4      D     2      b
# 5:     5      E     1      b

select_first(dt, keyby = "c") # result as expected
# Key: <c>
#        a      b     c      d
#    <int> <char> <int> <char>
# 1:     5      E     1      b
@TimTaylor
Copy link
Contributor

TimTaylor commented Nov 12, 2024

@danwwilson - this is not to do with tibble but with the fact that you should avoid using setDT programmatically (i.e use as.data.table() instead) or exercise great care in non-interactive settings. To illustrate:

as_bob <- function(x) {
    x <- as.data.frame(x)
    class(x) <- c("bob", class(x))
    x
}

df <- data.frame(
    a = 1:5,
    b = LETTERS[1:5],
    c = 5:1,
    d = c("a", "a", "b", "b", "b")
)

(bob <- as_bob(df))
#>   a b c d
#> 1 1 A 5 a
#> 2 2 B 4 a
#> 3 3 C 3 b
#> 4 4 D 2 b
#> 5 5 E 1 b
select_first(df, keyby = "c")
#> Key: <c>
#>        a      b     c      d
#>    <int> <char> <int> <char>
#> 1:     5      E     1      b
bob
#>   a b c d
#> 1 1 E 5 b
#> 2 2 D 4 b
#> 3 3 C 3 b
#> 4 4 B 2 a
#> 5 5 A 1 a

The onus would be on data.table to change the behaviour of setDT() (should they wish to) but that would likely remove it's efficient nature. You could make your function force a copy() but in that case you might as well call as.data.table() (at this point the only real difference in overhead would be the method dispatch).

Hope that makes sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants