collect more statistics about the data #2879

jangorecki · 2018-05-15T10:55:49Z

franknarf1 · 2018-05-15T13:16:09Z

I'd also be interested in collection of statistics related to a join (eg, does each row of i have exactly one matching row in x in an x[i] join?), though I don't know if that's relevant to optimization nor where such statistics would be collected (since they pertain to multiple tables, so embedding them in one table doesn't make sense). Anyway, just a though, mentioned earlier in #2022

jangorecki · 2018-05-15T14:39:14Z

@franknarf1 not exactly what you've asked but related.
is unique on a field would guarantee that joining to that table on this field won't produce multiple matches.

st-pasha · 2018-05-15T19:03:21Z

na_count makes it unnecessary to keep both has_nas and all_nas. It has additional benefits when computing stats for boolean columns (sum + na_count are sufficient to derive all other statistics). Also if you ever want to store a data.table in feather format, you'll need to know the count of NAs for each column.

In addition to the stats you mentioned, one of the most important ones is is_ascii for string columns. Having all-ASCII data allows using fast sort method, whereas for generic Unicode you'd need slower comparison sort.

jangorecki · 2018-05-16T05:31:57Z

We need to plug collections of those statistics into existing functions, so explicitly calling analyze won't be needed to get benefits. Posting as a reference:

analyze <- function(x, ...) {
  stopifnot(is.data.table(x))
  d <- dim(x)
  nr <- d[1L]
  nc <- d[2L]
  cols <- names(x)
  ans <- list(is.sorted=NA, is.indexed=NA, is.unique=NA, is.ascii=NA,
              unique.n=NA_integer_, maxgrp.n=NA_integer_, na.count=NA_integer_,
              has.na=NA, all.na=NA, has.nan=NA,
              min=NA_real_, max=NA_real_)
  ans <- rbindlist(sapply(cols, function(x) ans, simplify=FALSE), idcol="column")
  o <- sapply(simplify=FALSE, cols, data.table:::forderv, x=x, sort=TRUE, retGrp=TRUE)
  if (is.null(attr(x, "index", exact = TRUE))) setattr(x, "index", integer())
  idx <- attr(x, "index", TRUE)
  is.ascii <- function(x) {
    if (!is.character(x)) return(NA)
    asc <- iconv(x, "latin1", "ASCII")
    !any(is.na(asc) | asc != x)
  }
  for (i in 1:nc) {
    setattr(idx, paste0("__", cols[i]), c(o[[i]]))
    ans[i, `:=`(
             is.sorted = !length(o[[i]]),
             is.indexed = TRUE,
             is.ascii = is.ascii(x[[i]]),
             unique.n = length(attr(o[[i]], "starts", TRUE)),
             maxgrp.n = attr(o[[i]], "maxgrpn", TRUE),
             na.count = sum(is.na(x[[i]]))
           )][i, `:=`(is.unique = unique.n==nr,
                      has.na = na.count>0L,
                      all.na = na.count==nr,
                      has.nan = NA,
                      min = NA_real_,
                      max = NA_real_)]
  }
  setattr(x, "stats", ans)
}

stats <- function(x) attr(x, "stats", TRUE)

set.seed(108)
x <- data.table(v1=sample(8L, 10L, TRUE), v2=as.factor(sample(letters, 5L)),
                v3=rnorm(10L), v4=sample(c("\xfcasd", sample(letters, 4L))),
                v5=sample(c(rnorm(5), rep(NA, 5))), v6=NA)
x
#       v1     v2         v3      v4         v5     v6
#    <int> <fctr>      <num>  <char>      <num> <lgcl>
# 1:     4      x -0.1389979       o         NA     NA
# 2:     4      j -0.4059470       g  0.4747667     NA
# 3:     3      r -1.6771308 \374asd  0.6404341     NA
# 4:     6      s  0.4459993       y  0.5706510     NA
# 5:     4      n -0.6954863       i         NA     NA
# 6:     1      x  0.6769990       o -0.6944517     NA
# 7:     5      j  0.9524670       g         NA     NA
# 8:     2      r -2.2123936 \374asd         NA     NA
# 9:     5      s  0.9949963       y -0.7555400     NA
#10:     4      n -0.1515556       i         NA     NA
analyze(x)
stats(x)
#   column is.sorted is.indexed is.unique is.ascii unique.n maxgrp.n na.count
#   <char>    <lgcl>     <lgcl>    <lgcl>   <lgcl>    <int>    <int>    <int>
#1:     v1     FALSE       TRUE     FALSE       NA        6        4        0
#2:     v2     FALSE       TRUE     FALSE       NA        5        2        0
#3:     v3     FALSE       TRUE      TRUE       NA       10        1        0
#4:     v4     FALSE       TRUE     FALSE    FALSE        5        2        0
#5:     v5     FALSE       TRUE     FALSE       NA        6        5        5
#6:     v6      TRUE       TRUE     FALSE       NA        1       10       10
#   has.na all.na has.nan   min   max
#   <lgcl> <lgcl>  <lgcl> <num> <num>
#1:  FALSE  FALSE      NA    NA    NA
#2:  FALSE  FALSE      NA    NA    NA
#3:  FALSE  FALSE      NA    NA    NA
#4:  FALSE  FALSE      NA    NA    NA
#5:   TRUE  FALSE      NA    NA    NA
#6:   TRUE   TRUE      NA    NA    NA

jangorecki · 2018-07-03T10:03:46Z

The question is should we use plain R attributes to keep statistics or should we use new R C level attributes, probably quite experimental for now because introduced in R 3.5.0.

x = rnorm(1e2)
y = sort(x)
.Internal(inspect(x))
#@55b95405e2e0 14 REALSXP g0c7 [NAM(3)] (len=100, tl=0) 1.07949,0.227716,1.68177,-0.300367,0.332526,...
.Internal(inspect(y))
#@55b954cf0ad8 14 REALSXP g0c0 [NAM(3)]  wrapper [srt=1,no_na=1]
#  @55b95405da50 14 REALSXP g0c7 [NAM(3)] (len=100, tl=0) -3.17337,-2.3548,-1.94658,-1.84502,-1.72388,...

Notice [srt=1,no_na=1] in variable y which translates to sorted and no-NAs.
If we will use R-way we can speedup objects that were processed with base R and data.table, otherwise we will use statistics only created by data.table. Ideally would be to have exported API from base R to set and get those attributes.

2018-09-14:
The answer is: we should store attributes at C level, independent of R C API. So there will be no surprise that (probably not existing yet) API changed and we need to amend for new changes. Then we can also use them wherever we want, including parallel regions. @mattdowle do you agree?

HughParsonage · 2019-01-29T02:57:43Z

The answer is: we should store attributes at C level, independent of R C API. So there will be no surprise that (probably not existing yet) API changed and we need to amend for new changes. Then we can also use them wherever we want, including parallel regions. @mattdowle do you agree?

One downside to this -- unless I've misunderstood -- is that it doesn't allow the user to access these attributes. Many of them would improve the performance of user functions.

MichaelChirico · 2019-02-06T15:36:39Z

@HughParsonage I guess the idea would be to provide accessor functions to the attributes in C

jangorecki · 2019-02-06T16:44:04Z

@HughParsonage access can be easily made by accessor functions as Michael commented, but I am not sure if it is good idea. If user would mark column as non-duplicate, or non-NAs, then processing of that column by our algorithms could silently return wrong answer. It would be best to just make functions like unique, is.na, forder, etc. to write down those statistics.
Keeping stats as C struct would require a lot of code to be altered so whenever we pass int we will pass our_int where we carry int and stats.

HughParsonage · 2019-02-07T06:57:01Z

Sorry, I'm not suggesting the user has 'write' access, so to speak, for these attributes -- only 'read' access. It would definitely be useful and worth the cost.

But I think write-access would be useful too. Ultimately the user has to be a bit careful. After all, it's possible currently to mark an unsorted data.table as sorted. We accept that this is worth the cost.

jangorecki added internals enhancement labels May 15, 2018

MichaelChirico mentioned this issue Jun 10, 2018

FR: warn/error when updating and i has duplicates #2837

Closed

franknarf1 mentioned this issue Aug 10, 2018

Improve helpfulness of warning message during on-assignment type coer… #2989

Merged

jangorecki self-assigned this Sep 14, 2018

MichaelChirico mentioned this issue Dec 6, 2018

Master list of most-requested issues #3189

Open

76 tasks

MichaelChirico mentioned this issue Jan 29, 2019

Record DT[, lapply(.SD, anyNA)] as an attribute #3323

Closed

franknarf1 mentioned this issue Feb 19, 2019

No warning when updating while joining with LHS of greater length #3420

Closed

franknarf1 mentioned this issue May 21, 2019

merge method ignores allow.cartesian #3576

Closed

jangorecki mentioned this issue May 20, 2020

Teach forder to re-use existing key and index attributes instead of sorting from scratch #4386

Merged

3 tasks

jangorecki added a commit that referenced this issue May 23, 2020

analyze function, #2879

806bf0d

This was referenced May 23, 2020

analyze function that collects various statistics #4478

Draft

[R-Forge #2214] Add attribute/concept of unique keys #622

Open

MichaelChirico added the High label May 30, 2020

MichaelChirico added top request One of our most-requested issues and removed High labels Jun 7, 2020

This was referenced Jul 28, 2024

Indexed data can outperform keyed data when cached sort is uses #6318

Open

setkey() should retain retGrp info when re-using that info from setindex() #6319

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

collect more statistics about the data #2879

collect more statistics about the data #2879

jangorecki commented May 15, 2018 •

edited

Loading

franknarf1 commented May 15, 2018

jangorecki commented May 15, 2018 •

edited

Loading

st-pasha commented May 15, 2018

jangorecki commented May 16, 2018 •

edited

Loading

jangorecki commented Jul 3, 2018 •

edited

Loading

HughParsonage commented Jan 29, 2019

MichaelChirico commented Feb 6, 2019

jangorecki commented Feb 6, 2019

HughParsonage commented Feb 7, 2019

collect more statistics about the data #2879

collect more statistics about the data #2879

Comments

jangorecki commented May 15, 2018 • edited Loading

franknarf1 commented May 15, 2018

jangorecki commented May 15, 2018 • edited Loading

st-pasha commented May 15, 2018

jangorecki commented May 16, 2018 • edited Loading

jangorecki commented Jul 3, 2018 • edited Loading

HughParsonage commented Jan 29, 2019

MichaelChirico commented Feb 6, 2019

jangorecki commented Feb 6, 2019

HughParsonage commented Feb 7, 2019

jangorecki commented May 15, 2018 •

edited

Loading

jangorecki commented May 15, 2018 •

edited

Loading

jangorecki commented May 16, 2018 •

edited

Loading

jangorecki commented Jul 3, 2018 •

edited

Loading