Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

collect more statistics about the data #2879

Open
10 of 12 tasks
jangorecki opened this issue May 15, 2018 · 9 comments · May be fixed by #4478
Open
10 of 12 tasks

collect more statistics about the data #2879

jangorecki opened this issue May 15, 2018 · 9 comments · May be fixed by #4478
Assignees
Labels
enhancement internals top request One of our most-requested issues

Comments

@jangorecki
Copy link
Member

jangorecki commented May 15, 2018

data.table could collect more statistics about data while processing. This allows potential optimizations, not limited to internal data.table code. Users can use them to speed up their code and design more data-driven functions.
List of measures to collect:

  • is sorted: haskey(x)
  • has index: !is.null(idx<-attr(attr(x, "index"), idx_name))
  • has NA / anyNA
  • has NaN
  • number of groups (uniqueN): length(attr(idx, "starts"))
  • size of biggest group: attr(idx, "maxgrpn")
  • is unique (uniqueN == .N): attr(idx, "maxgrpn")==1L
  • range (min, max): x[c(idx[1L], idx[length(idx)])]
  • all NA: {{hasna}} && length(attr(idx, "starts"))==1L
  • is ascii

optionally, as I don't see obvious optimizations coming from those:

  • NA count
  • sd, var
@franknarf1
Copy link
Contributor

I'd also be interested in collection of statistics related to a join (eg, does each row of i have exactly one matching row in x in an x[i] join?), though I don't know if that's relevant to optimization nor where such statistics would be collected (since they pertain to multiple tables, so embedding them in one table doesn't make sense). Anyway, just a though, mentioned earlier in #2022

@jangorecki
Copy link
Member Author

jangorecki commented May 15, 2018

@franknarf1 not exactly what you've asked but related.
is unique on a field would guarantee that joining to that table on this field won't produce multiple matches.

@st-pasha
Copy link
Contributor

na_count makes it unnecessary to keep both has_nas and all_nas. It has additional benefits when computing stats for boolean columns (sum + na_count are sufficient to derive all other statistics). Also if you ever want to store a data.table in feather format, you'll need to know the count of NAs for each column.

In addition to the stats you mentioned, one of the most important ones is is_ascii for string columns. Having all-ASCII data allows using fast sort method, whereas for generic Unicode you'd need slower comparison sort.

@jangorecki
Copy link
Member Author

jangorecki commented May 16, 2018

We need to plug collections of those statistics into existing functions, so explicitly calling analyze won't be needed to get benefits. Posting as a reference:

analyze <- function(x, ...) {
  stopifnot(is.data.table(x))
  d <- dim(x)
  nr <- d[1L]
  nc <- d[2L]
  cols <- names(x)
  ans <- list(is.sorted=NA, is.indexed=NA, is.unique=NA, is.ascii=NA,
              unique.n=NA_integer_, maxgrp.n=NA_integer_, na.count=NA_integer_,
              has.na=NA, all.na=NA, has.nan=NA,
              min=NA_real_, max=NA_real_)
  ans <- rbindlist(sapply(cols, function(x) ans, simplify=FALSE), idcol="column")
  o <- sapply(simplify=FALSE, cols, data.table:::forderv, x=x, sort=TRUE, retGrp=TRUE)
  if (is.null(attr(x, "index", exact = TRUE))) setattr(x, "index", integer())
  idx <- attr(x, "index", TRUE)
  is.ascii <- function(x) {
    if (!is.character(x)) return(NA)
    asc <- iconv(x, "latin1", "ASCII")
    !any(is.na(asc) | asc != x)
  }
  for (i in 1:nc) {
    setattr(idx, paste0("__", cols[i]), c(o[[i]]))
    ans[i, `:=`(
             is.sorted = !length(o[[i]]),
             is.indexed = TRUE,
             is.ascii = is.ascii(x[[i]]),
             unique.n = length(attr(o[[i]], "starts", TRUE)),
             maxgrp.n = attr(o[[i]], "maxgrpn", TRUE),
             na.count = sum(is.na(x[[i]]))
           )][i, `:=`(is.unique = unique.n==nr,
                      has.na = na.count>0L,
                      all.na = na.count==nr,
                      has.nan = NA,
                      min = NA_real_,
                      max = NA_real_)]
  }
  setattr(x, "stats", ans)
}

stats <- function(x) attr(x, "stats", TRUE)

set.seed(108)
x <- data.table(v1=sample(8L, 10L, TRUE), v2=as.factor(sample(letters, 5L)),
                v3=rnorm(10L), v4=sample(c("\xfcasd", sample(letters, 4L))),
                v5=sample(c(rnorm(5), rep(NA, 5))), v6=NA)
x
#       v1     v2         v3      v4         v5     v6
#    <int> <fctr>      <num>  <char>      <num> <lgcl>
# 1:     4      x -0.1389979       o         NA     NA
# 2:     4      j -0.4059470       g  0.4747667     NA
# 3:     3      r -1.6771308 \374asd  0.6404341     NA
# 4:     6      s  0.4459993       y  0.5706510     NA
# 5:     4      n -0.6954863       i         NA     NA
# 6:     1      x  0.6769990       o -0.6944517     NA
# 7:     5      j  0.9524670       g         NA     NA
# 8:     2      r -2.2123936 \374asd         NA     NA
# 9:     5      s  0.9949963       y -0.7555400     NA
#10:     4      n -0.1515556       i         NA     NA
analyze(x)
stats(x)
#   column is.sorted is.indexed is.unique is.ascii unique.n maxgrp.n na.count
#   <char>    <lgcl>     <lgcl>    <lgcl>   <lgcl>    <int>    <int>    <int>
#1:     v1     FALSE       TRUE     FALSE       NA        6        4        0
#2:     v2     FALSE       TRUE     FALSE       NA        5        2        0
#3:     v3     FALSE       TRUE      TRUE       NA       10        1        0
#4:     v4     FALSE       TRUE     FALSE    FALSE        5        2        0
#5:     v5     FALSE       TRUE     FALSE       NA        6        5        5
#6:     v6      TRUE       TRUE     FALSE       NA        1       10       10
#   has.na all.na has.nan   min   max
#   <lgcl> <lgcl>  <lgcl> <num> <num>
#1:  FALSE  FALSE      NA    NA    NA
#2:  FALSE  FALSE      NA    NA    NA
#3:  FALSE  FALSE      NA    NA    NA
#4:  FALSE  FALSE      NA    NA    NA
#5:   TRUE  FALSE      NA    NA    NA
#6:   TRUE   TRUE      NA    NA    NA

@jangorecki
Copy link
Member Author

jangorecki commented Jul 3, 2018

The question is should we use plain R attributes to keep statistics or should we use new R C level attributes, probably quite experimental for now because introduced in R 3.5.0.

x = rnorm(1e2)
y = sort(x)
.Internal(inspect(x))
#@55b95405e2e0 14 REALSXP g0c7 [NAM(3)] (len=100, tl=0) 1.07949,0.227716,1.68177,-0.300367,0.332526,...
.Internal(inspect(y))
#@55b954cf0ad8 14 REALSXP g0c0 [NAM(3)]  wrapper [srt=1,no_na=1]
#  @55b95405da50 14 REALSXP g0c7 [NAM(3)] (len=100, tl=0) -3.17337,-2.3548,-1.94658,-1.84502,-1.72388,...

Notice [srt=1,no_na=1] in variable y which translates to sorted and no-NAs.
If we will use R-way we can speedup objects that were processed with base R and data.table, otherwise we will use statistics only created by data.table. Ideally would be to have exported API from base R to set and get those attributes.

2018-09-14:
The answer is: we should store attributes at C level, independent of R C API. So there will be no surprise that (probably not existing yet) API changed and we need to amend for new changes. Then we can also use them wherever we want, including parallel regions. @mattdowle do you agree?

@HughParsonage
Copy link
Member

The answer is: we should store attributes at C level, independent of R C API. So there will be no surprise that (probably not existing yet) API changed and we need to amend for new changes. Then we can also use them wherever we want, including parallel regions. @mattdowle do you agree?

One downside to this -- unless I've misunderstood -- is that it doesn't allow the user to access these attributes. Many of them would improve the performance of user functions.

@MichaelChirico
Copy link
Member

@HughParsonage I guess the idea would be to provide accessor functions to the attributes in C

@jangorecki
Copy link
Member Author

@HughParsonage access can be easily made by accessor functions as Michael commented, but I am not sure if it is good idea. If user would mark column as non-duplicate, or non-NAs, then processing of that column by our algorithms could silently return wrong answer. It would be best to just make functions like unique, is.na, forder, etc. to write down those statistics.
Keeping stats as C struct would require a lot of code to be altered so whenever we pass int we will pass our_int where we carry int and stats.

@HughParsonage
Copy link
Member

Sorry, I'm not suggesting the user has 'write' access, so to speak, for these attributes -- only 'read' access. It would definitely be useful and worth the cost.

But I think write-access would be useful too. Ultimately the user has to be a bit careful. After all, it's possible currently to mark an unsorted data.table as sorted. We accept that this is worth the cost.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement internals top request One of our most-requested issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants