Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster uniqueN(v) for logical v #2648

Merged
merged 9 commits into from
Mar 2, 2018
Merged

Faster uniqueN(v) for logical v #2648

merged 9 commits into from
Mar 2, 2018

Conversation

HughParsonage
Copy link
Member

For uniqueN(v) when v is a logical, atomic vector, we can utilize anyNA, any and all for a faster answer, especially when anyNA(v). On a 30M data.table with 50 logical columns, vapply(DT, uniqueN, logical(1L)) experienced a 20 s speedup (from 50s to 30s).

The only downside is that is.logical has to be run, though that is a trivial cost (could not be measured by microbenchmark).

Additional benchmarks (I've entered what I consider to be worst-case scenarios for the new version, though others may be able to find worse instances).

image

# New = unique_N
unique_N <- function(x, by = if (is.list(x)) seq_along(x) else NULL, na.rm=FALSE) { # na.rm, #1455
  if (missing(by) && is.data.table(x) && isTRUE(getOption("datatable.old.unique.by.key"))) {
    by = key(x)
    warning(warning_oldUniqueByKey)
  }
  if (is.null(x)) return(0L)
  if (!is.atomic(x) && !is.data.frame(x))
    stop("x must be an atomic vector or data.frames/data.tables")
  if (is.atomic(x)) {
    if (is.logical(x)) {
      # NAs + TRUE + FALSE
      if (na.rm) {
        return(any(x, na.rm = TRUE) + !all(x, na.rm = TRUE))
      } else {
        return(anyNA(x) + any(x, na.rm = TRUE) + !all(x, na.rm = TRUE))
      }
    } else {
      x = as_list(x)
    }
  }
  if (is.null(by)) by = seq_along(x)
  o = forderv(x, by=by, retGrp=TRUE, na.last=if (!na.rm) FALSE else NA)
  starts = attr(o, 'starts')
  if (!na.rm) {
    length(starts)
  } else {
    # TODO: internal efficient sum
    # fix for #1771, account for already sorted input
    sum( (if (length(o)) o[starts] else starts) != 0L)
  }
}

N = 1e7
rand_2 <- sample(c(TRUE, FALSE), size = N, replace = TRUE)
rand_3 <- sample(c(TRUE, FALSE, NA), size = N, replace = TRUE)

worst_TRUE <- rep_len(TRUE, N)
worst_FALSE <- rep_len(TRUE, N)
worst_NA <- c(rep_len(TRUE, N), NA)
stopifnot(
  identical(uniqueN(rand_2), unique_N(rand_2)),
  identical(uniqueN(rand_3), unique_N(rand_3)),
  identical(uniqueN(worst_TRUE), unique_N(worst_TRUE)),
  identical(uniqueN(worst_FALSE), unique_N(worst_FALSE)),
  identical(uniqueN(worst_NA), unique_N(worst_NA))
)

library(microbenchmark)
library(ggplot2)
mi <- 
  microbenchmark(
    uniqueN(rand_2), unique_N(rand_2),
    uniqueN(rand_3), unique_N(rand_3),
    uniqueN(worst_TRUE), unique_N(worst_TRUE),
    uniqueN(worst_FALSE), unique_N(worst_FALSE),
    uniqueN(worst_NA), unique_N(worst_NA)
  )

options("scipen" = 99)
mi %>%
  as.data.frame %>%
  as.data.table %>%
  .[, x := sub("^unique_?N.(.*).$", "\\1", expr)] %>%
  .[, new := grepl("unique_N", expr)] %>%
  # Drop the maximum (outlier)
  setorder(time) %>%
  .[, .(time = head(time, 99)), keyby = .(x, new)] %>%
  ggplot(aes(x = x, fill = new, y = time)) + 
  geom_boxplot() +
  scale_y_log10("Time / ms", breaks = c(10^(4:9), 0.5 * 10^(4:9)),
                labels = function(x) x / 1e6) + 
  coord_flip() +
  facet_wrap(~x, ncol = 1, scales = "free_y")

image

stopifnot(
  identical(uniqueN(rand_2, na.rm = TRUE),unique_N(rand_2, na.rm = TRUE)),
  identical(uniqueN(rand_3, na.rm = TRUE),unique_N(rand_3, na.rm = TRUE)),
  identical(uniqueN(worst_TRUE, na.rm = TRUE),unique_N(worst_TRUE, na.rm = TRUE)),
  identical(uniqueN(worst_FALSE, na.rm = TRUE),unique_N(worst_FALSE, na.rm = TRUE)),
  identical(uniqueN(worst_NA, na.rm = TRUE),unique_N(worst_NA, na.rm = TRUE))
)


mi <- 
  microbenchmark(
    uniqueN(rand_2, na.rm = TRUE), unique_N(rand_2, na.rm = TRUE),
    uniqueN(rand_3, na.rm = TRUE), unique_N(rand_3, na.rm = TRUE),
    uniqueN(worst_TRUE, na.rm = TRUE), unique_N(worst_TRUE, na.rm = TRUE),
    uniqueN(worst_FALSE, na.rm = TRUE), unique_N(worst_FALSE, na.rm = TRUE),
    uniqueN(worst_NA, na.rm = TRUE), unique_N(worst_NA, na.rm = TRUE)
  )

mi %>%
  as.data.frame %>%
  as.data.table %>%
  .[, x := sub("^unique_?N.(.*).$", "\\1", expr)] %>%
  .[, new := grepl("unique_N", expr)] %>%
  # Drop the maximum (outlier)
  setorder(time) %>%
  .[, .(time = head(time, 99)), keyby = .(x, new)] %>%
  ggplot(aes(x = x, fill = new, y = time)) + 
  geom_boxplot() +
  scale_y_log10("Time / ms", breaks = c(10^(4:9), 0.5 * 10^(4:9)),
                labels = function(x) x / 1e6) + 
  coord_flip() +
  facet_wrap(~x, ncol = 1, scales = "free_y")

@codecov-io
Copy link

codecov-io commented Mar 1, 2018

Codecov Report

Merging #2648 into master will increase coverage by 0.15%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2648      +/-   ##
==========================================
+ Coverage   93.17%   93.32%   +0.15%     
==========================================
  Files          61       61              
  Lines       12169    12427     +258     
==========================================
+ Hits        11338    11598     +260     
+ Misses        831      829       -2
Impacted Files Coverage Δ
src/init.c 93.22% <ø> (ø) ⬆️
R/duplicated.R 89.24% <100%> (+0.35%) ⬆️
src/uniqlist.c 95.06% <100%> (+0.65%) ⬆️
src/fread.c 98.42% <0%> (+0.46%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9d20576...dce5603. Read the comment docs.

@MichaelChirico
Copy link
Member

this is nice. I feel like a more custom-tailored version could do even
better (having to run both any and all when one pass over the vector should
suffice) -- would have to be done at C level of course.

Q about your examples: looks like worst_TRUE and worst_FALSE are identical?

The worst case for what I have in mind would be something like

c(rep(TRUE, 1e7), FALSE)

This is hard for all all is FALSE but it can't tell until the last element

@HughParsonage
Copy link
Member Author

Thanks, Michael. Yes, you're right that it would be better to avoid 2 or 3 passes over the vector. I don't have the skill to write a C function that could outperform base R's. Maybe I should learn how to, but I think it's best not to practise on you guys!

That being said, if the unique values aren't too clustered as they are in the worst case, any and !all only need to reach TRUE and FALSE to return early, so only a fraction of such vectors need to be scanned. The issue is with NA since na.rm=FALSE by default. You can see by comparing the top two panels that when the vector has no NAs that default means the function takes a long time to finish. But this seems unavoidable. One could advise the user to use na.rm = TRUE for maximum performance if they know there are no NAs. But that's basically asking the user to do the work for it!

Re: worst case, all(rep_len(TRUE, 1e7)) basically covers it, no? Both require full passes over the vector.

@MichaelChirico
Copy link
Member

yes indeed. and i think PR should be approved. just making note that further improvement can happen "down the line"

@mattdowle
Copy link
Member

Great idea! I couldn't resist doing it in C. Single pass and returns early. Its logic can probably be simplified, needs more eyes to check please. Maybe this can be used as a template example of how to add a fairly simple C function to the package.

@mattdowle mattdowle added this to the v1.10.6 milestone Mar 1, 2018
src/uniqlist.c Outdated
while (i<n && INTEGER(x)[i]==first) i++;
if (i==n)
return( ScalarInteger( first==NA_INTEGER && narm ? 0 : 1) ); // all one value
Rboolean second = INTEGER(x)[i];
Copy link
Member

@MichaelChirico MichaelChirico Mar 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mattdowle if na.rm, we can short-circuit here if both first and second are not NA_INTEGER

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it already does that, iiuc, on line 253 with the comment // TRUE and FALSE found before any NA, but na.rm=TRUE so we're done

@MichaelChirico
Copy link
Member

@mattdowle indeed, the code is about as straightforward as it comes from C. Good to see an isolated example like this, makes it clearer all the necessary steps to not just write the code but register it for compilation & use in R.

@HughParsonage
Copy link
Member Author

Strange. I'm getting slower speeds for rand_2 (about 7% longer times).

@mattdowle
Copy link
Member

mattdowle commented Mar 1, 2018

This is what I just got on the first attempt: very small timings and the first one is 14ms vs the subsequent at 4ms.

> system.time(uniqueN(rand_2))
   user  system elapsed 
  0.014   0.000   0.014 
> system.time(uniqueN(rand_2))
   user  system elapsed 
  0.004   0.000   0.004 
> system.time(uniqueN(rand_2))
   user  system elapsed 
  0.003   0.000   0.003 
> system.time(uniqueN(rand_2))
   user  system elapsed 
  0.004   0.000   0.004 
> system.time(uniqueN(rand_2))
   user  system elapsed 
  0.004   0.000   0.004 

Could something like that have happened (one relative outlier)? What N are you using to see 7% slowdown?

IIUC, rand_2 is a vector of all TRUE and FALSE with no NA. But na.rm=FALSE, so it has to do a full vector scan to ensure there are no NA before returning 2. It should find the first TRUE and FALSE instantly and then spend all its time on line 259 : while(i<n && INTEGER(x)[i]!=NA_INTEGER) i++; which looks optimal to me. Let's check that 7% slowdown figure next and then I should be able to reproduce.

@mattdowle
Copy link
Member

Can't think of any other refinements. Will wait for @HughParsonage to confirm the 7% figure before merging.

@HughParsonage
Copy link
Member Author

HughParsonage commented Mar 1, 2018

I'm using N = 1e8 but I get consistently faster speeds at other cardinalities too:

Unit: milliseconds
             expr      min       lq     mean   median       uq      max neval cld
 unique_N(rand_2) 40.03954 41.35899 44.91704 43.47084 46.24647 67.51462   100  a 
  uniqueN(rand_2) 42.39684 44.26684 47.81689 46.35776 47.71230 78.01211   100   b

It might be something to do with .Primitive getting off the line faster, but that doesn't explain why with higher cardinality the gap (in absolute terms) grows:

N = 2e8

Unit: milliseconds
             expr      min       lq     mean   median       uq      max neval cld
 unique_N(rand_2) 80.03691 83.41008 87.53279 84.98237 88.94148 130.6920   100  a 
  uniqueN(rand_2) 83.34744 89.22579 93.28837 90.84552 93.72958 146.7498   100   b

and yes rand_2 requires a full scan of the vector, so there won't be much improvement over the base R solution in any case. (That said it probably is the typical use-case.)

@mattdowle
Copy link
Member

mattdowle commented Mar 1, 2018

Interesting! I tried to reproduce but I see a much smaller difference. When you test the dev version of data.table are you compiling with -O3? [ All just for fun since difference is so small. ]

> N
[1] 1e+08
> microbenchmark(uniqueN(rand_2), unique_N(rand_2))
Unit: milliseconds
             expr      min       lq     mean   median       uq      max neval
  uniqueN(rand_2) 36.58546 37.13947 37.44734 37.31662 37.55191 42.63395   100
 unique_N(rand_2) 36.31593 36.99582 37.36691 37.15949 37.41272 46.47395   100
> N = 2e8
> rand_2 <- sample(c(TRUE, FALSE), size = N, replace = TRUE)
> microbenchmark(uniqueN(rand_2), unique_N(rand_2))
Unit: milliseconds
             expr      min       lq     mean   median       uq      max neval
  uniqueN(rand_2) 75.27115 75.85308 76.18890 76.08506 76.40229 78.30454   100
 unique_N(rand_2) 74.88188 75.44899 75.87109 75.80231 76.17355 77.68042   100

So I see a 0.5ms difference (<1%) here, where you're seeing 3ms - 14ms (4% - 12%).

@HughParsonage
Copy link
Member Author

Ah, no -O2. I can never remember how to set the -O3 flag...

@mattdowle
Copy link
Member

mattdowle commented Mar 1, 2018

Worth trying -O3 then for fun. It could be branch prediction, too. Looking at base R, it uses a for loop and an if rather than the while() I've got :

for (i = 0; i < n; i++)
  if (xI[i] == NA_LOGICAL) return TRUE;

That's giving the compiler/cpu pipeline a better chance to run ahead. We don't care where it finds the NA_LOGICAL just whether it does or not. So that might be helping it in base R.
I agree with you it's strange that the difference grows slightly at N increases. If it was call overhead, it wouldn't grow.

@HughParsonage
Copy link
Member Author

Hmm, very noob problem, but setting the optimization flag to -O3 I get an error during the build

'for' loop initial declarations are only allowed in C99 or C11 mode

(Don't worry if this will take too long.)

@mattdowle
Copy link
Member

mattdowle commented Mar 1, 2018

NP. Try passing -std=c99 to the compiler.
See https://github.com/Rdatatable/data.table/blob/master/cc.R#L58 for what I do.

@HughParsonage
Copy link
Member Author

HughParsonage commented Mar 1, 2018

Ah so close

c:/Rtools/mingw_64/bin/gcc -m64 -I"C:/PROGRA~1/MICROS~2/ROPEN~1/R-34~1.3/include" -DNDEBUG     -I"C:/swarm/workspace/External-R-3.4.3/vendor/extsoft/include"  -fopenmp   -std=c99 -O3 -pipe -c fread.c -o fread.o

fread.c:14:33: fatal error: sys/mman.h: No such file or directory
   #include <sys/mman.h>  // mmap
                                 ^
compilation terminated.

@mattdowle
Copy link
Member

Try: haskell/unix#49

@HughParsonage
Copy link
Member Author

Oh were it that simple on Windows...

I tried the binary version via appveyor but I get the same relative timings. I'm guessing R must build with more aggressive optimizations. Maybe the binaries distributed by CRAN will close the gap? I can try a bit harder to get the -O3 but it will probably take me a while.

@mattdowle
Copy link
Member

mattdowle commented Mar 2, 2018

Ok, please give that a go. Installing the AppVeyor binary is a good call. Installing the package (rather than cc() as I do in dev) can help a bit. Especially if we're only talking 0.5ms (for me anyway, but more for you). This attempt is to see if this solves the 3ms - 14ms slow down that you see.
With R CMD INSTALL and -O3, I see :

> N = 2e8
> rand_2 <- sample(c(TRUE, FALSE), size = N, replace = TRUE)
> microbenchmark(uniqueN(rand_2), unique_N(rand_2))
Unit: milliseconds
             expr      min       lq     mean   median       uq      max neval
  uniqueN(rand_2) 73.68120 74.29620 74.88501 74.62269 75.16583 80.88640   100
 unique_N(rand_2) 73.40919 74.25058 74.90208 74.60571 75.02107 91.54036   100

Also make sure that nothing else is running on your machine. Chrome is especially heavy. On Windows I don't recall if system.time() distinguishes user + system != elapsed. And I don't know if microbenchmark uses use+system as it should, or elapsed. Elapsed (wall clock time) is affected by other processes on the machine. Especially at this tiny scale! So I close Chrome, Slack, Pandora etc. Or run on a server. The high max values you see (130ms) vs the mean around (90s) possibly indicate other-process-effects like that. Also might be good to always set control=list(order="block") in microbenchmark(). By default it randomizes the order the expressions are timed, which can thrash L1i.

@HughParsonage
Copy link
Member Author

Nope:

Unit: milliseconds
             expr      min       lq     mean   median       uq      max neval cld
 unique_N(rand_2) 75.32244 76.56615 79.85340 76.89338 79.95288 103.1331   100  a 
  uniqueN(rand_2) 79.60397 81.34507 83.31572 81.55333 83.08195 113.1897   100   b

@mattdowle
Copy link
Member

Can you try : microbenchmark(uniqueN(rand_2), unique_N(rand_2), control=list(order="block"))

@HughParsonage
Copy link
Member Author

> microbenchmark(uniqueN(rand_2), unique_N(rand_2), control=list(order="block"))
Unit: milliseconds
             expr      min      lq     mean   median       uq      max neval cld
  uniqueN(rand_2) 81.38844 81.6229 83.15188 81.80060 82.10328 154.4352   100   b
 unique_N(rand_2) 76.48197 76.6827 77.83888 76.81929 77.18733 102.8325   100  a 

@mattdowle
Copy link
Member

How odd. I reproduced on Windows too.

> microbenchmark(uniqueN(rand_2), unique_N(rand_2), control=list(order="block"))
Unit: milliseconds
             expr      min       lq     mean   median       uq      max neval
  uniqueN(rand_2) 51.16815 51.94226 53.48157 52.37371 53.14368 87.33350   100
 unique_N(rand_2) 49.37339 49.95491 50.82088 50.18038 50.52716 82.32456   100

@mattdowle
Copy link
Member

The AppVeyor logs show it is using -O2. I tried with -O2 locally but I see each as the same timing, unlike on Windows. I'm at a loss then.
What you think? Are the improvements on the other tests worth going ahead with the C version, or should we revert to the R primitives.

@HughParsonage
Copy link
Member Author

My instinct is to go with base R only to keep the compiling to a minimum. Should keep the test suite of course. But the bulk of the gain from 1.10.4-3 has been achieved either way.

happy to improve my other PR :-) btw

@HughParsonage
Copy link
Member Author

That last commit resulted in a smaller gap:

Unit: milliseconds
             expr      min       lq     mean   median       uq      max neval cld
 unique_N(rand_2) 78.49323 81.89998 87.16853 83.60540 92.77697 110.4502   100   a
  uniqueN(rand_2) 77.91045 81.87996 88.45121 84.70032 93.00993 145.4622   100   a

@mattdowle
Copy link
Member

That seems better on my Windows VM :

> microbenchmark(uniqueN(rand_2), unique_N(rand_2), control=list(order="block"))
Unit: milliseconds
             expr      min       lq     mean   median       uq      max neval
  uniqueN(rand_2) 48.78577 49.35803 50.02199 49.76901 50.29283 60.40149   100
 unique_N(rand_2) 48.68613 49.42341 50.61808 49.82927 50.42004 58.82965   100

@mattdowle
Copy link
Member

Ok great. That makes sense then. int (32bit) type on a 64bit cpu has weird effects. It was the last thing that was different to base R. Interesting it helps a bit more on Windows. Good to discover.

With 5e8 on Windows VM :

> N = 5e8
> rand_2 <- sample(c(TRUE, FALSE), size = N, replace = TRUE)
> microbenchmark(uniqueN(rand_2), unique_N(rand_2), control=list(order="block"))
Unit: milliseconds
             expr      min       lq     mean   median       uq      max neval
  uniqueN(rand_2) 241.2369 247.9918 253.8572 249.4949 253.8268 292.5559   100
 unique_N(rand_2) 241.3590 248.2655 254.1266 249.9260 254.8916 287.5942   100

@MichaelChirico
Copy link
Member

What's the comparison for all of the tests, now? If the base version is really performing that well, may as well keep it, no?

@mattdowle
Copy link
Member

mattdowle commented Mar 2, 2018

Where does that leave us across all tests?
@MichaelChirico But this is the worst case of a full vector scan. I thought there were some cases where the R version needed several passes which the C version saves.

@HughParsonage
Copy link
Member Author

Matt's version is much faster for all but the rand_2 case:

image

@HughParsonage
Copy link
Member Author

(And probably faster for the rand_2 case too.)

@MichaelChirico
Copy link
Member

LGTM

@mattdowle
Copy link
Member

mattdowle commented Mar 2, 2018

This case shows the double pass I had in mind. (Which I now realize is covered by the chart!)

> N = 1e8
> end = c(rep(TRUE,N),FALSE,NA)
> uniqueN(end)
[1] 3
> unique_N(end)
[1] 3
> microbenchmark(uniqueN(end), unique_N(end), control=list(order="block"))
Unit: milliseconds
          expr      min        lq      mean    median        uq       max neval
  uniqueN(end) 49.81174  50.12209  51.06473  50.33556  50.69376  87.14682   100
 unique_N(end) 99.37851 100.19555 101.59117 100.76662 101.57558 120.54343   100

> end = c(rep(NA,N),TRUE,FALSE)
> uniqueN(end)
[1] 3
> unique_N(end)
[1] 3
> microbenchmark(uniqueN(end), unique_N(end), control=list(order="block"))
Unit: milliseconds
          expr       min        lq      mean    median        uq       max neval
  uniqueN(end)  48.80704  49.28497  49.81357  49.62486  50.15754  56.29367   100
 unique_N(end) 109.08013 110.32095 112.15381 110.96922 111.82071 131.69756   100

@mattdowle
Copy link
Member

I did start to regret biting on this one. But the R_xlen_t / int unexpected discovery could be useful in other places too. Good to have some fun sometimes.

@mattdowle
Copy link
Member

mattdowle commented Mar 2, 2018

But what I did here in C is negligible compared to Hugh's original which for vast majority of real-world cases is an infinite improvement and unbeatable. I've put a simplified version of this into NEWS to convey how big an improvement it is. [ A few other news items could do with similar timings adding just to convey the order of the improvement, very briefly. ]

N = 1e9
x = c(TRUE,FALSE,NA,rep(TRUE,N))
                             v1.10.4    Hugh(R)   Matt(C)
uniqueN(x) == 3                 5.4s     0.000s    0.000s
x = c(TRUE,rep(FALSE,N), NA)
uniqueN(x,na.rm=TRUE) == 2      5.4s     0.000s    0.000s
x = c(rep(TRUE,N),FALSE,NA)
uniqueN(x) == 3                 6.7s      1.14s     0.38s    # rare edge case

@mattdowle mattdowle merged commit a664ea4 into master Mar 2, 2018
@mattdowle mattdowle deleted the uniqueN-faster-logicals branch March 2, 2018 07:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants