Title: Comparing namedCapture with other R packages for regular expressions
Abstract: Regular expressions are powerful tools for manipulating non-tabular textual data. For many tasks (visualization, machine learning, etc), tables of numbers must be extracted from such data before processing by other R functions. We present the R package namedCapture, which facilitates such tasks by providing a new user-friendly syntax for defining regular expressions in R code. We begin by describing the history of regular expressions and their usage in R. We then describe the new features of the namedCapture package, and provide detailed comparisons with related R packages (rex, stringr, stringi, tidyr, rematch2, re2r).
Article accepted at R Journal. Errata which have been corrected in this local copy:
- The publised version states “Like TRE, the RE2 library guarantees linear time complexity” which is misleading. Actually TRE does NOT guarantee linear time complexity if the pattern has backreferences. In fact, this is true of any regex engine; because RE2 does not support backreferences, it can guarantee time complexity which is linear in the subject and pattern size.
Main source file to edit is hocking.Rnw and Makefile takes care of creating submission.zip
R-core member Tomas Kalibera has updated R so that it will use linear time algorithms for both substring and gregexpr, starting with R-3.6.0.
- substring https://github.com/wch/r-source/commit/4931902af83938a21f4d3f8fa0fce81ad062f2c1
- gregexpr https://github.com/wch/r-source/commit/98d247645ba95270bd14790284500a699caebf52
- Some minor fixes, move Figure with C libraries to beginning.
- main file to edit is now hocking.Rnw and Makefile takes care of creating submission.zip
Submitted. New source file hocking.tex generated manually from HOCKING-namedCapture.Rnw in order to work with R journal submission requirements, via RJwrapper.tex
figure-trackDb-pkgs.R produces the following figure, which
shows that applying the gregexpr_perl
patch recovers the linear time
complexity for the gregexpr function, but the other functions/packages
(rematch2, namedCapture, rex) which use it to extract captured
sub-patterns are still at quadratic time complexity:
The culprit is the substring
function in R, which has time
complexity that is quadratic as the number of substrings and the
subject text size increases. The code figure-substring-bug.R
produces the following figure which shows the quadratic time
complexity of the base R substring function, and the linear time
complexity of the stringi::stri_sub function:
I suspect the issue is again related to repeated calls to strlen in C.
linear-time-gregexpr-perl.patch moves strlen before the while loop, and reduces the time complexity from quadratic (before the patch)
to linear (after applying the patch to R-3.5.1)
Code to reproduce the figure above: figure-trackDb-Ronly.R
Found https://stackoverflow.com/questions/31216299/r-faster-gregexpr-for-very-large-strings and http://r.789695.n4.nabble.com/strsplit-perl-TRUE-gregexpr-perl-TRUE-very-slow-for-long-strings-td4727902.html maybe give PCRE_NO_UTF8_CHECK flag? https://stat.ethz.ch/pipermail/r-help/2008-October/178451.html
Tried R-2.14 – first version with PCRE named capture support – still slow.
TODO try converting pcre_demo.c to R pkg.
Investigation about R-PCRE slowness.
- Does not seem to be an issue with PCRE itself – I compiled command line programs that use PCRE1 and PCRE2 (without R) which have asymptotic timings similar to R-TRE.
- One difference is that pcre_demo.c uses non-zero options, whereas grep.c in R source code always uses options=0. However this does not seem to be the issue because options are only used when a zero-length match is encountered, which does not happen for my data.
- Another difference is that gregexpr_perl in grep.c in R source code uses a buffer of 1024 matches to begin with, and then doubles it if it is not big enough, whereas the command line pcre_demo.c program uses no such buffers and memory reallocation. However that does not seem to be the issue because gregexpr_Regexc (TRE) in grep.c in R source code does the same thing, and is not slow.
- A difference between PCRE and TRE in R is that PCRE calls extract_match_and_groups whereas TRE does not. However that does not seem to be the issue, because groups are extracted in the command line program.
- A difference is that pcre_demo does not study whereas R does. That does not seem to be the issue (changed to NULL in R grep.c but still slow).
- Is pcre_exec being called with different starts? No…
- It doesn’t Have to do with the if(capture_count) block because it is slow even when there are no capture groups.
- What is the difference bewteen regexpr_perl and gregexpr_perl? There does not seem to be any speed issue with regexpr_perl.
- TODO run a profiler to find out where the slow part is.
Observed that PCRE is much slower than other regex engines for finding all matches in a long string of text. Is this a bug?
- Example subject: trackDb-31622.txt
- R timing script: trackDb.gregexpr.R
- Python timing script: trackDb.py
Using both R and Python interfaces, I observed that matching is much slower for PCRE:
- PCRE from R = median 0.57 seconds.
- TRE from R = median 0.03 seconds.
> stats.dt[subject.size==max(subject.size)]
expr subject.size median q25 q75
1: perl=TRUE 31622.78 0.5658699 0.54575069 0.57090807
2: perl=FALSE 31622.78 0.0334391 0.03343219 0.03345432
>
Asymptotic analysis:
In Python:
tdhock@recycled:~/projects/namedCapture-article(master*)$ python trackDb.py trackDb.txt get_list('re2') 3302 function calls in 0.034 seconds ... get_list('re') 3549 function calls (3541 primitive calls) in 0.019 seconds ... get_list('pcre') 9789 function calls in 3.809 seconds ... tdhock@recycled:~/projects/namedCapture-article(master*)$