Scope for %plike% #3702

KyleHaynes · 2019-07-13T02:53:20Z

Just reading the new dev notes and noticed #3333. I was going to actually feature request %likep% (would make sense to conform to %plike%) the other day, but decided against it (thought maybe the consensus was that less convenience wrappers were more ideal for data.table. Any particular reason why data.table can't incorporate another, leveraging the perl = TRUE argument?

Often you get considerable speed improvements, and a bunch of other features / behaviors

# Following packages required .
# install.packages(c("stringi", "microbenchmark")

# load data.table.
library(data.table)

# Create a data.table of 100,000 random strings (20 chars in length).
DT = data.table(x = stringi::stri_rand_strings(100000, 20))

# Define a trivial regex pattern.
regex_pattern = "car|blah|far|nah"

# Create an alternative to %like% that sets `perl = TRUE`.
`%likep%` = function (vector, pattern) {
    if (is.factor(vector)) {
        as.integer(vector) %in% grep(pattern, levels(vector), perl = TRUE)
    }
    else {
        grepl(pattern, vector, perl = TRUE)
    }
}

# Microbenchmark the results to demonstrate speed improvements.
microbenchmark::microbenchmark(like = {(DT[x %like% regex_pattern])}, likep = (DT[x %likep% regex_pattern]))
# Unit: milliseconds
#   expr     min       lq     mean   median       uq      max neval
#   like 84.1235 86.56265 91.51547 87.74410 91.16710 159.6292   100
#  likep 16.0932 16.64750 17.81476 16.95985 17.82195  34.1415   100

The text was updated successfully, but these errors were encountered:

MichaelChirico · 2019-07-13T10:32:08Z

maybe we could just set Perl=TRUE by default to current %like%? off the top of my head, that shouldn't break anything?

…

On Sat, Jul 13, 2019, 10:53 AM Kyle Haynes ***@***.***> wrote: Just reading the new dev notes and noticed #3333 <#3333>. I was going to actually feature request %likep% (would make sense to conform to %plike%) the other day, but decided against it (thought maybe the consensus was that less convenience wrappers were more ideal for data.table. Any particular reason why data.table can't incorporate another, leveraging the perl = TRUE argument? Often you get considerable speed improvements, and a bunch of other features / behaviors <https://stackoverflow.com/questions/47240375/regular-expressions-in-base-r-perl-true-vs-the-default-pcre-vs-tre> # Following packages required .# install.packages(c("stringi", "microbenchmark") # load data.table. library(data.table) # Create a data.table of 100,000 random strings (20 chars in length).DT = data.table(x = stringi::stri_rand_strings(100000, 20)) # Define a trivial regex pattern.regex_pattern = "car|blah|far|nah" # Create an alternative to %like% that sets `perl = TRUE`. `%likep%` = function (vector, pattern) { if (is.factor(vector)) { as.integer(vector) %in% grep(pattern, levels(vector), perl = TRUE) } else { grepl(pattern, vector, perl = T) } } # Microbenchmark the results to demonstrate speed improvements.microbenchmark::microbenchmark(like = {(DT[x %like% regex_pattern])}, likep = (DT[x %likep% regex_pattern]))# Unit: milliseconds# expr min lq mean median uq max neval# like 84.1235 86.56265 91.51547 87.74410 91.16710 159.6292 100# likep 16.0932 16.64750 17.81476 16.95985 17.82195 34.1415 100 — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#3702?email_source=notifications&email_token=AB2BA5OBGQC2EUF4P2W7XA3P7E7SLA5CNFSM4ICVLFI2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4G6752RQ>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AB2BA5MB3YQFK22HOOV76CLP7E7SLANCNFSM4ICVLFIQ> .

KyleHaynes · 2019-07-13T20:15:10Z

It would throw a warning in %flike% as it's conflicting to declare both as TRUE.

My only justification for another function is to ...

Retain the default behavior of %like%, so backwards compatibility
It feels right that %like% should inherit base-R behavior.

The above points are obviously my own opinions, not sure if they completely reflect that of the data.table authors / users..?

MichaelChirico · 2019-07-13T20:51:05Z

I'm not sure there's any backwards compatibility issue (in my experience Perl regex in R are a superset of non-perl?) if so, pcre are (a) more flexible and as you say (b) more efficient, don't see why not to diverge from base (we have like already, after all) like() would retain Perl option and %flike% would be made to have the underlying like() call with Perl=false

…

On Sun, Jul 14, 2019, 4:15 AM Kyle Haynes ***@***.***> wrote: It would throw a warning in %flike% as it's conflicting to declare both as TRUE. My only justification for another function is to ... 1. Retain the default behavior of %like%, so backwards compatibility 2. It *feels* right that %like% should inherit base-R behavior. The above points are obviously my own opinions, not sure if they completely reflect that of the data.table authors / users..? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#3702?email_source=notifications&email_token=AB2BA5LJOGVFB3X64BI6I3TP7IZVJA5CNFSM4ICVLFI2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZ3YUVI#issuecomment-511150677>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AB2BA5LDDNWBBFDKLRK2JXLP7IZVJANCNFSM4ICVLFIQ> .

KyleHaynes · 2019-07-13T22:30:51Z

I'm not sure there's any backwards compatibility issue (in my experience
Perl regex in R are a superset of non-perl?)

Not entirely. The Perl variant often appears to extend on TRE (mainly with lookarounds), but, they do have slight different behaviors:

# TRE matches . to any literal character, including line break:
grepl(".", "\n", perl = F)
[1] TRUE

# Perl does not:
grepl(".", "\n", perl = T)
[1] FALSE

TRE also offers stuff that Perl doesn't. E.g.:

grepl("[:alpha:]","a", perl = F)
[1] TRUE

grepl("[:alpha:]","a", perl = T)
#Error in grepl("[:alpha:]", "1", perl = T) : 
# invalid regular expression '[:alpha:]'
# In addition: Warning message:
# In grepl("[:alpha:]", "1", perl = T) : PCRE pattern compilation error
#       'POSIX named classes are supported only within a class'
#        at '[:alpha:]'

if so, pcre are (a) more flexible and as you say (b) more efficient, don't
see why not to diverge from base (we have like already, after all)

I might be being daft but %like% doesn't diverge, like it's a convenience wrapper obviously, bu it currently inherits all default arguments?

like() would retain Perl option and %flike% would be made to have the
underlying like() call with Perl=false

Yeah, that's certainly possible.

I'm happy with whatever direction you guys wanna go in, I just would be super stoked if perl = TRUE was incorporated either in %like% or as another convenience wrapper.

MichaelChirico · 2019-07-13T22:41:06Z

bu it currently inherits all default arguments?

yes, what I'm getting at is that like() is ~95% a synonym for grepl(), one of the advantages of having split from that is we can choose our own defaults (with an eye to backwards compatibility)

thanks for spelling out the pre/tre differences so clearly; I'm guessing that in fact we'd be breaking some code if we changed the default. the examples you gave, is there a workaround to accomplish the same RE in perl? if so, and we're really insistent on using perl, we could do a deprecation cycle.

I'm not opposed to %plike%, just trying to make sure there's some orthogonal value-add as discipline to prevent the %[a-z]*like% family from exploding

KyleHaynes · 2019-07-13T23:39:40Z

yes, what I'm getting at is that like() is ~95% a synonym for grepl(), one of the advantages of having split from that is we can choose our own defaults (with an eye to backwards compatibility)

Ahh right. That makes sense. If the direction was to be insistent on using perl in %like%, would a workaround be to declare the arguments in the convenience function and have doco to indicate how easy it is to change these (if required), or is that too wishy-washy?

# %like%, new behavior as `perl = TRUE`:
`%like%` = function (vector, pattern, perl = TRUE) {
    if (is.factor(vector)) {
        as.integer(vector) %in% grep(pattern, levels(vector), perl = perl)
    }
    else {
        grepl(pattern, vector, perl = perl)
    }
}

# Some old dataset.
iris = as.data.table(copy(iris))
iris[, x := "\n"]

# Check that . matches any character, like it used to ...
iris[x %like% "."]
# Ohh no! It doesn't. As a user of R that's aware default arguments can change overtime, I'd immediately look at help ?'%like%' to which it would indicate if you want to revert to previous behavior then do the following...

# Change default back to FALSE.
formals(`%like%`)$perl = FALSE

# Re-run.
iris[x %like% "."]
# Phew, back to previous behavior.

I'm not opposed to %plike%, just trying to make sure there's some orthogonal value-add as discipline to prevent the %[a-z]*like% family from exploding

Totally agree, less is more. If you consider the above a potential solution, you could pass all arguments and allow the user to dictate if required. To extend on this I already pondered ignore.case = T & perl = T (as I often use it with grepl), but having another convenience function around that would probably result in it being less of a convenience and more of a burden of remembering what it was called.

smingerson · 2019-07-14T12:32:51Z

For the examples above at least, you can make TRE and PCRE match with small modifications. If only these and other minor differences existed, you could introduce data.table options which prefixed the pattern with the appropriate modifier.

Adding some of these to the ?like manual page might be a good idea. ?grepl and ?regex have some of the densest, arcane help pages -- with excellent reason.

> # alternates "single-line" modifier from the one used at compile time.
> # Use (?m) instead of (?s) for multi-line (Though this is the default for R.)
> # Can be combined, so (?im) sets caseless matching.
> grepl("(?s).", "\n", perl = TRUE)
[1] TRUE
> grepl(".", "\n", perl = FALSE)
[1] TRUE
> # Don't match new-line under TRE
> grepl("(?n).", "\n", perl = FALSE)
[1] FALSE
> 
> # Enclose character classes within an extra set of brackets.
> # Double brackets also works under TRE.
> grepl("[[:alpha:]]","a", perl = TRUE)
[1] TRUE
> grepl("[:alpha:]","a", perl = FALSE)
[1] TRUE

Not mentioned here, but in #3552 was caseless and fixed string matching. You can indicate caseless using (?i).

> grepl("this does not match", 'THIS DOES NOT MATCH')
[1] FALSE
> grepl('(?i)this does match', 'THIS DOES MATCH')
[1] TRUE

I'm not aware of a flag to indicate fixed matching, though wrapping regex ... in \\Q...\\E will indicate to the engine it contains no meta characters (except PERL uses $ and @ for interpolation inside this, see ?regex in R). I don't think this is the same as passing fixed = TRUE, because there is virtually no speed difference.

KyleHaynes · 2019-07-15T07:38:30Z

@smingerson thanks for you input, that's super neat! I'm totally happy to let @MichaelChirico indicate the best way forward, as I have no real strong opinions and am obviously not a key contributor to (my fav R package) data.table.

Adding some of these to the ?like manual page might be a good idea. ?grepl and ?regex have some of the densest, arcane help pages -- with excellent reason.

I do agree with both of your points as well.

smingerson · 2019-07-16T00:58:45Z

One major feature PCRE has that TRE does not (as far as I can tell), is recursion. I've never had occasion to use this, but it looks like one major difference.

# Example regex from https://www.regular-expressions.info/recurse.html
to_match <- "(((why yes))))"
grepl("\\((?>[^()]|(?R))*\\)", to_match, perl = TRUE)
#> [1] TRUE
# Does not match because parentheses are unbalanced
# and we require them to be by the addition of [^)] at the end.
grepl("\\((?>[^()]|(?R))*\\)[^)]", to_match, perl = TRUE)
#> [1] FALSE

grepl("\\((?>[^()]|(?R))*\\)", to_match, perl = FALSE)
#> Error in grepl("\\((?>[^()]|(?R))*\\)", to_match, perl = FALSE): invalid regular expression '\((?>[^()]|(?R))*\)', reason 'Invalid regexp'

TRE has a feature PCRE lacks in approximate regex matching, but since that is not accessed through grepl() I don't know whether that point has any value.

jangorecki · 2019-11-30T15:34:50Z

I think it is much safer to leave %like% as it is now, and just add %plike%.

KyleHaynes · 2019-12-18T00:27:33Z

@MichaelChirico - if you agree, I'm more than happy to create a pull request.

jangorecki · 2019-12-18T11:15:07Z

@KyleHaynes you are welcome, you are already in contributors team so please push to Rdatatable/data.table repo directly

KyleHaynes mentioned this issue Dec 18, 2019

Adding plike #4129

Merged

mattdowle added this to the 1.14.1 milestone Aug 4, 2021

mattdowle added the feature request label Aug 4, 2021

mattdowle closed this as completed in #4129 Aug 4, 2021

jangorecki modified the milestones: 1.14.9, 1.15.0 Oct 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scope for %plike% #3702

Scope for %plike% #3702

KyleHaynes commented Jul 13, 2019 •

edited

Loading

MichaelChirico commented Jul 13, 2019 via email

KyleHaynes commented Jul 13, 2019

MichaelChirico commented Jul 13, 2019 via email

KyleHaynes commented Jul 13, 2019

MichaelChirico commented Jul 13, 2019

KyleHaynes commented Jul 13, 2019 •

edited

Loading

smingerson commented Jul 14, 2019 •

edited

Loading

KyleHaynes commented Jul 15, 2019

smingerson commented Jul 16, 2019 •

edited

Loading

jangorecki commented Nov 30, 2019

KyleHaynes commented Dec 18, 2019

jangorecki commented Dec 18, 2019 •

edited

Loading

Scope for %plike% #3702

Scope for %plike% #3702

Comments

KyleHaynes commented Jul 13, 2019 • edited Loading

MichaelChirico commented Jul 13, 2019 via email

KyleHaynes commented Jul 13, 2019

MichaelChirico commented Jul 13, 2019 via email

KyleHaynes commented Jul 13, 2019

MichaelChirico commented Jul 13, 2019

KyleHaynes commented Jul 13, 2019 • edited Loading

smingerson commented Jul 14, 2019 • edited Loading

KyleHaynes commented Jul 15, 2019

smingerson commented Jul 16, 2019 • edited Loading

jangorecki commented Nov 30, 2019

KyleHaynes commented Dec 18, 2019

jangorecki commented Dec 18, 2019 • edited Loading

KyleHaynes commented Jul 13, 2019 •

edited

Loading

KyleHaynes commented Jul 13, 2019 •

edited

Loading

smingerson commented Jul 14, 2019 •

edited

Loading

smingerson commented Jul 16, 2019 •

edited

Loading

jangorecki commented Dec 18, 2019 •

edited

Loading