Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scope for %plike% #3702

Closed
KyleHaynes opened this issue Jul 13, 2019 · 12 comments · Fixed by #4129
Closed

Scope for %plike% #3702

KyleHaynes opened this issue Jul 13, 2019 · 12 comments · Fixed by #4129

Comments

@KyleHaynes
Copy link
Contributor

KyleHaynes commented Jul 13, 2019

Just reading the new dev notes and noticed #3333. I was going to actually feature request %likep% (would make sense to conform to %plike%) the other day, but decided against it (thought maybe the consensus was that less convenience wrappers were more ideal for data.table. Any particular reason why data.table can't incorporate another, leveraging the perl = TRUE argument?

Often you get considerable speed improvements, and a bunch of other features / behaviors

# Following packages required .
# install.packages(c("stringi", "microbenchmark")

# load data.table.
library(data.table)

# Create a data.table of 100,000 random strings (20 chars in length).
DT = data.table(x = stringi::stri_rand_strings(100000, 20))

# Define a trivial regex pattern.
regex_pattern = "car|blah|far|nah"

# Create an alternative to %like% that sets `perl = TRUE`.
`%likep%` = function (vector, pattern) {
    if (is.factor(vector)) {
        as.integer(vector) %in% grep(pattern, levels(vector), perl = TRUE)
    }
    else {
        grepl(pattern, vector, perl = TRUE)
    }
}

# Microbenchmark the results to demonstrate speed improvements.
microbenchmark::microbenchmark(like = {(DT[x %like% regex_pattern])}, likep = (DT[x %likep% regex_pattern]))
# Unit: milliseconds
#   expr     min       lq     mean   median       uq      max neval
#   like 84.1235 86.56265 91.51547 87.74410 91.16710 159.6292   100
#  likep 16.0932 16.64750 17.81476 16.95985 17.82195  34.1415   100
@MichaelChirico
Copy link
Member

MichaelChirico commented Jul 13, 2019 via email

@KyleHaynes
Copy link
Contributor Author

It would throw a warning in %flike% as it's conflicting to declare both as TRUE.

My only justification for another function is to ...

  1. Retain the default behavior of %like%, so backwards compatibility
  2. It feels right that %like% should inherit base-R behavior.

The above points are obviously my own opinions, not sure if they completely reflect that of the data.table authors / users..?

@MichaelChirico
Copy link
Member

MichaelChirico commented Jul 13, 2019 via email

@KyleHaynes
Copy link
Contributor Author

I'm not sure there's any backwards compatibility issue (in my experience
Perl regex in R are a superset of non-perl?)

Not entirely. The Perl variant often appears to extend on TRE (mainly with lookarounds), but, they do have slight different behaviors:

# TRE matches . to any literal character, including line break:
grepl(".", "\n", perl = F)
[1] TRUE

# Perl does not:
grepl(".", "\n", perl = T)
[1] FALSE

TRE also offers stuff that Perl doesn't. E.g.:

grepl("[:alpha:]","a", perl = F)
[1] TRUE

grepl("[:alpha:]","a", perl = T)
#Error in grepl("[:alpha:]", "1", perl = T) : 
# invalid regular expression '[:alpha:]'
# In addition: Warning message:
# In grepl("[:alpha:]", "1", perl = T) : PCRE pattern compilation error
#       'POSIX named classes are supported only within a class'
#        at '[:alpha:]'

if so, pcre are (a) more flexible and as you say (b) more efficient, don't
see why not to diverge from base (we have like already, after all)

I might be being daft but %like% doesn't diverge, like it's a convenience wrapper obviously, bu it currently inherits all default arguments?

like() would retain Perl option and %flike% would be made to have the
underlying like() call with Perl=false

Yeah, that's certainly possible.

I'm happy with whatever direction you guys wanna go in, I just would be super stoked if perl = TRUE was incorporated either in %like% or as another convenience wrapper.

@MichaelChirico
Copy link
Member

bu it currently inherits all default arguments?

yes, what I'm getting at is that like() is ~95% a synonym for grepl(), one of the advantages of having split from that is we can choose our own defaults (with an eye to backwards compatibility)

thanks for spelling out the pre/tre differences so clearly; I'm guessing that in fact we'd be breaking some code if we changed the default. the examples you gave, is there a workaround to accomplish the same RE in perl? if so, and we're really insistent on using perl, we could do a deprecation cycle.

I'm not opposed to %plike%, just trying to make sure there's some orthogonal value-add as discipline to prevent the %[a-z]*like% family from exploding

@KyleHaynes
Copy link
Contributor Author

KyleHaynes commented Jul 13, 2019

yes, what I'm getting at is that like() is ~95% a synonym for grepl(), one of the advantages of having split from that is we can choose our own defaults (with an eye to backwards compatibility)

Ahh right. That makes sense. If the direction was to be insistent on using perl in %like%, would a workaround be to declare the arguments in the convenience function and have doco to indicate how easy it is to change these (if required), or is that too wishy-washy?

# %like%, new behavior as `perl = TRUE`:
`%like%` = function (vector, pattern, perl = TRUE) {
    if (is.factor(vector)) {
        as.integer(vector) %in% grep(pattern, levels(vector), perl = perl)
    }
    else {
        grepl(pattern, vector, perl = perl)
    }
}

# Some old dataset.
iris = as.data.table(copy(iris))
iris[, x := "\n"]

# Check that . matches any character, like it used to ...
iris[x %like% "."]
# Ohh no! It doesn't. As a user of R that's aware default arguments can change overtime, I'd immediately look at help ?'%like%' to which it would indicate if you want to revert to previous behavior then do the following...

# Change default back to FALSE.
formals(`%like%`)$perl = FALSE

# Re-run.
iris[x %like% "."]
# Phew, back to previous behavior.

I'm not opposed to %plike%, just trying to make sure there's some orthogonal value-add as discipline to prevent the %[a-z]*like% family from exploding

Totally agree, less is more. If you consider the above a potential solution, you could pass all arguments and allow the user to dictate if required. To extend on this I already pondered ignore.case = T & perl = T (as I often use it with grepl), but having another convenience function around that would probably result in it being less of a convenience and more of a burden of remembering what it was called.

@smingerson
Copy link

smingerson commented Jul 14, 2019

For the examples above at least, you can make TRE and PCRE match with small modifications. If only these and other minor differences existed, you could introduce data.table options which prefixed the pattern with the appropriate modifier.

Adding some of these to the ?like manual page might be a good idea. ?grepl and ?regex have some of the densest, arcane help pages -- with excellent reason.

> # alternates "single-line" modifier from the one used at compile time.
> # Use (?m) instead of (?s) for multi-line (Though this is the default for R.)
> # Can be combined, so (?im) sets caseless matching.
> grepl("(?s).", "\n", perl = TRUE)
[1] TRUE
> grepl(".", "\n", perl = FALSE)
[1] TRUE
> # Don't match new-line under TRE
> grepl("(?n).", "\n", perl = FALSE)
[1] FALSE
> 
> # Enclose character classes within an extra set of brackets.
> # Double brackets also works under TRE.
> grepl("[[:alpha:]]","a", perl = TRUE)
[1] TRUE
> grepl("[:alpha:]","a", perl = FALSE)
[1] TRUE

Not mentioned here, but in #3552 was caseless and fixed string matching. You can indicate caseless using (?i).

> grepl("this does not match", 'THIS DOES NOT MATCH')
[1] FALSE
> grepl('(?i)this does match', 'THIS DOES MATCH')
[1] TRUE

I'm not aware of a flag to indicate fixed matching, though wrapping regex ... in \\Q...\\E will indicate to the engine it contains no meta characters (except PERL uses $ and @ for interpolation inside this, see ?regex in R). I don't think this is the same as passing fixed = TRUE, because there is virtually no speed difference.

@KyleHaynes
Copy link
Contributor Author

@smingerson thanks for you input, that's super neat! I'm totally happy to let @MichaelChirico indicate the best way forward, as I have no real strong opinions and am obviously not a key contributor to (my fav R package) data.table.

Adding some of these to the ?like manual page might be a good idea. ?grepl and ?regex have some of the densest, arcane help pages -- with excellent reason.

I do agree with both of your points as well.

@smingerson
Copy link

smingerson commented Jul 16, 2019

One major feature PCRE has that TRE does not (as far as I can tell), is recursion. I've never had occasion to use this, but it looks like one major difference.

# Example regex from https://www.regular-expressions.info/recurse.html
to_match <- "(((why yes))))"
grepl("\\((?>[^()]|(?R))*\\)", to_match, perl = TRUE)
#> [1] TRUE
# Does not match because parentheses are unbalanced
# and we require them to be by the addition of [^)] at the end.
grepl("\\((?>[^()]|(?R))*\\)[^)]", to_match, perl = TRUE)
#> [1] FALSE

grepl("\\((?>[^()]|(?R))*\\)", to_match, perl = FALSE)
#> Error in grepl("\\((?>[^()]|(?R))*\\)", to_match, perl = FALSE): invalid regular expression '\((?>[^()]|(?R))*\)', reason 'Invalid regexp'

TRE has a feature PCRE lacks in approximate regex matching, but since that is not accessed through grepl() I don't know whether that point has any value.

@jangorecki
Copy link
Member

I think it is much safer to leave %like% as it is now, and just add %plike%.

@KyleHaynes
Copy link
Contributor Author

@MichaelChirico - if you agree, I'm more than happy to create a pull request.

@jangorecki
Copy link
Member

jangorecki commented Dec 18, 2019

@KyleHaynes you are welcome, you are already in contributors team so please push to Rdatatable/data.table repo directly

@mattdowle mattdowle added this to the 1.14.1 milestone Aug 4, 2021
@jangorecki jangorecki modified the milestones: 1.14.9, 1.15.0 Oct 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants