-
Notifications
You must be signed in to change notification settings - Fork 985
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scope for %plike% #3702
Comments
maybe we could just set Perl=TRUE by default to current %like%? off the top
of my head, that shouldn't break anything?
…On Sat, Jul 13, 2019, 10:53 AM Kyle Haynes ***@***.***> wrote:
Just reading the new dev notes and noticed #3333
<#3333>. I was going to
actually feature request %likep% (would make sense to conform to %plike%)
the other day, but decided against it (thought maybe the consensus was that
less convenience wrappers were more ideal for data.table. Any particular
reason why data.table can't incorporate another, leveraging the perl =
TRUE argument?
Often you get considerable speed improvements, and a bunch of other
features / behaviors
<https://stackoverflow.com/questions/47240375/regular-expressions-in-base-r-perl-true-vs-the-default-pcre-vs-tre>
# Following packages required .# install.packages(c("stringi", "microbenchmark")
# load data.table.
library(data.table)
# Create a data.table of 100,000 random strings (20 chars in length).DT = data.table(x = stringi::stri_rand_strings(100000, 20))
# Define a trivial regex pattern.regex_pattern = "car|blah|far|nah"
# Create an alternative to %like% that sets `perl = TRUE`.
`%likep%` = function (vector, pattern) {
if (is.factor(vector)) {
as.integer(vector) %in% grep(pattern, levels(vector), perl = TRUE)
}
else {
grepl(pattern, vector, perl = T)
}
}
# Microbenchmark the results to demonstrate speed improvements.microbenchmark::microbenchmark(like = {(DT[x %like% regex_pattern])}, likep = (DT[x %likep% regex_pattern]))# Unit: milliseconds# expr min lq mean median uq max neval# like 84.1235 86.56265 91.51547 87.74410 91.16710 159.6292 100# likep 16.0932 16.64750 17.81476 16.95985 17.82195 34.1415 100
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#3702?email_source=notifications&email_token=AB2BA5OBGQC2EUF4P2W7XA3P7E7SLA5CNFSM4ICVLFI2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4G6752RQ>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AB2BA5MB3YQFK22HOOV76CLP7E7SLANCNFSM4ICVLFIQ>
.
|
It would throw a warning in My only justification for another function is to ...
The above points are obviously my own opinions, not sure if they completely reflect that of the |
I'm not sure there's any backwards compatibility issue (in my experience
Perl regex in R are a superset of non-perl?)
if so, pcre are (a) more flexible and as you say (b) more efficient, don't
see why not to diverge from base (we have like already, after all)
like() would retain Perl option and %flike% would be made to have the
underlying like() call with Perl=false
…On Sun, Jul 14, 2019, 4:15 AM Kyle Haynes ***@***.***> wrote:
It would throw a warning in %flike% as it's conflicting to declare both
as TRUE.
My only justification for another function is to ...
1. Retain the default behavior of %like%, so backwards compatibility
2. It *feels* right that %like% should inherit base-R behavior.
The above points are obviously my own opinions, not sure if they
completely reflect that of the data.table authors / users..?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#3702?email_source=notifications&email_token=AB2BA5LJOGVFB3X64BI6I3TP7IZVJA5CNFSM4ICVLFI2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZ3YUVI#issuecomment-511150677>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AB2BA5LDDNWBBFDKLRK2JXLP7IZVJANCNFSM4ICVLFIQ>
.
|
Not entirely. The Perl variant often appears to extend on TRE (mainly with lookarounds), but, they do have slight different behaviors: # TRE matches . to any literal character, including line break:
grepl(".", "\n", perl = F)
[1] TRUE
# Perl does not:
grepl(".", "\n", perl = T)
[1] FALSE TRE also offers stuff that Perl doesn't. E.g.: grepl("[:alpha:]","a", perl = F)
[1] TRUE
grepl("[:alpha:]","a", perl = T)
#Error in grepl("[:alpha:]", "1", perl = T) :
# invalid regular expression '[:alpha:]'
# In addition: Warning message:
# In grepl("[:alpha:]", "1", perl = T) : PCRE pattern compilation error
# 'POSIX named classes are supported only within a class'
# at '[:alpha:]'
I might be being daft but
Yeah, that's certainly possible. I'm happy with whatever direction you guys wanna go in, I just would be super stoked if |
yes, what I'm getting at is that like() is ~95% a synonym for grepl(), one of the advantages of having split from that is we can choose our own defaults (with an eye to backwards compatibility) thanks for spelling out the pre/tre differences so clearly; I'm guessing that in fact we'd be breaking some code if we changed the default. the examples you gave, is there a workaround to accomplish the same RE in perl? if so, and we're really insistent on using perl, we could do a deprecation cycle. I'm not opposed to %plike%, just trying to make sure there's some orthogonal value-add as discipline to prevent the %[a-z]*like% family from exploding |
Ahh right. That makes sense. If the direction was to be insistent on using perl in %like%, would a workaround be to declare the arguments in the convenience function and have doco to indicate how easy it is to change these (if required), or is that too wishy-washy? # %like%, new behavior as `perl = TRUE`:
`%like%` = function (vector, pattern, perl = TRUE) {
if (is.factor(vector)) {
as.integer(vector) %in% grep(pattern, levels(vector), perl = perl)
}
else {
grepl(pattern, vector, perl = perl)
}
}
# Some old dataset.
iris = as.data.table(copy(iris))
iris[, x := "\n"]
# Check that . matches any character, like it used to ...
iris[x %like% "."]
# Ohh no! It doesn't. As a user of R that's aware default arguments can change overtime, I'd immediately look at help ?'%like%' to which it would indicate if you want to revert to previous behavior then do the following...
# Change default back to FALSE.
formals(`%like%`)$perl = FALSE
# Re-run.
iris[x %like% "."]
# Phew, back to previous behavior.
Totally agree, less is more. If you consider the above a potential solution, you could pass all arguments and allow the user to dictate if required. To extend on this I already pondered ignore.case = T & perl = T (as I often use it with grepl), but having another convenience function around that would probably result in it being less of a convenience and more of a burden of remembering what it was called. |
For the examples above at least, you can make TRE and PCRE match with small modifications. If only these and other minor differences existed, you could introduce data.table options which prefixed the pattern with the appropriate modifier. Adding some of these to the ?like manual page might be a good idea.
Not mentioned here, but in #3552 was caseless and fixed string matching. You can indicate caseless using
I'm not aware of a flag to indicate fixed matching, though wrapping regex |
@smingerson thanks for you input, that's super neat! I'm totally happy to let @MichaelChirico indicate the best way forward, as I have no real strong opinions and am obviously not a key contributor to (my fav R package)
I do agree with both of your points as well. |
One major feature PCRE has that TRE does not (as far as I can tell), is recursion. I've never had occasion to use this, but it looks like one major difference. # Example regex from https://www.regular-expressions.info/recurse.html
to_match <- "(((why yes))))"
grepl("\\((?>[^()]|(?R))*\\)", to_match, perl = TRUE)
#> [1] TRUE
# Does not match because parentheses are unbalanced
# and we require them to be by the addition of [^)] at the end.
grepl("\\((?>[^()]|(?R))*\\)[^)]", to_match, perl = TRUE)
#> [1] FALSE
grepl("\\((?>[^()]|(?R))*\\)", to_match, perl = FALSE)
#> Error in grepl("\\((?>[^()]|(?R))*\\)", to_match, perl = FALSE): invalid regular expression '\((?>[^()]|(?R))*\)', reason 'Invalid regexp' TRE has a feature PCRE lacks in approximate regex matching, but since that is not accessed through |
I think it is much safer to leave |
@MichaelChirico - if you agree, I'm more than happy to create a pull request. |
@KyleHaynes you are welcome, you are already in contributors team so please push to Rdatatable/data.table repo directly |
Just reading the new dev notes and noticed #3333. I was going to actually feature request
%likep%
(would make sense to conform to%plike%
) the other day, but decided against it (thought maybe the consensus was that less convenience wrappers were more ideal fordata.table
. Any particular reason why data.table can't incorporate another, leveraging theperl = TRUE
argument?Often you get considerable speed improvements, and a bunch of other features / behaviors
The text was updated successfully, but these errors were encountered: