Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Match regex metacharacters #297

Open
torbiak opened this issue Jan 6, 2020 · 17 comments · Fixed by #625
Open

Match regex metacharacters #297

torbiak opened this issue Jan 6, 2020 · 17 comments · Fixed by #625

Comments

@torbiak
Copy link
Contributor

torbiak commented Jan 6, 2020

Very similar to #159, I can't match a square bracket, asterisk, etc, regardless how many backslashes I put because mlr_alloc_double_backslash is doubling them. Adding exceptions for square brackets, like the one already there for period, allows me to match them as expected, but it seems we'd either need an exception for all the ERE metacharacters or take a different approach to escaping backslashes.

$ echo 'a=[' | c/mlr put '$a = gsub($a, "\[", "left_square")'
mlr: could not compile regex "\[" : Invalid regular expression
$ git stash pop
$ git diff
diff --git a/c/lib/mlrutil.c b/c/lib/mlrutil.c
index a532a25e..dfd86e18 100644
--- a/c/lib/mlrutil.c
+++ b/c/lib/mlrutil.c
@@ -502,7 +502,7 @@ char* mlr_alloc_double_backslash(char* input) {
        char* output = mlr_malloc_or_die(input_length + num_backslashes + 1);
        for (p = input, q = output; *p; p++) {
                if (*p == '\\') {
-                       if (p[1] != '.') {
+                       if (p[1] != '.' && p[1] != '[' && p[1] != ']') {
                                *(q++) = *p;
                        }
                        *(q++) = *p;
$ make
$ echo 'a=[' | c/mlr put '$a = gsub($a, "\[", "left_square")'
a=left_square
@aborruso
Copy link
Contributor

aborruso commented Jan 6, 2020

Try using

echo 'a=[' | mlr put '$a = gsub($a, "[[]", "left_square")'

In the FAQ http://johnkerl.org/miller/doc/faq.html#How_to_escape_'?'_in_regexes?

@torbiak
Copy link
Contributor Author

torbiak commented Jan 7, 2020

@aborruso Thanks for the alternative syntax.

Here's my journey:

  • I'm trying to match square brackets and escaping them with backslashes is typically how I do it; this doesn't work in miller, though.
  • This makes me wonder what regex variant miller supports. The main reference says "Miller lets you use regular expressions (of type POSIX.2)..."
  • The regex(7) Linux manpage says an atom can be a '\' followed by one of the characters "^.[$()|*+?{\" (matching that character taken as an ordinary character); and POSIX says, referring to special characters: when preceded by a <backslash>, such a character shall be an ERE that matches the special character itself, which apparently contradicts the behaviour I'm seeing with miller
  • So I dug into the code and saw that, prior to a regex being given to regcomp, backslashes are being doubled except before periods. I understand how escaped square brackets don't work now, but I don't understand why there's an exception only for periods, or fully why backslashes are being doubled.

@johnkerl
Copy link
Owner

johnkerl commented Jan 7, 2020

Hi @torbiak -- I'll dig. "What regex variant" -- it's just libregex on whatever platform it's compiled for. Regarding the periods -- I'll look at this. Thanks!!

@torbiak
Copy link
Contributor Author

torbiak commented Jan 15, 2020

I now see there's a comment in mlrregex.c that clearly explains why regexes are being double-backslashed: it makes it easier to match backslashes. However, it also makes it harder to match all the other metacharacters, and impossible to match them literally using backslashes without addding exceptions to mlr_alloc_double_backslash. Tools like Perl, Awk, and Python provide an alternative syntax that skips the normal escape sequences, which avoids processing certain backslash escapes twice, and uses a slightly different set of escapes that are more suitable for regular expressions, which makes using things like backreferences easier.

@hftf
Copy link

hftf commented May 4, 2020

Is it correct that regex metacharacters, such as the "word boundary" anchor \b, are currently impossible to match? Apologies if this was already covered, but I did not see it in the documentation, and after landing here, I am still trying to make sense of the program flow discussed above. Perhaps it should be mentioned in the documentation until there is a solution (re-implementation of double backslash escaping) or workaround (such as using string substitution functions).

@aborruso
Copy link
Contributor

aborruso commented May 4, 2020

Hi @hftf in the documentation you have: Miller lets you use regular expressions of type POSIX.2.

I think \b is it not supported. I think you can use this syntax https://www.systutorials.com/docs/linux/man/7-regex/

@hftf
Copy link

hftf commented May 4, 2020

@aborruso Learned something new, thanks! I wasn’t familiar with POSIX regular expression behavior specifically, and I had just assumed that widespread metacharacters like \b or \w would be supported.

(Edit: There might be a workaround, explained here: https://stackoverflow.com/questions/9792702/does-bash-support-word-boundary-regular-expressions)

@johnkerl johnkerl added the go-port Things which will be addressed in the Go port AKA Miller 6 label Aug 11, 2021
@johnkerl
Copy link
Owner

johnkerl commented Aug 11, 2021

Hi @torbiak -- after long delay I'm looking at this for the Go port.

Currently:

  • "\t" is converted to TAB etc for all string literals; also \\ -> \
  • Then in mlr_alloc_double_backslash we undo that via \ -> \\ ... with unpleasant side effects as you carefully noted above.

I'm not positive yet but I think the solution might be having that first conversion (\\ -> \ etc) be skipped entirely only for the regex arguments to sub, gsub, regextract, regextract_or_else, =~, and !=~.

This means a context-specific "leave alone" for regex strings so there will be nothing for the Go equivalent of mlr_alloc_double_backslash to undo.

(Note that in Python -- as you pointed out -- there is the r"..." syntax -- I'm proposing an 'implicit r' which would kick in for string literals in regex position in the above six callsites.)

@johnkerl
Copy link
Owner

johnkerl commented Aug 12, 2021

@torbiak

echo 'x=[.o*o.]' | mlr put '$y=gsub($x, "\[", "LEFT")'
x=[.o*o.],y=LEFT.o*o.]

echo 'x=[.o*o.]' | mlr put '$y=gsub($x, ".",  "ALL")'
x=[.o*o.],y=ALLALLALLALLALLALLALL

echo 'x=[.o*o.]' | mlr put '$y=gsub($x, "\.", "DOT")'
x=[.o*o.],y=[DOTo*oDOT]

echo 'x=[.o*o.]' | mlr put '$y=gsub($x, "\*", "STAR")'
x=[.o*o.],y=[.oSTARo.]

echo 'x=[.o*o.]' | mlr put '$y=sub($x,  "\]", "RIGHT")'
x=[.o*o.],y=[.o*o.RIGHT

This will be in the Go port -- next PR is in prep.

THANK YOU for your thorough research on this! 🙏

@johnkerl
Copy link
Owner

johnkerl commented Aug 12, 2021

Test case 55209bf

Comments 9286f5a

More doc details about Miller regexes upcoming at https://johnkerl.org/miller6 (subsequent PRs)

johnkerl added a commit that referenced this issue Aug 12, 2021
@torbiak
Copy link
Contributor Author

torbiak commented Aug 14, 2021

@johnkerl

With the NodeTypeRegex relabelling scheme described in the Comments 9286f5a (leaves.go) link above, would \* still match * if a variable is given to a regex-accepting operator or function instead of a string literal? For example:

mlr put '
star_re = "\*";
if ($b > 0) {
    $a = gsub($a, star_re, "STAR");
} else {
    $c = gsub($c, star_re, "STAR");
}
' <<'EOF'
a=*,b=0,c=*
a=*,b=1,c=*
EOF

@johnkerl
Copy link
Owner

johnkerl commented Aug 15, 2021

@torbiak no ... the 'implicit r' would only apply for string literals in that position.

To make this work with the star_re = "\*" example we'd need an 'explicit r' which would be easy enough to add to the DSL grammar.

@johnkerl
Copy link
Owner

johnkerl commented Aug 15, 2021

@torbiak another option would be to abandon the notion of 'implicit r' entirely, and only have 'explicit r' ... i'm not sure which approach follows the principle of least surprise ...

Option 1 -- implicit and explicit r:

$y = gsub($x, "\*", "star");  # matches the regex
$y = gsub($x, r"\*", "star"); # matches the regex
rstar = r"\*";
$y = gsub($x, rstar, "star"); # matches the regex

Option 2 -- explicit r only:

$y = gsub($x, "\*", "star");  # does not match the regex
$y = gsub($x, r"\*", "star"); # matches the regex
rstar = r"\*";
$y = gsub($x, rstar, "star"); # matches the regex

@johnkerl
Copy link
Owner

I prefer option 1 personally ...

@torbiak
Copy link
Contributor Author

torbiak commented Aug 17, 2021

Python's had a big influence on me, so I'm inclined towards explicit-only, but I'm having trouble writing a good argument against having implicit r-strings, too. I can't think of a case where you'd want the usual string literal behaviour for a regex. Having both feels consistent with the kind of tool that miller is and what I've read about your goals for it.

My biggest worry about implicit r-strings is that for someone who hasn't read about r-strings in the the docs, it could be confusing that variable and literal arguments to the regex functions behave differently, and that this will make it harder for them to build an accurate mental model of how string literals and regexes work in miller.

@johnkerl
Copy link
Owner

Good news is there's little backward behavior to protect so explicit-r would break almost nobody ...

... bad news is the one thing that does work is \. and explicit-r would break that ...

... other than that I am with you on the above.

@johnkerl
Copy link
Owner

johnkerl commented Jan 9, 2022

This issue is resolved in Miller 6 except for the new feature-add of r-strings as discussed above -- probably a good candidate for 6.1.

@johnkerl johnkerl added on deck and removed go-port Things which will be addressed in the Go port AKA Miller 6 labels Jan 17, 2022
@johnkerl johnkerl removed the on deck label Mar 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants