-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
clarify newline behavior of regex #35751
Conversation
cc @Tisit |
docs/standard/base-types/backtracking-in-regular-expressions.md
Outdated
Show resolved
Hide resolved
Is anyone available to review.. |
|
||
Note that `\Z` matches `\n` but does not match `\r\n` (the CR/LF character combination). To match CR/LF, include `\r?\Z` in the regular expression pattern. | ||
Note that `\Z` is satisfied by `\n` but is not satisfied by `\r\n` (the CR/LF character combination). To treat CR/LF as if it were `\n`, include `\r?\Z` in the regular expression pattern. Note that this will make the `\r` part of the match - if you are accessing the match, and do not want `\r` to be part of it, you may replace `\r?\Z` with a positive lookahead assert, `(?=\r\n\z|\z|(?<!\r)(?=\n\z))`. In this expression, `\z` asserts the end of the input string. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Negative lookbehind inside of a positive lookahead is a really complicated construct. Is it necessary? If the first branch of the alternation is \r\n\z
, why can't the later branch just be \n\z
, or why not just have the \r
be optional inside of the lookahead, e.g. (?=\r\n\z|\n\z|\z)
or (?=(?:\r?\n)?\z)
or some such thing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assumption is that we want to recognize \r\n
but not treat it like two lines (no editor would treat that as two line breaks)
For example given "abc\n" in SingleLine mode the pattern ".\Z" will match at position 2 (before the newline) and position 3 (before the end). Now take "abc\r\n". If we want \r\n
to be treated like a newline, then we want to match at 2 (before newline) and 4 (before end)
(?=\r\n\z|\z|(?<!\r)(?=\n\z))
does this but (?=\r\n\z|\n\z|\z)
and (?=(?:\r?\n)?\z)
will add a third match at position 3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recommending lookarounds is problematic because lookarounds don't work with non-backtracking, nevermind their being complicated. It'd be better, for example, to recommend other patterns that would match the newlines and then have a capture group for just the piece you want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be a more extended discussion. I think there is value in including this subexpression for reference and for those that need it. Mind if I leave as is for now? I feel like what we have here is a net improvement and I can do more to the docs another day.
Adding support for AnyNewLine is the most useful thing we can do here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you recommend such patterns then you need to call out that they will prevent NonBacktracking from working / will fail with NonBacktracking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amongst engines that do support backtracking it seems to be fairly common to support \r\n as an atomic unit
Can you elaborate? What is the syntax someone uses with PCRE for example to achieve the "treat \n or \r\n as a newline" semantic? Is that the \R
?
Does PCRE support that with its non-backtracking functions?
We could support it as part of matching, but without changing the matching logic I don't think we can do it as written just during parsing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
\R consumes a "generic line break". In PCRE what \R considers that to be can be adjusted. Similarly, what $ and \Z consider a line break can be separately adjusted, at compile time or within the pattern.
https://www.pcre.org/current/doc/html/pcre2pattern.html#newlines https://www.pcre.org/current/doc/html/pcre2pattern.html#newlineseq
https://www.pcre.org/current/doc/html/pcre2api.html#newlines
eg \R might be equivalent to (?>\r\n|\n|\x0b|\f|\r|\x85)
. And you might have (*ANYCRLF)
in your pattern to mean that you want to recognize any of CR, LF, or CRLF.
It makes clear that if CRLF is recognized as a newline, then it will be treated atomically:
When the newline convention (see "Newline conventions" below) recognizes the two-character sequence CRLF as a newline, this is preferred, even if the single characters CR and LF are also recognized as newlines. For example, if the newline convention is "any", a multiline mode circumflex matches before "xyz" in the string "abc\r\nxyz" rather than after CR, even though CR on its own is a valid newline. (It also matches at the very start of the string, of course.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Perl, \R matches generic newline: it's not clear whether you can change what $ recognizes.
In Java, $ and \Z and . as appropriate recognize a variety of line terminators including the atomic \r\n
. But there's an option UNIX_LINES to make them only recognize \n
. In other words, they have AnyNewLine as proposed, but on by default.
https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html (see Line terminators)
\R in Java seems to always be matching a variety, like Perl.
\R Any Unicode linebreak sequence, is equivalent to \u000D\u000A|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. So a straightforward fragment is essentially the one where unbounded loops do not occur, perhaps even stricter -- no loops (either bounded or nonbounded). One other issue is that in general NFA mode might still be needed during lookaraounds, but that should be orthogonal to the rest and happen automatically. My understanding is also that all captures are disabled during lookarounds and not stored.
docs/standard/base-types/backtracking-in-regular-expressions.md
Outdated
Show resolved
Hide resolved
docs/standard/base-types/backtracking-in-regular-expressions.md
Outdated
Show resolved
Hide resolved
docs/standard/base-types/backtracking-in-regular-expressions.md
Outdated
Show resolved
Hide resolved
docs/standard/base-types/backtracking-in-regular-expressions.md
Outdated
Show resolved
Hide resolved
Co-authored-by: Stephen Toub <[email protected]>
Co-authored-by: Stephen Toub <[email protected]>
@stephentoub I removed the lookahead here so we can merge the rest - does it look OK? |
Fixes #35498
Attempt to clarify behavior of
$
in regex, to address confusion in dotnet/runtime#87368.Internal previews