clarify newline behavior of regex #35751

danmoseley · 2023-06-11T20:00:50Z

Attempt to clarify behavior of $ in regex, to address confusion in dotnet/runtime#87368.

Internal previews

📄 File	🔗 Preview link
docs/standard/base-types/anchors-in-regular-expressions.md	Anchors in Regular Expressions
docs/standard/base-types/backtracking-in-regular-expressions.md	Backtracking in Regular Expressions
docs/standard/base-types/character-classes-in-regular-expressions.md	Character Classes in .NET Regular Expressions
docs/standard/base-types/regular-expression-options.md	Regular expression options
docs/standard/base-types/regular-expression-source-generators.md	docs/standard/base-types/regular-expression-source-generators

docs/standard/base-types/regular-expression-options.md

danmoseley · 2023-06-11T20:47:22Z

cc @Tisit

docs/standard/base-types/backtracking-in-regular-expressions.md

danmoseley · 2023-06-14T03:50:22Z

Is anyone available to review..

docs/standard/base-types/anchors-in-regular-expressions.md

stephentoub · 2023-06-14T12:53:24Z

docs/standard/base-types/anchors-in-regular-expressions.md


- Note that `\Z` matches `\n` but does not match `\r\n` (the CR/LF character combination). To match CR/LF, include `\r?\Z` in the regular expression pattern.  
+ Note that `\Z` is satisfied by `\n` but is not satisfied by `\r\n` (the CR/LF character combination). To treat CR/LF as if it were `\n`, include `\r?\Z` in the regular expression pattern. Note that this will make the `\r` part of the match - if you are accessing the match, and do not want `\r` to be part of it, you may replace `\r?\Z` with a positive lookahead assert, `(?=\r\n\z|\z|(?<!\r)(?=\n\z))`. In this expression, `\z` asserts the end of the input string.


Negative lookbehind inside of a positive lookahead is a really complicated construct. Is it necessary? If the first branch of the alternation is \r\n\z, why can't the later branch just be \n\z, or why not just have the \r be optional inside of the lookahead, e.g. (?=\r\n\z|\n\z|\z) or (?=(?:\r?\n)?\z) or some such thing?

The assumption is that we want to recognize \r\n but not treat it like two lines (no editor would treat that as two line breaks)

For example given "abc\n" in SingleLine mode the pattern ".\Z" will match at position 2 (before the newline) and position 3 (before the end). Now take "abc\r\n". If we want \r\n to be treated like a newline, then we want to match at 2 (before newline) and 4 (before end)

(?=\r\n\z|\z|(?<!\r)(?=\n\z)) does this but (?=\r\n\z|\n\z|\z) and (?=(?:\r?\n)?\z) will add a third match at position 3.

Recommending lookarounds is problematic because lookarounds don't work with non-backtracking, nevermind their being complicated. It'd be better, for example, to recommend other patterns that would match the newlines and then have a capture group for just the piece you want.

That would be a more extended discussion. I think there is value in including this subexpression for reference and for those that need it. Mind if I leave as is for now? I feel like what we have here is a net improvement and I can do more to the docs another day.

Adding support for AnyNewLine is the most useful thing we can do here.

If you recommend such patterns then you need to call out that they will prevent NonBacktracking from working / will fail with NonBacktracking.

Amongst engines that do support backtracking it seems to be fairly common to support \r\n as an atomic unit

Can you elaborate? What is the syntax someone uses with PCRE for example to achieve the "treat \n or \r\n as a newline" semantic? Is that the \R?

Does PCRE support that with its non-backtracking functions?

We could support it as part of matching, but without changing the matching logic I don't think we can do it as written just during parsing.

\R consumes a "generic line break". In PCRE what \R considers that to be can be adjusted. Similarly, what $ and \Z consider a line break can be separately adjusted, at compile time or within the pattern.

https://www.pcre.org/current/doc/html/pcre2pattern.html#newlines https://www.pcre.org/current/doc/html/pcre2pattern.html#newlineseq
https://www.pcre.org/current/doc/html/pcre2api.html#newlines

eg \R might be equivalent to (?>\r\n|\n|\x0b|\f|\r|\x85). And you might have (*ANYCRLF) in your pattern to mean that you want to recognize any of CR, LF, or CRLF.

It makes clear that if CRLF is recognized as a newline, then it will be treated atomically:

When the newline convention (see "Newline conventions" below) recognizes the two-character sequence CRLF as a newline, this is preferred, even if the single characters CR and LF are also recognized as newlines. For example, if the newline convention is "any", a multiline mode circumflex matches before "xyz" in the string "abc\r\nxyz" rather than after CR, even though CR on its own is a valid newline. (It also matches at the very start of the string, of course.)

In Perl, \R matches generic newline: it's not clear whether you can change what $ recognizes.

In Java, $ and \Z and . as appropriate recognize a variety of line terminators including the atomic \r\n. But there's an option UNIX_LINES to make them only recognize \n. In other words, they have AnyNewLine as proposed, but on by default.
https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html (see Line terminators)

\R in Java seems to always be matching a variety, like Perl.

\R Any Unicode linebreak sequence, is equivalent to \u000D\u000A|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029]

@veanes / @olsaarik, I assume for the lookarounds support we discussed, we could by default only permit lookarounds with a max length limited by the pattern rather than by the input, eg no unbounded loops. That would enable the patterns discussed here.

Right. So a straightforward fragment is essentially the one where unbounded loops do not occur, perhaps even stricter -- no loops (either bounded or nonbounded). One other issue is that in general NFA mode might still be needed during lookaraounds, but that should be orthogonal to the rest and happen automatically. My understanding is also that all captures are disabled during lookarounds and not stored.

docs/standard/base-types/backtracking-in-regular-expressions.md

docs/standard/base-types/regular-expression-options.md

Co-authored-by: Stephen Toub <[email protected]>

danmoseley · 2023-06-24T21:59:53Z

@stephentoub I removed the lookahead here so we can merge the rest - does it look OK?

clarify newline

b4907f9

danmoseley requested a review from stephentoub June 11, 2023 20:00

danmoseley requested a review from adegeo as a code owner June 11, 2023 20:00

dotnet-bot added this to the June 2023 milestone Jun 11, 2023

dotnet-bot added the dotnet-fundamentals/svc label Jun 11, 2023

danmoseley added 3 commits June 11, 2023 15:20

adjust

5850452

more

406e9a5

simplify

7ef7e4b

danmoseley commented Jun 11, 2023

View reviewed changes

docs/standard/base-types/regular-expression-options.md Outdated Show resolved Hide resolved

add \Z

19ef874

danmoseley added 2 commits June 11, 2023 15:50

more

a6fef65

nit

3de70cf

danmoseley requested a review from a team June 11, 2023 21:01

danmoseley added 11 commits June 11, 2023 16:04

error

de289fd

advert

5090668

note about $

105af17

nit

f5bb159

nits

c167ed5

Advertise non backtrack

dc6eb09

recommend sg

230d4e0

fix existing error

8f51665

update backtracking

a16c17e

perf tipcs

e5afe6b

consistent

3c56a1e

danmoseley commented Jun 11, 2023

View reviewed changes

docs/standard/base-types/backtracking-in-regular-expressions.md Outdated Show resolved Hide resolved

danmoseley added 2 commits June 11, 2023 18:06

broken link

4f523b3

include

627e382

stephentoub reviewed Jun 14, 2023

View reviewed changes

danmoseley and others added 7 commits June 17, 2023 15:38

Apply suggestions from code review

fef7a92

Co-authored-by: Stephen Toub <[email protected]>

feedback

3c19085

nit

06722db

nit

6870c59

note

011d6e5

Update docs/standard/base-types/backtracking-in-regular-expressions.md

afe53d9

Co-authored-by: Stephen Toub <[email protected]>

remove lookahead

67cb4f9

fix link

54987e1

stephentoub approved these changes Jun 25, 2023

View reviewed changes

danmoseley enabled auto-merge (squash) June 25, 2023 01:55

danmoseley merged commit b1a1a20 into dotnet:main Jun 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clarify newline behavior of regex #35751

clarify newline behavior of regex #35751

danmoseley commented Jun 11, 2023 •

edited by github-actions bot

Loading

danmoseley commented Jun 11, 2023

danmoseley commented Jun 14, 2023

stephentoub Jun 14, 2023

danmoseley Jun 17, 2023 •

edited

Loading

stephentoub Jun 17, 2023 •

edited

Loading

danmoseley Jun 17, 2023

stephentoub Jun 17, 2023

stephentoub Jun 17, 2023 •

edited

Loading

danmoseley Jun 17, 2023 •

edited

Loading

danmoseley Jun 17, 2023 •

edited

Loading

stephentoub Jun 17, 2023

veanes Jun 24, 2023

danmoseley commented Jun 24, 2023


		Note that `\Z` matches `\n` but does not match `\r\n` (the CR/LF character combination). To match CR/LF, include `\r?\Z` in the regular expression pattern.
		Note that `\Z` is satisfied by `\n` but is not satisfied by `\r\n` (the CR/LF character combination). To treat CR/LF as if it were `\n`, include `\r?\Z` in the regular expression pattern. Note that this will make the `\r` part of the match - if you are accessing the match, and do not want `\r` to be part of it, you may replace `\r?\Z` with a positive lookahead assert, `(?=\r\n\z\|\z\|(?<!\r)(?=\n\z))`. In this expression, `\z` asserts the end of the input string.

clarify newline behavior of regex #35751

clarify newline behavior of regex #35751

Conversation

danmoseley commented Jun 11, 2023 • edited by github-actions bot Loading

Internal previews

danmoseley commented Jun 11, 2023

danmoseley commented Jun 14, 2023

stephentoub Jun 14, 2023

Choose a reason for hiding this comment

danmoseley Jun 17, 2023 • edited Loading

Choose a reason for hiding this comment

stephentoub Jun 17, 2023 • edited Loading

Choose a reason for hiding this comment

danmoseley Jun 17, 2023

Choose a reason for hiding this comment

stephentoub Jun 17, 2023

Choose a reason for hiding this comment

stephentoub Jun 17, 2023 • edited Loading

Choose a reason for hiding this comment

danmoseley Jun 17, 2023 • edited Loading

Choose a reason for hiding this comment

danmoseley Jun 17, 2023 • edited Loading

Choose a reason for hiding this comment

stephentoub Jun 17, 2023

Choose a reason for hiding this comment

veanes Jun 24, 2023

Choose a reason for hiding this comment

danmoseley commented Jun 24, 2023

danmoseley commented Jun 11, 2023 •

edited by github-actions bot

Loading

danmoseley Jun 17, 2023 •

edited

Loading

stephentoub Jun 17, 2023 •

edited

Loading

stephentoub Jun 17, 2023 •

edited

Loading

danmoseley Jun 17, 2023 •

edited

Loading

danmoseley Jun 17, 2023 •

edited

Loading