Strange behavior with multiline Regex #87368

Tisit · 2023-06-10T16:45:55Z

Description

Let's say I have a multiline string as follows:

Yoko
GO
Ono

I want to split this string by the GO line. For some reason the Regex I am using matches only GO\r and the rest ends up attached to the second part \nOno. So I get lone newline at the beginning of the string. This is unexpected, because according to documentation I found \r$ should be the right combination to match both. Of course I can workaround this by using \r\n. Still wanted to clarify, if I'm misinterpreting something.

Note: There is at least one other variant of this regex, which behaves the same way: (?m)^GO\s$.
This variant fails completely to split string, which I find surprising too: (?m)^GO$

Reproduction Steps

Run the following program and observe that the hexadecimal representation of the Ono string. It has 0A (\n) attached at the beginning

using System.Text.RegularExpressions;

class Program
{
    static void Main()
    {
        string regexPattern = "(?m)^GO\r$";

        string Content = @"
Yoko
GO
Ono
";
        string[] pieces = Regex.Split(Content, regexPattern);

        string MatchValue = Regex.Match(Content, regexPattern).Value;
        byte[] bytes = System.Text.Encoding.UTF8.GetBytes(MatchValue);
        string Hex = BitConverter.ToString(bytes).Replace("-", " ");

        Console.WriteLine(MatchValue);
        Console.WriteLine(Hex);

        foreach (string piece in pieces)
        {
            bytes = System.Text.Encoding.UTF8.GetBytes(piece);
            Hex = BitConverter.ToString(bytes).Replace("-", " ");

            Console.WriteLine(piece);
            Console.WriteLine(Hex);
        }
    }
}

Expected behavior

No newline character at the beginning of split string

Actual behavior

Output of the attached program. Notice the additional printed newline.

GO
47 4F 0D

Yoko

0D 0A 59 6F 6B 6F 0D 0A

Ono

0A 4F 6E 6F 0D 0A

Regression?

No response

Known Workarounds

No response

Configuration

.NET 7.0
Windows 10 Professional

Other information

No response

The text was updated successfully, but these errors were encountered:

ghost · 2023-06-10T16:46:04Z

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

Let's say I have a multiline string as follows:

Yoko
GO
Ono

I want to split this string by the GO line. For some reason the Regex I am using matches only GO\r and the rest ends up attached to the second part \nOno. So I get lone newline at the beginning of the string. This is unexpected, because according to documentation I found \r$ should be the right combination to match both. Of course I can workaround this by using \r\n. Still wanted to clarify, if I'm misinterpreting something.

Note: There is at least one other variant of this regex, which behaves the same way: (?m)^GO\s$.
This variant fails completely to split string, which I find surprising too: (?m)^GO$

Reproduction Steps

Run the following program and observe that the hexadecimal representation of the Ono string. It has 0A (\n) attached at the beginning

using System.Text.RegularExpressions;

class Program
{
    static void Main()
    {
        string regexPattern = "(?m)^GO\r$";

        string Content = @"
Yoko
GO
Ono
";
        string[] pieces = Regex.Split(Content, regexPattern);

        string MatchValue = Regex.Match(Content, regexPattern).Value;
        byte[] bytes = System.Text.Encoding.UTF8.GetBytes(MatchValue);
        string Hex = BitConverter.ToString(bytes).Replace("-", " ");

        Console.WriteLine(MatchValue);
        Console.WriteLine(Hex);

        foreach (string piece in pieces)
        {
            bytes = System.Text.Encoding.UTF8.GetBytes(piece);
            Hex = BitConverter.ToString(bytes).Replace("-", " ");

            Console.WriteLine(piece);
            Console.WriteLine(Hex);
        }
    }
}

Expected behavior

No newline character at the beginning of split string

Actual behavior

Output of the attached program. Notice the additional printed newline.

GO
47 4F 0D

Yoko

0D 0A 59 6F 6B 6F 0D 0A

Ono

0A 4F 6E 6F 0D 0A

Regression?

No response

Known Workarounds

No response

Configuration

.NET 7.0
Windows 10 Professional

Other information

No response

Author:	Tisit
Assignees:	-
Labels:	`area-System.Text.RegularExpressions`
Milestone:	-

danmoseley · 2023-06-10T17:01:23Z

related
#66845
#25598

danmoseley · 2023-06-11T18:19:18Z

Three points

The way Split works is it returns each string fragment that appears BETWEEN the matches.
^ and $ (multiline or not) are ANCHORS, ie., conditions or predicates. They do not form part of the match.
The engine currently does not have any special treatment for \r (or \r\n) at all. (I guess you could say it was written for Unix style text!). It only has special treatment for \n, in that \n will satisfy $ in multiline mode and will not match . unless in singleline mode.

Combining those three facts, n your case ^GO\r$ (in multiline mode) is matching GO\r so long as it comes directly after the start of the string or a newline (\n) and so long as it comes directly before the end of the string or a new line (\n).

Then Split returns the pieces between those matches of GO\r, which in your case are \r\nYOKO\r\n and \nOno\r\n.

So Split is working as specified.

This fact that Regex is ignorant of \r can be troublesome when line endings might be \r\n as they often are on Windows. As you pointed out, the docs say

By default, $ matches only the end of the input string. If you specify the RegexOptions.Multiline option, it matches either the newline character (\n) or the end of the input string. It does not, however, match the carriage return/line feed character combination. To successfully match them, use the subexpression \r?$ instead of just $.

This is perhaps a little misleading, because when it says match, what it really means is "find a match at" rather than actually "match" because $ does not match anything -- as an anchor, it's a condition, or predicate for the match. So even if you had no \r's at all, you might not be getting what you wanted because your Ono would still have a leading \n.

If you're now asking, what is the correct way to do you do your split while being tolerant of lines ending with \r\n instead of \n ? I would suggest to ignore multiline mode, ie., something like this (adjust depending on whether you want the newlines in the splits or not)

https://sharplab.io/#v2:EYLgxg9gTgpgtADwGwBYA0AXEUCuA7AHwAEAmABgFgAoUgRmqLIAIjaA6AFRgQzYCUYAcxwAbAIZQAoggAOsAM7yAlhDzyA3NQYkWtAOzUA3tSamTppudOskLFEwCyYpXgAUASitNjVC393MsILcAApiGBgwUHhMALxMAEQAOlAA/El4AOIA8inpeAmaVF5+rMwAwqqReBhxiSkZAJoQANYQDVm5UBnZeO2dHQAyMHh9BUX+FiUWThhgABaVIiIwYBgqMQC24Qsw8nUCwQhss7vyrpU1IxhoTEGh4ZHR7hOTAQDaALp3e6IY+/E8DAAO4fbZzeZ7NiVfC1ADUTFon1ekxctQgADMMfIYLV4mQUf40UwlHUCdNTBjoDAxAsmK5TvMmOC6S5mTtIfJPL43j43v4FH95O8lN94pdqrwAMo4YBlVyY7G424s+ZsACSeAAJtwmHAmIqcRgXhT/IbcXVVRrtbqEVbhnhBBh5oT+Uo4XDXRYAL6m02CkT/EVipgS65sGVy2hkBVYo23MM1NgOp1M/Xm41esw8yZU2C0+auMpMfZsgP/bn8vn8iysACcrgAJAl3oZ5PwYDJxGAYK5klAErdkikEu4O13ab3kgUh0kMqPvZ9R1n/PWmy3DAAhJQYS4ANyiT04EClGCgLkERfYXB4bEkeEgWovbAAqhwAGIADjYmVxm4AnpE5xcmOAgTj2fZwIOiSJO4i4JEwCLTsupq+jmTBoWhQA

danmoseley · 2023-06-11T18:23:21Z

When someone implements #25598 you'll be able to use Split in the way you were expecting it to work.

danmoseley · 2023-06-11T18:46:36Z

As another option, just replace $ (which is equivalent to (?=\n|\z) in multiline mode) with (?=\r\n|\z|(?<!\r)(?=\n)), ie

string regexPattern = "(?m)^GO(?=\r\n|\z|(?<!\r)(?=\n))";

Again, not sure whether you want the newlines in the matches or not so you'll need to adjust that.

Yes, it should be easier -- this is tracked by #25598

Almost nobody would bother with the above -- I expect they'd just trim the results.

dotnet-issue-labeler bot added the area-System.Text.RegularExpressions label Jun 10, 2023

ghost added the untriaged New issue has not been triaged by the area owner label Jun 10, 2023

danmoseley closed this as completed Jun 11, 2023

ghost removed the untriaged New issue has not been triaged by the area owner label Jun 11, 2023

danmoseley mentioned this issue Jun 11, 2023

clarify newline behavior of regex dotnet/docs#35751

Merged

ghost locked as resolved and limited conversation to collaborators Jul 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strange behavior with multiline Regex #87368

Strange behavior with multiline Regex #87368

Tisit commented Jun 10, 2023

ghost commented Jun 10, 2023

Description

Reproduction Steps

Expected behavior

Actual behavior

Regression?

Known Workarounds

Configuration

Other information

danmoseley commented Jun 10, 2023

danmoseley commented Jun 11, 2023 •

edited

Loading

danmoseley commented Jun 11, 2023

danmoseley commented Jun 11, 2023 •

edited

Loading

Strange behavior with multiline Regex #87368

Strange behavior with multiline Regex #87368

Comments

Tisit commented Jun 10, 2023

Description

Reproduction Steps

Expected behavior

Actual behavior

Regression?

Known Workarounds

Configuration

Other information

ghost commented Jun 10, 2023

Description

Reproduction Steps

Expected behavior

Actual behavior

Regression?

Known Workarounds

Configuration

Other information

danmoseley commented Jun 10, 2023

danmoseley commented Jun 11, 2023 • edited Loading

danmoseley commented Jun 11, 2023

danmoseley commented Jun 11, 2023 • edited Loading

danmoseley commented Jun 11, 2023 •

edited

Loading

danmoseley commented Jun 11, 2023 •

edited

Loading