Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange behavior with multiline Regex #87368

Closed
Tisit opened this issue Jun 10, 2023 · 5 comments
Closed

Strange behavior with multiline Regex #87368

Tisit opened this issue Jun 10, 2023 · 5 comments

Comments

@Tisit
Copy link

Tisit commented Jun 10, 2023

Description

Let's say I have a multiline string as follows:

Yoko
GO
Ono

I want to split this string by the GO line. For some reason the Regex I am using matches only GO\r and the rest ends up attached to the second part \nOno. So I get lone newline at the beginning of the string. This is unexpected, because according to documentation I found \r$ should be the right combination to match both. Of course I can workaround this by using \r\n. Still wanted to clarify, if I'm misinterpreting something.

Note: There is at least one other variant of this regex, which behaves the same way: (?m)^GO\s$.
This variant fails completely to split string, which I find surprising too: (?m)^GO$

Reproduction Steps

Run the following program and observe that the hexadecimal representation of the Ono string. It has 0A (\n) attached at the beginning

using System.Text.RegularExpressions;

class Program
{
    static void Main()
    {
        string regexPattern = "(?m)^GO\r$";

        string Content = @"
Yoko
GO
Ono
";
        string[] pieces = Regex.Split(Content, regexPattern);

        string MatchValue = Regex.Match(Content, regexPattern).Value;
        byte[] bytes = System.Text.Encoding.UTF8.GetBytes(MatchValue);
        string Hex = BitConverter.ToString(bytes).Replace("-", " ");

        Console.WriteLine(MatchValue);
        Console.WriteLine(Hex);

        foreach (string piece in pieces)
        {
            bytes = System.Text.Encoding.UTF8.GetBytes(piece);
            Hex = BitConverter.ToString(bytes).Replace("-", " ");

            Console.WriteLine(piece);
            Console.WriteLine(Hex);
        }
    }
}

Expected behavior

No newline character at the beginning of split string

Actual behavior

Output of the attached program. Notice the additional printed newline.

GO
47 4F 0D

Yoko

0D 0A 59 6F 6B 6F 0D 0A

Ono

0A 4F 6E 6F 0D 0A

Regression?

No response

Known Workarounds

No response

Configuration

.NET 7.0
Windows 10 Professional

Other information

No response

@ghost ghost added the untriaged New issue has not been triaged by the area owner label Jun 10, 2023
@ghost
Copy link

ghost commented Jun 10, 2023

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

Let's say I have a multiline string as follows:

Yoko
GO
Ono

I want to split this string by the GO line. For some reason the Regex I am using matches only GO\r and the rest ends up attached to the second part \nOno. So I get lone newline at the beginning of the string. This is unexpected, because according to documentation I found \r$ should be the right combination to match both. Of course I can workaround this by using \r\n. Still wanted to clarify, if I'm misinterpreting something.

Note: There is at least one other variant of this regex, which behaves the same way: (?m)^GO\s$.
This variant fails completely to split string, which I find surprising too: (?m)^GO$

Reproduction Steps

Run the following program and observe that the hexadecimal representation of the Ono string. It has 0A (\n) attached at the beginning

using System.Text.RegularExpressions;

class Program
{
    static void Main()
    {
        string regexPattern = "(?m)^GO\r$";

        string Content = @"
Yoko
GO
Ono
";
        string[] pieces = Regex.Split(Content, regexPattern);

        string MatchValue = Regex.Match(Content, regexPattern).Value;
        byte[] bytes = System.Text.Encoding.UTF8.GetBytes(MatchValue);
        string Hex = BitConverter.ToString(bytes).Replace("-", " ");

        Console.WriteLine(MatchValue);
        Console.WriteLine(Hex);

        foreach (string piece in pieces)
        {
            bytes = System.Text.Encoding.UTF8.GetBytes(piece);
            Hex = BitConverter.ToString(bytes).Replace("-", " ");

            Console.WriteLine(piece);
            Console.WriteLine(Hex);
        }
    }
}

Expected behavior

No newline character at the beginning of split string

Actual behavior

Output of the attached program. Notice the additional printed newline.

GO
47 4F 0D

Yoko

0D 0A 59 6F 6B 6F 0D 0A

Ono

0A 4F 6E 6F 0D 0A

Regression?

No response

Known Workarounds

No response

Configuration

.NET 7.0
Windows 10 Professional

Other information

No response

Author: Tisit
Assignees: -
Labels:

area-System.Text.RegularExpressions

Milestone: -

@danmoseley
Copy link
Member

related
#66845
#25598

@danmoseley
Copy link
Member

danmoseley commented Jun 11, 2023

Three points

  1. The way Split works is it returns each string fragment that appears BETWEEN the matches.
  2. ^ and $ (multiline or not) are ANCHORS, ie., conditions or predicates. They do not form part of the match.
  3. The engine currently does not have any special treatment for \r (or \r\n) at all. (I guess you could say it was written for Unix style text!). It only has special treatment for \n, in that \n will satisfy $ in multiline mode and will not match . unless in singleline mode.

Combining those three facts, n your case ^GO\r$ (in multiline mode) is matching GO\r so long as it comes directly after the start of the string or a newline (\n) and so long as it comes directly before the end of the string or a new line (\n).

Then Split returns the pieces between those matches of GO\r, which in your case are \r\nYOKO\r\n and \nOno\r\n.

So Split is working as specified.

This fact that Regex is ignorant of \r can be troublesome when line endings might be \r\n as they often are on Windows. As you pointed out, the docs say

By default, $ matches only the end of the input string. If you specify the RegexOptions.Multiline option, it matches either the newline character (\n) or the end of the input string. It does not, however, match the carriage return/line feed character combination. To successfully match them, use the subexpression \r?$ instead of just $.

This is perhaps a little misleading, because when it says match, what it really means is "find a match at" rather than actually "match" because $ does not match anything -- as an anchor, it's a condition, or predicate for the match. So even if you had no \r's at all, you might not be getting what you wanted because your Ono would still have a leading \n.

If you're now asking, what is the correct way to do you do your split while being tolerant of lines ending with \r\n instead of \n ? I would suggest to ignore multiline mode, ie., something like this (adjust depending on whether you want the newlines in the splits or not)

https://sharplab.io/#v2:EYLgxg9gTgpgtADwGwBYA0AXEUCuA7AHwAEAmABgFgAoUgRmqLIAIjaA6AFRgQzYCUYAcxwAbAIZQAoggAOsAM7yAlhDzyA3NQYkWtAOzUA3tSamTppudOskLFEwCyYpXgAUASitNjVC393MsILcAApiGBgwUHhMALxMAEQAOlAA/El4AOIA8inpeAmaVF5+rMwAwqqReBhxiSkZAJoQANYQDVm5UBnZeO2dHQAyMHh9BUX+FiUWThhgABaVIiIwYBgqMQC24Qsw8nUCwQhss7vyrpU1IxhoTEGh4ZHR7hOTAQDaALp3e6IY+/E8DAAO4fbZzeZ7NiVfC1ADUTFon1ekxctQgADMMfIYLV4mQUf40UwlHUCdNTBjoDAxAsmK5TvMmOC6S5mTtIfJPL43j43v4FH95O8lN94pdqrwAMo4YBlVyY7G424s+ZsACSeAAJtwmHAmIqcRgXhT/IbcXVVRrtbqEVbhnhBBh5oT+Uo4XDXRYAL6m02CkT/EVipgS65sGVy2hkBVYo23MM1NgOp1M/Xm41esw8yZU2C0+auMpMfZsgP/bn8vn8iysACcrgAJAl3oZ5PwYDJxGAYK5klAErdkikEu4O13ab3kgUh0kMqPvZ9R1n/PWmy3DAAhJQYS4ANyiT04EClGCgLkERfYXB4bEkeEgWovbAAqhwAGIADjYmVxm4AnpE5xcmOAgTj2fZwIOiSJO4i4JEwCLTsupq+jmTBoWhQA

@danmoseley
Copy link
Member

When someone implements #25598 you'll be able to use Split in the way you were expecting it to work.

@ghost ghost removed the untriaged New issue has not been triaged by the area owner label Jun 11, 2023
@danmoseley
Copy link
Member

danmoseley commented Jun 11, 2023

As another option, just replace $ (which is equivalent to (?=\n|\z) in multiline mode) with (?=\r\n|\z|(?<!\r)(?=\n)), ie

string regexPattern = "(?m)^GO(?=\r\n|\z|(?<!\r)(?=\n))";

Again, not sure whether you want the newlines in the matches or not so you'll need to adjust that.

Yes, it should be easier -- this is tracked by #25598

Almost nobody would bother with the above -- I expect they'd just trim the results.

@ghost ghost locked as resolved and limited conversation to collaborators Jul 12, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants