Add in a regex parser so that the IDE can provide services around regex editing. #23984

CyrusNajmabadi · 2017-12-31T06:15:04Z

I was bored flying to MN and DC. So i decided to scratch an itch that came up when discussing things over here: dotnet/csharplang#371

This PR introduces support in the Roslyn IDE for improved features for users working with .Net regexes. Specifically, the intent of this PR is to support the following (entirely optional) features around regexes:

Colorization. Users should be able to write a regex and have it colorized automatically, with a set of colors they can customize if they want.
Highlighting. Users should be able to place their cursor on something like a brace and have the matching brace highlighted. This should extend to complex regex concepts (like backreferences) as well.
Syntax errors. Users should see syntax errors directly in the editor and error list without having to compile or run their program.
Fixes. Errors in the regex should come with potential fixes to address the problem immediately.
Suggestions. The IDE should be able to suggest completely safe refactorings to a users regex to simplify it. For example, suggesting that "(a)(b)" be simplified directly to ab.

However, supporting regexes is complicated by how regexes are supported in the language today. Specifically because of the complexity present in how VB and C# escape values in strings on top of how complex .net regexes are as well. To deal with that, two new components have been added to Roslyn.

The first subsystem is called the VirtualCharService deals with the following issue. To the final .net regex, the following snippets of code appear completely identical to it:

"\\1"       // In a normal string, we have to escape the \
@"\1"       // But not in a verbatim string
"\\\u0031"  // The '1' could be escaped
"\\u005c1"  // Even the backslash *itself* may be escaped

These are all ways of writing the \1 regex.

In other words, C# allows a wide variety of input strings to all compile down to the same final 'value' (or 'ValueText') that the regex engine will finally see. This is a major issue as it means that any data reported by the regex engine must be accurate with respect to the text as the user wrote it. For example, in all of the equivalent cases above, there is the same error "Reference to undefined group number 1". However, for each form the user wrote, it's necessary to understand what the right value is to highlight as the problem. i.e.

and

So, the purpose of the VirtualCharService is to translate all of the above pieces of user code to the same final set of characters the regex engine will see (specifically \ and 1) while also maintaining the knowledge of where those characters came from (for example, that 1 came from \u0031 in the last example). In essence, the VirtualCharService is able to produce the ValueText for any string literal, while having a mapping back from each character in the ValueText back to the original source span of the document that formed this.

With the VirtualCharService user code can be translated into a common format that then can be processed uniformly. This means that the part of the system that actually tries to understand the regex does not need to know about the differences between @"" and "" strings, or the differences between C# and VB. It also means that it can be used by any roslyn language (for example, F#) if that is so desired.

The second service is the actual regex parser that consumes the results of the VirtualCharService. The regex parser is a fork of the real .net regex parser (from https://github.com/dotnet/corefx/blob/master/src/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexParser.cs) however, changed heavily. The reason for this is that the existing regex parser is ill suited for all the needs an IDE has. Major problems the existing parser has are:

It is not error tolerant. With the first error it encounters it completely bails out. This would mean that the roslyn services would only work if the regex was written completely properly.
it does not produce a usable tree by IDE standards. While it does produce a tree, it is an extremely abstract one that not only does not contain much of the information that would be needed. It is also heavily processed and changed as parsing happens. i.e. the parser will actually merge, split, erase, and even reorder nodes within the tree.
it works with data formats that are intended to work extremely fast and efficiently when actually performing matching of regexes, but which would be very baroque and unpleasant for editing scenarios. For example, it compresses data into a compact binary format, which it then stores in strings themselves that it intersperses with chunks of the original string.
It will actually crash when passed in some strings (i was able to get index-out-of-bounds and even null-refs when using it).
It can use excessive amounts of memory (even OOMing) just when parsing.

However, the existing parser does have certain things going for it. Namely, it is the source of truth in the .net world. It doesn't matter what the docs say on how .net regexes work and are parsed. At the end of they day, this is the parser used, so it must always be deferred to when determining what is a syntactically legal regex and what isn't.

The regex parser this PR adds introduces a system far more in line with how other roslyn parsers work. Namely, the tree it produces is immutable and fully represents all the characters in the original token. The tree also follows roslyn invariants around nodes, tokens and trivia. Ideally, the tree would literally be built on the syntax classes defined at the Microsoft.CodeAnalysis layer. However, they contain internals that make them only implementable at the C# and Vb layers. Because of that, the tree produces by the new regex parser follows the spirit of the Roslyn syntax model even if it can't fit into it perfectly.

This parser has been extensively tested and comes with 750 seed cases that validate it. These are called 'test seeds' as the inputs to the engine are then mutated heavily to produce even more cases to validate the engine with (for example, testing all prefixes and suffixes of every test case to ensure no crashes or infinite loops occur). Validation is very stringent. All cases must parse without any problems, producing trees that abide by all roslyn invariants. Furthermore, all cases validate that the new parser does not produce errors if the native parser does not. Conversely, if the native parser produces an error, the new parser must produce the same error as it does. That means that we do not even define new messages to show the user. Instead, we use the same message (or message fragment) that the native regex parser gives the user. For example:

Because of this validation, it's very easy to feed in new tests into the system and have it validate that the roslyn regex parser is working in line with the native parser.

The test suite currently contains all the existing positive and negative tests from the existing .net regex suite, as well as many edge cases i added while writing the new parser.

--

Here's the current worklist for this feature:

Required:

Nice to have:

Add fixer for regex errors
Add simplifier for complex regexes
Add pattern-help for regex components (i.e. when typing (? something should pop up to show you all the legal forms there are).

I hope to get all this done in the next week or so. I hope that this could serve as a pleasant addition to the C# and Vb editing experience for so many of our devs who do use regexes day to day.

CyrusNajmabadi · 2017-12-31T06:22:27Z

Tagging @kuhlenh @dpoeschl @Pilchie

CyrusNajmabadi · 2017-12-31T06:23:44Z

src/Features/Core/Portable/FeaturesResources.resx

@@ -1334,4 +1334,7 @@ This version used in: {2}</value>
  <data name="indexer_" xml:space="preserve">
    <value>indexer</value>
  </data>
-</root>
+  <data name="Regex_issue_0" xml:space="preserve">
+    <value>Regex issue: {0}</value>


Note: localization seems to have changed a lot. There are a lot of these new .xlf files. They added a lot of bloat to this PR. But i wasn't sure what else to do. Let me know if i need to do something different here.

CyrusNajmabadi · 2017-12-31T06:25:04Z

src/Features/Core/Portable/ValidateRegexString/AbstractValidateRegexStringDiagnosticAnalyzer.cs

+        protected abstract IParameterSymbol DetermineParameter(SemanticModel semanticModel, SyntaxNode argumentNode, CancellationToken cancellationToken);
+
+        protected override void InitializeWorker(AnalysisContext context)
+            => context.RegisterSemanticModelAction(AnalyzeSemanticModel);


I did this as a semantic model analyzer so that we wouldn't need callbacks and processing for every string in the world. Instead, we could process a file at a time, checking the user options once, and then doing a full pass to check each string in it. @sharwell let me know if there's a preference on the best way to design these to be good perf citizens.

CyrusNajmabadi · 2017-12-31T06:26:47Z

Note: this PR may look large. However, 21k of the 29k is purely tests and 3k is just loc. So the actual core logic added is more like 5k.

CyrusNajmabadi · 2017-12-31T06:27:49Z

src/Workspaces/Core/Portable/RegularExpressions/RegexCharClass.cs

+    /// Minimal copy of https://github.com/dotnet/corefx/blob/master/src/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCharClass.cs
+    /// Used to accurately determine if something is a WordChar according to the .Net regex engine.
+    /// </summary>
+    internal static class RegexCharClass


this code was copied over near verbatim. All i did was clean up formatting and remove any methods that were not needed at all by the new regex parser.

CyrusNajmabadi · 2017-12-31T06:29:09Z

src/Workspaces/Core/Portable/RegularExpressions/RegexKind.cs

+
+namespace Microsoft.CodeAnalysis.RegularExpressions
+{
+    internal enum RegexKind


Note: should proably be called SyntaxKind. But i figured Roslyn devs would be annoyed if searching for SyntaxKind in something like NavigateTo now showed this as well.

CyrusNajmabadi · 2017-12-31T06:31:06Z

src/Workspaces/Core/Portable/WorkspacesResources.resx

@@ -613,4 +613,94 @@
  <data name="Changing_document_property_is_not_supported" xml:space="preserve">
    <value>Changing document properties is not supported</value>
  </data>
+  <data name="Alternation_conditions_cannot_be_comments" xml:space="preserve">


These values are all copied from corefx. I think the phrasing of many of them is quite poor (and some are downright misleading depending on what the user actually wrote). However, they match the errors the native parser produces, which i think is quite important.

CyrusNajmabadi · 2017-12-31T06:36:37Z

src/Workspaces/Core/Portable/RegularExpressions/RegexNodes.cs

+
+namespace Microsoft.CodeAnalysis.RegularExpressions
+{
+    internal sealed class RegexCompilationUnit : RegexNode


All the nodes we create in the parser. I considered autogenerating. But it wasn't that hard to just hand author. If we get to a point that we want to really flesh these out (for example, adding things like Update methods), then it might make sense to generate.

AdamSpeight2008 · 2017-12-31T07:40:29Z

VirtualCharService maybe useful for other roles like validation of format string.

Tragetaschen · 2018-01-01T15:22:50Z

...or showing in the IDE what character a unicode escape sequence actually is.

CyrusNajmabadi · 2018-01-02T00:02:26Z

@KirillOsenkov I'm trying to get classification working. However, i'm having a problem with adding more classifications for a region already classified as a string literal. Do you know the right MEF EditorFormatDefinition shenangians to make it possible for another type definition to have higher priority and be shown when these spans overlap.

Thanks!

KirillOsenkov · 2018-01-02T04:03:24Z

@CyrusNajmabadi yeah, Microsoft.VisualStudio.Language.Intellisense.PriorityAttribute defined in Microsoft.VisualStudio.Text.Internal.dll

[Priority(1)] // 1 wins over the default 0

CyrusNajmabadi · 2018-01-02T04:38:40Z

Figured it out! you just need to add:

        [Order(After = ClassificationTypeNames.StringLiteral)]
        [Order(After = ClassificationTypeNames.VerbatimStringLiteral)]

to your type definition.

CyrusNajmabadi · 2018-01-02T05:59:51Z

Working regex classification:

CyrusNajmabadi · 2018-01-02T08:26:48Z

@KirillOsenkov Do you know how to create separate light/dark theme values for format definitions. The ones i've picked have some issues in the light theme:

Note: i like these colors mostly in both light/dark. The exception is the plain-text color (which should be black in the light theme) and the | character (which likely needs to be a darker yellow/brown). But i'm not sure how to actually get that.

Thanks!

CyrusNajmabadi · 2018-01-02T08:34:53Z

@dpoeschl Can you help with this question (about light/dark theme colors) as well? Thanks!

jnm2 · 2018-01-02T15:46:42Z

@CyrusNajmabadi You're amazing. 😆

CyrusNajmabadi · 2018-01-02T16:12:48Z

@dotnet/roslyn-ide This is ready for an initial review pass. Main logic is complete (though i will probably do a commenting pass). Tests still need to be created for some of the high level features. However, they will be very light as the main testing happens at the virtual-char and regex-parsing layers. i.e. the classification tests only need to validate the main regex classifications. All the myriad regex permutations don't need to be tested at the classification layer as that has already been tested at the parser layer.

Tragetaschen · 2018-01-02T20:00:53Z

src/Features/Core/Portable/Diagnostics/Analyzers/IDEDiagnosticIds.cs

@@ -63,6 +63,9 @@ internal static class IDEDiagnosticIds

        public const string UseDeconstructionDiagnosticId = "IDE0042";

+        public const string RegexPatternDiagnosticId = "IDE0042";


Thanks. Fixed!

CyrusNajmabadi · 2018-01-03T15:39:34Z

@dotnet/roslyn-ide

There is one more error that the native parser produces that i don't currently produce today. Once i get that i think i will be at parity with it. However, even without that error, the PR is ready for review.

CyrusNajmabadi · 2018-01-04T00:41:38Z

Ok. That last native regex error was very wonky. It took me a while to get the new parser to properly emulate it. Seems highly likely that the existing behavior in the native parser is unintentional. But, hey, what can you do.

At this point i think i'm very close to 100% parity with the native parser. I've debugged through pretty much all its branches, and i've tried to create test cases to hit every flow (including all errors). There are likely still going to be some mistakes, but i think this is close enough to be shippable.

Note: i considered having the IDE code actually run both parsers, and only report errors from mine if the native parser reports errors. However, this wouldn't work well in practice because the native parser can easily exhaust memory given certain regex patterns. This would definitely be quite bad if this was triggered during normal processing.

yaakov-h · 2018-01-04T00:44:51Z

the native parser can easily exhaust memory given certain regex patterns

When compiling it without running it? How does this new parser avoid doing so?

CyrusNajmabadi · 2018-01-04T02:47:29Z

When compiling it without running it?

Yes.

How does this new parser avoid doing so?

Because the new one only creates a parse tree, and doesn't try to create an evaluation engine (which itself may do things that exhaust memory). Here's a simple pattern to try out which demonstrates this: a{2147483647,}

Inside the native regex engine, this ends up trying to make a string with literally 2 billion 'a's in a row. Doing that is bad :)

The new system doesn't try to do anything like that. It just makes a simple parse tree with 6 tokens in it. Easy peasy.

--

Note this is the same reason that it's possible for Roslyn to produce an in memory representation for C# when actually executing that C# might never work.

CyrusNajmabadi · 2018-01-04T05:57:51Z

Ever wanted to know what one of your backreferences is actually referring to? It's easy now!

Brace highlighting is also there to help you easily track down matching parens in complex expressions:

CyrusNajmabadi · 2018-01-04T19:07:57Z

Here's how things currently look in the Blue theme:

Dark theme looks like:

There are likely contrast issues here that will have to be nailed down. @dpoeschl @kuhlenh I'm hoping you can help with this as I'm totally clueless about that area.

Note: i've gone for a colorful medley for colorizing regexes. While this doesn't exactly fit the more austere style we have for C# and VB, i think it's appropriate given the complexity and information density present in regexes. In C#/Vb there is enough structure and spacing to help identify things quickly. In regexes this is really not the case, especially as the complexity and usage of escape sequence increases. As such, having clearly separated colors for all the different regex concepts is quite useful for better understanding things.

CyrusNajmabadi · 2018-01-04T19:10:26Z

For example, with appropriate coloring, would you know that the first dash here is very different from the second, third or fourth?

Similarly, colors can immediately tell you that the following is doing what you think:

vs:

CyrusNajmabadi · 2018-01-04T19:16:00Z

src/EditorFeatures/Core.Wpf/EditorFeaturesWpfResources.resx

-	<!-- 
-		Microsoft ResX Schema
+  <!-- 
+    Microsoft ResX Schema 


Not sure why this changed. All i did was use the normal VS designer to edit these files.

CyrusNajmabadi · 2018-01-04T19:25:52Z

src/Workspaces/CSharpTest/RegularExpressions/CSharpRegexParserTests_TestGeneration.cs

+
+namespace Microsoft.CodeAnalysis.CSharp.UnitTests.RegularExpressions
+{
+    internal class LogicalStringComparer : IComparer<string>


Useful code i used to quickly generate tons of test cases from lots of .net regex examples i found. Can be removed from final checkin.

CyrusNajmabadi · 2018-01-04T19:28:57Z

src/Workspaces/Core/Portable/RegularExpressions/RegexCharClass.cs

+            return s_escapeCategories.Contains(value);
+        }
+
+        public static bool IsWordChar(char ch)


this is the important function we needed. if someone has high confidence on how this can be translated to usages of existing APIs in .net (i.e. methods on char or UnicodeCategory and whatnot), i can switch over to using that and can remove all this code. @sharwell as you sounded like you were very familiar with unicode.

CyrusNajmabadi · 2018-08-14T19:32:01Z

I don't know if this was addressed. The question is why store an array of diagnostics? It seems we only need zero or one.

A few reasons:

I'm trying to match the Roslyn syntactic API fairly closely.
I don't want to limit this to one, then find out we need multiple later, and then have to go figure out if this is all working properly :)
the core syntax types are not Regex-specific. They are embedded-language-neutral. That way more languages can be added in the future if we decide we want that (like Json and whatnot). I don't want to limit to one just because that's how we currently implement Regex.

jcouv

LGTM (iteration 34) modulo open issue on licence header and a couple nits/suggestions for your consideration.
Thanks for this fantastic PR and your patience!

CyrusNajmabadi · 2018-08-14T19:45:39Z

LGTM (iteration 34) modulo open issue on licence header

@Pilchie @jinujoseph We include a corefx file here: https://github.com/dotnet/roslyn/pull/23984/files#diff-f689c52f48c471beeaea4252031ed8f3

It's unclear what we should do about copyright headers. Can you advise?

jcouv · 2018-08-14T19:48:14Z

@dpoeschl has a thread on the header question. He'll share the answer once the thread lands.

CyrusNajmabadi · 2018-08-14T19:49:23Z

Overall, i'm happy with where we've landed nit-wise. I've been over this code like a hundred times :) I think it's overall in a good shape.

CyrusNajmabadi · 2018-08-14T20:11:59Z

@jcouv This is also being merged into a feature branch. So ideally we can still move forward, wihle still waiting on the results of that email thread. Thanks!

jcouv · 2018-08-14T20:19:35Z

That makes sense. Let's merge once green.

CyrusNajmabadi · 2018-08-14T20:30:50Z

@jcouv Required tests are passing. Not sure what's up with integration tests. AFAICT, they never want to pass. I'm seeing the following in the logs:

13:26:30 FATAL: command execution failed
13:26:30 java.nio.channels.ClosedChannelException
13:26:30 	at org.jenkinsci.remoting.protocol.NetworkLayer.onRecvClosed(NetworkLayer.java:154)
13:26:30 	at org.jenkinsci.remoting.protocol.impl.NIONetworkLayer.ready(NIONetworkLayer.java:179)
13:26:30 	at org.jenkinsci.remoting.protocol.IOHub$OnReady.run(IOHub.java:721)
13:26:30 	at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
13:26:30 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
13:26:30 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
13:26:30 	at java.lang.Thread.run(Thread.java:748)
13:26:30 Caused: java.io.IOException: Backing channel 'JNLP4-connect connection from 40.112.170.247/40.112.170.247:6017' is disconnected.

CyrusNajmabadi requested a review from a team as a code owner December 31, 2017 06:15

CyrusNajmabadi commented Dec 31, 2017

View reviewed changes

Tragetaschen reviewed Jan 2, 2018

View reviewed changes

CyrusNajmabadi commented Jan 4, 2018

View reviewed changes

CyrusNajmabadi added 8 commits August 14, 2018 10:45

Fix comment.

66a782b

Expand message.

250203e

Simplify parsing loop

0d97cbc

Update comment.

b1d5765

Use helper.

22e175b

Ensure 'None' is 0

a30fa0f

Add comments

b58c0c7

Remove unused local.

8a3dd8f

jcouv approved these changes Aug 14, 2018

View reviewed changes

jcouv merged commit d0dfe31 into dotnet:features/embeddedRegex Aug 14, 2018

CyrusNajmabadi deleted the regexParsing4 branch August 14, 2018 20:51

0x53A mentioned this pull request Jul 2, 2019

regex literal fsharp/fslang-suggestions#763

Closed

yaakov-h mentioned this pull request Jul 17, 2019

Proposal: allow a [RegexColorization] pseudo-attribute for colorization of regex strings. dotnet/csharplang#2662

Closed

danmoseley mentioned this pull request Jan 31, 2020

Investigate porting some/all of Roslyn regex parsing checker tests dotnet/runtime#25270

Closed

CyrusNajmabadi mentioned this pull request Jul 24, 2020

Make enum RegexParseError and RegexParseException public dotnet/runtime#38872

Closed

CyrusNajmabadi mentioned this pull request Aug 18, 2021

[Proposal]: Raw string literal dotnet/csharplang#4304

Open

4 tasks

		@@ -63,6 +63,9 @@ internal static class IDEDiagnosticIds

		public const string UseDeconstructionDiagnosticId = "IDE0042";

		public const string RegexPatternDiagnosticId = "IDE0042";

Add in a regex parser so that the IDE can provide services around regex editing. #23984

Add in a regex parser so that the IDE can provide services around regex editing. #23984

Conversation

CyrusNajmabadi commented Dec 31, 2017 • edited Loading

CyrusNajmabadi commented Dec 31, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CyrusNajmabadi commented Dec 31, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AdamSpeight2008 commented Dec 31, 2017

Tragetaschen commented Jan 1, 2018

CyrusNajmabadi commented Jan 2, 2018

KirillOsenkov commented Jan 2, 2018

CyrusNajmabadi commented Jan 2, 2018

CyrusNajmabadi commented Jan 2, 2018 • edited Loading

CyrusNajmabadi commented Jan 2, 2018 • edited Loading

CyrusNajmabadi commented Jan 2, 2018

jnm2 commented Jan 2, 2018

CyrusNajmabadi commented Jan 2, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CyrusNajmabadi commented Jan 3, 2018

CyrusNajmabadi commented Jan 4, 2018

yaakov-h commented Jan 4, 2018

CyrusNajmabadi commented Jan 4, 2018

CyrusNajmabadi commented Jan 4, 2018 • edited Loading

CyrusNajmabadi commented Jan 4, 2018

CyrusNajmabadi commented Jan 4, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CyrusNajmabadi commented Aug 14, 2018 • edited Loading

jcouv left a comment

Choose a reason for hiding this comment

CyrusNajmabadi commented Aug 14, 2018

jcouv commented Aug 14, 2018

CyrusNajmabadi commented Aug 14, 2018

CyrusNajmabadi commented Aug 14, 2018

jcouv commented Aug 14, 2018

CyrusNajmabadi commented Aug 14, 2018

CyrusNajmabadi commented Dec 31, 2017 •

edited

Loading

CyrusNajmabadi commented Dec 31, 2017 •

edited

Loading

CyrusNajmabadi commented Jan 2, 2018 •

edited

Loading

CyrusNajmabadi commented Jan 2, 2018 •

edited

Loading

CyrusNajmabadi commented Jan 4, 2018 •

edited

Loading

CyrusNajmabadi commented Jan 4, 2018 •

edited

Loading

CyrusNajmabadi commented Aug 14, 2018 •

edited

Loading