Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add in a regex parser so that the IDE can provide services around regex editing. #23984

Merged
merged 34 commits into from
Aug 14, 2018

Conversation

CyrusNajmabadi
Copy link
Member

@CyrusNajmabadi CyrusNajmabadi commented Dec 31, 2017

I was bored flying to MN and DC. So i decided to scratch an itch that came up when discussing things over here: dotnet/csharplang#371

This PR introduces support in the Roslyn IDE for improved features for users working with .Net regexes. Specifically, the intent of this PR is to support the following (entirely optional) features around regexes:

  1. Colorization. Users should be able to write a regex and have it colorized automatically, with a set of colors they can customize if they want.
  2. Highlighting. Users should be able to place their cursor on something like a brace and have the matching brace highlighted. This should extend to complex regex concepts (like backreferences) as well.
  3. Syntax errors. Users should see syntax errors directly in the editor and error list without having to compile or run their program.
  4. Fixes. Errors in the regex should come with potential fixes to address the problem immediately.
  5. Suggestions. The IDE should be able to suggest completely safe refactorings to a users regex to simplify it. For example, suggesting that "(a)(b)" be simplified directly to ab.

However, supporting regexes is complicated by how regexes are supported in the language today. Specifically because of the complexity present in how VB and C# escape values in strings on top of how complex .net regexes are as well. To deal with that, two new components have been added to Roslyn.

The first subsystem is called the VirtualCharService deals with the following issue. To the final .net regex, the following snippets of code appear completely identical to it:

"\\1"       // In a normal string, we have to escape the \
@"\1"       // But not in a verbatim string
"\\\u0031"  // The '1' could be escaped
"\\u005c1"  // Even the backslash *itself* may be escaped

These are all ways of writing the \1 regex.

In other words, C# allows a wide variety of input strings to all compile down to the same final 'value' (or 'ValueText') that the regex engine will finally see. This is a major issue as it means that any data reported by the regex engine must be accurate with respect to the text as the user wrote it. For example, in all of the equivalent cases above, there is the same error "Reference to undefined group number 1". However, for each form the user wrote, it's necessary to understand what the right value is to highlight as the problem. i.e.

image

and

image

So, the purpose of the VirtualCharService is to translate all of the above pieces of user code to the same final set of characters the regex engine will see (specifically \ and 1) while also maintaining the knowledge of where those characters came from (for example, that 1 came from \u0031 in the last example). In essence, the VirtualCharService is able to produce the ValueText for any string literal, while having a mapping back from each character in the ValueText back to the original source span of the document that formed this.

With the VirtualCharService user code can be translated into a common format that then can be processed uniformly. This means that the part of the system that actually tries to understand the regex does not need to know about the differences between @"" and "" strings, or the differences between C# and VB. It also means that it can be used by any roslyn language (for example, F#) if that is so desired.

The second service is the actual regex parser that consumes the results of the VirtualCharService. The regex parser is a fork of the real .net regex parser (from https://github.com/dotnet/corefx/blob/master/src/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexParser.cs) however, changed heavily. The reason for this is that the existing regex parser is ill suited for all the needs an IDE has. Major problems the existing parser has are:

  1. It is not error tolerant. With the first error it encounters it completely bails out. This would mean that the roslyn services would only work if the regex was written completely properly.
  2. it does not produce a usable tree by IDE standards. While it does produce a tree, it is an extremely abstract one that not only does not contain much of the information that would be needed. It is also heavily processed and changed as parsing happens. i.e. the parser will actually merge, split, erase, and even reorder nodes within the tree.
  3. it works with data formats that are intended to work extremely fast and efficiently when actually performing matching of regexes, but which would be very baroque and unpleasant for editing scenarios. For example, it compresses data into a compact binary format, which it then stores in strings themselves that it intersperses with chunks of the original string.
  4. It will actually crash when passed in some strings (i was able to get index-out-of-bounds and even null-refs when using it).
  5. It can use excessive amounts of memory (even OOMing) just when parsing.

However, the existing parser does have certain things going for it. Namely, it is the source of truth in the .net world. It doesn't matter what the docs say on how .net regexes work and are parsed. At the end of they day, this is the parser used, so it must always be deferred to when determining what is a syntactically legal regex and what isn't.

The regex parser this PR adds introduces a system far more in line with how other roslyn parsers work. Namely, the tree it produces is immutable and fully represents all the characters in the original token. The tree also follows roslyn invariants around nodes, tokens and trivia. Ideally, the tree would literally be built on the syntax classes defined at the Microsoft.CodeAnalysis layer. However, they contain internals that make them only implementable at the C# and Vb layers. Because of that, the tree produces by the new regex parser follows the spirit of the Roslyn syntax model even if it can't fit into it perfectly.

This parser has been extensively tested and comes with 750 seed cases that validate it. These are called 'test seeds' as the inputs to the engine are then mutated heavily to produce even more cases to validate the engine with (for example, testing all prefixes and suffixes of every test case to ensure no crashes or infinite loops occur). Validation is very stringent. All cases must parse without any problems, producing trees that abide by all roslyn invariants. Furthermore, all cases validate that the new parser does not produce errors if the native parser does not. Conversely, if the native parser produces an error, the new parser must produce the same error as it does. That means that we do not even define new messages to show the user. Instead, we use the same message (or message fragment) that the native regex parser gives the user. For example:

image

image

Because of this validation, it's very easy to feed in new tests into the system and have it validate that the roslyn regex parser is working in line with the native parser.

The test suite currently contains all the existing positive and negative tests from the existing .net regex suite, as well as many edge cases i added while writing the new parser.

--

Here's the current worklist for this feature:

Required:

  • Create VirtualCharService system
  • Implement VirtualCharService for C#
  • Test VirtualCharService for C#
  • Implement VirtualCharService for VB
  • Test VirtualCharService for VB
  • Create regular expression parser that will produce a Roslyn tree for a regex.
  • Test regular expression parser
  • Add stack overflow safeguards for regex parsing.
  • Add diagnostic analyzer that parses regexes and reports warnings for issues as Diagnostics.
  • Add tests for regex squiggles
  • Add colorizer and fonts/colors entries for regex components.
  • Add tests for regex colorizer
  • Add user facing options to disable language services for regexs.
  • Add way to apply this system to non-obvious regexes. For example allow users to write something like /* language=regex */ "[a-zA-Z]"
  • Add error for character classes that go in the wrong direction. i.e. [z-a].
  • Add highlighter for regex components

Nice to have:

  • Add fixer for regex errors
  • Add simplifier for complex regexes
  • Add pattern-help for regex components (i.e. when typing (? something should pop up to show you all the legal forms there are).

I hope to get all this done in the next week or so. I hope that this could serve as a pleasant addition to the C# and Vb editing experience for so many of our devs who do use regexes day to day.

@CyrusNajmabadi CyrusNajmabadi requested a review from a team as a code owner December 31, 2017 06:15
@CyrusNajmabadi
Copy link
Member Author

Tagging @kuhlenh @dpoeschl @Pilchie

@@ -1334,4 +1334,7 @@ This version used in: {2}</value>
<data name="indexer_" xml:space="preserve">
<value>indexer</value>
</data>
</root>
<data name="Regex_issue_0" xml:space="preserve">
<value>Regex issue: {0}</value>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: localization seems to have changed a lot. There are a lot of these new .xlf files. They added a lot of bloat to this PR. But i wasn't sure what else to do. Let me know if i need to do something different here.

protected abstract IParameterSymbol DetermineParameter(SemanticModel semanticModel, SyntaxNode argumentNode, CancellationToken cancellationToken);

protected override void InitializeWorker(AnalysisContext context)
=> context.RegisterSemanticModelAction(AnalyzeSemanticModel);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did this as a semantic model analyzer so that we wouldn't need callbacks and processing for every string in the world. Instead, we could process a file at a time, checking the user options once, and then doing a full pass to check each string in it. @sharwell let me know if there's a preference on the best way to design these to be good perf citizens.

@CyrusNajmabadi
Copy link
Member Author

CyrusNajmabadi commented Dec 31, 2017

Note: this PR may look large. However, 21k of the 29k is purely tests and 3k is just loc. So the actual core logic added is more like 5k.

/// Minimal copy of https://github.com/dotnet/corefx/blob/master/src/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCharClass.cs
/// Used to accurately determine if something is a WordChar according to the .Net regex engine.
/// </summary>
internal static class RegexCharClass
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this code was copied over near verbatim. All i did was clean up formatting and remove any methods that were not needed at all by the new regex parser.


namespace Microsoft.CodeAnalysis.RegularExpressions
{
internal enum RegexKind
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: should proably be called SyntaxKind. But i figured Roslyn devs would be annoyed if searching for SyntaxKind in something like NavigateTo now showed this as well.

@@ -613,4 +613,94 @@
<data name="Changing_document_property_is_not_supported" xml:space="preserve">
<value>Changing document properties is not supported</value>
</data>
<data name="Alternation_conditions_cannot_be_comments" xml:space="preserve">
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These values are all copied from corefx. I think the phrasing of many of them is quite poor (and some are downright misleading depending on what the user actually wrote). However, they match the errors the native parser produces, which i think is quite important.


namespace Microsoft.CodeAnalysis.RegularExpressions
{
internal sealed class RegexCompilationUnit : RegexNode
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the nodes we create in the parser. I considered autogenerating. But it wasn't that hard to just hand author. If we get to a point that we want to really flesh these out (for example, adding things like Update methods), then it might make sense to generate.

@AdamSpeight2008
Copy link
Contributor

VirtualCharService maybe useful for other roles like validation of format string.

@Tragetaschen
Copy link

...or showing in the IDE what character a unicode escape sequence actually is.

@CyrusNajmabadi
Copy link
Member Author

@KirillOsenkov I'm trying to get classification working. However, i'm having a problem with adding more classifications for a region already classified as a string literal. Do you know the right MEF EditorFormatDefinition shenangians to make it possible for another type definition to have higher priority and be shown when these spans overlap.

Thanks!

@KirillOsenkov
Copy link
Member

@CyrusNajmabadi yeah, Microsoft.VisualStudio.Language.Intellisense.PriorityAttribute defined in Microsoft.VisualStudio.Text.Internal.dll

[Priority(1)] // 1 wins over the default 0

@CyrusNajmabadi
Copy link
Member Author

Figured it out! you just need to add:

        [Order(After = ClassificationTypeNames.StringLiteral)]
        [Order(After = ClassificationTypeNames.VerbatimStringLiteral)]

to your type definition.

@CyrusNajmabadi
Copy link
Member Author

CyrusNajmabadi commented Jan 2, 2018

Working regex classification:

image

@CyrusNajmabadi
Copy link
Member Author

CyrusNajmabadi commented Jan 2, 2018

@KirillOsenkov Do you know how to create separate light/dark theme values for format definitions. The ones i've picked have some issues in the light theme:

image

Note: i like these colors mostly in both light/dark. The exception is the plain-text color (which should be black in the light theme) and the | character (which likely needs to be a darker yellow/brown). But i'm not sure how to actually get that.

Thanks!

@CyrusNajmabadi
Copy link
Member Author

@dpoeschl Can you help with this question (about light/dark theme colors) as well? Thanks!

@jnm2
Copy link
Contributor

jnm2 commented Jan 2, 2018

@CyrusNajmabadi You're amazing. 😆

@CyrusNajmabadi
Copy link
Member Author

@dotnet/roslyn-ide This is ready for an initial review pass. Main logic is complete (though i will probably do a commenting pass). Tests still need to be created for some of the high level features. However, they will be very light as the main testing happens at the virtual-char and regex-parsing layers. i.e. the classification tests only need to validate the main regex classifications. All the myriad regex permutations don't need to be tested at the classification layer as that has already been tested at the parser layer.

@@ -63,6 +63,9 @@ internal static class IDEDiagnosticIds

public const string UseDeconstructionDiagnosticId = "IDE0042";

public const string RegexPatternDiagnosticId = "IDE0042";

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

43?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Fixed!

@CyrusNajmabadi
Copy link
Member Author

@dotnet/roslyn-ide

There is one more error that the native parser produces that i don't currently produce today. Once i get that i think i will be at parity with it. However, even without that error, the PR is ready for review.

@CyrusNajmabadi
Copy link
Member Author

Ok. That last native regex error was very wonky. It took me a while to get the new parser to properly emulate it. Seems highly likely that the existing behavior in the native parser is unintentional. But, hey, what can you do.

At this point i think i'm very close to 100% parity with the native parser. I've debugged through pretty much all its branches, and i've tried to create test cases to hit every flow (including all errors). There are likely still going to be some mistakes, but i think this is close enough to be shippable.

Note: i considered having the IDE code actually run both parsers, and only report errors from mine if the native parser reports errors. However, this wouldn't work well in practice because the native parser can easily exhaust memory given certain regex patterns. This would definitely be quite bad if this was triggered during normal processing.

@yaakov-h
Copy link
Member

yaakov-h commented Jan 4, 2018

the native parser can easily exhaust memory given certain regex patterns

When compiling it without running it? How does this new parser avoid doing so?

@CyrusNajmabadi
Copy link
Member Author

When compiling it without running it?

Yes.

How does this new parser avoid doing so?

Because the new one only creates a parse tree, and doesn't try to create an evaluation engine (which itself may do things that exhaust memory). Here's a simple pattern to try out which demonstrates this: a{2147483647,}

Inside the native regex engine, this ends up trying to make a string with literally 2 billion 'a's in a row. Doing that is bad :)

The new system doesn't try to do anything like that. It just makes a simple parse tree with 6 tokens in it. Easy peasy.

--

Note this is the same reason that it's possible for Roslyn to produce an in memory representation for C# when actually executing that C# might never work.

@CyrusNajmabadi
Copy link
Member Author

CyrusNajmabadi commented Jan 4, 2018

Ever wanted to know what one of your backreferences is actually referring to? It's easy now!

image

Brace highlighting is also there to help you easily track down matching parens in complex expressions:

image

@CyrusNajmabadi
Copy link
Member Author

Here's how things currently look in the Blue theme:

image

Dark theme looks like:

image

There are likely contrast issues here that will have to be nailed down. @dpoeschl @kuhlenh I'm hoping you can help with this as I'm totally clueless about that area.

Note: i've gone for a colorful medley for colorizing regexes. While this doesn't exactly fit the more austere style we have for C# and VB, i think it's appropriate given the complexity and information density present in regexes. In C#/Vb there is enough structure and spacing to help identify things quickly. In regexes this is really not the case, especially as the complexity and usage of escape sequence increases. As such, having clearly separated colors for all the different regex concepts is quite useful for better understanding things.

@CyrusNajmabadi
Copy link
Member Author

CyrusNajmabadi commented Jan 4, 2018

For example, with appropriate coloring, would you know that the first dash here is very different from the second, third or fourth?

image

Similarly, colors can immediately tell you that the following is doing what you think:

image

vs:

image

<!--
Microsoft ResX Schema
<!--
Microsoft ResX Schema
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why this changed. All i did was use the normal VS designer to edit these files.


namespace Microsoft.CodeAnalysis.CSharp.UnitTests.RegularExpressions
{
internal class LogicalStringComparer : IComparer<string>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Useful code i used to quickly generate tons of test cases from lots of .net regex examples i found. Can be removed from final checkin.

return s_escapeCategories.Contains(value);
}

public static bool IsWordChar(char ch)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the important function we needed. if someone has high confidence on how this can be translated to usages of existing APIs in .net (i.e. methods on char or UnicodeCategory and whatnot), i can switch over to using that and can remove all this code. @sharwell as you sounded like you were very familiar with unicode.

@CyrusNajmabadi
Copy link
Member Author

CyrusNajmabadi commented Aug 14, 2018

I don't know if this was addressed. The question is why store an array of diagnostics? It seems we only need zero or one.

A few reasons:

  1. I'm trying to match the Roslyn syntactic API fairly closely.
  2. I don't want to limit this to one, then find out we need multiple later, and then have to go figure out if this is all working properly :)
  3. the core syntax types are not Regex-specific. They are embedded-language-neutral. That way more languages can be added in the future if we decide we want that (like Json and whatnot). I don't want to limit to one just because that's how we currently implement Regex.

Copy link
Member

@jcouv jcouv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (iteration 34) modulo open issue on licence header and a couple nits/suggestions for your consideration.
Thanks for this fantastic PR and your patience!

@CyrusNajmabadi
Copy link
Member Author

LGTM (iteration 34) modulo open issue on licence header

@Pilchie @jinujoseph We include a corefx file here: https://github.com/dotnet/roslyn/pull/23984/files#diff-f689c52f48c471beeaea4252031ed8f3

It's unclear what we should do about copyright headers. Can you advise?

@jcouv
Copy link
Member

jcouv commented Aug 14, 2018

@dpoeschl has a thread on the header question. He'll share the answer once the thread lands.

@CyrusNajmabadi
Copy link
Member Author

Overall, i'm happy with where we've landed nit-wise. I've been over this code like a hundred times :) I think it's overall in a good shape.

@CyrusNajmabadi
Copy link
Member Author

@jcouv This is also being merged into a feature branch. So ideally we can still move forward, wihle still waiting on the results of that email thread. Thanks!

@jcouv
Copy link
Member

jcouv commented Aug 14, 2018

That makes sense. Let's merge once green.

@CyrusNajmabadi
Copy link
Member Author

@jcouv Required tests are passing. Not sure what's up with integration tests. AFAICT, they never want to pass. I'm seeing the following in the logs:

13:26:30 FATAL: command execution failed
13:26:30 java.nio.channels.ClosedChannelException
13:26:30 	at org.jenkinsci.remoting.protocol.NetworkLayer.onRecvClosed(NetworkLayer.java:154)
13:26:30 	at org.jenkinsci.remoting.protocol.impl.NIONetworkLayer.ready(NIONetworkLayer.java:179)
13:26:30 	at org.jenkinsci.remoting.protocol.IOHub$OnReady.run(IOHub.java:721)
13:26:30 	at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
13:26:30 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
13:26:30 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
13:26:30 	at java.lang.Thread.run(Thread.java:748)
13:26:30 Caused: java.io.IOException: Backing channel 'JNLP4-connect connection from 40.112.170.247/40.112.170.247:6017' is disconnected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area-IDE Community The pull request was submitted by a contributor who is not a Microsoft employee.
Projects
None yet
Development

Successfully merging this pull request may close these issues.