-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add in a regex parser so that the IDE can provide services around regex editing. #23984
Add in a regex parser so that the IDE can provide services around regex editing. #23984
Conversation
@@ -1334,4 +1334,7 @@ This version used in: {2}</value> | |||
<data name="indexer_" xml:space="preserve"> | |||
<value>indexer</value> | |||
</data> | |||
</root> | |||
<data name="Regex_issue_0" xml:space="preserve"> | |||
<value>Regex issue: {0}</value> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: localization seems to have changed a lot. There are a lot of these new .xlf files. They added a lot of bloat to this PR. But i wasn't sure what else to do. Let me know if i need to do something different here.
protected abstract IParameterSymbol DetermineParameter(SemanticModel semanticModel, SyntaxNode argumentNode, CancellationToken cancellationToken); | ||
|
||
protected override void InitializeWorker(AnalysisContext context) | ||
=> context.RegisterSemanticModelAction(AnalyzeSemanticModel); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did this as a semantic model analyzer so that we wouldn't need callbacks and processing for every string in the world. Instead, we could process a file at a time, checking the user options once, and then doing a full pass to check each string in it. @sharwell let me know if there's a preference on the best way to design these to be good perf citizens.
Note: this PR may look large. However, 21k of the 29k is purely tests and 3k is just loc. So the actual core logic added is more like 5k. |
/// Minimal copy of https://github.com/dotnet/corefx/blob/master/src/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCharClass.cs | ||
/// Used to accurately determine if something is a WordChar according to the .Net regex engine. | ||
/// </summary> | ||
internal static class RegexCharClass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this code was copied over near verbatim. All i did was clean up formatting and remove any methods that were not needed at all by the new regex parser.
|
||
namespace Microsoft.CodeAnalysis.RegularExpressions | ||
{ | ||
internal enum RegexKind |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: should proably be called SyntaxKind. But i figured Roslyn devs would be annoyed if searching for SyntaxKind in something like NavigateTo now showed this as well.
@@ -613,4 +613,94 @@ | |||
<data name="Changing_document_property_is_not_supported" xml:space="preserve"> | |||
<value>Changing document properties is not supported</value> | |||
</data> | |||
<data name="Alternation_conditions_cannot_be_comments" xml:space="preserve"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These values are all copied from corefx. I think the phrasing of many of them is quite poor (and some are downright misleading depending on what the user actually wrote). However, they match the errors the native parser produces, which i think is quite important.
|
||
namespace Microsoft.CodeAnalysis.RegularExpressions | ||
{ | ||
internal sealed class RegexCompilationUnit : RegexNode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the nodes we create in the parser. I considered autogenerating. But it wasn't that hard to just hand author. If we get to a point that we want to really flesh these out (for example, adding things like Update methods), then it might make sense to generate.
|
...or showing in the IDE what character a unicode escape sequence actually is. |
@KirillOsenkov I'm trying to get classification working. However, i'm having a problem with adding more classifications for a region already classified as a string literal. Do you know the right MEF EditorFormatDefinition shenangians to make it possible for another type definition to have higher priority and be shown when these spans overlap. Thanks! |
@CyrusNajmabadi yeah,
|
Figured it out! you just need to add: [Order(After = ClassificationTypeNames.StringLiteral)]
[Order(After = ClassificationTypeNames.VerbatimStringLiteral)] to your type definition. |
@KirillOsenkov Do you know how to create separate light/dark theme values for format definitions. The ones i've picked have some issues in the light theme: Note: i like these colors mostly in both light/dark. The exception is the plain-text color (which should be black in the light theme) and the Thanks! |
@dpoeschl Can you help with this question (about light/dark theme colors) as well? Thanks! |
@CyrusNajmabadi You're amazing. 😆 |
@dotnet/roslyn-ide This is ready for an initial review pass. Main logic is complete (though i will probably do a commenting pass). Tests still need to be created for some of the high level features. However, they will be very light as the main testing happens at the virtual-char and regex-parsing layers. i.e. the classification tests only need to validate the main regex classifications. All the myriad regex permutations don't need to be tested at the classification layer as that has already been tested at the parser layer. |
@@ -63,6 +63,9 @@ internal static class IDEDiagnosticIds | |||
|
|||
public const string UseDeconstructionDiagnosticId = "IDE0042"; | |||
|
|||
public const string RegexPatternDiagnosticId = "IDE0042"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
43?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Fixed!
@dotnet/roslyn-ide There is one more error that the native parser produces that i don't currently produce today. Once i get that i think i will be at parity with it. However, even without that error, the PR is ready for review. |
Ok. That last native regex error was very wonky. It took me a while to get the new parser to properly emulate it. Seems highly likely that the existing behavior in the native parser is unintentional. But, hey, what can you do. At this point i think i'm very close to 100% parity with the native parser. I've debugged through pretty much all its branches, and i've tried to create test cases to hit every flow (including all errors). There are likely still going to be some mistakes, but i think this is close enough to be shippable. Note: i considered having the IDE code actually run both parsers, and only report errors from mine if the native parser reports errors. However, this wouldn't work well in practice because the native parser can easily exhaust memory given certain regex patterns. This would definitely be quite bad if this was triggered during normal processing. |
When compiling it without running it? How does this new parser avoid doing so? |
Yes.
Because the new one only creates a parse tree, and doesn't try to create an evaluation engine (which itself may do things that exhaust memory). Here's a simple pattern to try out which demonstrates this: a{2147483647,} Inside the native regex engine, this ends up trying to make a string with literally 2 billion 'a's in a row. Doing that is bad :) The new system doesn't try to do anything like that. It just makes a simple parse tree with 6 tokens in it. Easy peasy. -- Note this is the same reason that it's possible for Roslyn to produce an in memory representation for C# when actually executing that C# might never work. |
Here's how things currently look in the Blue theme: Dark theme looks like: There are likely contrast issues here that will have to be nailed down. @dpoeschl @kuhlenh I'm hoping you can help with this as I'm totally clueless about that area. Note: i've gone for a colorful medley for colorizing regexes. While this doesn't exactly fit the more austere style we have for C# and VB, i think it's appropriate given the complexity and information density present in regexes. In C#/Vb there is enough structure and spacing to help identify things quickly. In regexes this is really not the case, especially as the complexity and usage of escape sequence increases. As such, having clearly separated colors for all the different regex concepts is quite useful for better understanding things. |
<!-- | ||
Microsoft ResX Schema | ||
<!-- | ||
Microsoft ResX Schema |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why this changed. All i did was use the normal VS designer to edit these files.
|
||
namespace Microsoft.CodeAnalysis.CSharp.UnitTests.RegularExpressions | ||
{ | ||
internal class LogicalStringComparer : IComparer<string> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Useful code i used to quickly generate tons of test cases from lots of .net regex examples i found. Can be removed from final checkin.
return s_escapeCategories.Contains(value); | ||
} | ||
|
||
public static bool IsWordChar(char ch) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the important function we needed. if someone has high confidence on how this can be translated to usages of existing APIs in .net (i.e. methods on char or UnicodeCategory and whatnot), i can switch over to using that and can remove all this code. @sharwell as you sounded like you were very familiar with unicode.
A few reasons:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (iteration 34) modulo open issue on licence header and a couple nits/suggestions for your consideration.
Thanks for this fantastic PR and your patience!
@Pilchie @jinujoseph We include a corefx file here: https://github.com/dotnet/roslyn/pull/23984/files#diff-f689c52f48c471beeaea4252031ed8f3 It's unclear what we should do about copyright headers. Can you advise? |
@dpoeschl has a thread on the header question. He'll share the answer once the thread lands. |
Overall, i'm happy with where we've landed nit-wise. I've been over this code like a hundred times :) I think it's overall in a good shape. |
@jcouv This is also being merged into a feature branch. So ideally we can still move forward, wihle still waiting on the results of that email thread. Thanks! |
That makes sense. Let's merge once green. |
@jcouv Required tests are passing. Not sure what's up with integration tests. AFAICT, they never want to pass. I'm seeing the following in the logs:
|
I was bored flying to MN and DC. So i decided to scratch an itch that came up when discussing things over here: dotnet/csharplang#371
This PR introduces support in the Roslyn IDE for improved features for users working with .Net regexes. Specifically, the intent of this PR is to support the following (entirely optional) features around regexes:
"(a)(b)"
be simplified directly toab
.However, supporting regexes is complicated by how regexes are supported in the language today. Specifically because of the complexity present in how VB and C# escape values in strings on top of how complex .net regexes are as well. To deal with that, two new components have been added to Roslyn.
The first subsystem is called the VirtualCharService deals with the following issue. To the final .net regex, the following snippets of code appear completely identical to it:
These are all ways of writing the
\1
regex.In other words, C# allows a wide variety of input strings to all compile down to the same final 'value' (or 'ValueText') that the regex engine will finally see. This is a major issue as it means that any data reported by the regex engine must be accurate with respect to the text as the user wrote it. For example, in all of the equivalent cases above, there is the same error "Reference to undefined group number 1". However, for each form the user wrote, it's necessary to understand what the right value is to highlight as the problem. i.e.
and
So, the purpose of the VirtualCharService is to translate all of the above pieces of user code to the same final set of characters the regex engine will see (specifically
\
and1
) while also maintaining the knowledge of where those characters came from (for example, that1
came from\u0031
in the last example). In essence, the VirtualCharService is able to produce the ValueText for any string literal, while having a mapping back from each character in the ValueText back to the original source span of the document that formed this.With the VirtualCharService user code can be translated into a common format that then can be processed uniformly. This means that the part of the system that actually tries to understand the regex does not need to know about the differences between
@""
and""
strings, or the differences between C# and VB. It also means that it can be used by any roslyn language (for example, F#) if that is so desired.The second service is the actual regex parser that consumes the results of the VirtualCharService. The regex parser is a fork of the real .net regex parser (from https://github.com/dotnet/corefx/blob/master/src/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexParser.cs) however, changed heavily. The reason for this is that the existing regex parser is ill suited for all the needs an IDE has. Major problems the existing parser has are:
However, the existing parser does have certain things going for it. Namely, it is the source of truth in the .net world. It doesn't matter what the docs say on how .net regexes work and are parsed. At the end of they day, this is the parser used, so it must always be deferred to when determining what is a syntactically legal regex and what isn't.
The regex parser this PR adds introduces a system far more in line with how other roslyn parsers work. Namely, the tree it produces is immutable and fully represents all the characters in the original token. The tree also follows roslyn invariants around nodes, tokens and trivia. Ideally, the tree would literally be built on the syntax classes defined at the Microsoft.CodeAnalysis layer. However, they contain internals that make them only implementable at the C# and Vb layers. Because of that, the tree produces by the new regex parser follows the spirit of the Roslyn syntax model even if it can't fit into it perfectly.
This parser has been extensively tested and comes with 750 seed cases that validate it. These are called 'test seeds' as the inputs to the engine are then mutated heavily to produce even more cases to validate the engine with (for example, testing all prefixes and suffixes of every test case to ensure no crashes or infinite loops occur). Validation is very stringent. All cases must parse without any problems, producing trees that abide by all roslyn invariants. Furthermore, all cases validate that the new parser does not produce errors if the native parser does not. Conversely, if the native parser produces an error, the new parser must produce the same error as it does. That means that we do not even define new messages to show the user. Instead, we use the same message (or message fragment) that the native regex parser gives the user. For example:
Because of this validation, it's very easy to feed in new tests into the system and have it validate that the roslyn regex parser is working in line with the native parser.
The test suite currently contains all the existing positive and negative tests from the existing .net regex suite, as well as many edge cases i added while writing the new parser.
--
Here's the current worklist for this feature:
Required:
/* language=regex */ "[a-zA-Z]"
[z-a]
.Nice to have:
(?
something should pop up to show you all the legal forms there are).I hope to get all this done in the next week or so. I hope that this could serve as a pleasant addition to the C# and Vb editing experience for so many of our devs who do use regexes day to day.