diff --git a/docs/standard/base-types/anchors-in-regular-expressions.md b/docs/standard/base-types/anchors-in-regular-expressions.md index d58abba5bb5af..fa11783a74d01 100644 --- a/docs/standard/base-types/anchors-in-regular-expressions.md +++ b/docs/standard/base-types/anchors-in-regular-expressions.md @@ -62,9 +62,9 @@ Anchors, or atomic zero-width assertions, specify a position in the string where The `$` anchor specifies that the preceding pattern must occur at the end of the input string, or before `\n` at the end of the input string. - If you use `$` with the option, the match can also occur at the end of a line. Note that `$` matches `\n` but does not match `\r\n` (the combination of carriage return and newline characters, or CR/LF). To match the CR/LF character combination, include `\r?$` in the regular expression pattern. + If you use `$` with the option, the match can also occur at the end of a line. Note that `$` is satisfied at `\n` but not at `\r\n` (the combination of carriage return and newline characters, or CR/LF). To handle the CR/LF character combination, include `\r?$` in the regular expression pattern. Note that `\r?$` will include any `\r` in the match. - The following example adds the `$` anchor to the regular expression pattern used in the example in the [Start of String or Line](#start-of-string-or-line-) section. When used with the original input string, which includes five lines of text, the method is unable to find a match, because the end of the first line does not match the `$` pattern. When the original input string is split into a string array, the method succeeds in matching each of the five lines. When the method is called with the `options` parameter set to , no matches are found because the regular expression pattern does not account for the carriage return element (\u+000D). However, when the regular expression pattern is modified by replacing `$` with `\r?$`, calling the method with the `options` parameter set to again finds five matches. + The following example adds the `$` anchor to the regular expression pattern used in the example in the [Start of String or Line](#start-of-string-or-line-) section. When used with the original input string, which includes five lines of text, the method is unable to find a match, because the end of the first line does not match the `$` pattern. When the original input string is split into a string array, the method succeeds in matching each of the five lines. When the method is called with the `options` parameter set to , no matches are found because the regular expression pattern does not account for the carriage return character `\r`. However, when the regular expression pattern is modified by replacing `$` with `\r?$`, calling the method with the `options` parameter set to again finds five matches. [!code-csharp[Conceptual.RegEx.Language.Assertions#2](../../../samples/snippets/csharp/VS_Snippets_CLR/conceptual.regex.language.assertions/cs/endofstring1.cs#2)] [!code-vb[Conceptual.RegEx.Language.Assertions#2](../../../samples/snippets/visualbasic/VS_Snippets_CLR/conceptual.regex.language.assertions/vb/endofstring1.vb#2)] @@ -80,18 +80,18 @@ Anchors, or atomic zero-width assertions, specify a position in the string where ## End of String or Before Ending Newline: \Z - The `\Z` anchor specifies that a match must occur at the end of the input string, or before `\n` at the end of the input string. It is identical to the `$` anchor, except that `\Z` ignores the option. Therefore, in a multiline string, it can only match the end of the last line, or the last line before `\n`. + The `\Z` anchor specifies that a match must occur at the end of the input string, or before `\n` at the end of the input string. It is identical to the `$` anchor, except that `\Z` ignores the option. Therefore, in a multiline string, it can only be satisfied by the end of the last line, or the last line before `\n`. - Note that `\Z` matches `\n` but does not match `\r\n` (the CR/LF character combination). To match CR/LF, include `\r?\Z` in the regular expression pattern. + Note that `\Z` is satisfied at `\n` but is not satisfied at `\r\n` (the CR/LF character combination). To treat CR/LF as if it were `\n`, include `\r?\Z` in the regular expression pattern. Note that this will make the `\r` part of the match. - The following example uses the `\Z` anchor in a regular expression that is similar to the example in the [Start of String or Line](#start-of-string-or-line-) section, which extracts information about the years during which some professional baseball teams existed. The subexpression `\r?\Z` in the regular expression `^((\w+(\s?)){2,}),\s(\w+\s\w+),(\s\d{4}(-(\d{4}|present))?,?)+\r?\Z` matches the end of a string, and also matches a string that ends with `\n` or `\r\n`. As a result, each element in the array matches the regular expression pattern. + The following example uses the `\Z` anchor in a regular expression that is similar to the example in the [Start of String or Line](#start-of-string-or-line-) section, which extracts information about the years during which some professional baseball teams existed. The subexpression `\r?\Z` in the regular expression `^((\w+(\s?)){2,}),\s(\w+\s\w+),(\s\d{4}(-(\d{4}|present))?,?)+\r?\Z` is satisfied at the end of a string, and also at the end of a string that ends with `\n` or `\r\n`. As a result, each element in the array matches the regular expression pattern. [!code-csharp[Conceptual.RegEx.Language.Assertions#4](../../../samples/snippets/csharp/VS_Snippets_CLR/conceptual.regex.language.assertions/cs/endofstring2.cs#4)] [!code-vb[Conceptual.RegEx.Language.Assertions#4](../../../samples/snippets/visualbasic/VS_Snippets_CLR/conceptual.regex.language.assertions/vb/endofstring2.vb#4)] ## End of String Only: \z - The `\z` anchor specifies that a match must occur at the end of the input string. Like the `$` language element, `\z` ignores the option. Unlike the `\Z` language element, `\z` does not match a `\n` character at the end of a string. Therefore, it can only match the last line of the input string. + The `\z` anchor specifies that a match must occur at the end of the input string. Like the `$` language element, `\z` ignores the option. Unlike the `\Z` language element, `\z` is not satisfied by a `\n` character at the end of a string. Therefore, it can only match the end of the input string. The following example uses the `\z` anchor in a regular expression that is otherwise identical to the example in the previous section, which extracts information about the years during which some professional baseball teams existed. The example tries to match each of five elements in a string array with the regular expression pattern `^((\w+(\s?)){2,}),\s(\w+\s\w+),(\s\d{4}(-(\d{4}|present))?,?)+\r?\z`. Two of the strings end with carriage return and line feed characters, one ends with a line feed character, and two end with neither a carriage return nor a line feed character. As the output shows, only the strings without a carriage return or line feed character match the pattern. diff --git a/docs/standard/base-types/backtracking-in-regular-expressions.md b/docs/standard/base-types/backtracking-in-regular-expressions.md index c42d68776468b..00b4b3111d5a1 100644 --- a/docs/standard/base-types/backtracking-in-regular-expressions.md +++ b/docs/standard/base-types/backtracking-in-regular-expressions.md @@ -21,8 +21,7 @@ ms.assetid: 34df1152-0b22-4a1c-a76c-3c28c47b70d8 Backtracking occurs when a regular expression pattern contains optional [quantifiers](quantifiers-in-regular-expressions.md) or [alternation constructs](alternation-constructs-in-regular-expressions.md), and the regular expression engine returns to a previous saved state to continue its search for a match. Backtracking is central to the power of regular expressions; it makes it possible for expressions to be powerful and flexible, and to match very complex patterns. At the same time, this power comes at a cost. Backtracking is often the single most important factor that affects the performance of the regular expression engine. Fortunately, the developer has control over the behavior of the regular expression engine and how it uses backtracking. This topic explains how backtracking works and how it can be controlled. -> [!NOTE] -> In general, a Nondeterministic Finite Automaton (NFA) engine like .NET regular expression engine places the responsibility for crafting efficient, fast regular expressions on the developer. +[!INCLUDE [regex](../../../includes/regex.md)] ## Linear Comparison Without Backtracking @@ -61,7 +60,7 @@ Backtracking occurs when a regular expression pattern contains optional [quantif ## Backtracking with Optional Quantifiers or Alternation Constructs - When a regular expression includes optional quantifiers or alternation constructs, the evaluation of the input string is no longer linear. Pattern matching with an NFA engine is driven by the language elements in the regular expression and not by the characters to be matched in the input string. Therefore, the regular expression engine tries to fully match optional or alternative subexpressions. When it advances to the next language element in the subexpression and the match is unsuccessful, the regular expression engine can abandon a portion of its successful match and return to an earlier saved state in the interest of matching the regular expression as a whole with the input string. This process of returning to a previous saved state to find a match is known as backtracking. + When a regular expression includes optional quantifiers or alternation constructs, the evaluation of the input string is no longer linear. Pattern matching with an Nondeterministic Finite Automaton (NFA) engine is driven by the language elements in the regular expression and not by the characters to be matched in the input string. Therefore, the regular expression engine tries to fully match optional or alternative subexpressions. When it advances to the next language element in the subexpression and the match is unsuccessful, the regular expression engine can abandon a portion of its successful match and return to an earlier saved state in the interest of matching the regular expression as a whole with the input string. This process of returning to a previous saved state to find a match is known as backtracking. For example, consider the regular expression pattern `.*(es)`, which matches the characters "es" and all the characters that precede it. As the following example shows, if the input string is "Essential services are provided by regular expressions.", the pattern matches the whole string up to and including the "es" in "expressions". @@ -103,9 +102,17 @@ Backtracking occurs when a regular expression pattern contains optional [quantif Backtracking lets you create powerful, flexible regular expressions. However, as the previous section showed, these benefits may be coupled with unacceptably poor performance. To prevent excessive backtracking, you should define a time-out interval when you instantiate a object or call a static regular expression matching method. This is discussed in the next section. In addition, .NET supports three regular expression language elements that limit or suppress backtracking and that support complex regular expressions with little or no performance penalty: [atomic groups](#atomic-groups), [lookbehind assertions](#lookbehind-assertions), and [lookahead assertions](#lookahead-assertions). For more information about each language element, see [Grouping Constructs](grouping-constructs-in-regular-expressions.md). -### Defining a Time-out Interval +### Consider the non-backtracking regular expression engine - Starting with .NET Framework 4.5, you can set a time-out value that represents the longest interval the regular expression engine will search for a single match before it abandons the attempt and throws a exception. You specify the time-out interval by supplying a value to the constructor for instance regular expressions. In addition, each static pattern matching method has an overload with a parameter that allows you to specify a time-out value. + If you do not need to use any constructs that require backtracking (e.g. lookarounds, backreferences, atomic groups), consider using the mode. This mode is designed to execute in time proportional to the length of the input. See [NonBacktracking mode](regular-expression-options.md#nonbacktracking-mode) for more information. You can additionally set a time-out value. + +### Limit the size of inputs + + Some regular expressions have acceptable performance unless the input is exceptionally large. If all reasonable text inputs in your scenario are known to be under a certain length, consider rejecting longer inputs before applying the regular expression to them. + +### Specify a Time-out Interval + + You can set a time-out value that represents the longest interval the regular expression engine will search for a single match before it abandons the attempt and throws a exception. You specify the time-out interval by supplying a value to the constructor for instance regular expressions. In addition, each static pattern matching method has an overload with a parameter that allows you to specify a time-out value. If you do not set a time-out value explicitly, the default time-out value is determined as follows: @@ -116,7 +123,7 @@ If you do not set a time-out value explicitly, the default time-out value is det By default, the time-out interval is set to and the regular expression engine does not time out. > [!IMPORTANT] -> We recommend that you always set a time-out interval if your regular expression relies on backtracking. +> When not using , we recommend that you always set a time-out interval if your regular expression relies on backtracking or operates on untrusted inputs. A exception indicates that the regular expression engine was unable to find a match within the specified time-out interval but does not indicate why the exception was thrown. The reason might be excessive backtracking, but it is also possible that the time-out interval was set too low given the system load at the time the exception was thrown. When you handle the exception, you can choose to abandon further matches with the input string or increase the time-out interval and retry the matching operation. @@ -127,7 +134,7 @@ If you do not set a time-out value explicitly, the default time-out value is det ### Atomic groups - The `(?>` *subexpression*`)` language element suppresses backtracking into the subexpression. Once it has successfully matched, it will not give up any part of its match to subsequent backtracking. For example, in the pattern `(?>\w*\d*)1`, if the `1` cannot be matched, the `\d*` will not give up any of its match even if that means it would allow the `1` to successfully match. Atomic groups can help prevent the performance problems associated with failed matches. + The `(?>` *subexpression*`)` language element is an atomic grouping. It prevents backtracking into the subexpression. Once this language element has successfully matched, it will not give up any part of its match to subsequent backtracking. For example, in the pattern `(?>\w*\d*)1`, if the `1` cannot be matched, the `\d*` will not give up any of its match even if that means it would allow the `1` to successfully match. Atomic groups can help prevent the performance problems associated with failed matches. The following example illustrates how suppressing backtracking improves performance when using nested quantifiers. It measures the time required for the regular expression engine to determine that an input string does not match two regular expressions. The first regular expression uses backtracking to attempt to match a string that contains one or more occurrences of one or more hexadecimal digits, followed by a colon, followed by one or more hexadecimal digits, followed by two colons. The second regular expression is identical to the first, except that it disables backtracking. As the output from the example shows, the performance improvement from disabling backtracking is significant. @@ -198,6 +205,20 @@ If you do not set a time-out value explicitly, the default time-out value is det |`[A-Z]\w*`|Match an alphabetical character followed by zero or more word characters.| |`$`|End the match at the end of the input string.| +### General Performance Considerations + +The following suggestions are not specifically to prevent excessive backtracking, but may help increase the performance of your regular expression: + +1. Precompile heavily used patterns. The best way to do this is to use the [regular expression source generator](regular-expression-source-generators.md) to precompile it. If the source generator is not available for your app, for example you are not targeting .NET 7 or later, or you do not know the pattern at compile time, use the option. + +1. Cache heavily used Regex objects. This implicitly occurs when you are using the source generator. Otherwise, create a Regex object and store it for reuse, rather than using the static Regex methods or creating and throwing away a Regex object. + +1. Start matching from an offset. If you know that matches will always start beyond a certain offset into the pattern, pass the offset in using an overload like . This will reduce the amount of the text the engine needs to consider. + +1. Gather only the information you need. If you only need to know whether a match occurs but not where the match occurs, prefer . If you only need to know how many times something matches, prefer using . If you only need to know the bounds of a match but not anything about a match's captures, prefer using . The less information the engine needs to provide, the better. + +1. Avoid unnecessary captures. Parentheses in your pattern by default form a capturing group. If you don't need captures, either specify or use [non-capturing groups](grouping-constructs-in-regular-expressions.md#noncapturing-groups) instead. This saves the engine keeping track of those captures. + ## See also - [.NET Regular Expressions](regular-expressions.md) diff --git a/docs/standard/base-types/character-classes-in-regular-expressions.md b/docs/standard/base-types/character-classes-in-regular-expressions.md index 4e3a6441f2fa5..33d07250460f3 100644 --- a/docs/standard/base-types/character-classes-in-regular-expressions.md +++ b/docs/standard/base-types/character-classes-in-regular-expressions.md @@ -156,17 +156,17 @@ where *firstCharacter* is the character that begins the range and *lastCharacter ## Any character: . - The period character (.) matches any character except `\n` (the newline character, \u000A), with the following two qualifications: + The period character (.) matches any character except `\n` (the newline character), with the following two qualifications: - If a regular expression pattern is modified by the option, or if the portion of the pattern that contains the `.` character class is modified by the `s` option, `.` matches any character. For more information, see [Regular Expression Options](regular-expression-options.md). - The following example illustrates the different behavior of the `.` character class by default and with the option. The regular expression `^.+` starts at the beginning of the string and matches every character. By default, the match ends at the end of the first line; the regular expression pattern matches the carriage return character, `\r` or \u000D, but it does not match `\n`. Because the option interprets the entire input string as a single line, it matches every character in the input string, including `\n`. + The following example illustrates the different behavior of the `.` character class by default and with the option. The regular expression `^.+` starts at the beginning of the string and matches every character. By default, the match ends at the end of the first line; the regular expression pattern matches the carriage return character, `\r`, but it does not match `\n`. Because the option interprets the entire input string as a single line, it matches every character in the input string, including `\n`. [!code-csharp[Conceptual.Regex.Language.CharacterClasses#5](../../../samples/snippets/csharp/VS_Snippets_CLR/conceptual.regex.language.characterclasses/cs/any2.cs#5)] [!code-vb[Conceptual.Regex.Language.CharacterClasses#5](../../../samples/snippets/visualbasic/VS_Snippets_CLR/conceptual.regex.language.characterclasses/vb/any2.vb#5)] > [!NOTE] -> Because it matches any character except `\n`, the `.` character class also matches `\r` (the carriage return character, \u000D). +> Because it matches any character except `\n`, the `.` character class also matches `\r` (the carriage return character). - In a positive or negative character group, a period is treated as a literal period character, and not as a character class. For more information, see [Positive Character Group](#PositiveGroup) and [Negative Character Group](#NegativeGroup) earlier in this topic. The following example provides an illustration by defining a regular expression that includes the period character (`.`) both as a character class and as a member of a positive character group. The regular expression `\b.*[.?!;:](\s|\z)` begins at a word boundary, matches any character until it encounters one of five punctuation marks, including a period, and then matches either a white-space character or the end of the string. diff --git a/docs/standard/base-types/regular-expression-options.md b/docs/standard/base-types/regular-expression-options.md index 8b68454f57387..8c90cbdb2f50b 100644 --- a/docs/standard/base-types/regular-expression-options.md +++ b/docs/standard/base-types/regular-expression-options.md @@ -21,7 +21,7 @@ By default, the comparison of an input string with any literal characters in a r |-|-|-|-| | | Not available | Use default behavior. | [Default options](#default-options) | | | `i` | Use case-insensitive matching. | [Case-insensitive matching](#case-insensitive-matching) | -| | `m` | Use multiline mode, where `^` and `$` match the beginning and end of each line (instead of the beginning and end of the input string). | [Multiline mode](#multiline-mode) | +| | `m` | Use multiline mode, where `^` and `$` indicate the beginning and end of each line (instead of the beginning and end of the input string). | [Multiline mode](#multiline-mode) | | | `s` | Use single-line mode, where the period (.) matches every character (instead of every character except `\n`). | [Single-line mode](#single-line-mode) | | | `n` | Do not capture unnamed groups. The only valid captures are explicitly named or numbered groups of the form `(?<`*name*`>` *subexpression*`)`. | [Explicit captures only](#explicit-captures-only) | | | Not available | Compile the regular expression to an assembly. | [Compiled regular expressions](#compiled-regular-expressions) | @@ -113,7 +113,7 @@ The option, or the `m` inline option, enables the regular expression engine to handle an input string that consists of multiple lines. It changes the interpretation of the `^` and `$` language elements so that they match the beginning and end of a line, instead of the beginning and end of the input string. +The option, or the `m` inline option, enables the regular expression engine to handle an input string that consists of multiple lines. It changes the interpretation of the `^` and `$` language elements so that they indicate the beginning and end of a line, instead of the beginning and end of the input string. -By default, `$` matches only the end of the input string. If you specify the option, it matches either the newline character (`\n`) or the end of the input string. It does not, however, match the carriage return/line feed character combination. To successfully match them, use the subexpression `\r?$` instead of just `$`. +By default, `$` will be satisfied only at the end of the input string. If you specify the option, it will be satisfied by either the newline character (`\n`) or the end of the input string. + +In neither case does `$` recognize the carriage return/line feed character combination (`\r\n`). `$` always ignores any carriage return (`\r`). To end your match with either `\r\n` or `\n`, use the subexpression `\r?$` instead of just `$`. Note that this will make the `\r` part of the match. The following example extracts bowlers' names and scores and adds them to a collection that sorts them in descending order. The method is called twice. In the first method call, the regular expression is `^(\w+)\s(\d+)$` and no options are set. As the output shows, because the regular expression engine cannot match the input pattern along with the beginning and end of the input string, no matches are found. In the second method call, the regular expression is changed to `^(\w+)\s(\d+)\r?$` and the options are set to . As the output shows, the names and scores are successfully matched, and the scores are displayed in descending order. @@ -171,11 +173,9 @@ The following example is equivalent to the previous one, except that it uses the ## Single-line mode -The option, or the `s` inline option, causes the regular expression engine to treat the input string as if it consists of a single line. It does this by changing the behavior of the period (`.`) language element so that it matches every character, instead of matching every character except for the newline character `\n` or `\u000A`. - -The `$` language element will match the end of the string or a trailing newline character `\n`. +The option, or the `s` inline option, causes the regular expression engine to treat the input string as if it consists of a single line. It does this by changing the behavior of the period (`.`) language element so that it matches every character, instead of matching every character except for the newline character `\n`. -The following example illustrates how the behavior of the `.` language element changes when you use the option. The regular expression `^.+` starts at the beginning of the string and matches every character. By default, the match ends at the end of the first line; the regular expression pattern matches the carriage return character, `\r` or \u000D, but it does not match `\n`. Because the option interprets the entire input string as a single line, it matches every character in the input string, including `\n`. +The following example illustrates how the behavior of the `.` language element changes when you use the option. The regular expression `^.+` starts at the beginning of the string and matches every character. By default, the match ends at the end of the first line; the regular expression pattern matches the carriage return character `\r`, but it does not match `\n`. Because the option interprets the entire input string as a single line, it matches every character in the input string, including `\n`. [!code-csharp[Conceptual.Regex.Language.CharacterClasses#5](../../../samples/snippets/csharp/VS_Snippets_CLR/conceptual.regex.language.characterclasses/cs/any2.cs#5)] [!code-vb[Conceptual.Regex.Language.CharacterClasses#5](../../../samples/snippets/visualbasic/VS_Snippets_CLR/conceptual.regex.language.characterclasses/vb/any2.vb#5)] @@ -225,6 +225,9 @@ Finally, you can use the inline group element `(?n:)` to suppress automatic capt ## Compiled regular expressions +> [!NOTE] +> Where possible, use [source generated regular expressions](regular-expression-source-generators.md) instead of compiling regular expressions using the option. Source generation can help your app start faster, run more quickly and be more trimmable. To learn when source generation is possible, see [When to use it](regular-expression-source-generators.md#when-to-use-it). + By default, regular expressions in .NET are interpreted. When a object is instantiated or a static method is called, the regular expression pattern is parsed into a set of custom opcodes, and an interpreter uses these opcodes to run the regular expression. This involves a tradeoff: The cost of initializing the regular expression engine is minimized at the expense of run-time performance. You can use compiled instead of interpreted regular expressions by using the option. In this case, when a pattern is passed to the regular expression engine, it is parsed into a set of opcodes and then converted to Microsoft intermediate language (MSIL), which can be passed directly to the common language runtime. Compiled regular expressions maximize run-time performance at the expense of initialization time. diff --git a/docs/standard/base-types/regular-expression-source-generators.md b/docs/standard/base-types/regular-expression-source-generators.md index a14dbda95948e..d4384095bb7a2 100644 --- a/docs/standard/base-types/regular-expression-source-generators.md +++ b/docs/standard/base-types/regular-expression-source-generators.md @@ -11,6 +11,9 @@ ms.author: dapine A regular expression, or regex, is a string that enables a developer to express a pattern being searched for, making it a very common way to search text and extract results as a subset from the searched string. In .NET, the `System.Text.RegularExpressions` namespace is used to define instances and static methods, and match on user-defined patterns. In this article, you'll learn how to use source generation to generate `Regex` instances to optimize performance. +> [!NOTE] +> Where possible, use source generated regular expressions instead of compiling regular expressions using the option. Source generation can help your app start faster, run more quickly and be more trimmable. To learn when source generation is possible, see [When to use it](#when-to-use-it). + ## Compiled regular expressions When you write `new Regex("somepattern")`, a few things happen. The specified pattern is parsed, both to ensure the validity of the pattern and to transform it into an internal tree that represents the parsed regex. The tree is then optimized in various ways, transforming the pattern into a functionally equivalent variation that can be more efficiently executed. The tree is written into a form that can be interpreted as a series of opcodes and operands that provide instructions to the regex interpreter engine on how to match. When a match is performed, the interpreter simply walks through those instructions, processing them against the input text. When instantiating a new `Regex` instance or calling one of the static methods on `Regex`, the interpreter is the default engine employed.