You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When comparing a string in a regex (RegexNode.Multi), we unroll the matching for smaller strings. But rather than reading and comparing a single character at a time, we try to read/compare two or four characters at a time by reading the data as a UInt32 or UInt64. We were never able to do this with RegexOptions.IgnoreCase, however, because the design demanded that we char.ToLower each individual character read from the input. But, we're now switching to a design where casing is almost entirely handled at parse time (#61048), where most letters are then parsed into a set, e.g. 'a' becomes '[Aa]'. For most of ASCII, then, we can actually apply this optimization, e.g. the string "abcd" becomes "[Aa][Bb][Cc][Dd]", and we can read a UInt64 input OR'd with 0x20202020, and then compare it against the precomputed UInt64 value similarly OR'd with 0x20202020. This will give us most of the vectorization gains we see for case-sensitive for case-insensitive as well.
The text was updated successfully, but these errors were encountered:
Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.
Issue Details
When comparing a string in a regex (RegexNode.Multi), we unroll the matching for smaller strings. But rather than reading and comparing a single character at a time, we try to read/compare two or four characters at a time by reading the data as a UInt32 or UInt64. We were never able to do this with RegexOptions.IgnoreCase, however, because the design demanded that we char.ToLower each individual character read from the input. But, we're now switching to a design where casing is almost entirely handled at parse time (#61048), where most letters are then parsed into a set, e.g. 'a' becomes '[Aa]'. For most of ASCII, then, we can actually apply this optimization, e.g. the string "abcd" becomes "[Aa][Bb][Cc][Dd]", and we can read a UInt64 input OR'd with 0x20202020, and then compare it against the precomputed UInt64 value similarly OR'd with 0x20202020. This will give us most of the vectorization gains we see for case-sensitive for case-insensitive as well.
When comparing a string in a regex (RegexNode.Multi), we unroll the matching for smaller strings. But rather than reading and comparing a single character at a time, we try to read/compare two or four characters at a time by reading the data as a UInt32 or UInt64. We were never able to do this with RegexOptions.IgnoreCase, however, because the design demanded that we char.ToLower each individual character read from the input. But, we're now switching to a design where casing is almost entirely handled at parse time (#61048), where most letters are then parsed into a set, e.g. 'a' becomes '[Aa]'. For most of ASCII, then, we can actually apply this optimization, e.g. the string "abcd" becomes "[Aa][Bb][Cc][Dd]", and we can read a UInt64 input OR'd with 0x20202020, and then compare it against the precomputed UInt64 value similarly OR'd with 0x20202020. This will give us most of the vectorization gains we see for case-sensitive for case-insensitive as well.
The text was updated successfully, but these errors were encountered: