Switching ReadOnlySpan<char> with constant strings #1881

Happypig375 · 2018-09-23T05:14:53Z

Speclet: https://github.com/dotnet/csharplang/blob/main/proposals/csharp-11.0/pattern-match-span-of-char-on-string.md

With the recent addition of Spans, we can reduce a lot of allocations when dealing with strings. For example, substrings can be replaced with slices. However, there is one place where using strings is better:

string str = ...;
switch (str)
{
    case "string 1":
        break;
    case "string 2":
        break;
    case "string 3":
        break;
    case "string 4":
        break;
}

With spans, you are forced to write boilerplate code:

ReadOnlySpan<char> span = ...;
switch (span)
{
    case var _ where span == "string 1":
        break;
    case var _ where span == "string 2":
        break;
    case var _ where span == "string 3":
        break;
    case var _ where span == "string 4":
        break;
}

Which is identical to the if-else mess that switches aim to solve.

ReadOnlySpan<char> span = ...;
if (span == "string 1")
else if (span == "string 2")
else if (span == "string 3")
else if (span == "string 4")

It would be better if spans can be switched with constant strings.

ReadOnlySpan<char> span = ...;
switch (span)
{
    case "string 1":
        break;
    case "string 2":
        break;
    case "string 3":
        break;
    case "string 4":
        break;
}

Special-casing spans as a compiler-known type already have precedent in stackalloc expressions, it can also be done here.
Possible extension to regular Span<char> can also be considered, but it is not as important as ReadOnlySpan<char>. We could also enable this with ReadOnlyMemory<char> and Memory<char>.

More possible extensions: switching (ReadOnly)Span<char> with char, (ReadOnly)Span<T> with T where T is a constant type.

Design meetings

https://github.com/dotnet/csharplang/blob/main/meetings/2022/LDM-2022-02-23.md#pattern-matching-over-spanchar

agocke · 2018-09-24T06:31:41Z

This does seem really annoying. I'm not sure I have a language proposal that would fix this, but it would be something I'd like to address.

The obvious route would be something like an implicit conversion on string to ReadOnlySpan<char>. I'm not sure if that would work.

Happypig375 · 2018-09-24T07:41:55Z

The obvious route would be something like an implicit conversion on string to ReadOnlySpan<char>. I'm not sure if that would work.

Yes it does exist in .NET Core 2.1 (NOT with the portable version!!)
However, it is not considered to be a constant conversion (e.g. int → byte). The compiler prohibits that inside switch statements.

If this was implemented, this should work with the portable version too.

CyrusNajmabadi · 2018-09-24T18:00:43Z

Note: is there an idea on what the codegen here would actually be? The normal string-switch works by building a dictionary and switching on that. That likely wouldn't work for ReadOnlySpans. Perhaps emitting code for an inline-trie might be appropriate?

svick · 2018-09-24T21:23:13Z

@CyrusNajmabadi

The normal string-switch works by building a dictionary and switching on that. That likely wouldn't work for ReadOnlySpans.

Why not? As far as I can tell, it doesn't actually use Dictionary. Instead it computes the hash code (using Roslyn-generated method) and then uses binary search on that. I think all of that should work for ROS.

scalablecory · 2018-09-24T21:43:02Z

@CyrusNajmabadi @svick Dictionary was generated back in the day. hash-based switch was introduced with the Roslyn migration.

scalablecory · 2018-09-24T21:49:37Z

I'd like for switch to be extended to support ReadOnlySpan. Seems like low-hanging fruit.

CyrusNajmabadi · 2018-09-24T22:25:35Z

Oh terrific. Good to know!

It does seem a bit interesting to me that it both has to scan the string to produce the hash, and then has to do the equality check again. It feels like a double pass of the string, when only one would be necessary since you're generating the code here manually.

But maybe it would be a bit too much codebloat, or too branchy, when this can be very simple codegen.

scalablecory · 2018-09-24T22:39:00Z

@CyrusNajmabadi the comparison is necessary. you can make a perfect hash function to avoid collisions among your cases, but it's still possible for your input string to not be one of the switch options.

We'd need something akin to VC++'s __assume(0).

CyrusNajmabadi · 2018-09-24T22:47:45Z

@scalablecory I understand the need for the compare if you go the hash route.

I was saying that i was suprised that the hash-route was used in the first place, isntead of just codegening something like a trie.

Happypig375 · 2018-09-25T00:05:09Z

https://visualstudiomagazine.com/Articles/2015/10/20/Text-Pattern-Search-Trie-Class-NET.aspx?m=1&Page=2

It states that a trie would be slower than a HashSet when searching entire strings. A trie would be faster if we are using pattern searching, but switches are not for that.

HaloFour · 2018-09-25T01:06:22Z

@Happypig375

The trie clearly searches more efficiently but loses the battle in the Insertion operation.

There would be no insertion operations. The compiler wouldn't be populating a generic trie container, it would be generating the trie-comparisons directly based on the literal strings provided.

CyrusNajmabadi · 2018-09-25T01:30:35Z

Your link says this:

The trie clearly searches more efficiently but loses the battle in the Insertion operation. But as I noted earlier, if you know you'll get frequent search operations instead of frequent insertion/deletion operations, it definitely pays off to implement the trie.

CyrusNajmabadi · 2018-09-25T01:32:38Z

it would be generating the trie-comparisons directly based on the literal strings provided.

yes. that's what i'm trying to say. Effectivley, right now the compiler has to walk every character of the string, to produce the hash. It then has to switch on the hash, and once done, the runtime has to then walk every character of the string a second time.

This is basically the poster case for when you want a trie. Instead of having two loops of the string, you just walk each character of it one at a time, and you either end up exactly in the matching branch, or you fall out into the default branch label.

Happypig375 · 2018-09-25T02:44:05Z

@HaloFour @CyrusNajmabadi
Only Figure 7 in the link is relevant to my point. The rest are irrelevant.

I am talking about trie vs HashSet, not trie vs List or trie vs SortedList!!!

CyrusNajmabadi · 2018-09-25T02:52:25Z

I don't see how it's relevant. It's not talkign about what i was talking about. Specifically that's a benchmark of a very specific trie impl against a hashset. Importantly, it's a trie actually built of in-memory objects. So you're incurring all sorts of hits trying to walk that.

I'm talking about a trie represented as code. These are entirely different things. It would be like me complaining about the hashing that the compiler does here because of some aspect of a HashSet. They're entirely different beasts, even though they both involve 'hashing'.

CyrusNajmabadi · 2018-09-25T05:31:29Z

Ok. I actually did a perf comparison of the two. My Trie version beats the Standard C# compiler hashing version by about 2.2x. Suprisingly, i made an unsafe version as well... but it performed worse than the safe version. The numbers came out to:

           Method |      Mean |     Error |    StdDev |    Median |
----------------- |----------:|----------:|----------:|----------:|
       HashSwitch | 21.083 us | 0.4214 us | 0.9682 us | 20.599 us |
   SafeTrieSwitch |  9.101 us | 0.3892 us | 0.3997 us |  8.938 us |
 TrieUnsafeSwitch |  9.508 us | 0.1891 us | 0.1579 us |  9.564 us |

The test worked by generating switches using the top hundred most common english words. It then performed the switch using the top thousand words (so 10% hitrate, 90% miss rate). Codefiles of interest:

HashSwitch
TrieSwitch (safe and unsafe)
CodeGen and Benchmarking tool

There are very likely opportunities for improvement here. I didn't spend much time optimizing anything.

CyrusNajmabadi · 2018-09-25T05:37:00Z

Note: the numbers are even worse for hashing if i go for a 100% hitrate on the switch:

           Method |       Mean |     Error |   StdDev |     Median |
----------------- |-----------:|----------:|---------:|-----------:|
       HashSwitch | 1,369.5 ns | 27.363 ns | 74.90 ns | 1,348.5 ns |
   SafeTrieSwitch |   484.0 ns |  9.703 ns | 21.09 ns |   472.1 ns |
 TrieUnsafeSwitch |   484.7 ns |  9.668 ns | 21.22 ns |   476.4 ns |

Here the hashing version is 2.8x worse. This is unsurprising as the hashing version will definitely do better on misses than matches (after all, on a match, it has to actually then compare the strings, but on a miss, it can bail without any string compare). However, even with the boost it gets from misses, it's still more than 2x slower than the trie version.

svick · 2018-09-25T20:18:56Z

@CyrusNajmabadi Isn't that off-topic in this issue? Maybe it would make sense to create a new issue about that in the roslyn repo?

CyrusNajmabadi · 2018-09-25T20:25:42Z

@svick Yes. It's arguably off topic in that the language shouldn't have to care about the impl details. however, i thought it was a bit worthwhile to understand if there were at least feasible implementation strategies. I assume if we wanted switching over readonlyspans, we'd want to know if it could be done efficiently. So i thought this exploration was valuable in terms of demonstrating that that was the case :)

This wasn't an attempt to say: C# should switch (har har) to this. It was more: this language feature can sensibly be implemented, and here are some good perf results to demonstrate that it likely would work well and would not conflict with the goals of C# or ReadOnlySpans.

Cheers!

Happypig375 · 2018-10-01T05:08:46Z

@CyrusNajmabadi How would cases with conditions play with the trie implementation?
For example,

switch(someReadOnlySpan)
{
    case "\u0444":
         return 444;
    case var inSomeList when someList.Contains(inSomeList[0]):
         Console.Write(inSomeList[0]);
         return (int)inSomeList[0];
    case "\u0456":
         return 456;
}

If '\u0456' is in someList, then it will be written to the console.
What would be emitted?

jnm2 · 2018-10-01T13:37:52Z

You'd need a separate trie before and after the center case, since the center case makes the switch become order-dependent. Imagine splitting into if statements and then back into order-independent switch statements above and below the center if.

Happypig375 · 2018-10-01T14:36:03Z

Welp, that would be a new design guideline for me.
Now, would anyone please champion this? :)

Happypig375 · 2018-10-06T11:10:46Z

We might want to enable this with ReadOnlyMemory<char> and Memory<char> too.

AceHack · 2019-04-12T22:18:28Z

Any update? I would really like this feature.

gafter · 2019-04-13T22:32:40Z

I'll bring this up at the LDM next time we triage.

Happypig375 · 2019-04-14T02:21:52Z

My first suggestion that gets championed!! ❤️😆

gafter · 2019-04-14T03:01:19Z

hez2010 · 2022-04-14T17:33:54Z

It doesn't work with Utf8String:

public int Foo(ReadOnlySpan<byte> s)
{
    return s switch
    {
        "foo"u8 => 1,
        "bar"u8 => 2,
        _ => 0
    };
}

error CS0150: A constant value is expected

https://sharplab.io/#v2:EYLgtghglgdgPgAQEwEYCwAoTsAuACAMQHsiAKAJQFMIATAeRgBsBPAZQAcIYAeYZnSgD48AZwCUmAN6Y8svAgDso0QHcoOAMYALGXOkY5hvACIAZiWMBXABx4AvMJQAaXUZPAIAJyu2HeJC4GbngA+vbCAAyueAC+ANyYMUA===

333fred · 2022-04-14T18:29:12Z

Correct. We did not enable this feature for Utf8Strings, as they are not constants.

slang25 · 2022-08-13T12:24:25Z

Correct. We did not enable this feature for Utf8Strings, as they are not constants.

That seems like a great opportunity to improve the feature.

Could supported pattern expressions be expanded to include UTF8 strings when they are expressed as literals (and only when)?

agocke · 2023-01-06T08:05:36Z

Correct. We did not enable this feature for Utf8Strings, as they are not constants.

This doesn't make sense to me. Why are UTF8 string literals not constants, according to the language?

bernd5 · 2023-01-06T08:18:42Z

Logically they are constants, but:

A u8 literal doesn't have a constant value. That is because ReadOnlySpan cannot be type of a constant today. If the definition of const is expanded in the future to consider ReadOnlySpan, then this value should also be considered a constant. Practically though this means a u8 literal cannot be used as the default value of an optional parameter.

agocke · 2023-01-06T08:52:31Z

That feels like circular logic -- utf8 literals can't be constants because they're not defined as constants. Pure language rules.

As it stands it doesn't seem like a good justification. There are good reasons to make these constants and I haven't seen any reasons not to.

Also, I don't think it has to be the case that all constant values have to be allowed as optional parameter values. That's another language rule that can be changed if implementation restrictions require it.

bernd5 · 2023-01-06T10:40:26Z

And pratically they are not constants:

using System;
using System.Runtime.CompilerServices;

ReadOnlySpan<byte> x = "Hallo"u8;

Console.WriteLine(x[0]); //72
ref readonly var b = ref x[0];
ref var bwrite = ref Unsafe.AsRef(in b);
bwrite = 7;

Console.WriteLine(b); //7
Console.WriteLine(bwrite); //7
Console.WriteLine(x[0]); //7

As written in the proposal, those strings / bytes are stored in the .data section - not in the .rodata section. They are constantly initialized - but not const.
But I agree - that this could be interpreted as an implementation detail. The spec could say that it is not allowed to change the underlying data.

alrz · 2023-01-06T11:09:40Z

If you want to go unsafe, regular strings are not "constant" either.

var x ="Hallo";
fixed(char* p = x) p[0]='X';
Console.Write("Hallo");

But I think that's besides the point. I understand #1881 (comment) as a call for a feature to make more types viable for a constant (including span or other structs) which will cover utf8 strings as a side effect.

That would be still different from making the utf8 string literal itself a constant as that may require special handling - this issue.

(There's a few other proposals to allow typeof(T), typeof(T).FullName as constants, or support any of the above in attributes.)

bernd5 · 2023-01-06T11:34:08Z

For string it is an implementation detail - they are actually not allocated in ro memory sections.
But the spec says:

Modifying objects of managed type through fixed pointers can result in undefined behavior.

Note: For example, because strings are immutable, it is the programmer’s responsibility to ensure that the characters referenced by a pointer to a fixed string are not modified. end note

Maybe this should be added to the spec for UT8-literals, too.

FaustVX · 2023-01-06T14:13:17Z

@bernd5

And pratically they are not constants:
using System;
using System.Runtime.CompilerServices;

ReadOnlySpan<byte> x = "Hallo"u8;

Console.WriteLine(x[0]); //72
ref readonly var b = ref x[0];
ref var bwrite = ref Unsafe.AsRef(in b);
bwrite = 7;

Console.WriteLine(b); //7
Console.WriteLine(bwrite); //7
Console.WriteLine(x[0]); //7
As written in the proposal, those strings / bytes are stored in the .data section - not in the .rodata section. They are constantly initialized - but not const. But I agree - that this could be interpreted as an implementation detail. The spec could say that it is not allowed to change the underlying data.

Even simpler:

unsafe {
    byte* datas = stackalloc byte[]
    {
        1,2,3,4,5,
    };
    ReadOnlySpan<byte> x = new(datas, 5);

    Console.WriteLine(datas[0]); //1
    Console.WriteLine(x[0]); //1
    
    datas[0] = 7;

    Console.WriteLine(datas[0]); //7
    Console.WriteLine(x[0]); //7
}

ROS isn't a frozen collection, it's just a read only view on some datas.

agocke · 2023-01-06T18:36:16Z

@FaustVX You're misunderstanding the proposal. It's not to allow all ROS<byte> as constant values, any more than all ints are constant values. It's specifically that UTF8 string literals should be constants.

And yes, I regard implementation restrictions, like what's allowed in metadata, as being questions about the language implementation, not the language definition (the spec). If the use cases we're thinking of (switching) are completely unimplementable -- that's a good reason not to do it in the language. But if there simply need to be some restrictions on where they can go (e.g., not allowed in optional parameters), that seems fine to me.

CyrusNajmabadi · 2023-01-06T18:47:15Z

it def feels (to me) at least) that there's some space for value to be constant even if the type it is isn't always a constant. ROS<char> is not always a constant, like so:

        char[] chars = { 'a', 'b', 'c' };
        ReadOnlySpan<char> span = chars.AsSpan();

However, in a cast where there the value is a literal, we could make the claim it's a constant. So this would be ok:

const ReadOnlySpan<byte> ConstantString = "abc"u8;

agocke · 2023-01-06T18:58:46Z

Right, to be clear, that's how all constants work. There are no "constant types" in the language, there are only constant values, and it happens to be the case that those values are only of certain types.

That is,

int x = 5;
// x is not constant

const int x = 5;
// x is constant

Moreover, allowing const on ROS is itself a separate decision. The question is whether or not the literal expression "abc"u8 is a constant and should be allowed in certain places where only constants are allowed, like switch statements and pattern matches. I think so.

agocke · 2023-01-06T19:03:36Z

(btw this issue is about switching, but I don't see any reason why we can't allow utf8 string literals in optional parameters either. It would be implemented like DecimalConstantAttribute)

CyrusNajmabadi · 2023-01-06T19:06:06Z

@agocke Good points. Thanks! I"m good with championing this alongside Julien :)

agocke · 2023-01-07T20:16:36Z

One more thing that I thought of.

Making this change has some interesting implications for other types. In particular, this would be the only way in the language to produce a constant-expr byte string, and the only way to match against a constant-expr byte string. Depending on which particular byte string you're using, I could see people encoding it in a UTF8 string (assuming that it's valid UTF8) just so they could use the constant-expr functionality.

Honestly, it seems like a pretty reasonable usage. I wouldn't try to block this, but instead look it is as potentially another language deficiency -- there's currently no way to produce a constant byte array in the language. I know there's a separate working group looking at new literals for lists and dictionaries. It might be interesting to either explore one of those syntaxes producing a constant expression.

Alternatively, it might be possible to make new[] { 1, 2, 3 } a valid constant expression, if the type is ROS. This one might need a little bit more reworking since the current underlying type is int[]. Obviously it couldn't be a constant expression in the expression (new[] { 1, 2, 3})[0] = 5.

alex-jitbit · 2024-03-19T09:11:37Z

Actually, even the workaround that the OP's suggests is not correct. The if (span == "string 1") code wont' work.

"test".AsSpan().Slice(0, 2) == "te"; //this is FALSE

We have to use

"test".AsSpan().Slice(0, 2).Equals("te".AsSpan(), StringComparison.Ordinal);

We do need a better way to compare spans with strings, with lees hoop-jumping

Foxtrek64 · 2024-10-04T15:20:05Z

Alternatively, it might be possible to make new[] { 1, 2, 3 } a valid constant expression

Are there any other places in the language where new is allowed in a const expr?

To me at least, [1, 2, 3] feels better as an "array literal" or whatever we want to call it. That is, it feels more like declaring an explicit value rather than newing up an array, similar to why const string str = "ccccc"; is allowed but const string str = new string('c', 5); is not, even if they produce the same end value.

I imagine this probably comes with its own strings attached but I thought I'd at least toss it out there. I'm definitely interested in hearing about any potential drawbacks to this solution. I'm not an expert in this space.

agocke added the Discussion label Sep 24, 2018

gafter added the Proposal champion label Apr 13, 2019

This was referenced Mar 10, 2022

Revert named colors dotnet/Microsoft.Maui.Graphics#345

Merged

Pattern matching ReadOnlySpan<char> is worse than String dotnet/runtime#66529

Closed

This was referenced Mar 15, 2022

Performance optimizations for SvgColourConverter parser svg-net/SVG#806

Closed

Performance optimization for SvgPointCollectionConverter parser svg-net/SVG#808

Closed

Memory usage optimization for SvgElement svg-net/SVG#813

Closed

siegfriedpammer mentioned this issue Apr 9, 2022

C# language support status icsharpcode/ILSpy#829

Open

jcouv modified the milestones: Working Set, 11.0 Sep 26, 2022

davidfowl mentioned this issue Nov 21, 2022

Update to .NET 7 TechEmpower/FrameworkBenchmarks#7698

Merged

CyrusNajmabadi self-assigned this Jan 6, 2023

333fred added the Implemented Needs ECMA Spec This feature has been implemented in C#, but still needs to be merged into the ECMA specification label Jan 9, 2023

jcouv added the Proposal label Sep 17, 2024

Switching ReadOnlySpan<char> with constant strings #1881

Switching ReadOnlySpan<char> with constant strings #1881

Comments

Happypig375 commented Sep 23, 2018 • edited by jcouv Loading

Design meetings

agocke commented Sep 24, 2018

Happypig375 commented Sep 24, 2018 • edited Loading

CyrusNajmabadi commented Sep 24, 2018

svick commented Sep 24, 2018

scalablecory commented Sep 24, 2018

scalablecory commented Sep 24, 2018

CyrusNajmabadi commented Sep 24, 2018

scalablecory commented Sep 24, 2018

CyrusNajmabadi commented Sep 24, 2018

Happypig375 commented Sep 25, 2018 • edited Loading

HaloFour commented Sep 25, 2018

CyrusNajmabadi commented Sep 25, 2018

CyrusNajmabadi commented Sep 25, 2018

Happypig375 commented Sep 25, 2018 • edited Loading

CyrusNajmabadi commented Sep 25, 2018

CyrusNajmabadi commented Sep 25, 2018 • edited Loading

CyrusNajmabadi commented Sep 25, 2018 • edited Loading

svick commented Sep 25, 2018

CyrusNajmabadi commented Sep 25, 2018

Happypig375 commented Oct 1, 2018

jnm2 commented Oct 1, 2018 • edited Loading

Happypig375 commented Oct 1, 2018

Happypig375 commented Oct 6, 2018

AceHack commented Apr 12, 2019

gafter commented Apr 13, 2019

Happypig375 commented Apr 14, 2019

gafter commented Apr 14, 2019

hez2010 commented Apr 14, 2022 • edited Loading

333fred commented Apr 14, 2022

slang25 commented Aug 13, 2022

agocke commented Jan 6, 2023 • edited Loading

bernd5 commented Jan 6, 2023

agocke commented Jan 6, 2023 • edited Loading

bernd5 commented Jan 6, 2023

alrz commented Jan 6, 2023

bernd5 commented Jan 6, 2023 • edited Loading

FaustVX commented Jan 6, 2023

agocke commented Jan 6, 2023 • edited Loading

CyrusNajmabadi commented Jan 6, 2023

agocke commented Jan 6, 2023

agocke commented Jan 6, 2023

CyrusNajmabadi commented Jan 6, 2023

agocke commented Jan 7, 2023

alex-jitbit commented Mar 19, 2024 • edited Loading

Foxtrek64 commented Oct 4, 2024

Happypig375 commented Sep 23, 2018 •

edited by jcouv

Loading

Happypig375 commented Sep 24, 2018 •

edited

Loading

Happypig375 commented Sep 25, 2018 •

edited

Loading

Happypig375 commented Sep 25, 2018 •

edited

Loading

CyrusNajmabadi commented Sep 25, 2018 •

edited

Loading

CyrusNajmabadi commented Sep 25, 2018 •

edited

Loading

jnm2 commented Oct 1, 2018 •

edited

Loading

hez2010 commented Apr 14, 2022 •

edited

Loading

agocke commented Jan 6, 2023 •

edited

Loading

agocke commented Jan 6, 2023 •

edited

Loading

bernd5 commented Jan 6, 2023 •

edited

Loading

agocke commented Jan 6, 2023 •

edited

Loading

alex-jitbit commented Mar 19, 2024 •

edited

Loading