Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement System.Buffers.Text.Base64.DecodeFromUtf8 for Arm64 #70336

Closed
wants to merge 1 commit into from

Conversation

a74nh
Copy link
Contributor

@a74nh a74nh commented Jun 7, 2022

Like the AVX2 and SSE3 versions, this is based off the Aklomp base64
algorithm.

The AdvSimd API does not yet have support for squential multi register
instructions, such as TBL4/LD4/ST3. This code implements the those
instructions using single register instructions. Once API support is
added, this code can be greatly simplified and get an additional
performance boost.

Like the AVX2 and SSE3 versions, this is based off the Aklomp base64
algorithm.

The AdvSimd API does not yet have support for squential multi register
instructions, such as TBL4/LD4/ST3. This code implements the those
instructions using single register instructions. Once API support is
added, this code can be greatly simplified and get an additional
performance boost.
@ghost ghost added the community-contribution Indicates that the PR has been added by a community member label Jun 7, 2022
@ghost
Copy link

ghost commented Jun 7, 2022

Tagging subscribers to this area: @dotnet/area-system-memory
See info in area-owners.md if you want to be subscribed.

Issue Details

Like the AVX2 and SSE3 versions, this is based off the Aklomp base64
algorithm.

The AdvSimd API does not yet have support for squential multi register
instructions, such as TBL4/LD4/ST3. This code implements the those
instructions using single register instructions. Once API support is
added, this code can be greatly simplified and get an additional
performance boost.

Author: a74nh
Assignees: -
Labels:

area-System.Memory

Milestone: -

@a74nh
Copy link
Contributor Author

a74nh commented Jun 7, 2022

For this implementation I first started with the C implementation here:
https://github.com/aklomp/base64/blob/e516d769a2a432c08404f1981e73b431566057be/lib/arch/neon64
I rewrote it so that it didn’t use LD4/ST3/TBX4 instructions.
Running that under perf on a Altra showed a slowdown vs the original neon version, but it was still quite a bit faster than the non-neon version.

I took that and rewrote it in C# using the new Vector API (then falling back to the old API where equivalents don’t exist).

Examining the assembler output, the code looks fairly decent. However, running the performance Base64Decode test on an Altra shows a 3x slowdown:

|       Method |        Job |                                                                                                        Toolchain | NumberOfBytes |      Mean |     Error |    StdDev |    Median |       Min |       Max | Ratio | MannWhitney(2%) | Allocated | Alloc Ratio |
|------------- |----------- |----------------------------------------------------------------------------------------------------------------- |-------------- |----------:|----------:|----------:|----------:|----------:|----------:|------:|---------------- |----------:|------------:|
| Base64Decode | Job-PXGVDP |       /runtime_HEAD/artifacts/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun |          1000 |  7.237 ns | 0.0032 ns | 0.0025 ns |  7.237 ns |  7.231 ns |  7.241 ns |  1.00 |            Base |         - |          NA |
| Base64Decode | Job-VFUVWH | /runtime_intrinsics/artifacts/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun |          1000 | 23.444 ns | 0.0289 ns | 0.0256 ns | 23.439 ns | 23.409 ns | 23.489 ns |  3.24 |          Slower |         - |          NA |

It’s not immediately obvious to me why it’s slower. I’m going to investigate further given the C version was still decent.

I didn’t merge this code with the SSE2 and SSE3 versions because although they are doing similar things, a combined version would just be a bunch of “if x86 else if arm” clauses and almost no commonality. As per the Aklomp code, the Arm version is loading a lot more data each iteration (due to it relying on a full table lookup due to lack of shuffle) - making sure those are only loaded once from memory is crucial for performance.

@a74nh
Copy link
Contributor Author

a74nh commented Jun 7, 2022

This is the code being produced for the main loop:

IN0032: 0000C8                    ld1     {v29.16b}, [x11]
IN0033: 0000CC                    add     x12, x11, #16
IN0034: 0000D0                    ld1     {v30.16b}, [x12]
IN0035: 0000D4                    add     x12, x11, #32
IN0036: 0000D8                    ld1     {v31.16b}, [x12]
IN0037: 0000DC                    add     x12, x11, #48
IN0038: 0000E0                    ld1     {v7.16b}, [x12]
IN0039: 0000E4                    uzp1    v6.8h, v29.8h, v30.8h
IN003a: 0000E8                    uzp2    v5.8h, v29.8h, v30.8h
IN003b: 0000EC                    uzp1    v30.8h, v31.8h, v7.8h
IN003c: 0000F0                    uzp2    v7.8h, v31.8h, v7.8h
IN003d: 0000F4                    uzp1    v29.16b, v6.16b, v30.16b
IN003e: 0000F8                    uzp2    v30.16b, v6.16b, v30.16b
IN003f: 0000FC                    uzp1    v31.16b, v5.16b, v7.16b
IN0040: 000100                    uzp2    v7.16b, v5.16b, v7.16b
IN0041: 000104                    mov     v6.16b, v27.16b
IN0042: 000108                    tbx     v6.16b, {v16.16b}, v29.16b
IN0043: 00010C                    sub     v29.16b, v29.16b, v28.16b
IN0044: 000110                    tbx     v6.16b, {v17.16b}, v29.16b
IN0045: 000114                    sub     v29.16b, v29.16b, v28.16b
IN0046: 000118                    tbx     v6.16b, {v18.16b}, v29.16b
IN0047: 00011C                    sub     v29.16b, v29.16b, v28.16b
IN0048: 000120                    tbx     v6.16b, {v19.16b}, v29.16b
IN0049: 000124                    sub     v29.16b, v29.16b, v28.16b
IN004a: 000128                    tbx     v6.16b, {v20.16b}, v29.16b
IN004b: 00012C                    sub     v29.16b, v29.16b, v28.16b
IN004c: 000130                    tbx     v6.16b, {v21.16b}, v29.16b
IN004d: 000134                    sub     v29.16b, v29.16b, v28.16b
IN004e: 000138                    tbx     v6.16b, {v22.16b}, v29.16b
IN004f: 00013C                    sub     v29.16b, v29.16b, v28.16b
IN0050: 000140                    tbx     v6.16b, {v23.16b}, v29.16b
IN0051: 000144                    mov     v29.16b, v6.16b
IN0052: 000148                    mov     v6.16b, v27.16b
IN0053: 00014C                    tbx     v6.16b, {v16.16b}, v30.16b
IN0054: 000150                    sub     v30.16b, v30.16b, v28.16b
IN0055: 000154                    tbx     v6.16b, {v17.16b}, v30.16b
IN0056: 000158                    sub     v30.16b, v30.16b, v28.16b
IN0057: 00015C                    tbx     v6.16b, {v18.16b}, v30.16b
IN0058: 000160                    sub     v30.16b, v30.16b, v28.16b
IN0059: 000164                    tbx     v6.16b, {v19.16b}, v30.16b
IN005a: 000168                    sub     v30.16b, v30.16b, v28.16b
IN005b: 00016C                    tbx     v6.16b, {v20.16b}, v30.16b
IN005c: 000170                    sub     v30.16b, v30.16b, v28.16b
IN005d: 000174                    tbx     v6.16b, {v21.16b}, v30.16b
IN005e: 000178                    sub     v30.16b, v30.16b, v28.16b
IN005f: 00017C                    tbx     v6.16b, {v22.16b}, v30.16b
IN0060: 000180                    sub     v30.16b, v30.16b, v28.16b
IN0061: 000184                    tbx     v6.16b, {v23.16b}, v30.16b
IN0062: 000188                    mov     v30.16b, v6.16b
IN0063: 00018C                    mov     v6.16b, v27.16b
IN0064: 000190                    tbx     v6.16b, {v16.16b}, v31.16b
IN0065: 000194                    sub     v31.16b, v31.16b, v28.16b
IN0066: 000198                    tbx     v6.16b, {v17.16b}, v31.16b
IN0067: 00019C                    sub     v31.16b, v31.16b, v28.16b
IN0068: 0001A0                    tbx     v6.16b, {v18.16b}, v31.16b
IN0069: 0001A4                    sub     v31.16b, v31.16b, v28.16b
IN006a: 0001A8                    tbx     v6.16b, {v19.16b}, v31.16b
IN006b: 0001AC                    sub     v31.16b, v31.16b, v28.16b
IN006c: 0001B0                    tbx     v6.16b, {v20.16b}, v31.16b
IN006d: 0001B4                    sub     v31.16b, v31.16b, v28.16b
IN006e: 0001B8                    tbx     v6.16b, {v21.16b}, v31.16b
IN006f: 0001BC                    sub     v31.16b, v31.16b, v28.16b
IN0070: 0001C0                    tbx     v6.16b, {v22.16b}, v31.16b
IN0071: 0001C4                    sub     v31.16b, v31.16b, v28.16b
IN0072: 0001C8                    tbx     v6.16b, {v23.16b}, v31.16b
IN0073: 0001CC                    mov     v31.16b, v6.16b
IN0074: 0001D0                    mov     v6.16b, v27.16b
IN0075: 0001D4                    tbx     v6.16b, {v16.16b}, v7.16b
IN0076: 0001D8                    sub     v7.16b, v7.16b, v28.16b
IN0077: 0001DC                    tbx     v6.16b, {v17.16b}, v7.16b
IN0078: 0001E0                    sub     v7.16b, v7.16b, v28.16b
IN0079: 0001E4                    tbx     v6.16b, {v18.16b}, v7.16b
IN007a: 0001E8                    sub     v7.16b, v7.16b, v28.16b
IN007b: 0001EC                    tbx     v6.16b, {v19.16b}, v7.16b
IN007c: 0001F0                    sub     v7.16b, v7.16b, v28.16b
IN007d: 0001F4                    tbx     v6.16b, {v20.16b}, v7.16b
IN007e: 0001F8                    sub     v7.16b, v7.16b, v28.16b
IN007f: 0001FC                    tbx     v6.16b, {v21.16b}, v7.16b
IN0080: 000200                    sub     v7.16b, v7.16b, v28.16b
IN0081: 000204                    tbx     v6.16b, {v22.16b}, v7.16b
IN0082: 000208                    sub     v7.16b, v7.16b, v28.16b
IN0083: 00020C                    tbx     v6.16b, {v23.16b}, v7.16b
IN0084: 000210                    mov     v7.16b, v6.16b
IN0085: 000214                    umaxp   v6.16b, v29.16b, v30.16b
IN0086: 000218                    umaxp   v5.16b, v31.16b, v7.16b
IN0087: 00021C                    umaxp   v6.16b, v6.16b, v5.16b
IN0088: 000220                    umov    x12, v6.d[0]
IN0089: 000224                    tst     x12, #0xc0c0c0c0c0c0c0c0
IN008a: 000228                    bne     G_M5947_IG06
IN008b: 00022C                    shl     v29.16b, v29.16b, #2
IN008c: 000230                    ushr    v6.16b, v30.16b, #4
IN008d: 000234                    orr     v29.16b, v29.16b, v6.16b
IN008e: 000238                    shl     v30.16b, v30.16b, #4
IN008f: 00023C                    ushr    v6.16b, v31.16b, #2
IN0090: 000240                    orr     v30.16b, v30.16b, v6.16b
IN0091: 000244                    shl     v31.16b, v31.16b, #6
IN0092: 000248                    orr     v31.16b, v31.16b, v7.16b
IN0093: 00024C                    movi    v7.4s, #0x00
IN0094: 000250                    tbx     v7.16b, {v29.16b}, v24.16b
IN0095: 000254                    sub     v6.16b, v24.16b, v28.16b
IN0096: 000258                    tbx     v7.16b, {v30.16b}, v6.16b
IN0097: 00025C                    sub     v6.16b, v6.16b, v28.16b
IN0098: 000260                    tbx     v7.16b, {v31.16b}, v6.16b
IN0099: 000264                    movi    v6.4s, #0x00
IN009a: 000268                    tbx     v6.16b, {v29.16b}, v25.16b
IN009b: 00026C                    sub     v5.16b, v25.16b, v28.16b
IN009c: 000270                    tbx     v6.16b, {v30.16b}, v5.16b
IN009d: 000274                    sub     v5.16b, v5.16b, v28.16b
IN009e: 000278                    tbx     v6.16b, {v31.16b}, v5.16b
IN009f: 00027C                    movi    v5.4s, #0x00
IN00a0: 000280                    tbx     v5.16b, {v29.16b}, v26.16b
IN00a1: 000284                    sub     v29.16b, v26.16b, v28.16b
IN00a2: 000288                    tbx     v5.16b, {v30.16b}, v29.16b
IN00a3: 00028C                    sub     v29.16b, v29.16b, v28.16b
IN00a4: 000290                    tbx     v5.16b, {v31.16b}, v29.16b
IN00a5: 000294                    st1     {v7.16b}, [x13]
IN00a6: 000298                    add     x12, x13, #16
IN00a7: 00029C                    st1     {v6.16b}, [x12]
IN00a8: 0002A0                    add     x12, x13, #32
IN00a9: 0002A4                    st1     {v5.16b}, [x12]
IN00aa: 0002A8                    add     x11, x11, #64
IN00ab: 0002AC                    add     x13, x13, #48
IN00ac: 0002B0                    cmp     x11, x10
IN00ad: 0002B4                    bls     G_M5947_IG05

}

[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static unsafe void AdvSimdDecode(ref byte* srcBytes, ref byte* destBytes, byte* srcEnd, int sourceLength, int destLength, byte* srcStart, byte* destStart)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious - does AggressiveInlining here show benefits in the benchmarks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I remove this one it doesn't make a difference.
If I remove the two around AdvSimdTbx3Byte and AdvSimdTbx8Byte it gets 2x worse.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

those definitely make sense while this one is a bit questionable

// Compress each four bytes into three.
Vector128<byte> dec0 = Vector128.BitwiseOr(Vector128.ShiftLeft(str0, 2), Vector128.ShiftRightLogical(str1, 4));
Vector128<byte> dec1 = Vector128.BitwiseOr(Vector128.ShiftLeft(str1, 4), Vector128.ShiftRightLogical(str2, 2));
Vector128<byte> dec2 = Vector128.BitwiseOr(Vector128.ShiftLeft(str2, 6), str3);
Copy link
Member

@EgorBo EgorBo Jun 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jit doesn't do instruction selection so if you want your shifts to be side-by-side for better pipelining you need to extract them to temp locals, e.g.:

var sl0 = Vector128.ShiftLeft(str0, 2);
var sl1 = Vector128.ShiftLeft(str1, 4);
var sl2 = Vector128.ShiftLeft(str2, 6);

var sr1 = Vector128.ShiftRightLogical(str1, 4);
var sr2 = Vector128.ShiftRightLogical(str2, 2);

Vector128<byte> dec0 = Vector128.BitwiseOr(sl0, sr1);
Vector128<byte> dec1 = Vector128.BitwiseOr(sl1, sr2);
Vector128<byte> dec2 = Vector128.BitwiseOr(sl2, str3);

not sure it matters much in terms of perf

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copying that verbatim didn't make any difference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another way to do this might be to treat the vector register as a single value and do something fancy with a masks and a single value shift. Not quite sure how that would look. Might get a some performance, but it'd be messy.

Vector128<byte> str0 = Vector128.LoadUnsafe(ref *src);
Vector128<byte> str1 = Vector128.LoadUnsafe(ref *src, 16);
Vector128<byte> str2 = Vector128.LoadUnsafe(ref *src, 32);
Vector128<byte> str3 = Vector128.LoadUnsafe(ref *src, 48);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect you can use AdvSimd.Arm64.LoadPairVector128 (and even nontemporal if it makes sense) here - jit is not smart enough yet to do it by itself

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing this worked, but didn't give any improvement

Vector128<byte> str0 = Vector128.LoadUnsafe(ref *src);
Vector128<byte> str1 = Vector128.LoadUnsafe(ref *src, 16);
Vector128<byte> str2 = Vector128.LoadUnsafe(ref *src, 32);
Vector128<byte> str3 = Vector128.LoadUnsafe(ref *src, 48);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect you can use AdvSimd.Arm64.LoadPairVector128 (and even nontemporal if it makes sense) here - jit is not smart enough yet to do it by itself

@EgorBo
Copy link
Member

EgorBo commented Jun 7, 2022

I compiled your code locally and here is what I've got on Main:

; Method C:AdvSimdDecode(byref,byref,long,int,int,long,long)
G_M23748_IG01:
            stp     fp, lr, [sp,#-16]!
            mov     fp, sp
						;; size=8 bbWeight=1    PerfScore 1.50

G_M23748_IG02:
            ldr     x3, [x0]
            ldr     x4, [x1]
						;; size=8 bbWeight=1    PerfScore 6.00

G_M23748_IG03:
            ld1     {v16.16b}, [x3]
            add     x5, x3, #16
            ld1     {v17.16b}, [x5]
            add     x5, x3, #32
            ld1     {v18.16b}, [x5]
            add     x5, x3, #48
            ld1     {v19.16b}, [x5]
            uzp1    v20.8h, v16.8h, v17.8h
            uzp2    v21.8h, v16.8h, v17.8h
            uzp1    v17.8h, v18.8h, v19.8h
            uzp2    v19.8h, v18.8h, v19.8h
            uzp1    v16.16b, v20.16b, v17.16b
            uzp2    v17.16b, v20.16b, v17.16b
            uzp1    v18.16b, v21.16b, v19.16b
            uzp2    v19.16b, v21.16b, v19.16b
            mvni    v20.4s, #0x00
            mvni    v21.4s, #0x00
            tbx     v20.16b, {v21.16b}, v16.16b
            ldr     q21, [@RWD00]
            sub     v16.16b, v16.16b, v21.16b
            mvni    v21.4s, #0x00
            tbx     v20.16b, {v21.16b}, v16.16b
            ldr     q21, [@RWD00]
            sub     v16.16b, v16.16b, v21.16b
            ldr     q21, [@RWD16]
            tbx     v20.16b, {v21.16b}, v16.16b
            ldr     q21, [@RWD00]
            sub     v16.16b, v16.16b, v21.16b
            ldr     q21, [@RWD32]
            tbx     v20.16b, {v21.16b}, v16.16b
            ldr     q21, [@RWD00]
            sub     v16.16b, v16.16b, v21.16b
            ldr     q21, [@RWD48]
            tbx     v20.16b, {v21.16b}, v16.16b
            ldr     q21, [@RWD00]
            sub     v16.16b, v16.16b, v21.16b
            ldr     q21, [@RWD64]
            tbx     v20.16b, {v21.16b}, v16.16b
            ldr     q21, [@RWD00]
            sub     v16.16b, v16.16b, v21.16b
            ldr     q21, [@RWD80]
            tbx     v20.16b, {v21.16b}, v16.16b
            ldr     q21, [@RWD00]
            sub     v16.16b, v16.16b, v21.16b
            ldr     q21, [@RWD96]
            tbx     v20.16b, {v21.16b}, v16.16b
            mov     v16.16b, v20.16b
            mvni    v20.4s, #0x00
            mvni    v21.4s, #0x00
            tbx     v20.16b, {v21.16b}, v17.16b
            ldr     q21, [@RWD00]
            sub     v17.16b, v17.16b, v21.16b
            mvni    v21.4s, #0x00
            tbx     v20.16b, {v21.16b}, v17.16b
            ldr     q21, [@RWD00]
            sub     v17.16b, v17.16b, v21.16b
            ldr     q21, [@RWD16]
            tbx     v20.16b, {v21.16b}, v17.16b
            ldr     q21, [@RWD00]
            sub     v17.16b, v17.16b, v21.16b
            ldr     q21, [@RWD32]
            tbx     v20.16b, {v21.16b}, v17.16b
            ldr     q21, [@RWD00]
            sub     v17.16b, v17.16b, v21.16b
            ldr     q21, [@RWD48]
            tbx     v20.16b, {v21.16b}, v17.16b
            ldr     q21, [@RWD00]
            sub     v17.16b, v17.16b, v21.16b
            ldr     q21, [@RWD64]
            tbx     v20.16b, {v21.16b}, v17.16b
            ldr     q21, [@RWD00]
            sub     v17.16b, v17.16b, v21.16b
            ldr     q21, [@RWD80]
            tbx     v20.16b, {v21.16b}, v17.16b
            ldr     q21, [@RWD00]
            sub     v17.16b, v17.16b, v21.16b
            ldr     q21, [@RWD96]
            tbx     v20.16b, {v21.16b}, v17.16b
            mov     v17.16b, v20.16b
            mvni    v20.4s, #0x00
            mvni    v21.4s, #0x00
            tbx     v20.16b, {v21.16b}, v18.16b
            ldr     q21, [@RWD00]
            sub     v18.16b, v18.16b, v21.16b
            mvni    v21.4s, #0x00
            tbx     v20.16b, {v21.16b}, v18.16b
            ldr     q21, [@RWD00]
            sub     v18.16b, v18.16b, v21.16b
            ldr     q21, [@RWD16]
            tbx     v20.16b, {v21.16b}, v18.16b
            ldr     q21, [@RWD00]
            sub     v18.16b, v18.16b, v21.16b
            ldr     q21, [@RWD32]
            tbx     v20.16b, {v21.16b}, v18.16b
            ldr     q21, [@RWD00]
            sub     v18.16b, v18.16b, v21.16b
            ldr     q21, [@RWD48]
            tbx     v20.16b, {v21.16b}, v18.16b
            ldr     q21, [@RWD00]
            sub     v18.16b, v18.16b, v21.16b
            ldr     q21, [@RWD64]
            tbx     v20.16b, {v21.16b}, v18.16b
            ldr     q21, [@RWD00]
            sub     v18.16b, v18.16b, v21.16b
            ldr     q21, [@RWD80]
            tbx     v20.16b, {v21.16b}, v18.16b
            ldr     q21, [@RWD00]
            sub     v18.16b, v18.16b, v21.16b
            ldr     q21, [@RWD96]
            tbx     v20.16b, {v21.16b}, v18.16b
            mov     v18.16b, v20.16b
            mvni    v20.4s, #0x00
            mvni    v21.4s, #0x00
            tbx     v20.16b, {v21.16b}, v19.16b
            ldr     q21, [@RWD00]
            sub     v19.16b, v19.16b, v21.16b
            mvni    v21.4s, #0x00
            tbx     v20.16b, {v21.16b}, v19.16b
            ldr     q21, [@RWD00]
            sub     v19.16b, v19.16b, v21.16b
            ldr     q21, [@RWD16]
            tbx     v20.16b, {v21.16b}, v19.16b
            ldr     q21, [@RWD00]
            sub     v19.16b, v19.16b, v21.16b
            ldr     q21, [@RWD32]
            tbx     v20.16b, {v21.16b}, v19.16b
            ldr     q21, [@RWD00]
            sub     v19.16b, v19.16b, v21.16b
            ldr     q21, [@RWD48]
            tbx     v20.16b, {v21.16b}, v19.16b
            ldr     q21, [@RWD00]
            sub     v19.16b, v19.16b, v21.16b
            ldr     q21, [@RWD64]
            tbx     v20.16b, {v21.16b}, v19.16b
            ldr     q21, [@RWD00]
            sub     v19.16b, v19.16b, v21.16b
            ldr     q21, [@RWD80]
            tbx     v20.16b, {v21.16b}, v19.16b
						;; size=552 bbWeight=8    PerfScore 1496.00

G_M23748_IG04:
            ldr     q21, [@RWD00]
            sub     v19.16b, v19.16b, v21.16b
            ldr     q21, [@RWD96]
            tbx     v20.16b, {v21.16b}, v19.16b
            mov     v19.16b, v20.16b
            umaxp   v20.16b, v16.16b, v17.16b
            umaxp   v21.16b, v18.16b, v19.16b
            umaxp   v20.16b, v20.16b, v21.16b
            umov    x5, v20.d[0]
            tst     x5, #0xd1ffab1e
            bne     G_M23748_IG06
						;; size=44 bbWeight=8    PerfScore 96.00

G_M23748_IG05:
            shl     v16.16b, v16.16b, #2
            ushr    v20.16b, v17.16b, #4
            orr     v16.16b, v16.16b, v20.16b
            shl     v17.16b, v17.16b, #4
            ushr    v20.16b, v18.16b, #2
            orr     v17.16b, v17.16b, v20.16b
            shl     v18.16b, v18.16b, #6
            orr     v18.16b, v18.16b, v19.16b
            movi    v19.4s, #0x00
            ldr     q20, [@RWD112]
            tbx     v19.16b, {v16.16b}, v20.16b
            ldr     q20, [@RWD112]
            ldr     q21, [@RWD00]
            sub     v20.16b, v20.16b, v21.16b
            tbx     v19.16b, {v17.16b}, v20.16b
            ldr     q21, [@RWD00]
            sub     v20.16b, v20.16b, v21.16b
            tbx     v19.16b, {v18.16b}, v20.16b
            st1     {v19.16b}, [x4]
            movi    v19.4s, #0x00
            ldr     q20, [@RWD128]
            tbx     v19.16b, {v16.16b}, v20.16b
            ldr     q20, [@RWD128]
            ldr     q21, [@RWD00]
            sub     v20.16b, v20.16b, v21.16b
            tbx     v19.16b, {v17.16b}, v20.16b
            ldr     q21, [@RWD00]
            sub     v20.16b, v20.16b, v21.16b
            tbx     v19.16b, {v18.16b}, v20.16b
            add     x5, x4, #16
            st1     {v19.16b}, [x5]
            movi    v19.4s, #0x00
            ldr     q20, [@RWD144]
            tbx     v19.16b, {v16.16b}, v20.16b
            ldr     q16, [@RWD144]
            ldr     q20, [@RWD00]
            sub     v16.16b, v16.16b, v20.16b
            tbx     v19.16b, {v17.16b}, v16.16b
            ldr     q17, [@RWD00]
            sub     v16.16b, v16.16b, v17.16b
            tbx     v19.16b, {v18.16b}, v16.16b
            add     x5, x4, #32
            st1     {v19.16b}, [x5]
            add     x3, x3, #64
            add     x4, x4, #48
            cmp     x3, x2
            bls     G_M23748_IG03
						;; size=188 bbWeight=4    PerfScore 214.00

G_M23748_IG06:
            str     x3, [x0]
            str     x4, [x1]
						;; size=8 bbWeight=1    PerfScore 2.00

G_M23748_IG07:
            ldp     fp, lr, [sp],#16
            ret     lr
						;; size=8 bbWeight=1    PerfScore 2.00
RWD00  	dq	1010101010101010h, 1010101010101010h
RWD16  	dq	FFFFFFFFFFFFFFFFh, 3FFFFFFF3EFFFFFFh
RWD32  	dq	3B3A393837363534h, FFFFFFFFFFFF3D3Ch
RWD48  	dq	06050403020100FFh, 0E0D0C0B0A090807h
RWD64  	dq	161514131211100Fh, FFFFFFFFFF191817h
RWD80  	dq	201F1E1D1C1B1AFFh, 2827262524232221h
RWD96  	dq	302F2E2D2C2B2A29h, FFFFFFFFFF333231h
RWD112 	dq	1202211101201000h, 0524140423130322h
RWD128 	dq	2717072616062515h, 1A0A291909281808h
RWD144 	dq	0D2C1C0C2B1B0B2Ah, 2F1F0F2E1E0E2D1Dh

Emitting R2R PE file: aot
; Total bytes of code: 816

Don't know what is happening but it feels like all the constants were "propagated" back to the loop body. cc @tannergooding @dotnet/jit-contrib

@EgorBo
Copy link
Member

EgorBo commented Jun 7, 2022

Ah, looks like it's CSE decided to do so

@a74nh
Copy link
Contributor Author

a74nh commented Jun 7, 2022

I compiled your code locally and here is what I've got on Main:

EgorBo - how are you running and dumping that?

How I was doing it was copying and pasting my code into a test app and running with COMPlus_TieredCompilation=0

It's quite possible there is something different happening when the code is in the library

@EgorBo
Copy link
Member

EgorBo commented Jun 7, 2022

How I was doing it was copying and pasting my code into a test app and running with COMPlus_TieredCompilation=0

So did I. The only exception - I was using crossgen. Can you paste the whole codegen for your case? (not just loop body)

@a74nh
Copy link
Contributor Author

a74nh commented Jun 7, 2022

So did I. The only exception - I was using crossgen. Can you paste the whole codegen for your case? (not just loop body)

Ahh, good :) I wasn't sure if there was a better way of doing it.

This is my full dump of DecodeFromUtf8:

*************** After end code gen, before unwindEmit()
G_M16407_IG01:        ; func=00, offs=000000H, size=0014H, bbWeight=1    PerfScore 4.50, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, nogc <-- Prolog IG

IN014a: 000000                    stp     fp, lr, [sp,#-48]!
IN014b: 000004                    stp     x19, x20, [sp,#32]
IN014c: 000008                    mov     fp, sp
IN014d: 00000C                    str     xzr, [fp,#24]	// [V06 loc1]
IN014e: 000010                    str     xzr, [fp,#16]	// [V08 loc3]

G_M16407_IG02:        ; offs=000014H, size=0004H, bbWeight=1    PerfScore 1.00, gcrefRegs=0000 {}, byrefRegs=0035 {x0 x2 x4 x5}, BB01 [0000], byref, isz

IN0001: 000014                    cbnz    w1, G_M16407_IG04

G_M16407_IG03:        ; offs=000018H, size=000CH, bbWeight=0.50 PerfScore 1.50, gcrefRegs=0000 {}, byrefRegs=0030 {x4 x5}, BB02 [0001], byref, isz

IN0002: 000018                    str     wzr, [x4]
IN0003: 00001C                    str     wzr, [x5]
IN0004: 000020                    b       G_M16407_IG18
IN0005: 000024                    align   [0 bytes for IG05]
IN0006: 000024                    align   [0 bytes]
IN0007: 000024                    align   [0 bytes]
IN0008: 000024                    align   [0 bytes]

G_M16407_IG04:        ; offs=000024H, size=00ACH, bbWeight=0.50 PerfScore 21.75, gcrefRegs=0000 {}, byrefRegs=0035 {x0 x2 x4 x5}, BB03 [0002], BB04 [0054], BB07 [0005], BB08 [0014], byref, isz

IN0009: 000024                    str     x0, [fp,#24]	// [V06 loc1]
IN000a: 000028                    str     x2, [fp,#16]	// [V08 loc3]
IN000b: 00002C                    and     w7, w1, #0xfffffffc
IN000c: 000030                    mov     w8, w7
IN000d: 000034                    cmp     w8, #0
IN000e: 000038                    blt     G_M16407_IG26
IN000f: 00003C                    asr     w8, w8, #2
IN0010: 000040                    mov     w9, #3
IN0011: 000044                    mul     w8, w8, w9
IN0012: 000048                    sub     w9, w8, #2
IN0013: 00004C                    cmp     w3, w9
IN0014: 000050                    movz    w9, #0x5556
IN0015: 000054                    movk    w9, #0x5555 LSL #16
IN0016: 000058                    smull   x9, w9, w3
IN0017: 00005C                    asr     x9, x9, #32
IN0018: 000060                    lsr     w10, w9, #31
IN0019: 000064                    add     w9, w9, w10
IN001a: 000068                    lsl     w9, w9, #2
IN001b: 00006C                    csel    w10, w9, w7, lt
IN001c: 000070                    mov     x11, x0
IN001d: 000074                    mov     x13, x2
IN001e: 000078                    add     x14, x11, w7, UXTW
IN001f: 00007C                    add     x12, x11, w10, UXTW
IN0020: 000080                    cmp     w10, #24
IN0021: 000084                    blt     G_M16407_IG07
IN0022: 000088                    sub     x10, x12, #96
IN0023: 00008C                    cmp     x10, x0
IN0024: 000090                    blo     G_M16407_IG07
IN0025: 000094                    mvni    v16.4s, #0x00
IN0026: 000098                    mvni    v17.4s, #0x00
IN0027: 00009C                    ldr     q18, [@RWD00]
IN0028: 0000A0                    ldr     q19, [@RWD16]
IN0029: 0000A4                    ldr     q20, [@RWD32]
IN002a: 0000A8                    ldr     q21, [@RWD48]
IN002b: 0000AC                    ldr     q22, [@RWD64]
IN002c: 0000B0                    ldr     q23, [@RWD80]
IN002d: 0000B4                    ldr     q24, [@RWD96]
IN002e: 0000B8                    ldr     q25, [@RWD112]
IN002f: 0000BC                    ldr     q26, [@RWD128]
IN0030: 0000C0                    mov     x11, x0
IN0031: 0000C4                    mov     x13, x2
IN0032: 0000C8                    movi    v27.16b, #0xff
IN0033: 0000CC                    movi    v28.16b, #0x10

G_M16407_IG05:        ; offs=0000D0H, size=01F0H, bbWeight=4    PerfScore 482.00, gcrefRegs=0000 {}, byrefRegs=0030 {x4 x5}, loop=IG05, BB09 [0057], BB10 [0058], byref, isz

IN0034: 0000D0                    ld1     {v29.16b}, [x11]
IN0035: 0000D4                    add     x12, x11, #16
IN0036: 0000D8                    ld1     {v30.16b}, [x12]
IN0037: 0000DC                    add     x12, x11, #32
IN0038: 0000E0                    ld1     {v31.16b}, [x12]
IN0039: 0000E4                    add     x12, x11, #48
IN003a: 0000E8                    ld1     {v7.16b}, [x12]
IN003b: 0000EC                    uzp1    v6.8h, v29.8h, v30.8h
IN003c: 0000F0                    uzp2    v5.8h, v29.8h, v30.8h
IN003d: 0000F4                    uzp1    v30.8h, v31.8h, v7.8h
IN003e: 0000F8                    uzp2    v7.8h, v31.8h, v7.8h
IN003f: 0000FC                    uzp1    v29.16b, v6.16b, v30.16b
IN0040: 000100                    uzp2    v30.16b, v6.16b, v30.16b
IN0041: 000104                    uzp1    v31.16b, v5.16b, v7.16b
IN0042: 000108                    uzp2    v7.16b, v5.16b, v7.16b
IN0043: 00010C                    mov     v6.16b, v27.16b
IN0044: 000110                    tbx     v6.16b, {v16.16b}, v29.16b
IN0045: 000114                    sub     v29.16b, v29.16b, v28.16b
IN0046: 000118                    tbx     v6.16b, {v17.16b}, v29.16b
IN0047: 00011C                    sub     v29.16b, v29.16b, v28.16b
IN0048: 000120                    tbx     v6.16b, {v18.16b}, v29.16b
IN0049: 000124                    sub     v29.16b, v29.16b, v28.16b
IN004a: 000128                    tbx     v6.16b, {v19.16b}, v29.16b
IN004b: 00012C                    sub     v29.16b, v29.16b, v28.16b
IN004c: 000130                    tbx     v6.16b, {v20.16b}, v29.16b
IN004d: 000134                    sub     v29.16b, v29.16b, v28.16b
IN004e: 000138                    tbx     v6.16b, {v21.16b}, v29.16b
IN004f: 00013C                    sub     v29.16b, v29.16b, v28.16b
IN0050: 000140                    tbx     v6.16b, {v22.16b}, v29.16b
IN0051: 000144                    sub     v29.16b, v29.16b, v28.16b
IN0052: 000148                    tbx     v6.16b, {v23.16b}, v29.16b
IN0053: 00014C                    mov     v29.16b, v6.16b
IN0054: 000150                    mov     v6.16b, v27.16b
IN0055: 000154                    tbx     v6.16b, {v16.16b}, v30.16b
IN0056: 000158                    sub     v30.16b, v30.16b, v28.16b
IN0057: 00015C                    tbx     v6.16b, {v17.16b}, v30.16b
IN0058: 000160                    sub     v30.16b, v30.16b, v28.16b
IN0059: 000164                    tbx     v6.16b, {v18.16b}, v30.16b
IN005a: 000168                    sub     v30.16b, v30.16b, v28.16b
IN005b: 00016C                    tbx     v6.16b, {v19.16b}, v30.16b
IN005c: 000170                    sub     v30.16b, v30.16b, v28.16b
IN005d: 000174                    tbx     v6.16b, {v20.16b}, v30.16b
IN005e: 000178                    sub     v30.16b, v30.16b, v28.16b
IN005f: 00017C                    tbx     v6.16b, {v21.16b}, v30.16b
IN0060: 000180                    sub     v30.16b, v30.16b, v28.16b
IN0061: 000184                    tbx     v6.16b, {v22.16b}, v30.16b
IN0062: 000188                    sub     v30.16b, v30.16b, v28.16b
IN0063: 00018C                    tbx     v6.16b, {v23.16b}, v30.16b
IN0064: 000190                    mov     v30.16b, v6.16b
IN0065: 000194                    mov     v6.16b, v27.16b
IN0066: 000198                    tbx     v6.16b, {v16.16b}, v31.16b
IN0067: 00019C                    sub     v31.16b, v31.16b, v28.16b
IN0068: 0001A0                    tbx     v6.16b, {v17.16b}, v31.16b
IN0069: 0001A4                    sub     v31.16b, v31.16b, v28.16b
IN006a: 0001A8                    tbx     v6.16b, {v18.16b}, v31.16b
IN006b: 0001AC                    sub     v31.16b, v31.16b, v28.16b
IN006c: 0001B0                    tbx     v6.16b, {v19.16b}, v31.16b
IN006d: 0001B4                    sub     v31.16b, v31.16b, v28.16b
IN006e: 0001B8                    tbx     v6.16b, {v20.16b}, v31.16b
IN006f: 0001BC                    sub     v31.16b, v31.16b, v28.16b
IN0070: 0001C0                    tbx     v6.16b, {v21.16b}, v31.16b
IN0071: 0001C4                    sub     v31.16b, v31.16b, v28.16b
IN0072: 0001C8                    tbx     v6.16b, {v22.16b}, v31.16b
IN0073: 0001CC                    sub     v31.16b, v31.16b, v28.16b
IN0074: 0001D0                    tbx     v6.16b, {v23.16b}, v31.16b
IN0075: 0001D4                    mov     v31.16b, v6.16b
IN0076: 0001D8                    mov     v6.16b, v27.16b
IN0077: 0001DC                    tbx     v6.16b, {v16.16b}, v7.16b
IN0078: 0001E0                    sub     v7.16b, v7.16b, v28.16b
IN0079: 0001E4                    tbx     v6.16b, {v17.16b}, v7.16b
IN007a: 0001E8                    sub     v7.16b, v7.16b, v28.16b
IN007b: 0001EC                    tbx     v6.16b, {v18.16b}, v7.16b
IN007c: 0001F0                    sub     v7.16b, v7.16b, v28.16b
IN007d: 0001F4                    tbx     v6.16b, {v19.16b}, v7.16b
IN007e: 0001F8                    sub     v7.16b, v7.16b, v28.16b
IN007f: 0001FC                    tbx     v6.16b, {v20.16b}, v7.16b
IN0080: 000200                    sub     v7.16b, v7.16b, v28.16b
IN0081: 000204                    tbx     v6.16b, {v21.16b}, v7.16b
IN0082: 000208                    sub     v7.16b, v7.16b, v28.16b
IN0083: 00020C                    tbx     v6.16b, {v22.16b}, v7.16b
IN0084: 000210                    sub     v7.16b, v7.16b, v28.16b
IN0085: 000214                    tbx     v6.16b, {v23.16b}, v7.16b
IN0086: 000218                    mov     v7.16b, v6.16b
IN0087: 00021C                    umaxp   v6.16b, v29.16b, v30.16b
IN0088: 000220                    umaxp   v5.16b, v31.16b, v7.16b
IN0089: 000224                    umaxp   v6.16b, v6.16b, v5.16b
IN008a: 000228                    umov    x12, v6.d[0]
IN008b: 00022C                    tst     x12, #0xc0c0c0c0c0c0c0c0
IN008c: 000230                    bne     G_M16407_IG06
IN008d: 000234                    shl     v29.16b, v29.16b, #2
IN008e: 000238                    ushr    v6.16b, v30.16b, #4
IN008f: 00023C                    orr     v29.16b, v29.16b, v6.16b
IN0090: 000240                    shl     v30.16b, v30.16b, #4
IN0091: 000244                    ushr    v6.16b, v31.16b, #2
IN0092: 000248                    orr     v30.16b, v30.16b, v6.16b
IN0093: 00024C                    shl     v31.16b, v31.16b, #6
IN0094: 000250                    orr     v31.16b, v31.16b, v7.16b
IN0095: 000254                    movi    v7.4s, #0x00
IN0096: 000258                    tbx     v7.16b, {v29.16b}, v24.16b
IN0097: 00025C                    sub     v6.16b, v24.16b, v28.16b
IN0098: 000260                    tbx     v7.16b, {v30.16b}, v6.16b
IN0099: 000264                    sub     v6.16b, v6.16b, v28.16b
IN009a: 000268                    tbx     v7.16b, {v31.16b}, v6.16b
IN009b: 00026C                    movi    v6.4s, #0x00
IN009c: 000270                    tbx     v6.16b, {v29.16b}, v25.16b
IN009d: 000274                    sub     v5.16b, v25.16b, v28.16b
IN009e: 000278                    tbx     v6.16b, {v30.16b}, v5.16b
IN009f: 00027C                    sub     v5.16b, v5.16b, v28.16b
IN00a0: 000280                    tbx     v6.16b, {v31.16b}, v5.16b
IN00a1: 000284                    movi    v5.4s, #0x00
IN00a2: 000288                    tbx     v5.16b, {v29.16b}, v26.16b
IN00a3: 00028C                    sub     v29.16b, v26.16b, v28.16b
IN00a4: 000290                    tbx     v5.16b, {v30.16b}, v29.16b
IN00a5: 000294                    sub     v29.16b, v29.16b, v28.16b
IN00a6: 000298                    tbx     v5.16b, {v31.16b}, v29.16b
IN00a7: 00029C                    st1     {v7.16b}, [x13]
IN00a8: 0002A0                    add     x12, x13, #16
IN00a9: 0002A4                    st1     {v6.16b}, [x12]
IN00aa: 0002A8                    add     x12, x13, #32
IN00ab: 0002AC                    st1     {v5.16b}, [x12]
IN00ac: 0002B0                    add     x11, x11, #64
IN00ad: 0002B4                    add     x13, x13, #48
IN00ae: 0002B8                    cmp     x11, x10
IN00af: 0002BC                    bls     G_M16407_IG05

G_M16407_IG06:        ; offs=0002C0H, size=0008H, bbWeight=0.50 PerfScore 0.75, gcrefRegs=0000 {}, byrefRegs=0030 {x4 x5}, BB11 [0059], byref, isz

IN00b0: 0002C0                    cmp     x11, x14
IN00b1: 0002C4                    beq     G_M16407_IG17

G_M16407_IG07:        ; offs=0002C8H, size=0010H, bbWeight=0.50 PerfScore 1.50, gcrefRegs=0000 {}, byrefRegs=0030 {x4 x5}, BB12 [0015], BB13 [0016], byref, isz

IN00b2: 0002C8                    uxtb    w6, w6
IN00b3: 0002CC                    cbnz    w6, G_M16407_IG08
IN00b4: 0002D0                    mov     w10, wzr
IN00b5: 0002D4                    b       G_M16407_IG09
IN00b6: 0002D8                    align   [0 bytes for IG11]
IN00b7: 0002D8                    align   [0 bytes]
IN00b8: 0002D8                    align   [0 bytes]
IN00b9: 0002D8                    align   [0 bytes]

G_M16407_IG08:        ; offs=0002D8H, size=0004H, bbWeight=0.50 PerfScore 0.25, gcrefRegs=0000 {}, byrefRegs=0030 {x4 x5}, BB14 [0017], byref

IN00ba: 0002D8                    mov     w10, #4

G_M16407_IG09:        ; offs=0002DCH, size=0010H, bbWeight=0.50 PerfScore 1.50, gcrefRegs=0000 {}, byrefRegs=0030 {x4 x5}, BB15 [0018], BB16 [0019], byref, isz

IN00bb: 0002DC                    cmp     w3, w8
IN00bc: 0002E0                    blt     G_M16407_IG10
IN00bd: 0002E4                    sub     w9, w7, w10
IN00be: 0002E8                    b       G_M16407_IG10

G_M16407_IG10:        ; offs=0002ECH, size=0018H, bbWeight=0.50 PerfScore 2.00, gcrefRegs=0000 {}, byrefRegs=0030 {x4 x5}, BB18 [0021], byref, isz

IN00bf: 0002EC                    movz    x8, #0xfbe4
IN00c0: 0002F0                    movk    x8, #0xef51 LSL #16
IN00c1: 0002F4                    movk    x8, #0xffff LSL #32
IN00c2: 0002F8                    add     x12, x0, w9, UXTW
IN00c3: 0002FC                    cmp     x11, x12
IN00c4: 000300                    bhs     G_M16407_IG12

G_M16407_IG11:        ; offs=000304H, size=0074H, bbWeight=4    PerfScore 158.00, gcrefRegs=0000 {}, byrefRegs=0130 {x4 x5 x8}, loop=IG11, BB19 [0022], BB20 [0023], byref, isz

IN00c5: 000304                    ldrb    w15, [x11]
IN00c6: 000308                    ldrb    wip0, [x11,#1]
IN00c7: 00030C                    ldrb    w19, [x11,#2]
IN00c8: 000310                    ldrb    w20, [x11,#3]
IN00c9: 000314                    mov     w15, w15
IN00ca: 000318                    ldrsb   w15, [x8, x15]
IN00cb: 00031C                    mov     wip0, wip0
IN00cc: 000320                    ldrsb   wip0, [x8, xip0]
IN00cd: 000324                    mov     w19, w19
IN00ce: 000328                    ldrsb   w19, [x8, x19]
IN00cf: 00032C                    mov     w20, w20
IN00d0: 000330                    ldrsb   w20, [x8, x20]
IN00d1: 000334                    lsl     w15, w15, #18
IN00d2: 000338                    lsl     wip0, wip0, #12
IN00d3: 00033C                    lsl     w19, w19, #6
IN00d4: 000340                    orr     w15, w15, w20
IN00d5: 000344                    orr     wip0, wip0, w19
IN00d6: 000348                    orr     w15, w15, wip0
IN00d7: 00034C                    cmp     w15, #0
IN00d8: 000350                    blt     G_M16407_IG24
IN00d9: 000354                    asr     wip0, w15, #16
IN00da: 000358                    strb    wip0, [x13]
IN00db: 00035C                    asr     wip0, w15, #8
IN00dc: 000360                    strb    wip0, [x13,#1]
IN00dd: 000364                    strb    w15, [x13,#2]
IN00de: 000368                    add     x11, x11, #4
IN00df: 00036C                    add     x13, x13, #3
IN00e0: 000370                    cmp     x11, x12
IN00e1: 000374                    blo     G_M16407_IG11

G_M16407_IG12:        ; offs=000378H, size=0040H, bbWeight=0.50 PerfScore 6.00, gcrefRegs=0000 {}, byrefRegs=0130 {x4 x5 x8}, BB21 [0025], BB22 [0026], BB23 [0027], BB24 [0028], BB25 [0045], byref, isz

IN00e2: 000378                    sub     w10, w7, w10
IN00e3: 00037C                    cmp     w10, w9
IN00e4: 000380                    bne     G_M16407_IG20
IN00e5: 000384                    cmp     x11, x14
IN00e6: 000388                    bne     G_M16407_IG13
IN00e7: 00038C                    cbnz    w6, G_M16407_IG24
IN00e8: 000390                    add     x14, x0, w1, SXTW
IN00e9: 000394                    cmp     x14, x11
IN00ea: 000398                    beq     G_M16407_IG17
IN00eb: 00039C                    sub     x8, x11, x0
IN00ec: 0003A0                    mov     w3, w8
IN00ed: 0003A4                    str     w3, [x4]
IN00ee: 0003A8                    sub     x2, x13, x2
IN00ef: 0003AC                    mov     w13, w2
IN00f0: 0003B0                    str     w13, [x5]
IN00f1: 0003B4                    b       G_M16407_IG22

G_M16407_IG13:        ; offs=0003B8H, size=0084H, bbWeight=0.50 PerfScore 21.50, gcrefRegs=0000 {}, byrefRegs=0130 {x4 x5 x8}, BB26 [0030], BB27 [0031], BB28 [0032], BB29 [0033], byref, isz

IN00f2: 0003B8                    ldrb    w9, [x14,#-4]
IN00f3: 0003BC                    ldrb    w10, [x14,#-3]
IN00f4: 0003C0                    ldrb    w12, [x14,#-2]
IN00f5: 0003C4                    ldrb    w14, [x14,#-1]
IN00f6: 0003C8                    mov     w9, w9
IN00f7: 0003CC                    ldrsb   w9, [x8, x9]
IN00f8: 0003D0                    mov     w10, w10
IN00f9: 0003D4                    ldrsb   w10, [x8, x10]
IN00fa: 0003D8                    lsl     w9, w9, #18
IN00fb: 0003DC                    lsl     w10, w10, #12
IN00fc: 0003E0                    orr     w9, w9, w10
IN00fd: 0003E4                    add     x3, x2, w3, UXTW
IN00fe: 0003E8                    cmp     w14, #61
IN00ff: 0003EC                    beq     G_M16407_IG14
IN0100: 0003F0                    mov     w12, w12
IN0101: 0003F4                    ldrsb   w10, [x8, x12]
IN0102: 0003F8                    mov     w14, w14
IN0103: 0003FC                    ldrsb   w8, [x8, x14]
IN0104: 000400                    lsl     w10, w10, #6
IN0105: 000404                    orr     w9, w9, w8
IN0106: 000408                    orr     w9, w9, w10
IN0107: 00040C                    cmp     w9, #0
IN0108: 000410                    blt     G_M16407_IG24
IN0109: 000414                    add     x8, x13, #3
IN010a: 000418                    cmp     x8, x3
IN010b: 00041C                    bhi     G_M16407_IG20
IN010c: 000420                    asr     w3, w9, #16
IN010d: 000424                    strb    w3, [x13]
IN010e: 000428                    asr     w6, w9, #8
IN010f: 00042C                    strb    w6, [x13,#1]
IN0110: 000430                    strb    w9, [x13,#2]
IN0111: 000434                    mov     x13, x8
IN0112: 000438                    b       G_M16407_IG16

G_M16407_IG14:        ; offs=00043CH, size=0044H, bbWeight=0.50 PerfScore 7.75, gcrefRegs=0000 {}, byrefRegs=0130 {x4 x5 x8}, BB30 [0034], BB31 [0035], BB32 [0036], BB33 [0037], byref, isz

IN0113: 00043C                    cmp     w12, #61
IN0114: 000440                    beq     G_M16407_IG15
IN0115: 000444                    mov     w10, w12
IN0116: 000448                    ldrsb   w8, [x8, x10]
IN0117: 00044C                    lsl     w8, w8, #6
IN0118: 000450                    orr     w9, w9, w8
IN0119: 000454                    cmp     w9, #0
IN011a: 000458                    blt     G_M16407_IG24
IN011b: 00045C                    add     x8, x13, #2
IN011c: 000460                    cmp     x8, x3
IN011d: 000464                    bhi     G_M16407_IG20
IN011e: 000468                    asr     w3, w9, #16
IN011f: 00046C                    strb    w3, [x13]
IN0120: 000470                    asr     w9, w9, #8
IN0121: 000474                    strb    w9, [x13,#1]
IN0122: 000478                    mov     x13, x8
IN0123: 00047C                    b       G_M16407_IG16

G_M16407_IG15:        ; offs=000480H, size=0020H, bbWeight=0.50 PerfScore 3.00, gcrefRegs=0000 {}, byrefRegs=0030 {x4 x5}, BB34 [0038], BB35 [0039], BB36 [0040], byref, isz

IN0124: 000480                    cmp     w9, #0
IN0125: 000484                    blt     G_M16407_IG24
IN0126: 000488                    add     x8, x13, #1
IN0127: 00048C                    cmp     x8, x3
IN0128: 000490                    bhi     G_M16407_IG20
IN0129: 000494                    asr     w6, w9, #16
IN012a: 000498                    strb    w6, [x13]
IN012b: 00049C                    mov     x13, x8

G_M16407_IG16:        ; offs=0004A0H, size=000CH, bbWeight=0.50 PerfScore 1.00, gcrefRegs=0000 {}, byrefRegs=0030 {x4 x5}, BB37 [0041], byref, isz

IN012c: 0004A0                    add     x11, x11, #4
IN012d: 0004A4                    cmp     w7, w1
IN012e: 0004A8                    bne     G_M16407_IG24

G_M16407_IG17:        ; offs=0004ACH, size=0010H, bbWeight=0.50 PerfScore 1.50, gcrefRegs=0000 {}, byrefRegs=0030 {x4 x5}, BB38 [0042], byref

IN012f: 0004AC                    sub     x0, x11, x0
IN0130: 0004B0                    str     w0, [x4]
IN0131: 0004B4                    sub     x0, x13, x2
IN0132: 0004B8                    str     w0, [x5]

G_M16407_IG18:        ; offs=0004BCH, size=0004H, bbWeight=0.50 PerfScore 0.25, gcrefRegs=0000 {}, byrefRegs=0000 {}, BB39 [0103], byref

IN0133: 0004BC                    mov     w0, wzr

G_M16407_IG19:        ; offs=0004C0H, size=000CH, bbWeight=0.50 PerfScore 1.50, epilog, nogc, extend

IN014f: 0004C0                    ldp     x19, x20, [sp,#32]
IN0150: 0004C4                    ldp     fp, lr, [sp],#48
IN0151: 0004C8                    ret     lr

G_M16407_IG20:        ; offs=0004CCH, size=0024H, bbWeight=0.50 PerfScore 3.00, gcVars= {}, gcrefRegs=0000 {}, byrefRegs=0030 {x4 x5}, BB40 [0043], BB41 [0044], gcvars, byref, isz

IN0134: 0004CC                    cmp     w7, w1
IN0135: 0004D0                    cset    x1, ne
IN0136: 0004D4                    tst     w1, w6
IN0137: 0004D8                    bne     G_M16407_IG24
IN0138: 0004DC                    sub     x0, x11, x0
IN0139: 0004E0                    str     w0, [x4]
IN013a: 0004E4                    sub     x0, x13, x2
IN013b: 0004E8                    str     w0, [x5]
IN013c: 0004EC                    mov     w0, #1

G_M16407_IG21:        ; offs=0004F0H, size=000CH, bbWeight=0.50 PerfScore 1.50, epilog, nogc, extend

IN0152: 0004F0                    ldp     x19, x20, [sp,#32]
IN0153: 0004F4                    ldp     fp, lr, [sp],#48
IN0154: 0004F8                    ret     lr

G_M16407_IG22:        ; offs=0004FCH, size=0004H, bbWeight=0.50 PerfScore 0.25, gcVars= {}, gcrefRegs=0000 {}, byrefRegs=0000 {}, BB42 [0105], gcvars, byref

IN013d: 0004FC                    mov     w0, #2

G_M16407_IG23:        ; offs=000500H, size=000CH, bbWeight=0.50 PerfScore 1.50, epilog, nogc, extend

IN0155: 000500                    ldp     x19, x20, [sp,#32]
IN0156: 000504                    ldp     fp, lr, [sp],#48
IN0157: 000508                    ret     lr

G_M16407_IG24:        ; offs=00050CH, size=0014H, bbWeight=0.50 PerfScore 1.75, gcVars= {}, gcrefRegs=0000 {}, byrefRegs=0030 {x4 x5}, BB43 [0046], gcvars, byref

IN013e: 00050C                    sub     x0, x11, x0
IN013f: 000510                    str     w0, [x4]
IN0140: 000514                    sub     x0, x13, x2
IN0141: 000518                    str     w0, [x5]
IN0142: 00051C                    mov     w0, #3

G_M16407_IG25:        ; offs=000520H, size=000CH, bbWeight=0.50 PerfScore 1.50, epilog, nogc, extend

IN0158: 000520                    ldp     x19, x20, [sp,#32]
IN0159: 000524                    ldp     fp, lr, [sp],#48
IN015a: 000528                    ret     lr

G_M16407_IG26:        ; offs=00052CH, size=001CH, bbWeight=0    PerfScore 0.00, gcVars= {}, gcrefRegs=0000 {}, byrefRegs=0000 {}, BB45 [0053], gcvars, byref

IN0143: 00052C                    mov     w0, wzr
IN0144: 000530                    movz    x1, #0x3bb8
IN0145: 000534                    movk    x1, #0x7ea8 LSL #16
IN0146: 000538                    movk    x1, #0xffff LSL #32
IN0147: 00053C                    ldr     x1, [x1]
IN0148: 000540                    blr     x1
IN0149: 000544                    brk_unix #0

One of my thoughts was that loading the table was very inefficient - but increasing the text size in the test by 10x didn't improve things

@a74nh
Copy link
Contributor Author

a74nh commented Jun 7, 2022

It's worth noting that when I did this in C using aklomp, I was comparing the plain C version vs the neon C version. Once ported to C#, I'm comparing the plain C# version to the neon C# version.

If the aklomp C plain version isn't optimised and the C# plain version is optimised, then this would explain why I'm not seeing any improvements when I go from optimised C# plain version to semi-optimised C# NEON version.

indicies_sub = AdvSimd.Subtract(indicies_sub, offset);
dest = AdvSimd.Arm64.VectorTableLookupExtension(dest, table6, indicies_sub);
indicies_sub = AdvSimd.Subtract(indicies_sub, offset);
dest = AdvSimd.Arm64.VectorTableLookupExtension(dest, table7, indicies_sub);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this using perf, a lot of time is spent in this function. One reason for that is due to the chain of dependencies.

Splitting the indicies_sub into separate variables didn't make any noticeable difference:
var indicies1 = AdvSimd.Subtract(indicies, Vector128.Create((byte)(16U1)));
var indicies2 = AdvSimd.Subtract(indicies, Vector128.Create((byte)(16U
2)));
var indicies3 = AdvSimd.Subtract(indicies, Vector128.Create((byte)(16U*3)));
etc

The TBXs could be split out:
increment everything in the lookup table so that it starts from 1. Do the lookups using TBLs, meaning failures are 0. Combine all the results with ORs. Then subtract 1 from the result.
That would add more complexity, and I very much doubt it's going to give much benefit overall.

@a74nh
Copy link
Contributor Author

a74nh commented Jun 8, 2022

Without use of LD4/ST3/TBX4, I very much doubt we can get this above the performance of the existing code. There are some remaining ideas above, but it's very doubtful they will be enough to go from 3x slower to faster, plus they'll just add code complexity.

I'll leave this PR open for a bit longer in case anyone has any good ideas.

@tannergooding
Copy link
Member

tannergooding commented Jun 8, 2022

@a74nh could you elaborate on why the Arm64 implementation can't basically be a 1-to-1 port of the x86/x64 logic?

There isn't really anything in Ssse3 that isn't in AdvSimd and Ssse3Decode is pretty trivial (there is nothing there that isn't a simple translation over - And to And, Shuffle to TableLookup just need to double check edge case handling, MultiplyAddAdjacent to Unzip + Multiply + AddPairwiseWidening in the worst case).

I do understand using TBX2/3/4 might be even faster, but it'd be better to start with something that gives us the initial gains then have nothing at all.

@kunalspathak
Copy link
Member

and the C# plain version is optimised

I don't think that is the case and I agree with @tannergooding about trying to map it with SSE3Decode to get initial perf boost.

Just a long shot, can we spread out the dependencies so unrelated loads can happen in parallel? (Likewise doing 4 byte unrelated loads in batches)?

                 Vector128<short> tmp0 = AdvSimd.Arm64.UnzipEven(str0.AsInt16(), str1.AsInt16());
-                Vector128<short> tmp1 = AdvSimd.Arm64.UnzipOdd(str0.AsInt16(), str1.AsInt16());
                 Vector128<short> tmp2 = AdvSimd.Arm64.UnzipEven(str2.AsInt16(), str3.AsInt16());
-                Vector128<short> tmp3 = AdvSimd.Arm64.UnzipOdd(str2.AsInt16(), str3.AsInt16());
                 str0 = AdvSimd.Arm64.UnzipEven(tmp0.AsByte(), tmp2.AsByte());
                 str1 = AdvSimd.Arm64.UnzipOdd(tmp0.AsByte(), tmp2.AsByte());
-                str2 = AdvSimd.Arm64.UnzipEven(tmp1.AsByte(), tmp3.AsByte());
-                str3 = AdvSimd.Arm64.UnzipOdd(tmp1.AsByte(), tmp3.AsByte());
-
-                // Table lookup on each 16 bytes.
                 str0 = AdvSimdTbx8Byte(v255, dec_lut0, dec_lut1, dec_lut2, dec_lut3, dec_lut4, dec_lut5, dec_lut6, dec_lut7, str0, v16);
                 str1 = AdvSimdTbx8Byte(v255, dec_lut0, dec_lut1, dec_lut2, dec_lut3, dec_lut4, dec_lut5, dec_lut6, dec_lut7, str1, v16);
+
+                Vector128<short> tmp1 = AdvSimd.Arm64.UnzipOdd(str0.AsInt16(), str1.AsInt16());
+                Vector128<short> tmp3 = AdvSimd.Arm64.UnzipOdd(str2.AsInt16(), str3.AsInt16());
+                str2 = AdvSimd.Arm64.UnzipEven(tmp1.AsByte(), tmp3.AsByte());
+                str3 = AdvSimd.Arm64.UnzipOdd(tmp1.AsByte(), tmp3.AsByte());
                 str2 = AdvSimdTbx8Byte(v255, dec_lut0, dec_lut1, dec_lut2, dec_lut3, dec_lut4, dec_lut5, dec_lut6, dec_lut7, str2, v16);
                 str3 = AdvSimdTbx8Byte(v255, dec_lut0, dec_lut1, dec_lut2, dec_lut3, dec_lut4, dec_lut5, dec_lut6, dec_lut7, str3, v16);

@a74nh
Copy link
Contributor Author

a74nh commented Jun 8, 2022

@a74nh could you elaborate on why the Arm64 implementation can't basically be a 1-to-1 port of the x86/x64 logic?

Let me check this.....

@a74nh
Copy link
Contributor Author

a74nh commented Jun 8, 2022

@tannergooding : looks like this might be workable. However, just a little stumped on the MultiplyAddAdjacent equivalent. You suggested "Unzip + Multiply + AddPairwiseWidening", not quite sure exactly how you were doing that? I'm assuming you're thinking something special here and taking the mergeConstant0 into account. (Instead of just doing a full MultiplyAddAdjacent in neon instructions)

@tannergooding
Copy link
Member

tannergooding commented Jun 8, 2022

You suggested "Unzip + Multiply + AddPairwiseWidening", not quite sure exactly how you were doing that? I'm assuming you're thinking something special here and taking the mergeConstant0 into account. (Instead of just doing a full MultiplyAddAdjacent in neon instructions)

I was simply giving a naïve equivalent. Notably including Unzip there was an accident, I was conflating two things in my head.

MultiplyAddAdjacent effectively does:

tmp[0] = left[0] * right[0];
tmp[1] = left[1] * right[1]
tmp[2] = left[2] * right[2]
tmp[3] = left[3] * right[3];

res[0] = widen(tmp[0]) + widen(tmp[1]);
res[1] = widen(tmp[2]) + widen(tmp[3]);

So this is just AdvSimd.Multiply followed by AdvSimd.AddPairwiseWidening

@a74nh
Copy link
Contributor Author

a74nh commented Jun 9, 2022

So this is just AdvSimd.Multiply followed by AdvSimd.AddPairwiseWidening

Ahh, good, what I was thinking already.

My first version of using the the SSE3 version is looking more promising - it's only 1.28x slower, instead of 3.5x.

|       Method |        Job |                                                                                                        Toolchain | NumberOfBytes |     Mean |     Error |    StdDev |   Median |      Min |      Max | Ratio | MannWhitney(2%) | Allocated | Alloc Ratio |
|------------- |----------- |----------------------------------------------------------------------------------------------------------------- |-------------- |---------:|----------:|----------:|---------:|---------:|---------:|------:|---------------- |----------:|------------:|
| Base64Decode | Job-FVXTZE |       /runtime_HEAD/artifacts/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun |          1000 | 7.189 ns | 0.0054 ns | 0.0050 ns | 7.192 ns | 7.180 ns | 7.195 ns |  1.00 |            Base |         - |          NA |
| Base64Decode | Job-HSKPSG | /runtime_intrinsics/artifacts/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun |          1000 | 9.218 ns | 0.0128 ns | 0.0120 ns | 9.223 ns | 9.194 ns | 9.234 ns |  1.28 |          Slower |         - |          NA |

Will see if I can improve it a bit more....

@tannergooding
Copy link
Member

it's only 1.28x slower,

This is 1.28x slower than the scalar implementation?

@a74nh
Copy link
Contributor Author

a74nh commented Jun 9, 2022

it's only 1.28x slower,

This is 1.28x slower than the scalar implementation?

Yes. I was comparing against a build with no changes.

so - it's quite a bit faster than the neon version with all the TBX instructions.

@tannergooding
Copy link
Member

Yeah, but its surprising that its slower than the scalar version in this case. I'd be interested in the C# code and the disassembly if you have it :)

@a74nh
Copy link
Contributor Author

a74nh commented Jun 10, 2022

It looks to me like the micro benchmark for DecodeFromUtf8 isn't quite right:

https://github.com/dotnet/performance/blob/5c59ba485b58516bc4392a24ac605707c48ab998/src/benchmarks/micro/libraries/System.Buffers/Base64Tests.cs#L65

_encodedBytes will be filled with random byte values. However, that means some of them will be invalid base64 values.

That'll cause the simd function to exit early with invalid value here:

if (Sse2.MoveMask(Sse2.CompareGreaterThan(Sse2.And(lo, hi), zero)) != 0)

Instead, look at the unit tests:

Base64TestHelper.InitializeDecodableBytes(source, numBytes);

That's using InitializeDecodableBytes to create an array of valid base64 values:

internal static void InitializeDecodableBytes(Span<byte> bytes, int seed = 100)

I've rewritten the microbenchmark to use a copy/pasted version of InitializeDecodableBytes.
Doing this causes my latest NEON version (based off the SSE3 version) to run at 0.4x compared to the scalar only version. that makes a lot more sense now.

@a74nh
Copy link
Contributor Author

a74nh commented Jun 13, 2022

I now have two approaches to base64 decode:

NEON specific: (this patch)

  • Uses Vectopr128 where possible, but is it’s own function
  • On an Altra, runs at 0.46x compared to the scalar version
  • Some minor improvements could be made (by incorporating review comments)
  • When LD4 etc is available, it’ll be able to go faster.

Vector128 combined: (#70654)

  • Single version for SSE3 and NEON.
  • Still some differences between the two architectures.
  • On an Altra, runs at 0.48x compared to the scalar version (so slightly slower than the NEON version)
  • Don’t expect to be able to get any further improvements

Ultimately, the NEON specific version is the way to go for pure performance.
Given we don’t have LD4 etc yet, then I’m happy with either version for now.

Once which version to use is decided on, I can then look at Encode (which has exactly the same choices to make as this function).

I'll also look at posting my fixes to the performance suite too.

@tannergooding
Copy link
Member

I think the Vector128 combined is the better approach here long term. We can investigate the codegen inefficiencies and fix them up where possible here in .NET 7 or later in .NET 8.

When the JIT gets LD4 support, we can look at if things can be improved. We can likewise look at things like improving the JIT to automatically fold sequential loads into LD2/3/4 and sequential stores into ST2/3/4.

@a74nh
Copy link
Contributor Author

a74nh commented Jun 13, 2022

I think the Vector128 combined is the better approach here long term. We can investigate the codegen inefficiencies and fix them up where possible here in .NET 7 or later in .NET 8.

I'm coming to this conclusion too. For Net7 it is the neater and easier code.

I'll leave this request open for a day and then close it if there are no objections.

@kunalspathak
Copy link
Member

I've rewritten the microbenchmark to use a copy/pasted version of InitializeDecodableBytes.

Thanks @a74nh for finding that out. Could you please also send a PR to dotnet/performance ?

@a74nh
Copy link
Contributor Author

a74nh commented Jun 13, 2022

I've rewritten the microbenchmark to use a copy/pasted version of InitializeDecodableBytes.

Thanks @a74nh for finding that out. Could you please also send a PR to dotnet/performance ?

Will do once I fix up the Encode tests too

@a74nh
Copy link
Contributor Author

a74nh commented Jun 14, 2022

Thanks @a74nh for finding that out. Could you please also send a PR to dotnet/performance ?

dotnet/performance#2479

@a74nh
Copy link
Contributor Author

a74nh commented Jun 14, 2022

Closing this pull request in favour of #70654

@a74nh a74nh closed this Jun 14, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Jul 15, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Memory community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants