[RyuJit/x64] Generates extra movsxd when using a byte value as an address offset #8465

saucecontrol · 2017-07-02T08:12:44Z

In image processing, it's common to map byte values (8-bit-per-channel pixels) through a lookup table.

RyuJit x64 apparently uses a 32-bit movzx to read byte values, then widens using movsxd before using them as address offsets.

Here's a quick sample program demonstrating the issue:

using System;
using System.Runtime.CompilerServices;

class Program
{
	[MethodImpl(MethodImplOptions.NoInlining)]
	unsafe static void LUTMap(byte* src, float* dest, float* lut, int cb)
	{
		byte* end = src + cb - 4;
		while(src <= end)
		{
			dest[0] = lut[src[0]];
			dest[1] = lut[src[1]];
			dest[2] = lut[src[2]];
			dest[3] = lut[src[3]];

			src  += 4;
			dest += 4;
		}
	}

	unsafe static void Main(string[] args)
	{
		var bytes = new byte[1024 * 1024];
		var floats = new float[bytes.Length];
		var lut = new float[256];


		new Random(42).NextBytes(bytes);
		for (int i = 0; i < lut.Length; i++)
			lut[i] = (float)Math.Pow(i, 1 / 2.4);

		fixed (byte* pbytes = &bytes[0])
		fixed (float* pfloats = &floats[0], plut = &lut[0])
		{
			LUTMap(pbytes, pfloats, plut, bytes.Length);
		}
	}
}

RyuJit x64 generates the following for the LUTMap method

; Assembly listing for method Program:LUTMap(long,long,long,int)
; Emitting BLENDED_CODE for X64 CPU with AVX
; optimized code
; rsp based frame
; fully interruptible
; Final local variable assignments
;
;  V00 arg0         [V00,T00] ( 11, 32   )    long  ->  rcx
;  V01 arg1         [V01,T01] (  8, 26   )    long  ->  rdx
;  V02 arg2         [V02,T02] (  6, 18   )    long  ->   r8
;  V03 arg3         [V03,T04] (  3,  3   )     int  ->   r9
;  V04 loc0         [V04,T03] (  3,  6   )    long  ->  rax
;# V05 OutArgs      [V05    ] (  1,  1   )  lclBlk ( 0) [rsp+0x00]
;
; Lcl frame size = 0

G_M15454_IG01:
       C5F877               vzeroupper

G_M15454_IG02:
       4963C1               movsxd   rax, r9d
       488D4401FC           lea      rax, [rcx+rax-4]
       483BC8               cmp      rcx, rax
       7760                 ja       SHORT G_M15454_IG04

G_M15454_IG03:
       440FB609             movzx    r9, byte  ptr [rcx]
       4D63C9               movsxd   r9, r9d
       C4817A100488         vmovss   xmm0, dword ptr [r8+4*r9]
       C4E17A1102           vmovss   dword ptr [rdx], xmm0
       440FB64901           movzx    r9, byte  ptr [rcx+1]
       4D63C9               movsxd   r9, r9d
       C4817A100488         vmovss   xmm0, dword ptr [r8+4*r9]
       C4E17A114204         vmovss   dword ptr [rdx+4], xmm0
       440FB64902           movzx    r9, byte  ptr [rcx+2]
       4D63C9               movsxd   r9, r9d
       C4817A100488         vmovss   xmm0, dword ptr [r8+4*r9]
       C4E17A114208         vmovss   dword ptr [rdx+8], xmm0
       440FB64903           movzx    r9, byte  ptr [rcx+3]
       4D63C9               movsxd   r9, r9d
       C4817A100488         vmovss   xmm0, dword ptr [r8+4*r9]
       C4E17A11420C         vmovss   dword ptr [rdx+12], xmm0
       4883C104             add      rcx, 4
       4883C210             add      rdx, 16
       483BC8               cmp      rcx, rax
       76A5                 jbe      SHORT G_M15454_IG03

G_M15454_IG04:
       C3                   ret

; Total bytes of code 108, prolog size 3 for method Program:LUTMap(long,long,long,int)
; ============================================================

Can it be modified to use the 64-bit movzx in cases like this?

category:cq
theme:basic-cq
skill-level:intermediate
cost:medium
impact:small

The text was updated successfully, but these errors were encountered:

mikedn · 2017-07-02T08:28:00Z

Can it be modified to use the 64-bit movzx in cases like this?

Likely but it's not trivial unfortunately. Making the memory load produce a 64 bit value directly it's problematic due to the they way loads are represented in JIT. But it may be possible to turn the cast that follows it into a no-op because the register (r9 in this case) happens to already contain the correct value.

As a workaround you can cast the byte to uint, you'll still get a redundant instruction but at least it will be movzx (which is cheaper) instead of movsx.

saucecontrol · 2017-07-02T08:38:40Z

The uint cast actually flips the movsxd r9, r9d to mov r9d, r9d, so I guess I can get a no-op now, albeit a 3 byte one. Thanks for the hint.

mikedn · 2017-07-02T08:46:29Z

The uint cast actually flips the movsxd r9, r9d to mov r9d, r9d, so I guess I can get a no-op now

Right, it generates mov rather than movzx. Both mov and movzx have the advantage of having 0 latency (the CPU frontend handles them directly without generating a uop).

RussKeldorph · 2017-07-07T00:35:13Z

@dotnet/jit-contrib

mikedn · 2017-07-07T05:57:53Z

Looks like we should be able to handle this by making the cast operand contained (when it's an indir). Codegen needs a bit of work, it doesn't seem possible to handle this right now.

mikedn · 2017-07-07T16:58:59Z

@RussKeldorph I'm working on a fix in dotnet/coreclr#12676, you can assign to me

mikedn · 2017-07-19T05:21:17Z

As far as I can tell this also affects ARM64:

static long Test(ref sbyte x) => (long)x;

generates

G_M55886_IG01:
        A9BF7BFD          stp     fp, lr, [sp,#-16]!
        910003FD          mov     fp, sp
G_M55886_IG02:
        39800000          ldrsb   x0, [x0]
        93407C00          sxtw    x0, x0
G_M55886_IG03:
        A8C17BFD          ldp     fp, lr, [sp],#16
        D65F03C0          ret     lr

@sdmaclea The sxtw instruction is redundant since ldrsb already sign extended to 64 bit, right?

sdmaclea · 2017-07-19T14:15:40Z

@mikedn Correct sxtw is redundant correct

https://github.com/dotnet/coreclr/issues/12595

saucecontrol · 2018-08-01T19:39:32Z

I've just noticed similar redundant instructions when narrowing (u)int->(u)short as well. Here, the i0-i3 variables are int, and they're being stored to ushort* dest

   235: 					dest[0] = (ushort)i0;
00007FFC74A01C41  movzx       r9d,r9w  
00007FFC74A01C45  mov         word ptr [r8],r9w  
   236: 					dest[1] = (ushort)i1;
00007FFC74A01C4A  movzx       r9d,r10w  
00007FFC74A01C4E  mov         word ptr [r8+2],r9w  
   237: 					dest[2] = (ushort)i2;
00007FFC74A01C53  movzx       r9d,r11w  
00007FFC74A01C57  mov         word ptr [r8+4],r9w

Would this be the same issue?

mikedn · 2018-08-01T19:49:50Z

Nope, different issue. What is i0, do you have a complete example?

mikedn · 2018-08-01T19:51:17Z

BTW, after encountering don't remember how many bugs while working on this issue I'm finally back on it. Hopefully I'll fix it this month.

saucecontrol · 2018-08-01T19:56:28Z

i0 is a local int variable used for some fixed-point math. After calculation, it's stored in 16bits. This was something I just spotted while debugging some code on net472. I'll create a repro and verify it's still an issue on the current master. Will open a new issue if so.

saucecontrol · 2018-08-01T21:35:32Z

Sorry, that was a false alarm. It repros under netcoreapp2.0 but not under netcoreapp2.1 or current master.

jakobbotsch · 2024-07-08T09:26:29Z

Looks fixed today:

; Assembly listing for method Program:LUTMap(ulong,ulong,ulong,int) (FullOpts)
; Emitting BLENDED_CODE for X64 with AVX - Windows
; FullOpts code
; optimized code
; rsp based frame
; fully interruptible
; No PGO data
; Final local variable assignments
;
;  V00 arg0         [V00,T00] ( 11, 32   )    long  ->  rcx        
;  V01 arg1         [V01,T01] (  8, 26   )    long  ->  rdx        
;  V02 arg2         [V02,T02] (  6, 18   )    long  ->   r8         single-def
;  V03 arg3         [V03,T04] (  3,  3   )     int  ->   r9         single-def
;  V04 loc0         [V04,T03] (  3,  6   )    long  ->  rax         single-def
;# V05 OutArgs      [V05    ] (  1,  1   )  struct ( 0) [rsp+0x00]  do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;
; Lcl frame size = 0

G_M55172_IG01:  ;; offset=0x0000
						;; size=0 bbWeight=1 PerfScore 0.00
G_M55172_IG02:  ;; offset=0x0000
       movsxd   rax, r9d
       lea      rax, [rcx+rax-0x04]
       cmp      rcx, rax
       ja       SHORT G_M55172_IG04
       align    [0 bytes for IG03]
						;; size=13 bbWeight=1 PerfScore 2.50
G_M55172_IG03:  ;; offset=0x000D
       movzx    r10, byte  ptr [rcx]
       vmovss   xmm0, dword ptr [r8+4*r10]
       vmovss   dword ptr [rdx], xmm0
       movzx    r10, byte  ptr [rcx+0x01]
       vmovss   xmm0, dword ptr [r8+4*r10]
       vmovss   dword ptr [rdx+0x04], xmm0
       movzx    r10, byte  ptr [rcx+0x02]
       vmovss   xmm0, dword ptr [r8+4*r10]
       vmovss   dword ptr [rdx+0x08], xmm0
       movzx    r10, byte  ptr [rcx+0x03]
       vmovss   xmm0, dword ptr [r8+4*r10]
       vmovss   dword ptr [rdx+0x0C], xmm0
       add      rcx, 4
       add      rdx, 16
       cmp      rcx, rax
       jbe      SHORT G_M55172_IG03
						;; size=75 bbWeight=4 PerfScore 135.00
G_M55172_IG04:  ;; offset=0x0058
       ret      
						;; size=1 bbWeight=1 PerfScore 1.00

BruceForstall assigned mikedn Jul 7, 2017

saucecontrol referenced this issue in saucecontrol/PhotoSauce Jul 27, 2017

partial fix for RyuJIT perf issue

2a92d02

https://github.com/dotnet/coreclr/issues/12595

mikedn removed their assignment Jan 18, 2020

msftgits transferred this issue from dotnet/coreclr Jan 31, 2020

msftgits added this to the Future milestone Jan 31, 2020

saucecontrol mentioned this issue Apr 30, 2020

Code inefficiencies in loop array indexing #35618

Closed

BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020

TIHan removed the JitUntriaged CLR JIT issues needing additional triage label Oct 31, 2022

jakobbotsch closed this as completed Jul 8, 2024

github-actions bot locked and limited conversation to collaborators Aug 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RyuJit/x64] Generates extra movsxd when using a byte value as an address offset #8465

[RyuJit/x64] Generates extra movsxd when using a byte value as an address offset #8465

saucecontrol commented Jul 2, 2017 •

edited by BruceForstall

Loading

mikedn commented Jul 2, 2017

saucecontrol commented Jul 2, 2017 •

edited

Loading

mikedn commented Jul 2, 2017

RussKeldorph commented Jul 7, 2017

mikedn commented Jul 7, 2017

mikedn commented Jul 7, 2017

mikedn commented Jul 19, 2017

sdmaclea commented Jul 19, 2017

saucecontrol commented Aug 1, 2018

mikedn commented Aug 1, 2018

mikedn commented Aug 1, 2018

saucecontrol commented Aug 1, 2018

saucecontrol commented Aug 1, 2018

jakobbotsch commented Jul 8, 2024

[RyuJit/x64] Generates extra movsxd when using a byte value as an address offset #8465

[RyuJit/x64] Generates extra movsxd when using a byte value as an address offset #8465

Comments

saucecontrol commented Jul 2, 2017 • edited by BruceForstall Loading

mikedn commented Jul 2, 2017

saucecontrol commented Jul 2, 2017 • edited Loading

mikedn commented Jul 2, 2017

RussKeldorph commented Jul 7, 2017

mikedn commented Jul 7, 2017

mikedn commented Jul 7, 2017

mikedn commented Jul 19, 2017

sdmaclea commented Jul 19, 2017

saucecontrol commented Aug 1, 2018

mikedn commented Aug 1, 2018

mikedn commented Aug 1, 2018

saucecontrol commented Aug 1, 2018

saucecontrol commented Aug 1, 2018

jakobbotsch commented Jul 8, 2024

saucecontrol commented Jul 2, 2017 •

edited by BruceForstall

Loading

saucecontrol commented Jul 2, 2017 •

edited

Loading