Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RyuJit/x64] Generates extra movsxd when using a byte value as an address offset #8465

Closed
saucecontrol opened this issue Jul 2, 2017 · 14 comments
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions optimization tenet-performance Performance related issue
Milestone

Comments

@saucecontrol
Copy link
Member

saucecontrol commented Jul 2, 2017

In image processing, it's common to map byte values (8-bit-per-channel pixels) through a lookup table.

RyuJit x64 apparently uses a 32-bit movzx to read byte values, then widens using movsxd before using them as address offsets.

Here's a quick sample program demonstrating the issue:

using System;
using System.Runtime.CompilerServices;

class Program
{
	[MethodImpl(MethodImplOptions.NoInlining)]
	unsafe static void LUTMap(byte* src, float* dest, float* lut, int cb)
	{
		byte* end = src + cb - 4;
		while(src <= end)
		{
			dest[0] = lut[src[0]];
			dest[1] = lut[src[1]];
			dest[2] = lut[src[2]];
			dest[3] = lut[src[3]];

			src  += 4;
			dest += 4;
		}
	}

	unsafe static void Main(string[] args)
	{
		var bytes = new byte[1024 * 1024];
		var floats = new float[bytes.Length];
		var lut = new float[256];


		new Random(42).NextBytes(bytes);
		for (int i = 0; i < lut.Length; i++)
			lut[i] = (float)Math.Pow(i, 1 / 2.4);

		fixed (byte* pbytes = &bytes[0])
		fixed (float* pfloats = &floats[0], plut = &lut[0])
		{
			LUTMap(pbytes, pfloats, plut, bytes.Length);
		}
	}
}

RyuJit x64 generates the following for the LUTMap method

; Assembly listing for method Program:LUTMap(long,long,long,int)
; Emitting BLENDED_CODE for X64 CPU with AVX
; optimized code
; rsp based frame
; fully interruptible
; Final local variable assignments
;
;  V00 arg0         [V00,T00] ( 11, 32   )    long  ->  rcx
;  V01 arg1         [V01,T01] (  8, 26   )    long  ->  rdx
;  V02 arg2         [V02,T02] (  6, 18   )    long  ->   r8
;  V03 arg3         [V03,T04] (  3,  3   )     int  ->   r9
;  V04 loc0         [V04,T03] (  3,  6   )    long  ->  rax
;# V05 OutArgs      [V05    ] (  1,  1   )  lclBlk ( 0) [rsp+0x00]
;
; Lcl frame size = 0

G_M15454_IG01:
       C5F877               vzeroupper

G_M15454_IG02:
       4963C1               movsxd   rax, r9d
       488D4401FC           lea      rax, [rcx+rax-4]
       483BC8               cmp      rcx, rax
       7760                 ja       SHORT G_M15454_IG04

G_M15454_IG03:
       440FB609             movzx    r9, byte  ptr [rcx]
       4D63C9               movsxd   r9, r9d
       C4817A100488         vmovss   xmm0, dword ptr [r8+4*r9]
       C4E17A1102           vmovss   dword ptr [rdx], xmm0
       440FB64901           movzx    r9, byte  ptr [rcx+1]
       4D63C9               movsxd   r9, r9d
       C4817A100488         vmovss   xmm0, dword ptr [r8+4*r9]
       C4E17A114204         vmovss   dword ptr [rdx+4], xmm0
       440FB64902           movzx    r9, byte  ptr [rcx+2]
       4D63C9               movsxd   r9, r9d
       C4817A100488         vmovss   xmm0, dword ptr [r8+4*r9]
       C4E17A114208         vmovss   dword ptr [rdx+8], xmm0
       440FB64903           movzx    r9, byte  ptr [rcx+3]
       4D63C9               movsxd   r9, r9d
       C4817A100488         vmovss   xmm0, dword ptr [r8+4*r9]
       C4E17A11420C         vmovss   dword ptr [rdx+12], xmm0
       4883C104             add      rcx, 4
       4883C210             add      rdx, 16
       483BC8               cmp      rcx, rax
       76A5                 jbe      SHORT G_M15454_IG03

G_M15454_IG04:
       C3                   ret

; Total bytes of code 108, prolog size 3 for method Program:LUTMap(long,long,long,int)
; ============================================================

Can it be modified to use the 64-bit movzx in cases like this?

category:cq
theme:basic-cq
skill-level:intermediate
cost:medium
impact:small

@mikedn
Copy link
Contributor

mikedn commented Jul 2, 2017

Can it be modified to use the 64-bit movzx in cases like this?

Likely but it's not trivial unfortunately. Making the memory load produce a 64 bit value directly it's problematic due to the they way loads are represented in JIT. But it may be possible to turn the cast that follows it into a no-op because the register (r9 in this case) happens to already contain the correct value.

As a workaround you can cast the byte to uint, you'll still get a redundant instruction but at least it will be movzx (which is cheaper) instead of movsx.

@saucecontrol
Copy link
Member Author

saucecontrol commented Jul 2, 2017

The uint cast actually flips the movsxd r9, r9d to mov r9d, r9d, so I guess I can get a no-op now, albeit a 3 byte one. Thanks for the hint.

@mikedn
Copy link
Contributor

mikedn commented Jul 2, 2017

The uint cast actually flips the movsxd r9, r9d to mov r9d, r9d, so I guess I can get a no-op now

Right, it generates mov rather than movzx. Both mov and movzx have the advantage of having 0 latency (the CPU frontend handles them directly without generating a uop).

@RussKeldorph
Copy link
Contributor

@dotnet/jit-contrib

@mikedn
Copy link
Contributor

mikedn commented Jul 7, 2017

Looks like we should be able to handle this by making the cast operand contained (when it's an indir). Codegen needs a bit of work, it doesn't seem possible to handle this right now.

@mikedn
Copy link
Contributor

mikedn commented Jul 7, 2017

@RussKeldorph I'm working on a fix in dotnet/coreclr#12676, you can assign to me

@mikedn
Copy link
Contributor

mikedn commented Jul 19, 2017

As far as I can tell this also affects ARM64:

static long Test(ref sbyte x) => (long)x;

generates

G_M55886_IG01:
        A9BF7BFD          stp     fp, lr, [sp,#-16]!
        910003FD          mov     fp, sp
G_M55886_IG02:
        39800000          ldrsb   x0, [x0]
        93407C00          sxtw    x0, x0
G_M55886_IG03:
        A8C17BFD          ldp     fp, lr, [sp],#16
        D65F03C0          ret     lr

@sdmaclea The sxtw instruction is redundant since ldrsb already sign extended to 64 bit, right?

@sdmaclea
Copy link
Contributor

@mikedn Correct sxtw is redundant correct

@saucecontrol
Copy link
Member Author

I've just noticed similar redundant instructions when narrowing (u)int->(u)short as well. Here, the i0-i3 variables are int, and they're being stored to ushort* dest

   235: 					dest[0] = (ushort)i0;
00007FFC74A01C41  movzx       r9d,r9w  
00007FFC74A01C45  mov         word ptr [r8],r9w  
   236: 					dest[1] = (ushort)i1;
00007FFC74A01C4A  movzx       r9d,r10w  
00007FFC74A01C4E  mov         word ptr [r8+2],r9w  
   237: 					dest[2] = (ushort)i2;
00007FFC74A01C53  movzx       r9d,r11w  
00007FFC74A01C57  mov         word ptr [r8+4],r9w

Would this be the same issue?

@mikedn
Copy link
Contributor

mikedn commented Aug 1, 2018

Nope, different issue. What is i0, do you have a complete example?

@mikedn
Copy link
Contributor

mikedn commented Aug 1, 2018

BTW, after encountering don't remember how many bugs while working on this issue I'm finally back on it. Hopefully I'll fix it this month.

@saucecontrol
Copy link
Member Author

i0 is a local int variable used for some fixed-point math. After calculation, it's stored in 16bits. This was something I just spotted while debugging some code on net472. I'll create a repro and verify it's still an issue on the current master. Will open a new issue if so.

@saucecontrol
Copy link
Member Author

Sorry, that was a false alarm. It repros under netcoreapp2.0 but not under netcoreapp2.1 or current master.

@mikedn mikedn removed their assignment Jan 18, 2020
@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the Future milestone Jan 31, 2020
@BruceForstall BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020
@TIHan TIHan removed the JitUntriaged CLR JIT issues needing additional triage label Oct 31, 2022
@jakobbotsch
Copy link
Member

Looks fixed today:

; Assembly listing for method Program:LUTMap(ulong,ulong,ulong,int) (FullOpts)
; Emitting BLENDED_CODE for X64 with AVX - Windows
; FullOpts code
; optimized code
; rsp based frame
; fully interruptible
; No PGO data
; Final local variable assignments
;
;  V00 arg0         [V00,T00] ( 11, 32   )    long  ->  rcx        
;  V01 arg1         [V01,T01] (  8, 26   )    long  ->  rdx        
;  V02 arg2         [V02,T02] (  6, 18   )    long  ->   r8         single-def
;  V03 arg3         [V03,T04] (  3,  3   )     int  ->   r9         single-def
;  V04 loc0         [V04,T03] (  3,  6   )    long  ->  rax         single-def
;# V05 OutArgs      [V05    ] (  1,  1   )  struct ( 0) [rsp+0x00]  do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;
; Lcl frame size = 0

G_M55172_IG01:  ;; offset=0x0000
						;; size=0 bbWeight=1 PerfScore 0.00
G_M55172_IG02:  ;; offset=0x0000
       movsxd   rax, r9d
       lea      rax, [rcx+rax-0x04]
       cmp      rcx, rax
       ja       SHORT G_M55172_IG04
       align    [0 bytes for IG03]
						;; size=13 bbWeight=1 PerfScore 2.50
G_M55172_IG03:  ;; offset=0x000D
       movzx    r10, byte  ptr [rcx]
       vmovss   xmm0, dword ptr [r8+4*r10]
       vmovss   dword ptr [rdx], xmm0
       movzx    r10, byte  ptr [rcx+0x01]
       vmovss   xmm0, dword ptr [r8+4*r10]
       vmovss   dword ptr [rdx+0x04], xmm0
       movzx    r10, byte  ptr [rcx+0x02]
       vmovss   xmm0, dword ptr [r8+4*r10]
       vmovss   dword ptr [rdx+0x08], xmm0
       movzx    r10, byte  ptr [rcx+0x03]
       vmovss   xmm0, dword ptr [r8+4*r10]
       vmovss   dword ptr [rdx+0x0C], xmm0
       add      rcx, 4
       add      rdx, 16
       cmp      rcx, rax
       jbe      SHORT G_M55172_IG03
						;; size=75 bbWeight=4 PerfScore 135.00
G_M55172_IG04:  ;; offset=0x0058
       ret      
						;; size=1 bbWeight=1 PerfScore 1.00

@github-actions github-actions bot locked and limited conversation to collaborators Aug 9, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions optimization tenet-performance Performance related issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants