-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Conversation
Changes G_M25396_IG10:
mov eax, dword ptr [rcx+8]
sub eax, edx
cmp r10d, eax
jle SHORT G_M25396_IG11
jmp SHORT G_M25396_IG12
G_M25396_IG11:
mov eax, r10d
G_M25396_IG12:
mov r11d, dword ptr [r8+8]
sub r11d, r9d
cmp r10d, r11d
jle SHORT G_M25396_IG13
jmp SHORT G_M25396_IG14
G_M25396_IG13:
mov r11d, r10d To G_M25400_IG10:
mov eax, dword ptr [rcx+8]
sub eax, edx
cmp r10d, eax
setle r11b
movzx r11, r11b
neg r11d
mov esi, r10d
sub esi, eax
and r11d, esi
add r11d, eax
mov eax, dword ptr [r8+8]
sub eax, r9d
cmp r10d, eax
setle sil
movzx rsi, sil
neg esi
mov edi, r10d
sub edi, eax
and esi, edi
add esi, eax But size regressions... Total bytes of diff: 706 (0.02% of base)
diff is a regression.
Total byte diff includes 38 bytes from reconciling methods
Base had 0 unique methods, 0 unique bytes
Diff had 1 unique methods, 38 unique bytes
Top file regressions by size (bytes):
706 : System.Private.CoreLib.dasm (0.02% of base)
1 total files with size differences (0 improved, 1 regressed), 0 unchanged.
Top method regessions by size (bytes):
46 : System.Private.CoreLib.dasm - EventSource:CreateManifestAndDescriptors(ref,ref,ref,int):ref
38 : System.Private.CoreLib.dasm - JitHelpers:ConditionalSelect(bool,int,int):int (0/2 methods)
28 : System.Private.CoreLib.dasm - ConcurrentDictionary`2:Clear():this (2 methods)
27 : System.Private.CoreLib.dasm - String:CompareOrdinal(ref,int,ref,int,int):int
23 : System.Private.CoreLib.dasm - FileStream:ReadAsyncInternal(struct,struct,byref):ref:this
22 : System.Private.CoreLib.dasm - StringBuilder:CopyTo(int,struct,int):this
22 : System.Private.CoreLib.dasm - MemberInfoCache`1:MergeWithGlobalList(ref):this
20 : System.Private.CoreLib.dasm - ConcurrentDictionary`2:GrowTable(ref):this (2 methods)
19 : System.Private.CoreLib.dasm - String:Compare(ref,int,ref,int,int,ref,int):int
19 : System.Private.CoreLib.dasm - StringBuilder:ReplaceInPlaceAtChunk(byref,byref,long,int):this
17 : System.Private.CoreLib.dasm - String:Compare(ref,int,ref,int,int,bool):int
16 : System.Private.CoreLib.dasm - StringBuilder:.ctor(ref,struct):this
16 : System.Private.CoreLib.dasm - StringBuilder:MakeRoom(int,int,byref,byref,bool):this
15 : System.Private.CoreLib.dasm - ConcurrentQueue`1:EnqueueSlow(ref):this
15 : System.Private.CoreLib.dasm - PerCoreLockedStacks:.ctor():this (3 methods)
14 : System.Private.CoreLib.dasm - Math:Max(int,int):int (2 methods)
14 : System.Private.CoreLib.dasm - Math:Min(int,int):int (2 methods)
14 : System.Private.CoreLib.dasm - NameInfo:ReserveEventIDsBelow(int)
14 : System.Private.CoreLib.dasm - StringBuilder:AppendCore(ref,int,int):ref:this
14 : System.Private.CoreLib.dasm - StringBuilder:Replace(ushort,ushort,int,int):ref:this
13 : System.Private.CoreLib.dasm - String:CompareOrdinal(struct,struct):int
13 : System.Private.CoreLib.dasm - CalendarData:GetCalendars(ref,bool,ref):int
12 : System.Private.CoreLib.dasm - PinnableBufferCache:TrimFreeListIfNeeded():bool:this
12 : System.Private.CoreLib.dasm - EventSource:SendManifest(ref):bool:this
12 : System.Private.CoreLib.dasm - CompareInfo:CompareOrdinalIgnoreCase(struct,struct):int
Top method improvements by size (bytes):
-14 : System.Private.CoreLib.dasm - String:Compare(ref,int,ref,int,int,int):int
-2 : System.Private.CoreLib.dasm - Number:TryInt32ToHexStr(int,ushort,int,struct,byref):bool
-1 : System.Private.CoreLib.dasm - Number:TryUInt32ToDecStr(int,int,struct,byref):bool
61 total methods with size differences (3 improved, 58 regressed), 17092 unchanged. |
That's likely a relatively minor issue. The real problem is that when branch mispredictions aren't an issue then even the real |
Presumably the String.Compare code you're showing is this: int lengthA = length;
int lengthB = length;
if (strA != null)
lengthA = Math.Min(lengthA, strA.Length - indexA);
if (strB != null)
lengthB = Math.Min(lengthB, strB.Length - indexB); If you have reasons to try to improve this then probably the first attempt would be to get rid of if (strA.Length - indexA < lengthA)
lengthA = strA.Length - indexA; The And then if branch prediction is an issue then you'd probably want to try something like below. Not sure if it's completely correct but if it works, this approach comes before a solution that requires a compare and conversion from bool to int. int delta = (strA.Length - indexA) - length;
delta = delta >> 31;
lengthA = (delta & (strA.Length - indexA)) | (~delta & lengthA); |
Our experience dealing with string comparison suggest @mikedn is right. The size checks are not typically the issue, but the actual memory comparison itself; where the biggest bottleneck is memory latency. I would seriously kill for the ability to emit CMOV for the individual cases where you can certainly win, but in the context of strings where Math.Min/Math.Max appears in the framework it is probably not worth it. EDIT: Now you used a C# shim, emitting the |
Yeah, for Math.Min I'm not convinced there's much value add since this pattern really works better with constant inputs. But there are other APIs that show substantial perf gains, e.g., +60% for Int32.CompareTo with a slightly tweaked pattern. I'll investigate PRs for those. |
Really I'd just like |
It might be reasonable as a hardware intrinsic |
I do not think so, It is something that the JIT should just pattern match. |
@jkotas Not always... there are cases where |
Is there a prior art where folks have to tell C/C++ compilers when to emit |
I believe most of the C/C++ compilers will automatically do it, but it also depends on several options (such as emit smaller or emit faster code) and sometimes depends on more complex analysis, which they would have time to do (since the conditions under which |
@jkotas @tannergooding Actually you are asking yourself a trick question here. If the underlying question is the strict: "Can you force the C/C++ compiler to emit a However, if the underlying question is not the strict version: "Can I in C/C++ emit a Branchless Binary Search core loop: [EDIT: Wrong algorithm] __asm__ (
"comiss %[xi_mid],%[z]\n"
"cmovb %[mid],%[hi]\n"
"cmovae %[mid],%[lo]\n"
: [hi] "+r"(hi), [lo] "+r"(lo)
: [mid] "rm"(mid), [xi_mid] "xm"(xi[mid]), [z] "x"(z)
: "cc"
); |
I'm not aware of any such way to influence native compilers, but you never know. The analysis for cmov isn't particularly complex or costly. Certain idioms are well known to be inherently unpredictable (max, min, etc) and you can start with things that look like those. But it's best done fairly late and the JIT's IR is not really as friendly here as it could be. So the issue around cmov is more of ROI: it is a lot of work for something that has very selective impact. |
Adding to what @AndyAyersMS commented. A year or so ago I proposed being able to give extra information to the JIT about what branch is |
Here's an alternative [NonVersionable]
public static int Max(int val1, int val2)
{
var selectVal2Mask = JitHelpers.BoolToByteNonNormalized(val1 >= val2) - 1;
return (val2 & selectVal2Mask) + (val1 & ~selectVal2Mask);
} This gives a 280% throughput boost on my machine for random inputs (old: 2,150 ms for 500MM iterations; new: 770 ms for 500MM iterations). Of course, |
Based on |
All well known C/C++ compilers are pretty aggressive at CMOV generation. Small/Fast code options usually don't impact this since CMOV tends to be more or less good at both. What I would expect to impact CMOV generation is things like PGO, if a compiler knows that a branch is rarely taken then it could fallback to normal conditional code.
I don't know. I did it in lowering and JIT's IR worked well enough. The only thing that I found to be rather cumbersome is the fact is that JIT's IR doesn't have instructions and operands but a soup of nodes that are both. This tends to make the pattern matching code slightly more complex. It probably would have helped is containment pulled out contained nodes out of the linear order so they look more like "operands".
I'm happy to do such work if the JIT team supports it by reviewing, suggesting improvements etc. I already did an experiment about a year ago but then the JIT teams provided basically no feedback...
How well does that work on not so random input (e.g. a mostly sorted array)? |
Re cmov: I worry about introducing mid-block flag consumers given that we don't track flag liveness. But I suppose, given decomposition of large int ops and unordered float compares, we're already in this boat. And it's also a blessing that we don't do other late peepholes, though the code cries out for it in places. Sorry that we didn't get back to you about your earlier experiment. If you're willing to give it another go I'll make sure we pay attention. |
I worry about that too but it seems to work quite well in practice. At least for as long as LSRA doesn't start doing more fancy things...
Any examples? It would be useful to have a list to see what may potentially interfere with flags. |
@mikedn Assuming perfect branch prediction, the perf hit for the non-branching version is around -30%. The |
Right, with
Yes, most of this revolves around the JIT recognizing I need to dust off my if conversion experiment. The biggest unknown about it is the impact on JIT throughput, back then I didn't have the necessary tools to measure it. Then maybe we can put this matter to rest one way or another. |
@mikedn I have a benchmark I am working on that has this particular idiom. We can use it for benchmarking:
Code generated running on master (2 days ago) is ugly as hell: mov rbp,qword ptr [rdx]
xor rbp,qword ptr [rdx+rcx*8]
je 00007FFC5F8CA8EA
mov ebp,1
jmp 00007FFC5F8CA8EC
xor ebp,ebp
or r9,rbp EDIT: @AndyAyersMS @jkotas any plan to expose JitHelpers as public API? (if it is exposed already what package do I import to use it?) |
This doesn't even require CMOV, it should generate something like: mov rbp,qword ptr [rdx]
xor rbp,qword ptr [rdx+rcx*8]
setne al
movzx eax, al ; actually optional, or can be narrowed to "or r9l, al"
or r9, rax |
That code can use |
@mikedn yes I know... It is an special case for that last one. It probably should have read 'that idiom' instead of 'this idiom' to disambiguate. |
I think we're in somewhat murky waters here as far as what we are willing to do for performance. On the one hand, and speaking as a jit developer, I'd really prefer that we invest in improving the capabilities of the jit. The sorts of patterns discussed here and in other related issues are common and arise in both predictable and unexpected ways in user code (in particular a lot of them only be come visible after inlining). The optimal expansions are often machine and context dependent, and there are a lot of them. Encapsulating these patterns via helpers/intrinsics/etc leads to code that is now studded with odd performance barnacles. It leads to a kind of performance mysticism, giving off the impression that one must resort to contortions and somewhat arcane practices to get good performance out of C#. And by presuming to handle the important perf cases this way there's then a bit less incentive for the jit team to go in and fix the underling issues, as "anyone who cares" has presumably already coded around the problems. On the other hand, modern CPUs have a number of performance cliffs and the perf gap between naive codegen and optimal codegen can be quite large. So I get that it's frustrating as a developer to not be able to close this gap when performance really is a key factor. In cases like the HW intrinsics where it's very unlikely the JIT will ever be able to organically use the full set of operations the machine provides, exposing the raw hardware capabilities is reasonable. But in cases where we're talking about more basic idiomatic patterns I'd prefer we hold off and give the JIT team a chance to step up and fix the problems. On the (third?) hand we also need to be pragmatic and realize that the JIT team is relatively small and moves relatively slowly. So I think that leads us to a case-by-case evaluation of various proposals, and here I would like to see us more clearly provide guidance on what sorts of perf justify what sorts of changes. |
I agree. These are good experiments to demonstrate the benefits, but I do not think we want to be adding cryptic code like this to the framework. We should wait for the JIT fix. |
I wonder if some of this work can be offloaded onto Roslyn instead of RyuJIT. For example, the pattern |
I'm not sure where the String.Compare example came, the one I found has some null checks that are missing in the presented example code. In any case, the test rsi, rsi
je SHORT G_M30462_IG04
mov eax, dword ptr [rsi+8]
sub eax, ebx
cmp r12d, eax
mov r15d, eax
cmovle r15d, r12d
G_M30462_IG04:
test rdi, rdi
je SHORT G_M30462_IG05
mov eax, dword ptr [rdi+8]
sub eax, r14d
cmp r12d, eax
cmovle eax, r12d
mov r12d, eax
G_M30462_IG05: |
String:Compare was mostly from looking at the regressions; not as an example of a good result 😄 |
int Math.Clamp(int value, int min, int max) Might be something to look at; and its probably what the |
Only found 2 that should be clamp, both in StringBuilder #16219 |
Closing this as investigation into |
I wonder why it can't be just: [NonVersionable]
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static int Max(int val1, int val2)
{
if (Sse41.IsSupported)
{
Vector128<int> vec1 = Vector128.CreateScalarUnsafe(val1);
Vector128<int> vec2 = Vector128.CreateScalarUnsafe(val2);
return Sse41.Max(vec1, vec2).ToScalar();
}
return (val1 >= val2) ? val1 : val2;
}
// or Avx2.Max and ymm ( A small benchmark: https://gist.github.com/EgorBo/9e99d6958f4db162c075badaac29b5f2
case 2: equal data
Some C++ compilers also do it: https://godbolt.org/z/8TDdoh |
Mostly an experiment
/cc @mikedn @GrabYourPitchforks