-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT: ARM64 SVE format encodings, SVE_IF_4A
to SVE_JK_4B
#97739
Conversation
…er SVE format group.
…w for encoding elem size.
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsContributes to #94549 Adds 35 formats. This is a large one, but it is better to do these all at once. Progress:
|
Diff results for #97739Throughput diffsThroughput diffs for linux/arm64 ran on windows/x64MinOpts (-0.00% to +0.01%)
Details here Throughput diffs for linux/arm64 ran on linux/x64Overall (+0.02% to +0.04%)
MinOpts (+0.05% to +0.07%)
FullOpts (+0.02% to +0.03%)
Details here |
Diff results for #97739Throughput diffsThroughput diffs for linux/arm64 ran on windows/x64MinOpts (-0.01% to +0.00%)
Throughput diffs for windows/arm64 ran on windows/x64MinOpts (-0.00% to +0.01%)
Details here |
Diff results for #97739Throughput diffsThroughput diffs for linux/arm64 ran on windows/x64MinOpts (-0.01% to +0.00%)
Throughput diffs for windows/arm64 ran on windows/x64MinOpts (-0.00% to +0.01%)
Details here Throughput diffs for linux/arm64 ran on linux/x64Overall (+0.02% to +0.05%)
MinOpts (+0.06% to +0.10%)
FullOpts (+0.02% to +0.03%)
Details here |
@dotnet/jit-contrib @dotnet/arm64-contrib @kunalspathak @a74nh this is ready. |
Diff results for #97739Throughput diffsThroughput diffs for linux/arm64 ran on windows/x64MinOpts (-0.01% to -0.00%)
Throughput diffs for windows/arm64 ran on windows/x64MinOpts (-0.01% to +0.00%)
Details here |
Diff results for #97739Throughput diffsThroughput diffs for linux/arm64 ran on linux/x64Overall (+0.02% to +0.06%)
MinOpts (+0.07% to +0.12%)
FullOpts (+0.02% to +0.03%)
Details here |
I can look. My first guess is the additional cases in emitIns_R_R_R_R. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need another pass through (there's a lot of code!), but so far this is looking good.
INS_OPTS_SCALABLE_S); // LDNT1SB {<Zt>.S }, <Pg>/Z, [<Zn>.S{, <Xm>}] | ||
theEmitter->emitIns_R_R_R_R(INS_sve_ldnt1sh, EA_SCALABLE, REG_V3, REG_P4, REG_V1, REG_R2, | ||
INS_OPTS_SCALABLE_S); // LDNT1SH {<Zt>.S }, <Pg>/Z, [<Zn>.S{, <Xm>}] | ||
// REG_ZR can be used due to the optional {, <Xm>} of the format. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, we could use emitIns_R_R_R()
for these, but I think your way is better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did think about it. We still have to pass and encode REG_ZR regardless so it just made sense to keep it as emitIns_R_R_R_R
.
*/ | ||
|
||
void emitter::emitIns_BARR(instruction ins, insBarrier barrier) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I right in thinking you've not touched emitIns_BARR()
or emitIns_R_R_R_COND()
and this is just the diff getting confused?
Which emitIns_
functions have you changed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The diff is confused... never touched those. I only touched emitIns_R_R_R_R
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm probably going to create a new emitIns
called emitInsSve_R_R_R_R
and put all the handling of the sve instructions in that and have emitIns_R_R_R_R
call into it as a fallback. I'm hopeful that would limit the TP regressions and have it be more organized.
Diff results for #97739Throughput diffsThroughput diffs for linux/arm64 ran on windows/x64MinOpts (-0.01% to +0.00%)
Throughput diffs for windows/arm64 ran on windows/x64MinOpts (-0.01% to +0.00%)
Details here Throughput diffs for linux/arm64 ran on linux/x64Overall (+0.02% to +0.06%)
MinOpts (+0.07% to +0.12%)
FullOpts (+0.02% to +0.03%)
Details here |
Diff results for #97739Throughput diffsThroughput diffs for linux/arm64 ran on linux/x64Overall (+0.02% to +0.06%)
MinOpts (+0.07% to +0.12%)
FullOpts (+0.02% to +0.03%)
Details here |
Diff results for #97739Throughput diffsThroughput diffs for linux/arm64 ran on windows/x64Overall (-0.66% to -0.26%)
MinOpts (-1.28% to -0.76%)
FullOpts (-0.43% to -0.23%)
Throughput diffs for osx/arm64 ran on windows/x64Overall (-0.59% to -0.27%)
MinOpts (-1.29% to -0.76%)
FullOpts (-0.43% to -0.23%)
Throughput diffs for windows/arm64 ran on windows/x64Overall (-0.57% to -0.26%)
MinOpts (-1.29% to -0.76%)
FullOpts (-0.43% to -0.24%)
Details here |
@kunalspathak looks like I solved the TP regression :) |
Diff results for #97739Throughput diffsThroughput diffs for linux/arm64 ran on linux/x64Overall (-0.01% to -0.00%)
MinOpts (-0.04% to -0.01%)
FullOpts (-0.01% to -0.00%)
Details here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than the one thing below, LGTM.
Disclaimer: I've not checked checked every single pattern by hand as there are a lot.
@@ -6344,6 +6344,384 @@ void CodeGen::genArm64EmitterUnitTestsSve() | |||
INS_OPTS_SCALABLE_D); // LDFF1SH {<Zt>.D }, <Pg>/Z, [<Xn|SP>, <Zm>.D] | |||
theEmitter->emitIns_R_R_R_R(INS_sve_ldff1w, EA_SCALABLE, REG_V4, REG_P3, REG_R2, REG_V1, | |||
INS_OPTS_SCALABLE_D); // LDFF1W {<Zt>.D }, <Pg>/Z, [<Xn|SP>, <Zm>.D] | |||
|
|||
// IF_SVE_IF_4A | |||
theEmitter->emitIns_R_R_R_R(INS_sve_ldnt1b, EA_SCALABLE, REG_V3, REG_P2, REG_V1, REG_R0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should probably call emitInsSve_R_R_R_R()
directly here and remove the changes from emitIns_R_R_R_R()
?
@kunalspathak : not sure what your future plans for splitting things up was?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could go either way. The benefit of calling emitInsSve
is intent, and we could remove the scalable options from the emitIns
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for covering many formats. Added some suggestions.
INS_SCALABLE_OPTS_LSL_N); // LD3Q {<Zt1>.Q, <Zt2>.Q, <Zt3>.Q }, <Pg>/Z, [<Xn|SP>, | ||
// <Xm>, | ||
// LSL #4] | ||
theEmitter->emitIns_R_R_R_R(INS_sve_ld4q, EA_SCALABLE, REG_V5, REG_P1, REG_R4, REG_R3, INS_OPTS_SCALABLE_Q, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hhm, lot of instructions that needs consecutive registers.
src/coreclr/jit/emitarm64.cpp
Outdated
* for the 'dtype' field. | ||
*/ | ||
|
||
/*static*/ emitter::code_t emitter::insEncodeSveElemsize_dtype_ld1w(instruction ins, insFormat fmt, emitAttr size, code_t code) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a reason why this can't be part of insEncodeSveElemsize_dtype()
? For each size
, you can have a case INS_sve_ld1w
and then a switch/case for the formats?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, ld1w
has to be handled differently for two formats while the other cases in insEncodeSveElemsize_dtype
don't need to be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, so can you send the fmt
to insEncodeSveElemsize_dtype
and handle it accordingly instead of creating a new method?
Edit: possibly, we might need fmt
to handle future instructions as well, so should be ok to send it in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fmt
isn't needed for the other instructions, it's only needed for ld1w
. Passing in fmt
for insEncodeSveElemsize_dtype
would imply that it will check the format for the other instructions which I didn't want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What makes ld1w
special is that we have to set specific bits that are actually unrelated to dtype
, but related to elemsize.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Passing in fmt for insEncodeSveElemsize_dtype would imply that it will check the format for the other instructions which I didn't want
You can just a case ld1w
and do whatever you are doing in insEncodeSveElemsize_dtype_ld1w
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
discussed this offline - we decided to keep insEncodeSveElemsize_dtype_ld1w
but if more instructions need similar special handling, then we will merge it in insEncodeSveElemsize_dtype
.
break; | ||
|
||
default: | ||
assert(!"Invalid instruction"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is little odd that we have 4 cases that can land us inside this switch/case and here too we need to add default
and assert
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default
case should never be hit here. The assert is more for sanity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know, I am just pointing that we are writing odd looking pattern and testing C++ compiler here :) But honestly, not sure what's the best alternative here.
switch (ins)
{
case insA:
case insB:
case insC:
case insD:
...
...
switch (ins)
{
case insA:
case insB:
case insC:
case insD:
...
default:
assert("c++ compiler messed up :)");
}
}
fmt = IF_SVE_IK_4A_I; | ||
break; | ||
|
||
default: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
likewise here and down below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same with this, default case should never be hit.
* Returns true if the SVE instruction has a LSL addr. | ||
* This is for formats that have [<Xn|SP>, <Xm>, LSL #N], [<Xn|SP>{, <Xm>, LSL #N}] | ||
*/ | ||
/*static*/ bool emitter::insSveIsLslN(instruction ins, insFormat fmt) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this kind of methods should really be converted to a table driven lookup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added this in one of the task of #93095
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me, though we shouldn't do it in this PR. A follow-up will be better.
src/coreclr/jit/emit.h
Outdated
@@ -2397,6 +2398,7 @@ class emitter | |||
void emitAdvanceInstrDesc(instrDesc** id, size_t idSize) const; | |||
size_t emitIssue1Instr(insGroup* ig, instrDesc* id, BYTE** dp); | |||
size_t emitOutputInstr(insGroup* ig, instrDesc* id, BYTE** dp); | |||
BYTE* emitInstrSve(instrDesc* id, BYTE* dst); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename it to emitOutputInstrSve()
. Also should be guarded by #ifdef TARGET_ARM64
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I originally wanted to name it that, but emitOutputInstrSve
can't be called independently of emitOutputInstr
, and the signature is different, such as emitOutputInstr
returns a size_t
where as the emitOutputInstrSve
would return a BYTE*
.
Maybe I'm being too picky on the name. I noticed we also have a emitOutput_Instr
which has a similar signature to emitInstrSve
. Maybe I can just call it emitOutput_InstrSve
... just putting an underscore in there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea I wasn't thinking, this should have just been defined in emitarm64.h
rather than in emit.h
.
if (isVectorRegister(reg1)) | ||
{ | ||
// If the overall instruction is working on 128-bit | ||
// registers, the size of this register for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
given the capstone matches with JITDisasm, I just skimmed through the display code.
Diff results for #97739Throughput diffsThroughput diffs for linux/arm64 ran on windows/x64Overall (-0.62% to -0.26%)
MinOpts (-1.27% to -0.76%)
FullOpts (-0.44% to -0.25%)
Throughput diffs for osx/arm64 ran on windows/x64Overall (-0.71% to -0.27%)
MinOpts (-1.29% to -0.76%)
FullOpts (-0.43% to -0.24%)
Throughput diffs for windows/arm64 ran on windows/x64Overall (-0.58% to -0.26%)
MinOpts (-1.29% to -0.76%)
FullOpts (-0.44% to -0.24%)
Details here |
Diff results for #97739Throughput diffsThroughput diffs for linux/arm64 ran on linux/x64Overall (-0.02% to -0.00%)
MinOpts (-0.08% to -0.01%)
FullOpts (-0.01% to -0.00%)
Details here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Diff results for #97739Throughput diffsThroughput diffs for linux/arm64 ran on linux/x64Overall (-0.62% to -0.26%)
MinOpts (-1.28% to -0.76%)
FullOpts (-0.44% to -0.25%)
Throughput diffs for osx/arm64 ran on linux/x64Overall (-0.71% to -0.27%)
MinOpts (-1.29% to -0.76%)
FullOpts (-0.44% to -0.24%)
Throughput diffs for windows/arm64 ran on linux/x64Overall (-0.58% to -0.26%)
MinOpts (-1.29% to -0.76%)
FullOpts (-0.44% to -0.24%)
Details here |
Diff results for #97739Throughput diffsThroughput diffs for linux/arm64 ran on windows/x64Overall (-0.62% to -0.26%)
MinOpts (-1.28% to -0.76%)
FullOpts (-0.44% to -0.25%)
Throughput diffs for osx/arm64 ran on windows/x64Overall (-0.71% to -0.27%)
MinOpts (-1.29% to -0.76%)
FullOpts (-0.44% to -0.24%)
Throughput diffs for windows/arm64 ran on windows/x64Overall (-0.58% to -0.26%)
MinOpts (-1.29% to -0.76%)
FullOpts (-0.44% to -0.24%)
Details here Throughput diffs for linux/arm64 ran on linux/x64Overall (-0.01% to -0.00%)
MinOpts (-0.06% to -0.01%)
FullOpts (-0.01% to -0.00%)
Details here |
Diff results for #97739Throughput diffsThroughput diffs for linux/arm64 ran on linux/x64Overall (-0.01% to -0.00%)
MinOpts (-0.06% to -0.01%)
FullOpts (-0.01% to -0.00%)
Details here |
Contributes to #94549
Adds 35 formats. This is a large one, but it is better to do these all at once.
Progress:
Left: Capstone,
Right: Jit