-
-
Notifications
You must be signed in to change notification settings - Fork 30.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Branching design for Tier 2 (uops) interpreter #106529
Comments
Separate from all this is the design for the counters to be used to determine whether a particular POP_JUMP_IF_XXX is more likely to jump or not. @markshannon's suggestion is to add a 16 bit cache entry to those bytecode instructions, initialized with a pattern of alternating ones and zeros. Each time we execute the instruction we shift the value left by 1, shifting in a 1 for a jump taken or a 0 for a jump not taken. When the question is asked, "is this jump likely", we simply count the bits, and if it's mostly ones we assume it is, if it's mostly zeros we assume it isn't. If it's half-half, well, we have a tough choice. UPDATE: See gh-109039. |
Q. In Tier 2, should we have POP_JUMP_IF_XXX, which (like its Tier 1 counterpart) combines a pop and a branch, or should we emit separate POP_TOP instructions? Emitting separate POP_TOP uops seems perhaps simpler, but it's less performant to interpret, since the explicit POP_TOP will call Py_DECREF() on the stack top, whereas POP_JUMP_IF_TRUE/FALSE won't need to do that at all (True an False are immortal), an POP_JUMP_IF_[NOT_]NONE can skip that in the None case. I don't have much intuition yet about which will be easier for the optimizer to handle. It's easy to change. This also reminds me of a question for @brandtbucher: assuming that at some point the copy-and-patch machinery will use a Tier 2 uop sequence as input, how would you want to handle hand-written uops like JUMP_IF_FALSE or SAVE_IP? (If the answer is long or complicated, we should take this to the faster-cpython/ideas tracker.) |
Similarly, in his original guidance @markshannon suggests that UPDATE: Never mind, Mark did this in GH-106599 using a macro in bytecodes.c. |
The point of my original suggestions was that using additional micro-ops can help us to reduce the number of branch micro-ops. As you point out
All micro-ops are hand written, so |
How is this different from a counter, where you add for jump taken and subtract for jump not taken? I understand it records some recency information, but if that isn't used, isn't it better to get a more accurate number (with a memory larger than 16 entries?) |
A counter doesn't tell you anything about the recent history. An alternative is to count the length and direction of the last two runs, but that's more complex and slower. |
There seems to be some terminological confusion here. I'v been thinking of uops as anything that the Tier 2 interpreter can interpret. Most of those (over a 100 of them) are generated by the generator. There's only a handful hand-written ones (i.e., explicitly written down in the switch in But it doesn't matter, we seem to be in agreement. |
To clarify, my concern about the few hand-written uops is that IIUC @brandbucher's copy-and-patch tooling uses as its input either bytecodes.c or one of its .c.h outputs in order to generate its templates. For the hand-written uops it will need a way to use the hand-written code as template source. Maybe we need to add them to bytecodes.c after all, marks as uops. |
Off-line we discussed the The first one to land is likely |
- Hand-written uops JUMP_IF_{TRUE,FALSE}. These peek at the top of the stack. The jump target (in superblock space) is absolute. - Hand-written translation for POP_JUMP_IF_{TRUE,FALSE}, assuming the jump is unlikely. Once we implement jump-likelihood profiling, we can implement the jump-unlikely case (in another PR). - Tests (including some test cleanup). - Improvements to len(ex) and ex[i] to expose the whole trace.
FOR_ITER specializations, revisitedNote that
If the iterator is exhausted, it logically pushes a dummy value on top of the stack and jumps to the Take
Turning each of these three steps into a separate micro-op would cause a 3x3 explosion, since the details of each depends on the type (list, tuple or range). Perhaps a somewhat better approach would be to have one new micro-op for each type that pushes either the next value or So we could possibly do with 5 new micro-ops:
We can't use If we wanted to split the first micro-op into a separate guard and action micro-ops, we'd still be slightly ahead: 8 new micro-ops instead of 9. But there's probably not much point in separate the guard from the action; it's unlikely that any optimization pass would be able to do much guard elimination, given how Here's a sample implementation of
This appears to work in the Tier 1 interpreter. For Tier 2 I have to finagle a few more things, like hand-written versions of |
I don't see why 9 micro-ops is an "explosion", but 5 is fine.
This will prevent the optimizer from removing guards. Please keep guards and actions separate. |
For all instructions, we want to keep them efficient for the tier 1 interpreter, while producing optimizable micro-ops for tier 2. The problem with instructions like DEOPT_IF(Py_TYPE(iter) != &PyRangeIter_Type);
if (!is_exhausted(iter)) {
PUSH_RANGE_ITER_NEXT_VALUE;
}
else {
POP_AND_CLEANUP_RANGE_ITERATOR;
JUMP_BY(oparg + 1); // The +1 is for the END_FOR
} We want a definition of
or, for the uncommon case (exhausted)
|
The question is, how to express this? Something like: macro(FOR_ITER_RANGE) =
CHECK_ITER_RANGE +
(IS_RANGE_ITERATOR_NOT_EXHAUSTED ?
PUSH_RANGE_ITER_NEXT_VALUE :
(POP_AND_CLEANUP_RANGE_ITERATOR + JUMPBY(oparg+1))
); I'm glossing over a bunch of awkward details here, I realize. This could be quite fiddly to implement, but it should be doable. |
During superblock generation, a JUMP_BACKWARD instruction is translated to either a JUMP_TO_TOP micro-op (when the target of the jump is exactly the beginning of the superblock, closing the loop), or a SAVE_IP + EXIT_TRACE pair, when the jump goes elsewhere. The new JUMP_TO_TOP instruction includes a CHECK_EVAL_BREAKER() call, so a closed loop can still be interrupted.
Hmm... Implementing the ternary op in the macro syntax looks like fairly costly. And it looks like this will only ever be used for 3-4 instructions (basically just the The parser changes are straightforward enough, but we'd also need to be able to handle that in the Tier 1 code generation (combining ops efficiently is already some of the most complex code in the generator, and that will become more complicated with the conditional), as well as in the Tier 2 code generation. And then the Tier translation will need to become aware of it as well, and the Tier 2 interpreter. Separately, it looks like you are proposing a syntax addition where in a macro you'd be able to write
which gets special-cased in the superblock generator so it just emits a I'm still thinking about how best to do the ternary expression. I'll keep you posted. |
Note that this may generate two SAVE_IP uops in a row. Removing unneeded SAVE_IP uops is the optimizer's job.
So here's a possible way to deal with My primary goals here are:
I'm willing to sacrifice some complexity in the translation -- in particular, I propose to have three hand-written cases in the translator (one for each Taking
where
In Tier 2, the expansion will be slightly different. We reuse I'll whip up a demo. |
Correction. Following Mark's suggestion, the Tier 2 translation (which I propose to do manually) should be
together with a stub doing
|
PR: gh-106638 The manual Tier 2 translation code is here. Demo function (from test_misc.py): def testfunc(n):
total = 0
for i in range(n):
total += i
return total Tier 1 disassembly (by dis):
Tier 2 disassembly (output from commented-out code in the test, annotated):
|
When you say "do manually", I assume that means "special case these instructions in the code generator". I see two problems with doing this:
If you think that it is worth taking on this technical debt to unblock other work, then I accept that, but we will need to schedule implementing a more principled approach. |
Okay, let's go with this for now. I tried to come up with a way to encode translations like this in the "macro expansion" data structure and it would become very hacky. If we end up wanting to add new FOR_ITER specializations regularly we can iterate. I will land the FOR_ITER_RANGE PR and then work on FOR_ITER_LIST/TUPLE. After that I'll return to CALL specialization. |
For an example of what this does for Tier 1 and Tier 2, see #106529 (comment)
Also rename `_ITER_EXHAUSTED_XXX` to `_IS_ITER_EXHAUSTED_XXX` to make it clear this is a test.
Also rename `_ITER_EXHAUSTED_XXX` to `_IS_ITER_EXHAUSTED_XXX` to make it clear this is a test.
…on#106756) The Tier 2 opcode _IS_ITER_EXHAUSTED_LIST (and _TUPLE) didn't set it->it_seq to NULL, causing a subtle bug that resulted in test_exhausted_iterator in list_tests.py to fail when running all tests with -Xuops. The bug was introduced in pythongh-106696. Added this as an explicit test. Also fixed the dependencies for ceval.o -- it depends on executor_cases.c.h.
These aren't automatically translated because (ironically) they are macros deferring to POP_JUMP_IF_{TRUE,FALSE}, which are not viable uops (being manually translated). The hack is that we emit IS_NONE and then set opcode and jump to the POP_JUMP_IF_{TRUE,FALSE} translation code.
The design for However, we have another case: unspecialized Alas, |
That seems to work, except the deopt code must be duplicated (since we can't use |
This uses the new mechanism whereby certain uops are replaced by others during translation, using the `_PyUop_Replacements` table. We further special-case the `_FOR_ITER_TIER_TWO` uop to update the deoptimization target to point just past the corresponding `END_FOR` opcode. Two tiny code cleanups are also part of this PR.
- Double max trace size to 256 - Add a dependency on executor_cases.c.h for ceval.o - Mark `_SPECIALIZE_UNPACK_SEQUENCE` as `TIER_ONE_ONLY` - Add debug output back showing the optimized trace - Bunch of cleanups to Tools/cases_generator/
This uses the new mechanism whereby certain uops are replaced by others during translation, using the `_PyUop_Replacements` table. We further special-case the `_FOR_ITER_TIER_TWO` uop to update the deoptimization target to point just past the corresponding `END_FOR` opcode. Two tiny code cleanups are also part of this PR.
- Double max trace size to 256 - Add a dependency on executor_cases.c.h for ceval.o - Mark `_SPECIALIZE_UNPACK_SEQUENCE` as `TIER_ONE_ONLY` - Add debug output back showing the optimized trace - Bunch of cleanups to Tools/cases_generator/
This uses the new mechanism whereby certain uops are replaced by others during translation, using the `_PyUop_Replacements` table. We further special-case the `_FOR_ITER_TIER_TWO` uop to update the deoptimization target to point just past the corresponding `END_FOR` opcode. Two tiny code cleanups are also part of this PR.
This issue is part of the larger epic of gh-104584. In PR gh-106393 I tried to implement branching, but it was premature. Here's a better design, following @markshannon's guidance.
We have the following jump instructions (not counting the instrumented versions):
Unconditional jumps:
Branches, a.k.a. conditional jumps:
POP_JUMP_IF_FALSE, POP_JUMP_IF_TRUE,
POP_JUMP_IF_NONE, POP_JUMP_IF_NOT_NONEFOR_ITER's specializations:
FOR_ITER_GEN
SEND
Add counters to to POP_JUMP_IF_{FALSE,TRUE} to determine likeliness
The translation strategies could be as follows:
Unconditional jumps
JUMP_BACKWARD
JUMP_BACKWARD_NO_INTERRUPT
JUMP_FORWARD
instr
to the destination of the jump).Conditional jumps (branches)
POP_JUMP_IF_FALSE and friends
Consider the following Python code:
This translates roughly to the following Tier 1 bytecode (using B1, B2, ... to label Tier 1 instructions, and
<cond>
,<block>
etc. to represent code blocks that evaluate or execute the corresponding Python fragments):I propose the following translation into Tier 2 uops, assuming the branch is "unlikely":
Where JUMP_IF_FALSE inspects the top of stack but doesn't pop it, and has an argument that executes a jump in the Tier 2 uop instruction sequence.
If the branch is "likely", we do this instead:
Note how in this case,
<rest>
is projected as part of the trace, while<block>
is not, since the likely case is that we jump over<block>
to<rest>
.For the other simple conditional jumps (POP_JUMP_IF_TRUE,
POP_JUMP_IF_NONE, POP_JUMP_IF_NOT_NONE) we do the same: if the jump is unlikely, emit a JUMP_IF_XXX uop and a stub; if the jump is likely, emit the inverse JUMP_IF_NOT_XXX uop and a different stub, and continue projecting at the destination of the original jump bytecode.I propose to have hand-written cases both in the superblock generator and in the Tier 2 interpreter for these, since the translations are too irregular to fit easily in the macro expansion data structure. The stub generation will require additional logic and data structures in
translate_bytecode_to_trace()
to keep track of the stubs required so far, the available space for those, and the back-patching required to set the operands for the JUMP_IF_XXX uops.FOR_ITER and (especially) its specializations
The common case for these is not to jump. The bytecode definitions are too complex to duplicate in hand-written Tier 2 uops. My proposal is to change these in bytecodes.c so that, instead of using the
JUMPBY(n)
macro, they useJUMPBY_POP_DISPATCH(n)
, which in Tier 1 translates into justJUMPBY(n)
, but in Tier 2 translates into roughlythereby exiting the trace when the corresponding for-loop terminates.
I am assuming here that most loops have several iterations. I don't think it's worth special-casing the occasional for-loop that almost always immediately terminates.
SEND
Possibly we could treat this the same as FOR_ITER. But right now I propose to just punt here, and when we encounter it, stop projecting, as we do with any other unsupported bytecode instruction.
Linked PRs
The text was updated successfully, but these errors were encountered: