-
-
Notifications
You must be signed in to change notification settings - Fork 30.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gh-112354: Add executor for less-taken branch #112902
Conversation
Offline we discussed a different way to satisfy the progress requirement: Instead of inserting an ENTER_EXECUTOR in the Tier 1 stream for "side exit" traces/executors, have a pointer to the secondary executor directly in the executor whose side exit it represents. Then the requirement becomes "executors attached to ENTER_EXECUTOR must make progress" and all is well. |
c2f00b0
to
65c41cb
Compare
Regarding forward progress, we could add a boolean parameter to the optimize function requiring forward progress of the executor. In the optimize function, we can ensure forward progress by by de-specializing the first instruction. For example, if the first (tier 1) instruction is |
But in this PR we won't need that. |
(The latter as yet unused.)
864d68d
to
0f64231
Compare
Here's a new version.
TODO: The side exit executors are leaked when the main executor is deallocated. Easy to fix, I just forgot and it's time to stop working today. Also, I need to add tests, and for that, I need to add a way to get the sub-executors from a main executor (since the sub-executors are not reachable via @markshannon Please have another look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There seems to be a lot of unnecessary work happening when moving from trace to trace or from tier to tier.
The stack_pointer
and frame attributes should be correctly handled in both tier 1 and tier 2 interpreters. They shouldn't need fixing up.
opcode = next_instr[1].op.code; | ||
} | ||
|
||
// For selected opcodes build a new executor and enter it now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why "selected opcodes", why not everywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In an earlier version that somehow didn't work. Right now the check whether the new trace isn't going to immediately deopt again relies on these opcodes. I figured once we have the side exit machinery working we could gradually increase the scope to other deoptimizations. Also, not all deoptimizations are worthy of the effort (e.g. the PEP 523 test).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No special cases, please, it just make the code more complicated and slower.
If we want to treat some exits differently, let's do it properly faster-cpython/ideas#638, not here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are several reasons. First, as I explain below, for bytecodes other than branches, I can't promise an exact check for whether the newly created sub-executor doesn't just repeat the same deoptimizing uop that triggered its creation (in which case the sub-executor would always deopt immediately if it is entered at all).
Second, for most bytecodes other than branches, deoptimization paths are relatively rare (IIRC this is apparent from the pystats data -- with the exception of some LOAD_ATTR
specializations).
For branches, we expect many cases where the "common" path is not much more common than the "uncommon" path (e.g. 60/40 or 70/30). Now, it might make sense have a different special case here, where if e.g. _GUARD_IS_TRUE_POP
has a hot side-exit, we know that the branch goes the other way, so we can simply create a sub-executor starting at the less-common branch target. The current approach doesn't do this (mostly because I'd have to thread the special case all the way through the various optimizer functions) but just creates a new executor starting at the original Tier 1 branch bytecode -- in the expectation that if the counters are tuned just right, we will have executed the less-common branch in Tier 1 while taking the common branch in Tier 2, so that Tier 1's shift register has changed state and now indicates that the less-common branch is actually taken more frequently. The checks at L1161 and ff. below are a safeguard in case that didn't happen yet (there are all kinds of interesting scenarios, e.g. loops that don't iterate much -- remember that the first iteration each time we enter a loop will be done in Tier 1, where we stay until we hit the JUMP_BACKWARD
bytecode at the end of the loop).
I propose this PR as a starting point for futher iterations, not as the ultimate design for side-exits. Let's discuss this Monday.
...and set resume_threshold so they are actually produced.
@markshannon Please review again. I did some of the things you asked for, for a few others I explained why not. TODO:
|
This has no business being in this PR. This reverts commit f1998c0.
@@ -2353,6 +2353,7 @@ int | |||
void | |||
_Py_Specialize_ForIter(PyObject *iter, _Py_CODEUNIT *instr, int oparg) | |||
{ | |||
assert(_PyOpcode_Deopt[instr->op.code] == FOR_ITER); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should really add such asserts to many specialization functions; I ran into this one during an intense debugging session.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assert can be instr->op.code == FOR_ITER
and it shouldn't be necessary, as _Py_Specialize_ForIter
is only called from FOR_ITER
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried that and I get a Bus error. And of course it's not supposed to be called with something else! But a logic error in my early prototype caused that to happen, and it took me quite a while to track it down.
opcode = next_instr[1].op.code; | ||
} | ||
|
||
// For selected opcodes build a new executor and enter it now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No special cases, please, it just make the code more complicated and slower.
If we want to treat some exits differently, let's do it properly faster-cpython/ideas#638, not here.
goto enter_tier_two; // All systems go! | ||
} | ||
|
||
// The trace is guaranteed to deopt again; forget about it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it? Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See explanation above.
Py_DECREF(current_executor); | ||
current_executor = (_PyUOpExecutorObject *)*pexecutor; | ||
|
||
// Reject trace if it repeats the uop that just deoptimized. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test may be a bit imprecise(*), but it tries to discard the case where, even though the counter in the executor indicated that this side exit is "hot", the Tier 1 bytecode hasn't been re-specialized yet. In that case the trace projection will just repeat the uop that just took a deopt side exit, causing it to immediately deopt again. This seems a waste of time and executors -- eventually the sub-executor's deopt counter will also indicate it is hot, and then we'll try again, but it seems better (if we can catch it) to avoid creating the sub-executor in the first place, relying on exponential backoff for the side-exit counter instead (implemented below at L1180 and ff.).
For various reasons, the side-exit counters and the Tier 1 deopt counters don't run in sync, so it's possible that the side-exit counter triggers before the Tier 1 counter has re-specialized. This check gives that another chance.
The test that I would like to use here would be check if the Tier 1 opcode is still unchanged (i.e., not re-specialized), but the executor doesn't record that information (and it would take up a lot of space, we'd need an extra byte for each uop that can deoptimize at least).
(*) The test I wrote is exact for the conditional branches I special-cased above (that's why there's a further special case here for _IS_NONE
). For other opcodes it may miss a few cases, e.g. when a single T1 bytecode translates to multiple guards and the failing guard is not the first uop in the translation (i.e. this would always happens for calls, whose translation always starts with _PEP_523
, which never dopts in cases we care about). In those cases we can produce a sub-executor that immediately deoptimizes. (And we never try to re-create executors, no matter how often it deoptimizes -- that's a general flaw in the current executor architecture that we should probably file separately.)
@@ -2353,6 +2353,7 @@ int | |||
void | |||
_Py_Specialize_ForIter(PyObject *iter, _Py_CODEUNIT *instr, int oparg) | |||
{ | |||
assert(_PyOpcode_Deopt[instr->op.code] == FOR_ITER); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assert can be instr->op.code == FOR_ITER
and it shouldn't be necessary, as _Py_Specialize_ForIter
is only called from FOR_ITER
.
#113104 unifies |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's talk offline about special cases for side exits on Monday. I would prefer to do only the special cases first and generalize later, but I hear that you prefer a different development strategy.
@@ -2353,6 +2353,7 @@ int | |||
void | |||
_Py_Specialize_ForIter(PyObject *iter, _Py_CODEUNIT *instr, int oparg) | |||
{ | |||
assert(_PyOpcode_Deopt[instr->op.code] == FOR_ITER); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried that and I get a Bus error. And of course it's not supposed to be called with something else! But a logic error in my early prototype caused that to happen, and it took me quite a while to track it down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some food for thought for Monday's discussion.
Py_DECREF(current_executor); | ||
current_executor = (_PyUOpExecutorObject *)*pexecutor; | ||
|
||
// Reject trace if it repeats the uop that just deoptimized. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test may be a bit imprecise(*), but it tries to discard the case where, even though the counter in the executor indicated that this side exit is "hot", the Tier 1 bytecode hasn't been re-specialized yet. In that case the trace projection will just repeat the uop that just took a deopt side exit, causing it to immediately deopt again. This seems a waste of time and executors -- eventually the sub-executor's deopt counter will also indicate it is hot, and then we'll try again, but it seems better (if we can catch it) to avoid creating the sub-executor in the first place, relying on exponential backoff for the side-exit counter instead (implemented below at L1180 and ff.).
For various reasons, the side-exit counters and the Tier 1 deopt counters don't run in sync, so it's possible that the side-exit counter triggers before the Tier 1 counter has re-specialized. This check gives that another chance.
The test that I would like to use here would be check if the Tier 1 opcode is still unchanged (i.e., not re-specialized), but the executor doesn't record that information (and it would take up a lot of space, we'd need an extra byte for each uop that can deoptimize at least).
(*) The test I wrote is exact for the conditional branches I special-cased above (that's why there's a further special case here for _IS_NONE
). For other opcodes it may miss a few cases, e.g. when a single T1 bytecode translates to multiple guards and the failing guard is not the first uop in the translation (i.e. this would always happens for calls, whose translation always starts with _PEP_523
, which never dopts in cases we care about). In those cases we can produce a sub-executor that immediately deoptimizes. (And we never try to re-create executors, no matter how often it deoptimizes -- that's a general flaw in the current executor architecture that we should probably file separately.)
goto enter_tier_two; // All systems go! | ||
} | ||
|
||
// The trace is guaranteed to deopt again; forget about it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See explanation above.
opcode = next_instr[1].op.code; | ||
} | ||
|
||
// For selected opcodes build a new executor and enter it now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are several reasons. First, as I explain below, for bytecodes other than branches, I can't promise an exact check for whether the newly created sub-executor doesn't just repeat the same deoptimizing uop that triggered its creation (in which case the sub-executor would always deopt immediately if it is entered at all).
Second, for most bytecodes other than branches, deoptimization paths are relatively rare (IIRC this is apparent from the pystats data -- with the exception of some LOAD_ATTR
specializations).
For branches, we expect many cases where the "common" path is not much more common than the "uncommon" path (e.g. 60/40 or 70/30). Now, it might make sense have a different special case here, where if e.g. _GUARD_IS_TRUE_POP
has a hot side-exit, we know that the branch goes the other way, so we can simply create a sub-executor starting at the less-common branch target. The current approach doesn't do this (mostly because I'd have to thread the special case all the way through the various optimizer functions) but just creates a new executor starting at the original Tier 1 branch bytecode -- in the expectation that if the counters are tuned just right, we will have executed the less-common branch in Tier 1 while taking the common branch in Tier 2, so that Tier 1's shift register has changed state and now indicates that the less-common branch is actually taken more frequently. The checks at L1161 and ff. below are a safeguard in case that didn't happen yet (there are all kinds of interesting scenarios, e.g. loops that don't iterate much -- remember that the first iteration each time we enter a loop will be done in Tier 1, where we stay until we hit the JUMP_BACKWARD
bytecode at the end of the loop).
I propose this PR as a starting point for futher iterations, not as the ultimate design for side-exits. Let's discuss this Monday.
Offline we decided to give this a rest. Also we are back to requiring executors to make progress. A few ideas out of the discussion:
|
Closing in preference of the data structures proposed in faster-cpython/ideas#644. |
(For a description of what changed since the first version, see later comments -- almost everything is take care of.)
This brings up many questions, but shows a possible way forward.
EXTENDED_ARG
)What's wrong so far?
_PyExecutorObject
and_PyOptimizerObject
#108866 (but for now is the easiest way out)uint16_t
type/size)What's right?
EXTENDED_ARG
: The trick is that the deopt goes to theEXTENDED_ARG
, so we must decode the instruction before checking forENTER_EXECUTOR
.src
anddest
arguments to-_PyOptimizer_BackEdge
and friends are slightly different whenEXTENDED_ARG
is present:src
points to the actual instruction (e.g.JUMP_BACKWARD
) whiledest
may point to anEXTENDED_ARG
opcode. This is important when reusing the code for inserting an executor at a place that's not aJUMP_BACKWARD
.@markshannon @brandtbucher