-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Constructing Executors using stack contents #635
Comments
Let's try to go over those steps in a bit more detail for This is currently special-cased in optimizer.c by special-casing the
So now at some point we have a new buffer containing Tier 2 IR produced by projecting a trace from A simple thing we could do at this point would be to replace the But it would be more effective if we could extend the trace that was "interrupted" by the It's probably not a good idea to update the original Executor object in place (for one, Executor objects are variable-size objects so we can't make it longer). Instead, however, we can construct a new Executor and use it to replace the original in the First we have to allocate another buffer. Into this buffer we first copy the IR from the original Executor, but stopping at the There's a wrinkle though: stubs. The original Executor may contain one or more stub sequences at the end. The new trace buffer may contain another set of stub sequences (but at the very end of the buffer, with a gap in between). For optimization purposes, IIUC we don't optimize stubs, but they may become dead code. We should probably somehow collect these stubs in a separate buffer and add them back later, patching up the jumps. (This requires us to be somewhat clever, because there's nothing to tell us where the stubs are except the conditional branch uops. We can do something fairly stupid initially and make it faster once everything's working.) Replacing Executors in-flight looks easy enough -- there are reference counts that will cause the original Executor to be deallocated once the last thread or frame executing it releases it. Continuing in the new Executor where the original terminated may be tricky. The problem is that if we have an existing trace
and we combine it with a new trace
the optimizer may replace (for example)
with a single entirely different uop, say
Now where do we continue execution? We can't start before An alternative would be to optimize the new trace buffer before merging it with the original Executor's trace buffer. But that would reduce the optimizer's effectiveness, as it would not consider optimizations spanning both buffers (e.g. dropping a guard in the new trace that's already checked in the original trace). So it's probably better to just bail to Tier 1 for one iteration. Note that it is possible that the new Executor also ends in a |
Note that this is really just a form of trace stitching. I think we should design a mechanism that works for both. For example, we can just let the exit at the end of the trace "heat up" like normal side exits, and then project a new trace from there and stitch them together. I think, if we do it right, things can grow organically and we don't need to throw away and rebuild the "first" executor every time we rebuild the trace (and the nasty quadratic time complexity that introduces for a trace with a bunch of calls or whatever). If we wanted to, we could even use this for branches (and get rid of the existing counters). |
Agreed that the quadratic effect is undesirable. OTOH with trace stitching, we don't get the benefits of running the optimizer pass across all "segments". Maybe we could do something that only runs the optimizer once the last segment doesn't end in |
So that brings up another question: do we want to have all of the segments be exclusive to the first executor, or have each segment be its own executor that can be stitched into arbitrarily many traces? In the first case, we'd want to optimize each branch of the tree, but those branches can't easily be stitched into other trees. On the other hand, the second case would cut down on code duplication and make stitching all the hot paths of the program together easy (so we don't leave the trace graph once the program warms up), but we can't optimize across segments. I sort of like the latter, assuming it doesn't prevent too many useful optimizations. I think if we wanted to optimize the segments more heavily in that case, something like lazy basic block versioning would really shine. |
In terms of what's easiest to build right now, as a start for incremental improvements, turning each segment into its own Executor would be simplest. We'd basically just create a new Executor starting at the Oh, we would need to overwrite the |
Could we just have it check if there's an executor installed at the target instruction, and load it if so? Otherwise, it projects a new trace and installs it itself? If we do that, then we get cross-trace stitching for free. (We just have to be careful that a I’d have to think about how tricky this is for the JIT… but it shouldn’t be too bad, especially considering the benefits it brings. It does seem like the simplest of the ideas we’ve considered here, though. The tricky part is tail-calling into the new executor. |
Also, modifying traces after the fact is, uh, pretty painful for jitted code. So let’s continue to think of traces as immutable. ;) |
Right now, loading an executor is a fair amount of code (see if (operand) {
Py_DECREF(current_executor);
current_executor = Py_NewRef((PyObject *)operand);
GOTO_TIER_TWO();
} (Though I've probably missed a step or two.) |
Oh, never mind then. Then instead of stuffing the executor in if (next_instr->op.code == ENTER_EXECUTOR) {
PyCodeObject *code = _PyFrame_GetCode(frame);
_PyExecutorObject *executor = (_PyExecutorObject *)code->co_executors->executors[oparg&255];
int original_oparg = executor->vm_data.oparg | (oparg & 0xfffff00);
JUMPBY(1-original_oparg); // <-- This is problematic since we're not at a JUMP_BACKWARD
frame->instr_ptr = next_instr;
Py_INCREF(executor);
if (executor->execute == _PyUopExecute) {
current_executor = (_PyUOpExecutorObject *)executor;
GOTO_TIER_TWO();
}
} (There are definitely a few more nits that have to change here.) |
This will also force us to deal with the question of "what if an executor makes no progress?" (I'm sure there's a discussion about that in this tracker but I can't find it quickly.) An executor that replaces |
We could also create an “unfinished” executor that projects a trace the first (or n-th) time its I guess the main change would be that executors wouldn’t store the trace in their tails, it would be a pointer to a buffer. But that’s probably fine, since the JIT likely wants to throw the trace buffer away anyways. |
My mind is boggling. I am going to concentrate on creating a second executor using |
I think the question of forward progress can probably be answered by sticking the uops for the “base” opcode in any side exit stubs for the first instruction in the trace. So it’s as-if the base opcode already ran when we exit. |
Not sure what you're proposing here. We don't yet have And by "base opcode" I presume you're referring to the |
Yeah, I'm imagining this building upon python/cpython#111611.
No, I mean |
We can discuss more on Monday, I think. |
Oh, got it. That would make the jitted code for the side exit bulky though -- Plus, currently |
Bulky side exits are the future. ;) In all seriousness, I've been thinking of only jitting the mainline and interpreting the side exits. Trickier, but could be something to consider down the road if we're jitting too much code. |
I have a working prototype (need to clean it up before I push it to a branch). I realized there's another issue with not fusing the executors/segments: if a loop contains a call, it is broken up into two or more separate executors, and we can no longer use I've also discovered a problem with checking in |
Actually, checking whether an executor made progress is hard, period. I'll probably need a new variable to hold some state. I have a feeling Mark knew this but didn't explain it well enough for me to understand. :-( |
A tentative solution compares the current instruction to the target of the |
Oof, this is much harder than it seemed. One thing I haven't tried yet but should (although I doubt it'll fix any of the current blockers): instead of changing the signature of |
This all seems rather complicated. It might be worth dropping tracing across calls at all, and our efforts into making sure that we can stitch together the caller trace and callee trace. Let's not attempt to start executing traces in the middle.
|
For now we have the function (version) cache that enables us to trace across at least some calls. (We should try to figure out how effective it is.) Stitching will require re-optimizing, e.g. if we have a function that does something trivial (e.g. |
FWIW, @brandtbucher just showed me his version of the part (really just one yak I shaved) that deals with the requirement that |
This started at #628 (comment).
We can do away with the function version cache using this approach. The idea is that when the Tier 1 to Tier 2 translator encounters an instruction that requires tracing through another code object, instead of using the function cache to find that code object, we stop translation with a special uop (let's say
UPDATE_TRACE
) instead ofEXIT_TRACE
. When that uop is encountered during Tier 2 interpretation, we have to do the following:CALL
and its specializations).The text was updated successfully, but these errors were encountered: