We need to change the contract and interface of `_PyExecutorObject` and `_PyOptimizerObject` #108866

markshannon · 2023-09-04T09:40:36Z

The contract of _PyExecutorObject currently is that it executes zero or more bytecode instructions.
We should change that so that it must execute at least one instruction.
The reason for this change is so that we can use ENTER_EXECUTOR anywhere, and that we will need to replace arbitrary instructions with ENTER_EXECUTOR (see below for why)

If a _PyExecutorObject executes zero instructions, then ENTER_EXECUTOR is responsible for executing the original instruction.

If it executes one or more instructions the behavior of the first instruction is handled by the _PyExecutorObject so ENTER_EXECUTOR is just a simple, and fast, (tail) call.

We also want to change the signature of the execute function pointer to take _PyExecutorObject ** instead of _PyExecutorObject *.
See faster-cpython/ideas#621 for details.

We might as well make both changes at once.
I think our only "real" optimizer already executes at least three instructions, so it should be a fairly easy change.

Why do we need insert executors at arbitrary instructions?

Consider a nested if with at least two balanced hot paths, and at least one cold path.
At the join point, we want both paths to continue in optimized code, but as neither represents more than 50% of the flow, they will likely stop at the join point. Ideally, they will both jump into the same, following optimized code. But in order to find it, it needs to be attached to the tier 1 instructions using ENTER_EXECUTOR and that join point could be an arbitrary instruction, likely a LOAD_FAST or LOAD_GLOBAL.

Note that this makes life harder for the optimizer, as it cannot simply exit the optimized code if a guard fails in the first instruction. It is obliged to fully execute that instruction.

Other optimizers might also want to overwrite instructions with ENTER_EXECUTOR; PyTorch Dynamo, for example.

@gvanrossum

Linked PRs

The text was updated successfully, but these errors were encountered:

markshannon · 2023-09-04T09:43:38Z

We haven't implemented optimization on RESUME yet. One of the reasons for that is the awkwardness of handling executors that execute zero instructions.

markshannon · 2023-09-04T10:25:17Z

We may also need the executor to return the next_instr, not the frame, as INSTRUMENTED_LINE needs both the next instruction (itself) and frame->prev_instr to point to the previous instruction.

gvanrossum · 2023-09-04T17:11:33Z

Note that this makes life harder for the optimizer, as it cannot simply exit the optimized code if a guard fails in the first instruction. It is obliged to fully execute that instruction.

I think you meant this makes life harder for the executor? Or maybe it's harder for both. :-)

I'm not sure how to implement the proposal except by ensuring in the optimizer that the first Tier 1 instruction it translates doesn't deopt. This is easily checked, since gh-108583 introduced the HAS_DEOPT_FLAG and OPCODE_HAS_DEOPT(OP) to check for it. But it would reduce the effectiveness of the optimizer in case the first instruction can deoptimize -- we'd be forced not to optimize that trace at all. This would currently affect all for loops, because they all start with a specialization of FOR_ITER, all of which start with a guard.

I wouldn't want the Tier 2 executor to be responsible for executing the deoptimized Tier 1 instruction. I suppose in some cases we could define the "family head" in Tier 1 as a macro that is translatable to Tier 2 (in some cases it may already be translatable, though that isn't the case for FOR_ITER mentioned above), and the deopt operation could be reimplemented as a branch to a stub containing that macro expansion, followed by SAVE_IP and EXIT_TRACE. This would require changing all (or at least some) guard uops into specially-marked branch uops. I guess it's possible, but not simple.

We may also need the executor to return the next_instr, not the frame, as INSTRUMENTED_LINE needs both the next instruction (itself) and frame->prev_instr to point to the previous instruction.

Are you thinking of having INSTRUMENTED_LINE translated to Tier 2? Right now Tier 2 assumes all instrumentation is off, and attempts to enable it for code that is currently executing in Tier 2 should cause the executor to bail at the next CHECK_EVAL_BREAKER() call (that isn't implemented though).

Anyway, I see the attractiveness of the idea (you can plunk ENTER_EXECUTOR pretty much anywhere and you won't ever have to look at the opcode/oparg pair it hides), but I also see some issues implementing it.

markshannon · 2023-09-08T11:15:44Z

This would currently affect all for loops, because they all start with a specialization of FOR_ITER, all of which start with a guard.

We start optimizing on JUMP_BACKWARD, which cannot fail and we currently skip in ENTER_EXECUTOR to avoid the potential for an infinite loop. FOR_ITER is the second instruction, so it is fine if that de-opts.

I wouldn't want the Tier 2 executor to be responsible for executing the deoptimized Tier 1 instruction.

Indeed. So don't. That what EXIT_TRACE is for, to avoid executing anything that is better handled by tier 1.

Are you thinking of having INSTRUMENTED_LINE translated to Tier 2?

No

I also see some issues implementing it.

Being obliged to execute the first instruction would make things complicated, except that there is no obligation to optimize at all. If it is too complicated, just let tier 1 handle it.

gvanrossum · 2023-09-08T15:44:06Z

Okay, since the "first instruction" in this case is always JUMP_BACKWARD this all seems fine then. I guess in the future we'll also have RESUME which also can't fail (or we can know statically whether it can fail, right?).

gvanrossum · 2023-09-09T00:47:55Z

Consider a nested if with at least two balanced hot paths, and at least one cold path.
At the join point, we want both paths to continue in optimized code, but as neither represents more than 50% of the flow, they will likely stop at the join point. Ideally, they will both jump into the same, following optimized code. But in order to find it, it needs to be attached to the tier 1 instructions using ENTER_EXECUTOR and that join point could be an arbitrary instruction, likely a LOAD_FAST or LOAD_GLOBAL.

I just realized I don't think I understood this example before. :-(

But maybe another motivation works for me. If ENTER_EXECUTOR exits without error and with the instruction pointer still pointing at the same (ENTER_EXECUTOR) instruction, presumably because the instruction that ENTER_EXECUTOR replaced needs to deoptimize, we'd need to special case this, otherwise we'd just be attempting to execute ENTER_EXECUTOR again, likely causing an infinite loop.

However, consider the example you gave, where we want to replace a specialization of LOAD_GLOBAL with ENTER_EXECUTOR. This specialization (e.g. LOAD_GLOBAL_MODULE) contains a guard, which calls DEOPT_IF(). So it looks like the Tier 2 interpreter won't be able to honor the contract "execute at least one instruction" in that case, so the optimizer no choice but to refuse to optimize at all, when presented with that particular first instruction.

So I'm still a little bit stumped. Aren't we introducing a lot of complexity with this requirement?

gvanrossum · 2023-09-09T03:06:38Z

Separately, it would be nice if the ENTER_EXECUTOR patch could be inserted at the top of the loop instead of on the JUMP_BACKWARD, so that future calls can won't have to go through the Tier 1 interpreter for the first iteration. But this would definitely run into the "at least one instruction" thing.

The special-casing required wouldn't be too terrible, would it?

markshannon · 2023-09-11T09:17:37Z

However, consider the example you gave, where we want to replace a specialization of LOAD_GLOBAL with ENTER_EXECUTOR. This specialization (e.g. LOAD_GLOBAL_MODULE) contains a guard, which calls DEOPT_IF(). So it looks like the Tier 2 interpreter won't be able to honor the contract "execute at least one instruction" in that case, so the optimizer no choice but to refuse to optimize at all, when presented with that particular first instruction.

That is all true, but easily solved for our tier 2 optimizer; we simply don't insert ENTER_EXECUTOR at a LOAD_GLOBAL.

However, the ability to add an ENTER_EXECUTOR anywhere is empowering. We might well want to insert ENTER_EXECUTOR in places other than JUMP_BACKWARD and RESUME.
Third party optimizers like PyTorch Dynamo might also want this capability to resume after what they call "graph breaks".

Separately, it would be nice if the ENTER_EXECUTOR patch could be inserted at the top of the loop

We want to enter the executor at the end of the loop. It effectively gives us loop peeling for free, which improves type stability.

The special-casing required wouldn't be too terrible, would it?

Only if it makes things measurably better.

gvanrossum · 2023-09-11T23:37:21Z

Offline there was considerable skepticism about the feasibility -- e.g. even RESUME_CHECK still deoptimizes (in fact that's how it works :-), so by these rules it cannot be replaced by ENTER_EXECUTOR. @brandtbucher believes a simple check after the executor has run for whether it made any progress will be feasible. Instead of "measurably faster" the bar will probably have to be "not measurably slower".

Point taken about loop peeling.

markshannon · 2023-11-14T10:26:07Z

There seems to be some reluctance to do this. So let me give the reasons why we need to do this.

We want a guarantee of progress when the flow of control becomes obscure, as it does when we are moving from tier 1 to tier 2 and from trace to trace, otherwise the program can get stuck.

If executors are not required to make progress, we cannot stitch them together freely as we might end up forming a loop that make no progress.

Consider three traces E, A, B where E is an executor and A and B are exits, stubs, or some other non-executor trace.
Provided we start with E we can freely join A and/or B to form larger regions including loops provided E is guaranteed to make progress.
However if E is not guaranteed to make progress, stitching becomes more complicated and thus error-prone and slower.

If the requirement that an executor makes progress is were difficult to implement, I would understand the reluctance.
But it isn't.
There are few ways it can be done. Here's two simple ones:

Always de-specialize the first instruction in the trace, e.g. convert LOAD_GLOBAL_MODULE into LOAD_GLOBAL
Use a special exit stub for the first instruction that executes the de-specialized instruction before continuing.

gvanrossum · 2023-11-14T16:58:14Z

The example is too abstract, because I haven't seen the future yet where traces are joined. Is joining the same as stitching? Where would A an B come from? Currently all we have are exit stubs which live in an executor. I would need to understand more about stitching before believing that dropping the progress requirement causes a problem. I'm happy to hold off judgment until we are actually stitching.

Regarding tricks to force progress, consider CALL. It's now macro(CALL) = _SPECIALIZE_CALL + _CALL; but _CALL is not a viable uop. Likely that is because it uses DISPATCH_INLINED() and manipulated next_instr. Breaking it up into uops would not help, because then we'd have guards again that could deopt.

gvanrossum · 2023-11-14T18:45:01Z

All right, off line I got a better understanding of stitching. The idea is that (almost) every exit from a trace (as of gh-112045 always through a deopt) has a counter, and if it becomes hot enough, we project a new trace from that point, and change the deopt exit to transfer directly to that new trace. This should give us polymorphism as well (e.g. the deopt from BINARY_OP_ADD_INT may lead to another trace that uses BINARY_OP_ADD_FLOAT). There are a possible mechanisms for this. One idea is that we could just reset the Tier 1 instruction to its unspecialized form and reset its counter. The Tier 1 version would only be executed when the Executor deopts, so in the example it would never see the int arguments. Another approach would have counters on the Tier 2 deopt exits. We could try both approaches together for different situations.

Anyway, I do see that this kind of stitching would be complicated if the target executor makes no progress. One simple way to guarantee that, BTW, would be to just leave certain instructions (like unspecializable CALL) in Tier 1 and create a trace with the first guaranteed-success Tier 1 bytecode after that (e.g. LOAD_FAST).

…13006)

gvanrossum mentioned this issue Sep 4, 2023

Stitching it all together faster-cpython/ideas#621

Open

markshannon mentioned this issue Sep 8, 2023

GH-108866: Change optimizer API and contract #109144

Closed

markshannon mentioned this issue Nov 13, 2023

Constructing Executors using stack contents faster-cpython/ideas#635

Open

gvanrossum mentioned this issue Nov 24, 2023

Trace stitching #112354

Closed

iritkatriel added the interpreter-core (Objects, Python, Grammar, and Parser dirs) label Nov 27, 2023

gvanrossum mentioned this issue Dec 9, 2023

gh-112354: Add executor for less-taken branch #112902

Closed

bedevere-app bot mentioned this issue Dec 12, 2023

GH-108866: Guarantee forward progress in executors. #113006

Merged

markshannon added a commit that referenced this issue Dec 12, 2023

GH-108866: Guarantee forward progress in executors. (GH-113006)

9560238

markshannon closed this as completed Jan 19, 2024

aisk pushed a commit to aisk/cpython that referenced this issue Feb 11, 2024

pythonGH-108866: Guarantee forward progress in executors. (pythonGH-1…

85397f7

…13006)

Glyphack pushed a commit to Glyphack/cpython that referenced this issue Sep 2, 2024

pythonGH-108866: Guarantee forward progress in executors. (pythonGH-1…

6caea92

…13006)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

We need to change the contract and interface of `_PyExecutorObject` and `_PyOptimizerObject` #108866

We need to change the contract and interface of `_PyExecutorObject` and `_PyOptimizerObject` #108866

markshannon commented Sep 4, 2023 •

edited by bedevere-app bot

Loading

markshannon commented Sep 4, 2023

markshannon commented Sep 4, 2023

gvanrossum commented Sep 4, 2023

markshannon commented Sep 8, 2023

gvanrossum commented Sep 8, 2023

gvanrossum commented Sep 9, 2023

gvanrossum commented Sep 9, 2023

markshannon commented Sep 11, 2023

gvanrossum commented Sep 11, 2023

markshannon commented Nov 14, 2023 •

edited

Loading

gvanrossum commented Nov 14, 2023

gvanrossum commented Nov 14, 2023

We need to change the contract and interface of _PyExecutorObject and _PyOptimizerObject #108866

We need to change the contract and interface of _PyExecutorObject and _PyOptimizerObject #108866

Comments

markshannon commented Sep 4, 2023 • edited by bedevere-app bot Loading

Why do we need insert executors at arbitrary instructions?

Linked PRs

markshannon commented Sep 4, 2023

markshannon commented Sep 4, 2023

gvanrossum commented Sep 4, 2023

markshannon commented Sep 8, 2023

gvanrossum commented Sep 8, 2023

gvanrossum commented Sep 9, 2023

gvanrossum commented Sep 9, 2023

markshannon commented Sep 11, 2023

gvanrossum commented Sep 11, 2023

markshannon commented Nov 14, 2023 • edited Loading

gvanrossum commented Nov 14, 2023

gvanrossum commented Nov 14, 2023

We need to change the contract and interface of `_PyExecutorObject` and `_PyOptimizerObject` #108866

We need to change the contract and interface of `_PyExecutorObject` and `_PyOptimizerObject` #108866

markshannon commented Sep 4, 2023 •

edited by bedevere-app bot

Loading

markshannon commented Nov 14, 2023 •

edited

Loading