gh-112354: Add executor for less-taken branch #112902

gvanrossum · 2023-12-09T16:55:08Z

(For a description of what changed since the first version, see later comments -- almost everything is take care of.)

This brings up many questions, but shows a possible way forward.

Add an array of counters to executors, bumped by deopt exits
If the counter reaches a threshold, try to create a new executor at that point in the Tier 1 bytecode
Allow executors to deopt without making progress (supporting EXTENDED_ARG)
Various cleanups (e.g. in debug messages)

What's wrong so far?

The counter threshold is hard-coded
I had to restrict the new executor creation to branch ops
The API for creating new executors is a bit iffy, its implementation even more
Allowing executors not to make progress violates We need to change the contract and interface of _PyExecutorObject and _PyOptimizerObject #108866 (but for now is the easiest way out)
The cleanups should be done in a separate PR
I originally planned to somehow replace the counters with function pointers once the executor is triggered, but didn't get to that; maybe I should give up on that for now (so the counters can have the proper uint16_t type/size)

What's right?

I figured out how to check whether an executor made progress in the presence of EXTENDED_ARG: The trick is that the deopt goes to the EXTENDED_ARG, so we must decode the instruction before checking for ENTER_EXECUTOR.
I figured out that the src and dest arguments to-_PyOptimizer_BackEdge and friends are slightly different when EXTENDED_ARG is present: src points to the actual instruction (e.g. JUMP_BACKWARD) while dest may point to an EXTENDED_ARG opcode. This is important when reusing the code for inserting an executor at a place that's not a JUMP_BACKWARD.

@markshannon @brandtbucher

Issue: Trace stitching #112354

gvanrossum · 2023-12-11T18:50:57Z

Offline we discussed a different way to satisfy the progress requirement: Instead of inserting an ENTER_EXECUTOR in the Tier 1 stream for "side exit" traces/executors, have a pointer to the secondary executor directly in the executor whose side exit it represents. Then the requirement becomes "executors attached to ENTER_EXECUTOR must make progress" and all is well.

markshannon · 2023-12-12T17:02:46Z

Regarding forward progress, we could add a boolean parameter to the optimize function requiring forward progress of the executor.

In the optimize function, we can ensure forward progress by by de-specializing the first instruction.

For example, if the first (tier 1) instruction is LOAD_ATTR_INSTANCE_VALUE we can de-specialize that to LOAD_ATTR (which would become a _LOAD_ATTR uop once the _SPECIALIZE_LOAD_ATTR uop is stripped.

gvanrossum · 2023-12-12T17:27:15Z

Regarding forward progress, we could add a boolean parameter to the optimize function requiring forward progress of the executor.

But in this PR we won't need that.

(The latter as yet unused.)

gvanrossum · 2023-12-13T03:20:20Z

Here's a new version.

Created separate arrays of counters and executors
Remove check for lack of progress
Addressed most other concerns from the initial comment (but still restricting the effect to branch ops)

TODO: The side exit executors are leaked when the main executor is deallocated. Easy to fix, I just forgot and it's time to stop working today.

Also, I need to add tests, and for that, I need to add a way to get the sub-executors from a main executor (since the sub-executors are not reachable via ENTER_EXECUTOR).

@markshannon Please have another look.

markshannon

There seems to be a lot of unnecessary work happening when moving from trace to trace or from tier to tier.

The stack_pointer and frame attributes should be correctly handled in both tier 1 and tier 2 interpreters. They shouldn't need fixing up.

markshannon · 2023-12-13T10:47:52Z

Python/ceval.c

+        opcode = next_instr[1].op.code;
+    }
+
+    // For selected opcodes build a new executor and enter it now.


Why "selected opcodes", why not everywhere?

In an earlier version that somehow didn't work. Right now the check whether the new trace isn't going to immediately deopt again relies on these opcodes. I figured once we have the side exit machinery working we could gradually increase the scope to other deoptimizations. Also, not all deoptimizations are worthy of the effort (e.g. the PEP 523 test).

No special cases, please, it just make the code more complicated and slower.
If we want to treat some exits differently, let's do it properly faster-cpython/ideas#638, not here.

There are several reasons. First, as I explain below, for bytecodes other than branches, I can't promise an exact check for whether the newly created sub-executor doesn't just repeat the same deoptimizing uop that triggered its creation (in which case the sub-executor would always deopt immediately if it is entered at all).

Second, for most bytecodes other than branches, deoptimization paths are relatively rare (IIRC this is apparent from the pystats data -- with the exception of some LOAD_ATTR specializations).

For branches, we expect many cases where the "common" path is not much more common than the "uncommon" path (e.g. 60/40 or 70/30). Now, it might make sense have a different special case here, where if e.g. _GUARD_IS_TRUE_POP has a hot side-exit, we know that the branch goes the other way, so we can simply create a sub-executor starting at the less-common branch target. The current approach doesn't do this (mostly because I'd have to thread the special case all the way through the various optimizer functions) but just creates a new executor starting at the original Tier 1 branch bytecode -- in the expectation that if the counters are tuned just right, we will have executed the less-common branch in Tier 1 while taking the common branch in Tier 2, so that Tier 1's shift register has changed state and now indicates that the less-common branch is actually taken more frequently. The checks at L1161 and ff. below are a safeguard in case that didn't happen yet (there are all kinds of interesting scenarios, e.g. loops that don't iterate much -- remember that the first iteration each time we enter a loop will be done in Tier 1, where we stay until we hit the JUMP_BACKWARD bytecode at the end of the loop).

I propose this PR as a starting point for futher iterations, not as the ultimate design for side-exits. Let's discuss this Monday.

Python/optimizer.c

Python/ceval.c

...and set resume_threshold so they are actually produced.

gvanrossum · 2023-12-13T22:09:31Z

@markshannon Please review again. I did some of the things you asked for, for a few others I explained why not.

TODO:

Clearing out sub-executors when main executor is deallocated. I forgot about this, will add it before you have a look. (Fixed a memory leak while I was at it.)
Blurb.
Reduce stack frame save/restore ops when switching tiers.

This has no business being in this PR. This reverts commit f1998c0.

Include/internal/pycore_uops.h

gvanrossum · 2023-12-14T01:19:03Z

Python/specialize.c

@@ -2353,6 +2353,7 @@ int
 void
 _Py_Specialize_ForIter(PyObject *iter, _Py_CODEUNIT *instr, int oparg)
 {
+    assert(_PyOpcode_Deopt[instr->op.code] == FOR_ITER);


We should really add such asserts to many specialization functions; I ran into this one during an intense debugging session.

The assert can be instr->op.code == FOR_ITER and it shouldn't be necessary, as _Py_Specialize_ForIter is only called from FOR_ITER.

I tried that and I get a Bus error. And of course it's not supposed to be called with something else! But a logic error in my early prototype caused that to happen, and it took me quite a while to track it down.

markshannon · 2023-12-14T09:45:05Z

Python/ceval.c

+        opcode = next_instr[1].op.code;
+    }
+
+    // For selected opcodes build a new executor and enter it now.


No special cases, please, it just make the code more complicated and slower.
If we want to treat some exits differently, let's do it properly faster-cpython/ideas#638, not here.

markshannon · 2023-12-14T09:46:37Z

Python/ceval.c

+                goto enter_tier_two;  // All systems go!
+            }
+
+            // The trace is guaranteed to deopt again; forget about it.


Is it? Why?

See explanation above.

markshannon · 2023-12-14T09:48:52Z

Python/ceval.c

+            Py_DECREF(current_executor);
+            current_executor = (_PyUOpExecutorObject *)*pexecutor;
+
+            // Reject trace if it repeats the uop that just deoptimized.


This test may be a bit imprecise(*), but it tries to discard the case where, even though the counter in the executor indicated that this side exit is "hot", the Tier 1 bytecode hasn't been re-specialized yet. In that case the trace projection will just repeat the uop that just took a deopt side exit, causing it to immediately deopt again. This seems a waste of time and executors -- eventually the sub-executor's deopt counter will also indicate it is hot, and then we'll try again, but it seems better (if we can catch it) to avoid creating the sub-executor in the first place, relying on exponential backoff for the side-exit counter instead (implemented below at L1180 and ff.).

For various reasons, the side-exit counters and the Tier 1 deopt counters don't run in sync, so it's possible that the side-exit counter triggers before the Tier 1 counter has re-specialized. This check gives that another chance.

The test that I would like to use here would be check if the Tier 1 opcode is still unchanged (i.e., not re-specialized), but the executor doesn't record that information (and it would take up a lot of space, we'd need an extra byte for each uop that can deoptimize at least).

(*) The test I wrote is exact for the conditional branches I special-cased above (that's why there's a further special case here for _IS_NONE). For other opcodes it may miss a few cases, e.g. when a single T1 bytecode translates to multiple guards and the failing guard is not the first uop in the translation (i.e. this would always happens for calls, whose translation always starts with _PEP_523, which never dopts in cases we care about). In those cases we can produce a sub-executor that immediately deoptimizes. (And we never try to re-create executors, no matter how often it deoptimizes -- that's a general flaw in the current executor architecture that we should probably file separately.)

markshannon · 2023-12-14T09:53:30Z

Python/specialize.c

@@ -2353,6 +2353,7 @@ int
 void
 _Py_Specialize_ForIter(PyObject *iter, _Py_CODEUNIT *instr, int oparg)
 {
+    assert(_PyOpcode_Deopt[instr->op.code] == FOR_ITER);


The assert can be instr->op.code == FOR_ITER and it shouldn't be necessary, as _Py_Specialize_ForIter is only called from FOR_ITER.

Python/ceval.c

markshannon · 2023-12-14T14:41:14Z

#113104 unifies _EXIT_TRACE with other exits and reduces the number of code paths.

gvanrossum

Let's talk offline about special cases for side exits on Monday. I would prefer to do only the special cases first and generalize later, but I hear that you prefer a different development strategy.

gvanrossum · 2023-12-14T17:32:30Z

Python/specialize.c

@@ -2353,6 +2353,7 @@ int
 void
 _Py_Specialize_ForIter(PyObject *iter, _Py_CODEUNIT *instr, int oparg)
 {
+    assert(_PyOpcode_Deopt[instr->op.code] == FOR_ITER);


I tried that and I get a Bus error. And of course it's not supposed to be called with something else! But a logic error in my early prototype caused that to happen, and it took me quite a while to track it down.

Python/ceval.c

gvanrossum

Some food for thought for Monday's discussion.

gvanrossum · 2023-12-17T00:17:00Z

Python/ceval.c

+            Py_DECREF(current_executor);
+            current_executor = (_PyUOpExecutorObject *)*pexecutor;
+
+            // Reject trace if it repeats the uop that just deoptimized.


This test may be a bit imprecise(*), but it tries to discard the case where, even though the counter in the executor indicated that this side exit is "hot", the Tier 1 bytecode hasn't been re-specialized yet. In that case the trace projection will just repeat the uop that just took a deopt side exit, causing it to immediately deopt again. This seems a waste of time and executors -- eventually the sub-executor's deopt counter will also indicate it is hot, and then we'll try again, but it seems better (if we can catch it) to avoid creating the sub-executor in the first place, relying on exponential backoff for the side-exit counter instead (implemented below at L1180 and ff.).

For various reasons, the side-exit counters and the Tier 1 deopt counters don't run in sync, so it's possible that the side-exit counter triggers before the Tier 1 counter has re-specialized. This check gives that another chance.

The test that I would like to use here would be check if the Tier 1 opcode is still unchanged (i.e., not re-specialized), but the executor doesn't record that information (and it would take up a lot of space, we'd need an extra byte for each uop that can deoptimize at least).

(*) The test I wrote is exact for the conditional branches I special-cased above (that's why there's a further special case here for _IS_NONE). For other opcodes it may miss a few cases, e.g. when a single T1 bytecode translates to multiple guards and the failing guard is not the first uop in the translation (i.e. this would always happens for calls, whose translation always starts with _PEP_523, which never dopts in cases we care about). In those cases we can produce a sub-executor that immediately deoptimizes. (And we never try to re-create executors, no matter how often it deoptimizes -- that's a general flaw in the current executor architecture that we should probably file separately.)

gvanrossum · 2023-12-17T00:17:24Z

Python/ceval.c

+                goto enter_tier_two;  // All systems go!
+            }
+
+            // The trace is guaranteed to deopt again; forget about it.


See explanation above.

gvanrossum · 2023-12-17T00:40:15Z

Python/ceval.c

+        opcode = next_instr[1].op.code;
+    }
+
+    // For selected opcodes build a new executor and enter it now.


There are several reasons. First, as I explain below, for bytecodes other than branches, I can't promise an exact check for whether the newly created sub-executor doesn't just repeat the same deoptimizing uop that triggered its creation (in which case the sub-executor would always deopt immediately if it is entered at all).

Second, for most bytecodes other than branches, deoptimization paths are relatively rare (IIRC this is apparent from the pystats data -- with the exception of some LOAD_ATTR specializations).

For branches, we expect many cases where the "common" path is not much more common than the "uncommon" path (e.g. 60/40 or 70/30). Now, it might make sense have a different special case here, where if e.g. _GUARD_IS_TRUE_POP has a hot side-exit, we know that the branch goes the other way, so we can simply create a sub-executor starting at the less-common branch target. The current approach doesn't do this (mostly because I'd have to thread the special case all the way through the various optimizer functions) but just creates a new executor starting at the original Tier 1 branch bytecode -- in the expectation that if the counters are tuned just right, we will have executed the less-common branch in Tier 1 while taking the common branch in Tier 2, so that Tier 1's shift register has changed state and now indicates that the less-common branch is actually taken more frequently. The checks at L1161 and ff. below are a safeguard in case that didn't happen yet (there are all kinds of interesting scenarios, e.g. loops that don't iterate much -- remember that the first iteration each time we enter a loop will be done in Tier 1, where we stay until we hit the JUMP_BACKWARD bytecode at the end of the loop).

I propose this PR as a starting point for futher iterations, not as the ultimate design for side-exits. Let's discuss this Monday.

gvanrossum · 2023-12-18T20:11:25Z

Offline we decided to give this a rest.

Also we are back to requiring executors to make progress.

A few ideas out of the discussion:

Let's just do the work to ensure that various unspecialized uops (e.g. _CALL) are viable, so we can use them to guarantee progress
Move the special cases for conditional branches from the deoptimize block into the optimizer
Set the target for _GUARD_IS_TRUE_POP and friends to be the jump destination (and still require progress from there)
Similar for JUMP_BACKWARD, let the optimizer start at the jump so it can conclude the executor will definitely make progress

gvanrossum · 2024-01-12T19:15:04Z

Closing in preference of the data structures proposed in faster-cpython/ideas#644.

Eclips4 added the interpreter-core (Objects, Python, Grammar, and Parser dirs) label Dec 9, 2023

gvanrossum force-pushed the uops-extras branch from c2f00b0 to 65c41cb Compare December 12, 2023 00:47

gvanrossum added 21 commits December 12, 2023 16:32

Skip ENTER_EXECUTOR as deopt target (use vm_data)

36feeb1

Add an array of 'extras' to UOpExecutor

d12533b

Count side exits per uop loc and print if >= 10

f21f2d8

Add _PyOptimizer_Anywhere (not yet used)

8463965

Only jump in ENTER_EXECUTOR if overwriting JUMP_BACKWARD

6403752

Assert base opcode in _Py_Specialize_ForIter

329dead

Disable curses tests in --fast-ci mode (make test)

f1998c0

Improve (?) check for executor recursion

b0944e6

Only generate extra executors for branches

26b5f89

Fix Uop -> UOp

649581c

WIP

835bf13

Fix where next_instr points upon E_E avoidance

256b156

Allow executors with oparg >= 256

75c7c32

Don't try to optimize with default optimizer

a94c7f1

Use separate 'counters' and 'executors' arrays

747a3f0

(The latter as yet unused.)

Jump directly to side-exit executors

682cf5a

Remove progress check; clean up the rest a big

359c6fc

Ensure array of executor pointers is 64-bit aligned

ca6ed3a

Check at least two uops; further cleanup

e2a26b5

Move exit_trace up, since it is smaller

38c7aab

Use configured threshold and exp. backoff for counter

0f64231

gvanrossum force-pushed the uops-extras branch from 864d68d to 0f64231 Compare December 13, 2023 03:02

markshannon reviewed Dec 13, 2023

View reviewed changes

Add API to access sub-interpreters

83297df

...and set resume_threshold so they are actually produced.

Add test

10b98f1

gvanrossum changed the title ~~Proof of concept: add executor for less-taken branch~~ gh-112354: add executor for less-taken branch Dec 13, 2023

bedevere-app bot mentioned this pull request Dec 13, 2023

Trace stitching #112354

Closed

gvanrossum changed the title ~~gh-112354: add executor for less-taken branch~~ gh-112354: Add executor for less-taken branch Dec 13, 2023

gvanrossum marked this pull request as ready for review December 13, 2023 22:07

bedevere-app bot added the awaiting core review label Dec 13, 2023

gvanrossum added 5 commits December 13, 2023 15:11

Fix memory leak

1450ca6

Clear sub-executors array upon dealloc

dcde4d3

Add blurb

15df63f

Avoid redundant stack frame saves/restores

c786418

Revert "Disable curses tests in --fast-ci mode (make test)"

ee0734b

This has no business being in this PR. This reverts commit f1998c0.

gvanrossum commented Dec 14, 2023

View reviewed changes

markshannon reviewed Dec 14, 2023

View reviewed changes

gvanrossum added 3 commits December 14, 2023 08:23

Merge branch 'main' into uops-extras

655a841

Merge branch 'main' into uops-extras

32e36fa

Fix compiler warning about int/Py_ssize_t

f5b317a

gvanrossum commented Dec 15, 2023

View reviewed changes

Python/ceval.c Outdated Show resolved Hide resolved

Python/ceval.c Outdated Show resolved Hide resolved

gvanrossum added 2 commits December 15, 2023 20:54

Be less casual about incref/decref current executor

4804a3c

Slightly nicer way to handle refcounts

46c7d26

gvanrossum commented Dec 17, 2023

View reviewed changes

Silence compiler warning

b991279

gvanrossum marked this pull request as draft December 18, 2023 20:06

bedevere-app bot removed the awaiting core review label Dec 18, 2023

gvanrossum closed this Jan 12, 2024

gvanrossum deleted the uops-extras branch February 22, 2024 00:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gh-112354: Add executor for less-taken branch #112902

gh-112354: Add executor for less-taken branch #112902

gvanrossum commented Dec 9, 2023 •

edited

Loading

gvanrossum commented Dec 11, 2023

markshannon commented Dec 12, 2023

gvanrossum commented Dec 12, 2023

gvanrossum commented Dec 13, 2023 •

edited

Loading

markshannon left a comment

markshannon Dec 13, 2023

gvanrossum Dec 13, 2023

markshannon Dec 14, 2023

gvanrossum Dec 17, 2023

gvanrossum commented Dec 13, 2023 •

edited

Loading

gvanrossum Dec 14, 2023

markshannon Dec 14, 2023

gvanrossum Dec 14, 2023

markshannon Dec 14, 2023

markshannon Dec 14, 2023

gvanrossum Dec 17, 2023

markshannon Dec 14, 2023

gvanrossum Dec 17, 2023

markshannon Dec 14, 2023

markshannon commented Dec 14, 2023

gvanrossum left a comment

gvanrossum Dec 14, 2023

gvanrossum left a comment

gvanrossum Dec 17, 2023

gvanrossum Dec 17, 2023

gvanrossum Dec 17, 2023

gvanrossum commented Dec 18, 2023 •

edited

Loading

gvanrossum commented Jan 12, 2024

gh-112354: Add executor for less-taken branch #112902

gh-112354: Add executor for less-taken branch #112902

Conversation

gvanrossum commented Dec 9, 2023 • edited Loading

gvanrossum commented Dec 11, 2023

markshannon commented Dec 12, 2023

gvanrossum commented Dec 12, 2023

gvanrossum commented Dec 13, 2023 • edited Loading

markshannon left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gvanrossum commented Dec 13, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

markshannon commented Dec 14, 2023

gvanrossum left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gvanrossum left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gvanrossum commented Dec 18, 2023 • edited Loading

gvanrossum commented Jan 12, 2024

gvanrossum commented Dec 9, 2023 •

edited

Loading

gvanrossum commented Dec 13, 2023 •

edited

Loading

gvanrossum commented Dec 13, 2023 •

edited

Loading

gvanrossum commented Dec 18, 2023 •

edited

Loading