Jit/NoJit builds #631

markshannon · 2023-10-25T16:01:28Z

Not all platforms will support a JIT compiler, and even for those that do, building without the JIT is useful for fast build time and for testing.

The optimizer design allows us to jump from tier1 code into optimized (tier2) code at arbitrary points, and back from tier2 code to exit to tier1 code, but it does so with calls. Which is a problem for a couple of reasons:

Calls are very lopsided. We can pass an arbitrary amount of data from caller to callee, but only one datum (the return value) back
Within the optimized (tier2) code we are also going to want to jump from one piece of optimized code to others (e.g. trace stitching) which will blow the stack if we use calls.
Non-tail calls are expensive, at least compared to jumps, due to spilling.

So we need a transfer mechanism that allows us to pass as much information as we need, ideally in registers, and that won't blow the stack.

JIT build

For a JIT build, we can use a custom calling convention and use tail calls everywhere. We need this for the JIT itself, so it make sense to build the interpreter to use the same conventions.

Non-JIT build

For the non-JIT build, we should implement the tier1 and tier 2 interpreters in a single giant function.
Transfer of control should be implemented as gotos and information is passed in (C) local variables.

Types of transfer

Tier 1 dispatch
Tier 2 dispatch
Enter tier 2 from tier 1 for specialization (https://dl.acm.org/doi/10.1145/3617651.3622979) (not patchable)
Enter tier 2 from tier 1 when entering an executor (ENTER_EXECUTOR) (patchable)
Resuming tier 1 execution from tier 2
Jumping from one tier 2 executor to another (not patchable)
Jumping from one tier 2 executor to another (patchable)

Maybe others?

What does this look like in `bytecodes.c`

My preferred approach would be that each of the above transfers is expressed as a macro-like expression, that is understood by the code generator and replaced will the relevant C code. Using actual C macros tends to get confusing.

Implementing this in the interpreter.

Code examples assume no computed gotos. Those are left as an exercise for the reader 🙂

_Py_CODEUNIT *next_instr becomes union { _Py_CODEUNIT * tier1; PyUopInstruction *tier2; } next_instr

We already have a macro for tier 1 dispatch, DISPATCH() although it is mostly implicit
Tier 2 dispatch is entirely implicit, I'm just listing it for completeness
Entering tier 2 from tier 1 for specialization is goto tier2_dispatch; tier2_dispatch: switch (next_instr->tier2.opcode) {
Enter tier 2 from tier 1 when entering an executor requires increfing the executor, then doing the above jump.
Resuming tier 1 from tier 2, is also a jump: goto tier1_dispatch; tier1_dispatch: switch (next_instr->tier1.op.code) {

Patchable jumps need to pass their own address to the next piece of code.
We can pass this in a register for JIT code, for the interpreter we can pass it in memory.

The text was updated successfully, but these errors were encountered:

gvanrossum · 2023-10-25T21:23:45Z

Can't say I like the idea of making next_instr a union. Is this just to save a register? We will still need some way to tell which union member it currently contains -- separate variables will look cleaner (and require fewer changes to existing code that manipulates it -- it's used 120 times in bytecodes.c).

markshannon · 2023-10-26T13:07:30Z

It is "just" to save a register, yes. Saving a register can be worth a few percent performance on x64.

Almost all of the uses of next_instr next_instr-- are one of:

To get the current IP, we can replace those with frame->instr_ptr
Getting the cache, and
Backing up one, easily replaced by JUMPBY(-1)

Having said that, there is no reason why we can't use separate variables to start with, then unify them later.

gvanrossum · 2023-10-26T15:09:50Z

I am going to make a detailed plan of all the things we need to do here. Fortunately executor.c is not even 150 lines. These are its locals:

    _Py_CODEUNIT *ip_offset = (_Py_CODEUNIT *)_PyFrame_GetCode(frame)->co_code_adaptive;
    int pc = 0;
    int opcode;
    int oparg;
    uint64_t operand;

More later.

gvanrossum · 2023-10-27T00:42:13Z

Here are some things I've thought of so far.

There are a bunch of macros that are defined differently for tier 1 than for tier 2. This is done in a variety of ways; sometimes in executor.c there's #undef FOO followed by a different #define FOO...; a few are defined differently in ceval_macros.h depending on #if TIER_ONE or #if TIER_TWO. We should use the same strategy for all, probably based on #undef.
The ERROR_IF() macro jumps to the error label (or to pop_N_error, to pop N stack entries). The error handling for tier 2 is different (it needs to sync stuff to the frame first). Labels must be unique per function. It's easy in the code generator to change the label to e.g. error_tier_two.
Unfortunately there are also ~50 places in bytecodes.c that have goto error (not using ERROR_IF()). These will have to be rewritten using ERROR_IF(); this may cause some problems if the latter would go to pop_N_error (it does this if there are any inputs on the stack). This can all be solved using another conditional macro.
Obviously the ENTER_EXECUTOR instruction needs to change to use goto to jump into the Tier 2 loop, and the return points from Tier 2 need to jump back to the top of the Tier 1 dispatch loop (or to some error label).
We need to unify the debug printing code: ~~tier 1 uses lltrace but tier 2 uses uop_debug and a DPRINTF() macro~~ both layers use lltrace but with a different meaning.
Tier 1 has an 8-bit opcode variable, and this size (supposedly) helps generate efficient switch code in MSVC. But tier 2 opcodes are 9 bits. I can think of several solutions for this (e.g. use a cast in the tier 1 switch).
Eventually we want to minimize the number of registers; but we should start with a correct solution first.

gvanrossum · 2023-10-27T21:19:34Z

It would be nice if we could make DEOPT_IF() in Tier 2 jump directly to the correct unspecialized Tier 1 instruction's label, but we first need to decref the executor, so I think that's going to require extra work.

Which reminds me, another variable that may add register pressure is executor (a.k.a. self). We need this because of that decref when jumping back to Tier 1. Maybe we can come up with a clever scheme that doesn't require bumping the executor's refcount while it's executing? Or maybe we can store it in the frame? (Larger frames for fewer registers seems a fine trade-off.)

gvanrossum · 2023-10-27T21:28:51Z

Not every executor is necessarily a "Uop" executor (at least in theory; and there are tests using the "counter" executor). The inline dispatch from Tier 1 to Tier 2, at ENTER_EXECUTOR, must be conditional on the executor being a "Uop" executor. So ENTER_EXECUTOR should become something like this:

        inst(ENTER_EXECUTOR, (--)) {
            CHECK_EVAL_BREAKER();

            PyCodeObject *code = _PyFrame_GetCode(frame);
            _PyExecutorObject *executor = (_PyExecutorObject *)code->co_executors->executors[oparg&255];
            int original_oparg = executor->vm_data.oparg | (oparg & 0xfffff00);
            JUMPBY(1-original_oparg);
            frame->instr_ptr = next_instr;
            Py_INCREF(executor);
            /************ New code *************/
            if (executor->execute == _PyUopExecute) {
                self = (_PyUOpExecutorObject *)executor;
                goto dispatch_tier_two;
            }
            /************ End new code *************/
            frame = executor->execute(executor, frame, stack_pointer);
            if (frame == NULL) {
                frame = tstate->current_frame;
                goto resume_with_error;
            }
            next_instr = frame->instr_ptr;
            goto resume_frame;
        }

gvanrossum · 2023-10-27T22:26:01Z

The various jumping uops currently work by assigning to pc. We should use the next_uop variable instead, e.g.
```
              pc = self->trace + oparg;
```

Fidget-Spinner · 2023-10-28T13:02:49Z

It would be nice if we could make DEOPT_IF() in Tier 2 jump directly to the correct unspecialized Tier 1 instruction's label, but we first need to decref the executor, so I think that's going to require extra work.

Also, that would prevent us from introducing cleanup code in side exits. For example, if we want to re-materialize frames lazily or other stuff, and would block many optimizations.

gvanrossum · 2023-10-29T00:00:52Z

The prototype exhibits an interesting problem on Windows: some tests using deep recursion are failing (without enabling uops!), which makes me think that the C stack frame for _PyEval_EvalFrameDefault is larger than before. This makes sense: we have these new locals:

    _Py_CODEUNIT *ip_offset = (_Py_CODEUNIT *)_PyFrame_GetCode(frame)->co_code_adaptive;
    _PyUOpInstruction *next_uop = self->trace;
    uint64_t operand;

Possible hacks to get rid of these could include:

Turn next_instr into a union, as Mark suggested above (or, more hacky, cast it to _PyUOpInstruction* in the Tier 2 interpreter loop, so we won't have to change the over 100 places that use it in Tier 1).
Replace ip_offset with its expansion. It's only used by instructions that manipulate frames, and by _SET_IP. We might also encode the latter's argument differently, e.g. as an increment or decrement relative to the previous IP value set -- because we're in a superblock we only have one entry, each uop has only one predecessor, so we always know what the previous IP value was.
Change the code generator to use next_uop[-1].operand (or next_uop->operand, in combination with a change to when next_uop is incremented) in instructions that need it.

gvanrossum · 2023-10-29T00:03:15Z

In good news, the benchmarks show this doesn't make Tier 1 slower ("1.00x slower at 99th %ile").

gvanrossum · 2023-10-29T01:46:13Z

The benchmarks with uops aren't any better -- if anything, it's worse ("1.07x slower").

gvanrossum · 2023-10-31T00:13:47Z

The benchmarks with uops aren't any better -- if anything, it's worse ("1.07x slower").

It's not much worse though -- with just "uops-forever" I get "1.06x slower".

gvanrossum · 2023-11-01T20:15:08Z

The merge of Tier 2 into Tier 1 has happened. Merging the resulting code into @brandtbucher's justin branch should be simple.

brandtbucher · 2023-11-02T19:17:53Z

The merge of Tier 2 into Tier 1 has happened. Merging the resulting code into @brandtbucher's justin branch should be simple.

Hm, one thing I missed from this plan is that operands are now fetched using next_uop[-1].operand in the generated code (whereas before it was just an operand local). This makes "burning in" the value basically impossible.

How hard would it be to add the local back (like it is on debug builds) and use that instead? Does it really blow the stack?

gvanrossum · 2023-11-02T20:31:39Z

Hm, one thing I missed from this plan is that operands are now fetched using next_uop[-1].operand in the generated code (whereas before it was just an operand local). This makes "burning in" the value basically impossible.

How hard would it be to add the local back (like it is on debug builds) and use that instead? Does it really blow the stack?

Oh, sorry! Can you give that a try as a new PR? If the Windows CI passes it should be good.

brandtbucher · 2023-11-02T23:25:00Z

Yep, I’ll do that tomorrow.

gvanrossum · 2024-03-13T17:57:48Z

Much later: We now have the Tier 1 and 2 interpreters in the same function, _PyEval_EvalFrameDefault(), but we are not expecting the Tier 2 interpreter ever to outrun the Tier 1 interpreter -- instead, we intend to use it to debug the rest of the Tier 2 machinery, and for speed we rely on the JIT (which uses the Tier 2 IR but not the Tier 2 interpreter). IIUC the JIT is invoked via a true C-level call anyway. Perhaps we can revert the fusing of the T1 and T2 interpreters and move the T2 interpreter into its own C call frame? This might make it easier for some C compilers to optimize the T1 interpreter (less code and fewer registers needed).

@markshannon @brandtbucher Your thoughts?

brandtbucher · 2024-03-13T19:51:48Z

Did merging the interpreters improve performance at all? According to a comment above, it actually made things slower.

I'm not opposed to splitting them back up. It's sort of annoying keeping track of all of the different locals and labels (like what's safe to use, jump to, etc.). Plus we could distinguish between the two tiers in perf.

I'd just want to make sure that we're not seriously hurting tier two performance for platforms without JIT support. Unless, as you say, we never expect it to be turned on except for debugging.

Fidget-Spinner · 2024-03-13T20:11:41Z

I'd just want to make sure that we're not seriously hurting tier two performance for platforms without JIT support. Unless, as you say, we never expect it to be turned on except for debugging.

Tier 2 performance is anywhere from 7-10% slower on the Linux machine at the benchmarking repo. At this rate, I don't think it will ever be faster. Mark might disagree though.

gvanrossum · 2024-03-13T23:57:01Z

Here's a comment suggesting that the benchmarks were neutral (following comments suggest it might even be slower).

mdboom · 2024-03-18T12:47:42Z

Yeah, I tend to think if it's neutral performance, the benefits of having it separate outweight it, especially on Windows where it puts a lot of stack pressure and there's some evidence of hitting compiler optimization limits.

mdboom · 2024-03-18T16:19:53Z

In the Faster CPython team's internal meeting, we decided that separating this out makes sense, though the prioritization isn't clear.

gvanrossum mentioned this issue Oct 28, 2023

gh-111520: Integrate the Tier 2 interpreter in the Tier 1 interpreter python/cpython#111428

Merged

gvanrossum mentioned this issue Oct 29, 2023

How to handle YIELD_VALUE in Tier 2 #628

Open

gvanrossum mentioned this issue Oct 30, 2023

Merge Tier 1 and 2 interpreters into a single function python/cpython#111520

Closed

gvanrossum closed this as completed Nov 1, 2023

gvanrossum reopened this Nov 2, 2023

gvanrossum mentioned this issue Mar 18, 2024

Move Tier 2 interpreter out of _PyEval_EvalFrameDefault python/cpython#116970

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jit/NoJit builds #631

Jit/NoJit builds #631

markshannon commented Oct 25, 2023

gvanrossum commented Oct 25, 2023

markshannon commented Oct 26, 2023 •

edited

Loading

gvanrossum commented Oct 26, 2023

gvanrossum commented Oct 27, 2023 •

edited

Loading

gvanrossum commented Oct 27, 2023

gvanrossum commented Oct 27, 2023 •

edited

Loading

gvanrossum commented Oct 27, 2023

Fidget-Spinner commented Oct 28, 2023

gvanrossum commented Oct 29, 2023

gvanrossum commented Oct 29, 2023

gvanrossum commented Oct 29, 2023

gvanrossum commented Oct 31, 2023

gvanrossum commented Nov 1, 2023

brandtbucher commented Nov 2, 2023

gvanrossum commented Nov 2, 2023

brandtbucher commented Nov 2, 2023

gvanrossum commented Mar 13, 2024

brandtbucher commented Mar 13, 2024 •

edited

Loading

Fidget-Spinner commented Mar 13, 2024

gvanrossum commented Mar 13, 2024

mdboom commented Mar 18, 2024

mdboom commented Mar 18, 2024

Jit/NoJit builds #631

Jit/NoJit builds #631

Comments

markshannon commented Oct 25, 2023

JIT build

Non-JIT build

Types of transfer

What does this look like in bytecodes.c

Implementing this in the interpreter.

gvanrossum commented Oct 25, 2023

markshannon commented Oct 26, 2023 • edited Loading

gvanrossum commented Oct 26, 2023

gvanrossum commented Oct 27, 2023 • edited Loading

gvanrossum commented Oct 27, 2023

gvanrossum commented Oct 27, 2023 • edited Loading

gvanrossum commented Oct 27, 2023

Fidget-Spinner commented Oct 28, 2023

gvanrossum commented Oct 29, 2023

gvanrossum commented Oct 29, 2023

gvanrossum commented Oct 29, 2023

gvanrossum commented Oct 31, 2023

gvanrossum commented Nov 1, 2023

brandtbucher commented Nov 2, 2023

gvanrossum commented Nov 2, 2023

brandtbucher commented Nov 2, 2023

gvanrossum commented Mar 13, 2024

brandtbucher commented Mar 13, 2024 • edited Loading

Fidget-Spinner commented Mar 13, 2024

gvanrossum commented Mar 13, 2024

mdboom commented Mar 18, 2024

mdboom commented Mar 18, 2024

What does this look like in `bytecodes.c`

markshannon commented Oct 26, 2023 •

edited

Loading

gvanrossum commented Oct 27, 2023 •

edited

Loading

gvanrossum commented Oct 27, 2023 •

edited

Loading

brandtbucher commented Mar 13, 2024 •

edited

Loading