-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Jit/NoJit builds #631
Comments
Can't say I like the idea of making |
It is "just" to save a register, yes. Saving a register can be worth a few percent performance on x64. Almost all of the uses of
Having said that, there is no reason why we can't use separate variables to start with, then unify them later. |
I am going to make a detailed plan of all the things we need to do here. Fortunately executor.c is not even 150 lines. These are its locals: _Py_CODEUNIT *ip_offset = (_Py_CODEUNIT *)_PyFrame_GetCode(frame)->co_code_adaptive;
int pc = 0;
int opcode;
int oparg;
uint64_t operand; More later. |
Here are some things I've thought of so far.
|
It would be nice if we could make Which reminds me, another variable that may add register pressure is |
Not every executor is necessarily a "Uop" executor (at least in theory; and there are tests using the "counter" executor). The inline dispatch from Tier 1 to Tier 2, at inst(ENTER_EXECUTOR, (--)) {
CHECK_EVAL_BREAKER();
PyCodeObject *code = _PyFrame_GetCode(frame);
_PyExecutorObject *executor = (_PyExecutorObject *)code->co_executors->executors[oparg&255];
int original_oparg = executor->vm_data.oparg | (oparg & 0xfffff00);
JUMPBY(1-original_oparg);
frame->instr_ptr = next_instr;
Py_INCREF(executor);
/************ New code *************/
if (executor->execute == _PyUopExecute) {
self = (_PyUOpExecutorObject *)executor;
goto dispatch_tier_two;
}
/************ End new code *************/
frame = executor->execute(executor, frame, stack_pointer);
if (frame == NULL) {
frame = tstate->current_frame;
goto resume_with_error;
}
next_instr = frame->instr_ptr;
goto resume_frame;
} |
|
Also, that would prevent us from introducing cleanup code in side exits. For example, if we want to re-materialize frames lazily or other stuff, and would block many optimizations. |
The prototype exhibits an interesting problem on Windows: some tests using deep recursion are failing (without enabling uops!), which makes me think that the C stack frame for _Py_CODEUNIT *ip_offset = (_Py_CODEUNIT *)_PyFrame_GetCode(frame)->co_code_adaptive;
_PyUOpInstruction *next_uop = self->trace;
uint64_t operand; Possible hacks to get rid of these could include:
|
In good news, the benchmarks show this doesn't make Tier 1 slower ("1.00x slower at 99th %ile"). |
The benchmarks with uops aren't any better -- if anything, it's worse ("1.07x slower"). |
It's not much worse though -- with just "uops-forever" I get "1.06x slower". |
The merge of Tier 2 into Tier 1 has happened. Merging the resulting code into @brandtbucher's justin branch should be simple. |
Hm, one thing I missed from this plan is that operands are now fetched using How hard would it be to add the local back (like it is on debug builds) and use that instead? Does it really blow the stack? |
Oh, sorry! Can you give that a try as a new PR? If the Windows CI passes it should be good. |
Yep, I’ll do that tomorrow. |
Much later: We now have the Tier 1 and 2 interpreters in the same function, @markshannon @brandtbucher Your thoughts? |
Did merging the interpreters improve performance at all? According to a comment above, it actually made things slower. I'm not opposed to splitting them back up. It's sort of annoying keeping track of all of the different locals and labels (like what's safe to use, jump to, etc.). Plus we could distinguish between the two tiers in I'd just want to make sure that we're not seriously hurting tier two performance for platforms without JIT support. Unless, as you say, we never expect it to be turned on except for debugging. |
Tier 2 performance is anywhere from 7-10% slower on the Linux machine at the benchmarking repo. At this rate, I don't think it will ever be faster. Mark might disagree though. |
Here's a comment suggesting that the benchmarks were neutral (following comments suggest it might even be slower). |
Yeah, I tend to think if it's neutral performance, the benefits of having it separate outweight it, especially on Windows where it puts a lot of stack pressure and there's some evidence of hitting compiler optimization limits. |
In the Faster CPython team's internal meeting, we decided that separating this out makes sense, though the prioritization isn't clear. |
Not all platforms will support a JIT compiler, and even for those that do, building without the JIT is useful for fast build time and for testing.
The optimizer design allows us to jump from tier1 code into optimized (tier2) code at arbitrary points, and back from tier2 code to exit to tier1 code, but it does so with calls. Which is a problem for a couple of reasons:
So we need a transfer mechanism that allows us to pass as much information as we need, ideally in registers, and that won't blow the stack.
JIT build
For a JIT build, we can use a custom calling convention and use tail calls everywhere. We need this for the JIT itself, so it make sense to build the interpreter to use the same conventions.
Non-JIT build
For the non-JIT build, we should implement the tier1 and tier 2 interpreters in a single giant function.
Transfer of control should be implemented as
goto
s and information is passed in (C) local variables.Types of transfer
ENTER_EXECUTOR
) (patchable)Maybe others?
What does this look like in
bytecodes.c
My preferred approach would be that each of the above transfers is expressed as a macro-like expression, that is understood by the code generator and replaced will the relevant C code. Using actual C macros tends to get confusing.
Implementing this in the interpreter.
Code examples assume no computed gotos. Those are left as an exercise for the reader 🙂
_Py_CODEUNIT *next_instr
becomesunion { _Py_CODEUNIT * tier1; PyUopInstruction *tier2; } next_instr
DISPATCH()
although it is mostly implicitgoto tier2_dispatch; tier2_dispatch: switch (next_instr->tier2.opcode) {
goto tier1_dispatch; tier1_dispatch: switch (next_instr->tier1.op.code) {
Patchable jumps need to pass their own address to the next piece of code.
We can pass this in a register for JIT code, for the interpreter we can pass it in memory.
The text was updated successfully, but these errors were encountered: