Better specialization of calls, post introduction of `PRECALL`. #267

markshannon · 2022-02-10T17:35:21Z

With the new opcodes, the sequence of instructions executed during a call looks like this:

The stats https://github.com/faster-cpython/ideas/blob/main/stats.md#call show that we only have a 72% hit rate on the standard benchmark suite.

Specialization attempts

	Count	Ratio
Success	2909102	24.4%
Failure	9035032	75.6%

Failure kind	Count	Ratio
bound method	2136774	23.6%
complex parameters	1252534	13.9%
python class	1245231	13.8%
pycfunction with keywords	859904	9.5%
class no vectorcall	723781	8.0%
kwnames	634422	7.0%
pycfunction	586186	6.5%
pycfunction noargs	407864	4.5%
class mutable	327462	3.6%
other	232681	2.6%
bad call flags	179897	2.0%
cmethod	168390	1.9%
pycfunction fast with keywords	162744	1.8%
str	79833	0.9%
method wrapper	32561	0.4%
operator wrapper	4768	0.1%

Each of these failures need a different strategy, so multiple strategies.

Bound methods.

Bounds methods can come from two places, bound methods object explicitly used in the program, and classmethods.
The first should be handled in PRECALL_FUNCTION, the second in LOAD_METHOD.

Complex parameters.

This needs more investigation, to see if some of these can be specialized.

Python class

Should be handled in PRECALL_FUNCTION which will create the self object and push a clean-up frame, leaving the __init__ method to be handled by CALL.

Builtin functions and classes using the older caller conventions

These should be fixed by modernizing the callee, not accommodating them in the interpreter.

Mutable classes

Mystery classification. Needs investigation, possibly a bug in the classification.

The text was updated successfully, but these errors were encountered:

markshannon · 2022-02-16T12:06:51Z

The above sequence is still a bit messy, with an unnecessary distinction between calling an attribute, and other calls.
Ideally we want one PRECALL instruction that can be specialized for the type (or in some cases, value) of the callable, and the CALL instruction that can be specialized for the "shape" of the call (i.e. number of args, named args, etc).

LOAD_METHOD puts two values on the stack, either func, self or NULL, func.
The extra stack space allows us to unpack a bound-method without expensive stack manipulations.
There is no reason why we can't generalize this to all calls, allowing us that freedom to unpack (or repack) as needed.
By prefixing non-method call sequences with a PUSH_NULL instruction we can unify the two PRECALLs and reduce the number of special cases required.

Now PRECALL n expects the n+1 and n+2 items on the stack to be func, self or NULL, func, regardless of how the callable was evaluated.

The downside is that we are adding yet another instruction into the call sequence. Stats show that PUSH_NULL represents about 3% of instructions executed. Given it does so little work, the cost is just dispatching, so should be ~0.3%.

However, PUSH_NULL occurs before LOAD_GLOBAL in about 2/3s of cases.
Specializations of LOAD_GLOBAL can include a flag to push a NULL and skip an instruction without branching, so the expected overhead should be < 0.2%

More uniform and effective specialization should pay this back and plenty more.

markshannon · 2022-02-17T12:50:38Z

Specialization strategy for all currently specialized types/values and most of the types that we currently fail on.

What	`PRECALL`	Named keywords?	`CALL`
Python function	pass	yes	Two specializations: One for exact args. One to handle defaults and kwnames.
type(1), str(1), tuple(1)	Here, skip over `CALL`	no	no
Most builtin functions	Here, skip over `CALL`	no	no
builtin functions with `METH_FASTCALL \| METH_KEYWORDS`	pass	yes	No change to current specialization
len/isinstance	Here, skip over `CALL`	no	no
Most method descriptors	Here, skip over `CALL`	no	no
method descriptors with `METH_FASTCALL \| METH_KEYWORDS`	pass	yes	Much the same as for the equivalent builtin functions
Bound methods	Unpack onto stack, if space^*	yes	No, should be unpacked so will specialize as Python functions.
Python class	Allocate object, push cleanup frame. Leave `__init__` `self` on stack, if space^*	yes	No, expect `__init__` to be a Python function

^* If the lower slot on the stack is NULL we treat can this the same as PRECALL_FUNCTION.

The only class of failures in https://github.com/faster-cpython/ideas/blob/main/stats.md not covered is "complex parameters" which are Python functions, so would be handled by adding more cases to the specialization of Python functions.

markshannon · 2022-02-17T13:05:06Z

Looking at the above table, it is clear that we should place KW_NAMES before PRECALL.

Builtin functions and method descriptors with METH_FASTCALL | METH_KEYWORDS cannot currently be specialized in the PRECALL because we don't know if we have kwnames.

Swapping KW_NAMES and PRECALL fixes that. It makes no difference to any other calls.

markshannon · 2022-02-17T13:19:54Z

Resulting in this table:

What	`PRECALL`	Named keywords?	`CALL`
Python function	pass	yes	Two specializations: One for exact args. One to handle defaults and kwnames.
type(1), str(1), tuple(1)	Here, skip over `CALL`	no	no
Most builtin functions	Here, skip over `CALL`	no	no
builtin functions with `METH_FASTCALL \| METH_KEYWORDS`	As current specialization	yes	no
len/isinstance	Here, skip over `CALL`	no	no
Most method descriptors	Here, skip over `CALL`	no	no
method descriptors with `METH_FASTCALL \| METH_KEYWORDS`	Much the same as for the equivalent builtin functions	yes	no
Bound methods	Unpack onto stack, if space	yes	No, should be unpacked so will specialize as Python functions.
Python class	Allocate object, push cleanup frame. Leave `__init__` `self` on stack, if space	yes	No, expect `__init__` to be a Python function

markshannon · 2022-02-17T13:22:24Z

In summary, we will:

Move KW_NAMES before PRECALL.
Move all CALL specializations, except for Python functions, to PRECALL
Add specializations for bound methods and Python classes to PRECALL.
Extend CALL specializations to handle Python functions with more complex arguments.

markshannon · 2022-03-09T10:54:30Z

It looks like specializing calls to Python classes is more complex than I allowed for.

The three possible approaches we have so far come up with are:

My original idea: Push a temporary frame to clean. I've done some work on implementing this and it involves some moving of arguments to the new frame, which has a cost. It also needs one or more dynamically created Python functions for each class to allow effective specialization of the call to __init__ and do the cleanup.
Add some state to the frame add cleanup in RETURN_VALUE bpo-46939: Specialize calls to Python classes python/cpython#31707
Add a third bytecode to call sequence, POSTCALL. POSTCALL would be a no-op for all calls, except calls to Python classes. POSTCALL could be jumped over by the specializations of PRECALL as they do with CALL, so the cost would be small for those cases and a bit larger for calls to bound methods and Python functions.

TBH, all these solutions are a bit unsatisfying.

Adds some overhead and a fair bit of complexity, for what might be quite limited performance advantages.
Seems fragile and may slow down calls to normal Python functions and bound methods.
Makes the code objects even larger and will slow down calls to normal Python functions and bound methods.

I think we should put this on hold for now.
Maybe a better approach will emerge.
Maybe we will need to change other things and one of the existing approaches will become more viable.

TeamSpen210 · 2022-03-09T11:43:07Z

I'm not sure how well this would work, but would it be possible to do modifications to the __init__ function itself? When specialising class calls go through that function and swap out all the RETURN_VALUE opcodes, and add another opcode to the start. Adding a flag indicating if the code object was altered this way would ensure it only gets scanned once. There would need to also be a frame flag like in the second method that the new opcodes would check to deactivate the extra functionality if __init__ was called manually.

Fidget-Spinner · 2022-03-09T11:55:54Z

Spencer, I toyed with the idea too but ditched it because it means specialization attempts will be very expensive for large __init__ functions.

Mark, I will bench approach 2 once with pyperformance. If there's an overall speedup, I think it won't matter if functions slow down a little (specialization slows down all non-specialized stuff anyways and we still do it :). However, the fragility will definitely still be a concern even with good pyperformance numbers.

brandtbucher · 2022-03-09T14:36:14Z

Ideally our solution would be general enough to also work for other cases where we may want to do some "cleanup" after calling a Python function. A few possible examples:

BINARY_OP: raise TypeError on NotImplemented
FOR_ITER: jump on StopIteration (or when returning from a generator frame)
STORE_SUBSCR: ignore return value
STORE_ATTR: ignore return value

markshannon · 2022-03-09T15:29:53Z

Regarding STORE_SUBSCR and STORE_ATTR, see #241

markshannon · 2022-03-09T15:53:35Z

The benchmarks for 2 look promising.
I will try to get some numbers for 3 asap.
I'm going to ditch 1 for now, as too complex.

Fidget-Spinner · 2022-03-09T16:04:48Z

I hope I'm not repeating after anyone else here, but maybe a fourth approach that merges 2 and 3 would work:

Taking inspiration from how LOAD_METHOD deals with this problem (push [NULL, meth] or [meth, self] onto the stack depending on success): We can make PRECALL set [self, func, self] onto the stack when it sees __init__, or [NULL, func, self] otherwise. POSTCALL then looks for either [self, retvalue] or [NULL, retvalue], and POP/PUSH to make it become [self/retvalue]. The only question here is whether all the stack manipulation will slow things even more. The slowdown from POSTCALL to everything else should be the same as approach 2 (ie it's still worth it) so it's not in question. IMO it shouldn't be noticeable. As Mark mentioned above, anything that uses PRECALL to call won't be affected. This includes most callables.

Fidget-Spinner · 2022-03-10T10:57:01Z

Update:
I tried a modified version of 4. ([meth, self, arg, ..., self|NULL]) instead of [self|NULL, meth, self, arg,..] because it requires less stack manipulation. I couldn't get it to work and I have no clue why so I'm abandoning it :(. https://github.com/Fidget-Spinner/cpython/tree/specialize_py_class_postcall

The idea there is that apart from the PRECALLs which SKIP_CALL(), all other PRECALL pushes NULL or self onto TOS. CALL will then call without that NULL|self. POSTCALL just picks either the self or retval to be TOS. Full illustration here. I spent a few hours debugging but I can't find out why things on the stack aren't always cleaned up.

Fidget-Spinner · 2022-03-16T15:22:20Z

Update again:
I managed to get modified 4 to work at python/cpython#31936. I'm now in favour of POSTCALL over the other methods. It allows us to specialize for other complex call types (e.g. call a class with builtin __init__), whereas method 2 is a one-trick pony that works only for python __init__ calls. Also there's only extra overhead for calls to python functions. All the other call opcodes skip over the POSTCALL.

markshannon · 2022-04-07T09:29:46Z

I'm still not liking any of the approaches to specializing calls to Python classes much, but the complexity of the other solutions means that I'm inclining to 1.

Approach 1 (pushing a temporary frame) is robust and doesn't bulk out the calling sequence even more.

Approach 4 seems reasonably robust, but adds an extra cache and instruction to the call sequence, which will have knock on effects.

We can make approach 1 reasonably efficient by adding a trampoline function, that calls __init__, to each class.
By having one trampoline function per class, the trampoline functions can be tailored to exactly match the parameters of __init__.
That way we specialize in the usual place, but can move the cleanup code out of line.

This init_trampoline would have the following semantics, where <args> exactly match the arguments of the __init__ function.

def init_trampoline(self, <args>):
     res = type(self).__dict__["__init__"](self, <args>)
     if res is not None:
           raise ...
     return self

We can implement type(self).__dict__["__init__"](self, <args>) as a single ENTER_INIT n instruction which pushes the frame for __init__ and jumps to it.
This might not work for *args and **kwargs, but I think we can assume that they are rare.
It could also work for classes with builtin __init__ functions, although that might need a bit of tweaking.

As @Fidget-Spinner points out when talking about approach 4, this approach has the potential to be generalized to other calls, where we want to substitute one callable for another, to expose better optimization opportunities.

markshannon · 2022-04-07T09:33:46Z

@Fidget-Spinner thanks for help and perseverance with this issue.
The performance of calls is important, and Python's uniquely complex function parameters and variety of callables do make this challenging.
We are getting there, if somewhat slowly.

Fidget-Spinner · 2022-04-07T13:02:15Z

@markshannon I think you got it right when talking about the pros and cons of 1 vs 4 . I'll reiterate here so I remember in the future:
1's Pros:

Doesn't slow down python-to-python calls.
Doesn't use more memory.

1's Cons:

Needs more effort to specialize for as many shapes of __init__.

4's Pros:

Simpler implementation (though this is questionable)
Extremely versatile. Currently delegates specialization of __init__ shape to CALL so all current specialized forms of CALL can be reused. E.g CALL_PY_EXACT_ARGS deals with simple __init__s, CALL_PY_WITH_DEFAULTS deals with __init__ with defaults, and so on for all future specializations added to CALL.

4's Cons:

Uses more memory (1 POSTCALL instruction, 2 cache entries in PRECALL)
Slightly slower calling of python functions due to additional stack manipulation and POSTCALL overhead.

Personally, I still lean a little towards approach 4. Mainly because it saves us a lot of code. Every time we add a new CALL_X specialization, it will also speed up calls to __init__ since approach 4 delegates specialization of __init__s shape away to CALL. But I'm not opposed to 1 either, I'm happy as long as class calls speed up someway or another.

markshannon · 2022-05-18T09:53:01Z

It's time for the next chapter in this ongoing saga...

If we choose option 1 above, and make an assumption about the use of bound-methods, then we can remove the PRECALL instruction from the calling sequence, which becomes:

The assumption I want to make is this:
Almost all calls to bound-methods have simple arguments. I think this is a reasonable assumption as bound-methods are generally used as a performance optimization, so it is reasonable to assume that they will be called with simple arguments, as that has historically been the fastest way to make a call.

Given that, we can add a CALL specialization CALL_BOUND_METHOD_EXACT_ARGS which will do the job of PRECALL_BOUND_METHOD followed by CALL_PY_EXACT_ARGS.

markshannon · 2023-08-03T12:17:41Z

Since PRECALL has been removed, this issue is now obsolete. See #447 for how we plan to specialize the remaining calls.

markshannon self-assigned this Feb 10, 2022

markshannon mentioned this issue Feb 16, 2022

bpo-46329: Change calling sequence (again) python/cpython#31373

Merged

markshannon mentioned this issue Feb 21, 2022

bpo-46329: Streamline calling sequence a bit. python/cpython#31465

Merged

Fidget-Spinner mentioned this issue Mar 6, 2022

bpo-46939: Specialize calls to Python classes python/cpython#31707

Closed

3 tasks

markshannon mentioned this issue May 18, 2022

Improve efficiency of PRECALL/CALL instructions python/cpython#90690

Closed

mdboom added the epic-specialization More specialization work for 3.12 label Aug 2, 2022

markshannon closed this as completed Aug 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better specialization of calls, post introduction of `PRECALL`. #267

Better specialization of calls, post introduction of `PRECALL`. #267

markshannon commented Feb 10, 2022

markshannon commented Feb 16, 2022

markshannon commented Feb 17, 2022 •

edited

Loading

markshannon commented Feb 17, 2022 •

edited

Loading

markshannon commented Feb 17, 2022 •

edited

Loading

markshannon commented Feb 17, 2022

markshannon commented Mar 9, 2022

TeamSpen210 commented Mar 9, 2022

Fidget-Spinner commented Mar 9, 2022 •

edited

Loading

brandtbucher commented Mar 9, 2022 •

edited

Loading

markshannon commented Mar 9, 2022

markshannon commented Mar 9, 2022

Fidget-Spinner commented Mar 9, 2022 •

edited

Loading

Fidget-Spinner commented Mar 10, 2022 •

edited

Loading

Fidget-Spinner commented Mar 16, 2022 •

edited

Loading

markshannon commented Apr 7, 2022 •

edited

Loading

markshannon commented Apr 7, 2022

Fidget-Spinner commented Apr 7, 2022 •

edited

Loading

markshannon commented May 18, 2022

markshannon commented Aug 3, 2023

Better specialization of calls, post introduction of PRECALL. #267

Better specialization of calls, post introduction of PRECALL. #267

Comments

markshannon commented Feb 10, 2022

Specialization attempts

Bound methods.

Complex parameters.

Python class

Builtin functions and classes using the older caller conventions

Mutable classes

markshannon commented Feb 16, 2022

markshannon commented Feb 17, 2022 • edited Loading

Specialization strategy for all currently specialized types/values and most of the types that we currently fail on.

markshannon commented Feb 17, 2022 • edited Loading

markshannon commented Feb 17, 2022 • edited Loading

markshannon commented Feb 17, 2022

markshannon commented Mar 9, 2022

TeamSpen210 commented Mar 9, 2022

Fidget-Spinner commented Mar 9, 2022 • edited Loading

brandtbucher commented Mar 9, 2022 • edited Loading

markshannon commented Mar 9, 2022

markshannon commented Mar 9, 2022

Fidget-Spinner commented Mar 9, 2022 • edited Loading

Fidget-Spinner commented Mar 10, 2022 • edited Loading

Fidget-Spinner commented Mar 16, 2022 • edited Loading

markshannon commented Apr 7, 2022 • edited Loading

markshannon commented Apr 7, 2022

Fidget-Spinner commented Apr 7, 2022 • edited Loading

markshannon commented May 18, 2022

markshannon commented Aug 3, 2023

Better specialization of calls, post introduction of `PRECALL`. #267

Better specialization of calls, post introduction of `PRECALL`. #267

markshannon commented Feb 17, 2022 •

edited

Loading

markshannon commented Feb 17, 2022 •

edited

Loading

markshannon commented Feb 17, 2022 •

edited

Loading

Fidget-Spinner commented Mar 9, 2022 •

edited

Loading

brandtbucher commented Mar 9, 2022 •

edited

Loading

Fidget-Spinner commented Mar 9, 2022 •

edited

Loading

Fidget-Spinner commented Mar 10, 2022 •

edited

Loading

Fidget-Spinner commented Mar 16, 2022 •

edited

Loading

markshannon commented Apr 7, 2022 •

edited

Loading

Fidget-Spinner commented Apr 7, 2022 •

edited

Loading