Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better specialization of calls, post introduction of PRECALL. #267

Closed
markshannon opened this issue Feb 10, 2022 · 19 comments
Closed

Better specialization of calls, post introduction of PRECALL. #267

markshannon opened this issue Feb 10, 2022 · 19 comments
Assignees
Labels
epic-specialization More specialization work for 3.12

Comments

@markshannon
Copy link
Member

With the new opcodes, the sequence of instructions executed during a call looks like this:

call_sequence

The stats https://github.com/faster-cpython/ideas/blob/main/stats.md#call show that we only have a 72% hit rate on the standard benchmark suite.

Specialization attempts

Count Ratio
Success 2909102 24.4%
Failure 9035032 75.6%
Failure kind Count Ratio
bound method 2136774 23.6%
complex parameters 1252534 13.9%
python class 1245231 13.8%
pycfunction with keywords 859904 9.5%
class no vectorcall 723781 8.0%
kwnames 634422 7.0%
pycfunction 586186 6.5%
pycfunction noargs 407864 4.5%
class mutable 327462 3.6%
other 232681 2.6%
bad call flags 179897 2.0%
cmethod 168390 1.9%
pycfunction fast with keywords 162744 1.8%
str 79833 0.9%
method wrapper 32561 0.4%
operator wrapper 4768 0.1%

Each of these failures need a different strategy, so multiple strategies.

Bound methods.

Bounds methods can come from two places, bound methods object explicitly used in the program, and classmethods.
The first should be handled in PRECALL_FUNCTION, the second in LOAD_METHOD.

Complex parameters.

This needs more investigation, to see if some of these can be specialized.

Python class

Should be handled in PRECALL_FUNCTION which will create the self object and push a clean-up frame, leaving the __init__ method to be handled by CALL.

Builtin functions and classes using the older caller conventions

These should be fixed by modernizing the callee, not accommodating them in the interpreter.

Mutable classes

Mystery classification. Needs investigation, possibly a bug in the classification.

@markshannon markshannon self-assigned this Feb 10, 2022
@markshannon
Copy link
Member Author

The above sequence is still a bit messy, with an unnecessary distinction between calling an attribute, and other calls.
Ideally we want one PRECALL instruction that can be specialized for the type (or in some cases, value) of the callable, and the CALL instruction that can be specialized for the "shape" of the call (i.e. number of args, named args, etc).

LOAD_METHOD puts two values on the stack, either func, self or NULL, func.
The extra stack space allows us to unpack a bound-method without expensive stack manipulations.
There is no reason why we can't generalize this to all calls, allowing us that freedom to unpack (or repack) as needed.
By prefixing non-method call sequences with a PUSH_NULL instruction we can unify the two PRECALLs and reduce the number of special cases required.

call_sequence_with_null

Now PRECALL n expects the n+1 and n+2 items on the stack to be func, self or NULL, func, regardless of how the callable was evaluated.

The downside is that we are adding yet another instruction into the call sequence. Stats show that PUSH_NULL represents about 3% of instructions executed. Given it does so little work, the cost is just dispatching, so should be ~0.3%.

However, PUSH_NULL occurs before LOAD_GLOBAL in about 2/3s of cases.
Specializations of LOAD_GLOBAL can include a flag to push a NULL and skip an instruction without branching, so the expected overhead should be < 0.2%

More uniform and effective specialization should pay this back and plenty more.

@markshannon
Copy link
Member Author

markshannon commented Feb 17, 2022

Specialization strategy for all currently specialized types/values and most of the types that we currently fail on.

What PRECALL Named keywords? CALL
Python function pass yes Two specializations: One for exact args. One to handle defaults and kwnames.
type(1), str(1), tuple(1) Here, skip over CALL no no
Most builtin functions Here, skip over CALL no no
builtin functions with METH_FASTCALL | METH_KEYWORDS pass yes No change to current specialization
len/isinstance Here, skip over CALL no no
Most method descriptors Here, skip over CALL no no
method descriptors with METH_FASTCALL | METH_KEYWORDS pass yes Much the same as for the equivalent builtin functions
Bound methods Unpack onto stack, if space* yes No, should be unpacked so will specialize as Python functions.
Python class Allocate object, push cleanup frame. Leave __init__ self on stack, if space* yes No, expect __init__ to be a Python function

* If the lower slot on the stack is NULL we treat can this the same as PRECALL_FUNCTION.

The only class of failures in https://github.com/faster-cpython/ideas/blob/main/stats.md not covered is "complex parameters" which are Python functions, so would be handled by adding more cases to the specialization of Python functions.

@markshannon
Copy link
Member Author

markshannon commented Feb 17, 2022

Looking at the above table, it is clear that we should place KW_NAMES before PRECALL.

Builtin functions and method descriptors with METH_FASTCALL | METH_KEYWORDS cannot currently be specialized in the PRECALL because we don't know if we have kwnames.

Swapping KW_NAMES and PRECALL fixes that. It makes no difference to any other calls.

call_sequence_with_null_v3

@markshannon
Copy link
Member Author

markshannon commented Feb 17, 2022

Resulting in this table:

What PRECALL Named keywords? CALL
Python function pass yes Two specializations: One for exact args. One to handle defaults and kwnames.
type(1), str(1), tuple(1) Here, skip over CALL no no
Most builtin functions Here, skip over CALL no no
builtin functions with METH_FASTCALL | METH_KEYWORDS As current specialization yes no
len/isinstance Here, skip over CALL no no
Most method descriptors Here, skip over CALL no no
method descriptors with METH_FASTCALL | METH_KEYWORDS Much the same as for the equivalent builtin functions yes no
Bound methods Unpack onto stack, if space yes No, should be unpacked so will specialize as Python functions.
Python class Allocate object, push cleanup frame. Leave __init__ self on stack, if space yes No, expect __init__ to be a Python function

@markshannon
Copy link
Member Author

In summary, we will:

  1. Move KW_NAMES before PRECALL.
  2. Move all CALL specializations, except for Python functions, to PRECALL
  3. Add specializations for bound methods and Python classes to PRECALL.
  4. Extend CALL specializations to handle Python functions with more complex arguments.

@markshannon
Copy link
Member Author

It looks like specializing calls to Python classes is more complex than I allowed for.

The three possible approaches we have so far come up with are:

  1. My original idea: Push a temporary frame to clean. I've done some work on implementing this and it involves some moving of arguments to the new frame, which has a cost. It also needs one or more dynamically created Python functions for each class to allow effective specialization of the call to __init__ and do the cleanup.
  2. Add some state to the frame add cleanup in RETURN_VALUE bpo-46939: Specialize calls to Python classes python/cpython#31707
  3. Add a third bytecode to call sequence, POSTCALL. POSTCALL would be a no-op for all calls, except calls to Python classes. POSTCALL could be jumped over by the specializations of PRECALL as they do with CALL, so the cost would be small for those cases and a bit larger for calls to bound methods and Python functions.

TBH, all these solutions are a bit unsatisfying.

  1. Adds some overhead and a fair bit of complexity, for what might be quite limited performance advantages.
  2. Seems fragile and may slow down calls to normal Python functions and bound methods.
  3. Makes the code objects even larger and will slow down calls to normal Python functions and bound methods.

I think we should put this on hold for now.
Maybe a better approach will emerge.
Maybe we will need to change other things and one of the existing approaches will become more viable.

@TeamSpen210
Copy link

I'm not sure how well this would work, but would it be possible to do modifications to the __init__ function itself? When specialising class calls go through that function and swap out all the RETURN_VALUE opcodes, and add another opcode to the start. Adding a flag indicating if the code object was altered this way would ensure it only gets scanned once. There would need to also be a frame flag like in the second method that the new opcodes would check to deactivate the extra functionality if __init__ was called manually.

@Fidget-Spinner
Copy link
Collaborator

Fidget-Spinner commented Mar 9, 2022

Spencer, I toyed with the idea too but ditched it because it means specialization attempts will be very expensive for large __init__ functions.

Mark, I will bench approach 2 once with pyperformance. If there's an overall speedup, I think it won't matter if functions slow down a little (specialization slows down all non-specialized stuff anyways and we still do it :). However, the fragility will definitely still be a concern even with good pyperformance numbers.

@brandtbucher
Copy link
Member

brandtbucher commented Mar 9, 2022

Ideally our solution would be general enough to also work for other cases where we may want to do some "cleanup" after calling a Python function. A few possible examples:

  • BINARY_OP: raise TypeError on NotImplemented
  • FOR_ITER: jump on StopIteration (or when returning from a generator frame)
  • STORE_SUBSCR: ignore return value
  • STORE_ATTR: ignore return value

@markshannon
Copy link
Member Author

Regarding STORE_SUBSCR and STORE_ATTR, see #241

@markshannon
Copy link
Member Author

The benchmarks for 2 look promising.
I will try to get some numbers for 3 asap.
I'm going to ditch 1 for now, as too complex.

@Fidget-Spinner
Copy link
Collaborator

Fidget-Spinner commented Mar 9, 2022

I hope I'm not repeating after anyone else here, but maybe a fourth approach that merges 2 and 3 would work:

  1. Taking inspiration from how LOAD_METHOD deals with this problem (push [NULL, meth] or [meth, self] onto the stack depending on success): We can make PRECALL set [self, func, self] onto the stack when it sees __init__, or [NULL, func, self] otherwise. POSTCALL then looks for either [self, retvalue] or [NULL, retvalue], and POP/PUSH to make it become [self/retvalue]. The only question here is whether all the stack manipulation will slow things even more. The slowdown from POSTCALL to everything else should be the same as approach 2 (ie it's still worth it) so it's not in question. IMO it shouldn't be noticeable. As Mark mentioned above, anything that uses PRECALL to call won't be affected. This includes most callables.

@Fidget-Spinner
Copy link
Collaborator

Fidget-Spinner commented Mar 10, 2022

Update:
I tried a modified version of 4. ([meth, self, arg, ..., self|NULL]) instead of [self|NULL, meth, self, arg,..] because it requires less stack manipulation. I couldn't get it to work and I have no clue why so I'm abandoning it :(. https://github.com/Fidget-Spinner/cpython/tree/specialize_py_class_postcall

The idea there is that apart from the PRECALLs which SKIP_CALL(), all other PRECALL pushes NULL or self onto TOS. CALL will then call without that NULL|self. POSTCALL just picks either the self or retval to be TOS. Full illustration here. I spent a few hours debugging but I can't find out why things on the stack aren't always cleaned up.

@Fidget-Spinner
Copy link
Collaborator

Fidget-Spinner commented Mar 16, 2022

Update again:
I managed to get modified 4 to work at python/cpython#31936. I'm now in favour of POSTCALL over the other methods. It allows us to specialize for other complex call types (e.g. call a class with builtin __init__), whereas method 2 is a one-trick pony that works only for python __init__ calls. Also there's only extra overhead for calls to python functions. All the other call opcodes skip over the POSTCALL.

@markshannon
Copy link
Member Author

markshannon commented Apr 7, 2022

I'm still not liking any of the approaches to specializing calls to Python classes much, but the complexity of the other solutions means that I'm inclining to 1.

Approach 1 (pushing a temporary frame) is robust and doesn't bulk out the calling sequence even more.

Approach 4 seems reasonably robust, but adds an extra cache and instruction to the call sequence, which will have knock on effects.

We can make approach 1 reasonably efficient by adding a trampoline function, that calls __init__, to each class.
By having one trampoline function per class, the trampoline functions can be tailored to exactly match the parameters of __init__.
That way we specialize in the usual place, but can move the cleanup code out of line.

This init_trampoline would have the following semantics, where <args> exactly match the arguments of the __init__ function.

def init_trampoline(self, <args>):
     res = type(self).__dict__["__init__"](self, <args>)
     if res is not None:
           raise ...
     return self

We can implement type(self).__dict__["__init__"](self, <args>) as a single ENTER_INIT n instruction which pushes the frame for __init__ and jumps to it.
This might not work for *args and **kwargs, but I think we can assume that they are rare.
It could also work for classes with builtin __init__ functions, although that might need a bit of tweaking.

As @Fidget-Spinner points out when talking about approach 4, this approach has the potential to be generalized to other calls, where we want to substitute one callable for another, to expose better optimization opportunities.

@markshannon
Copy link
Member Author

@Fidget-Spinner thanks for help and perseverance with this issue.
The performance of calls is important, and Python's uniquely complex function parameters and variety of callables do make this challenging.
We are getting there, if somewhat slowly.

@Fidget-Spinner
Copy link
Collaborator

Fidget-Spinner commented Apr 7, 2022

@markshannon I think you got it right when talking about the pros and cons of 1 vs 4 . I'll reiterate here so I remember in the future:
1's Pros:

  • Doesn't slow down python-to-python calls.
  • Doesn't use more memory.

1's Cons:

  • Needs more effort to specialize for as many shapes of __init__.

4's Pros:

  • Simpler implementation (though this is questionable)
  • Extremely versatile. Currently delegates specialization of __init__ shape to CALL so all current specialized forms of CALL can be reused. E.g CALL_PY_EXACT_ARGS deals with simple __init__s, CALL_PY_WITH_DEFAULTS deals with __init__ with defaults, and so on for all future specializations added to CALL.

4's Cons:

  • Uses more memory (1 POSTCALL instruction, 2 cache entries in PRECALL)
  • Slightly slower calling of python functions due to additional stack manipulation and POSTCALL overhead.

Personally, I still lean a little towards approach 4. Mainly because it saves us a lot of code. Every time we add a new CALL_X specialization, it will also speed up calls to __init__ since approach 4 delegates specialization of __init__s shape away to CALL. But I'm not opposed to 1 either, I'm happy as long as class calls speed up someway or another.

@markshannon
Copy link
Member Author

It's time for the next chapter in this ongoing saga...

If we choose option 1 above, and make an assumption about the use of bound-methods, then we can remove the PRECALL instruction from the calling sequence, which becomes:

call_sequence_with_null_v4

The assumption I want to make is this:
Almost all calls to bound-methods have simple arguments. I think this is a reasonable assumption as bound-methods are generally used as a performance optimization, so it is reasonable to assume that they will be called with simple arguments, as that has historically been the fastest way to make a call.

Given that, we can add a CALL specialization CALL_BOUND_METHOD_EXACT_ARGS which will do the job of PRECALL_BOUND_METHOD followed by CALL_PY_EXACT_ARGS.

@markshannon
Copy link
Member Author

Since PRECALL has been removed, this issue is now obsolete. See #447 for how we plan to specialize the remaining calls.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
epic-specialization More specialization work for 3.12
Projects
Development

No branches or pull requests

5 participants