-
-
Notifications
You must be signed in to change notification settings - Fork 30.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce stack consumption of PyObject_CallFunctionObjArgs() and like #73056
Comments
Following patch I wrote in attempt to decrease a stack consumption of PyObject_CallFunctionObjArgs(), PyObject_CallMethodObjArgs() and _PyObject_CallMethodIdObjArgs(). But it doesn't affect a stack consumption. I still didn't measured what performance effect it has. Seems it makes a code a little cleaner. |
What do you think of using alloca() instead of an "PyObject *small_stack[5];" which has a fixed size? Note: About your patch, try to avoid _PyObject_CallArg1() if you care of the usage of the C stack, see issue bpo-28858. |
How do you check the stack consumption of PyObject_CallFunctionObjArgs()? |
alloca() is not in POSIX.1. I afraid it would make CPython less portable.
I don't understand how can I avoid it.
Using a script from bpo-28858. |
New changeset 71876e4abce4 by Victor Stinner in branch 'default': |
I reworked abstract.c to prepare work for this issue:
I wrote a function _testcapi to measure the consumption of the C code. I was surprised by the results: calling PyObject_CallFunctionObjArgs(func, arg1, arg2, NULL) consumes 560 bytes! I measured on a Python compiled in release mode. Attached less_stack.patch rewrites _PyObject_VaCallFunctionObjArgs(), it reduces the stack consumption from 560 bytes to 384 bytes (-176 bytes!). Changes:
Effect of _PY_FASTCALL_SMALL_STACK:
|
I don't propose to add _testcapi.pyobjectl_callfunctionobjargs_stacksize(). It's just to test the patch. I'm using it with: $./python -c 'import _testcapi; n=100; print(_testcapi.pyobjectl_callfunctionobjargs_stacksize(n) / (n+1))' The value of n has no impact on the stack, it gives the same value with n=0. |
I also tried to use alloca(): see attached alloca.patch. But the result is quite bad: 528 bytes of stack memory per call. I only attach the patch to discuss the issue, but I now dislike the option: the result is bad, it's less portable and more dangerous. |
I also tried Serhiy's approach, split the function into subfunctions, but the result is not as good as expected: 496 bytes. See attached subfunc.patch. |
For comparison, Python 3.5 (before fast calls) uses 448 bytes of C stack per call. Python 3.5 uses a tuple allocated in the heap memory. |
I have tested all three patches with the stack_overflow.py script. The only affected are recursive Python implementations of __call__, __getitem__ and __iter__.
test_python_call 9696 9876 9880 9876 |
testcapi_stacksize.patch: add _testcapi.pyobjectl_callfunctionobjargs_stacksize(), function used to measure the stack consumption. |
no_small_stack.patch: And now something completely different, a patch to remove the "small stack" alllocated on the C stack, always use the heap memory. FYI I created no_small_stack.patch from less_stack.patch. As expected, the stack usage is lower:
I didn't check the performance of no_small_stack.patch yet. |
In Python 3.5, PyObject_CallFunctionObjArgs() calls objargs_mktuple() which uses Py_VA_COPY(countva, va) and creates a tuple. The tuple constructor uses a free list to reduce the cost of heap memory allocations. |
I modified Serhiy's stack_overflow.py of bpo-28858:
Maximum number of Python calls before a crash. (*) Reference (unpatched): 560 bytes/call test_python_call 7172 (1) no_small_stack.patch: 368 bytes/call test_python_call 7172 (=) (2) less_stack.patch: 384 bytes/call test_python_call 7272 (+100) (3) subfunc.patch: 496 bytes test_python_call 7272 (+100) (4) alloca.patch: 528 bytes/call test_python_call 7272 (+100) Patched sorted by bytes/call, from best to worst: no_small_stack.patch (368) > less_stack.patch (384) > subfunc.patch (496) > alloca.patch (528) > reference (560). Patched sorted by number of calls before crash: subfunc.patch (20 004) > alloca.patch (19 488) > no_small_stack.patch (19 288) > less_stack.patch (19 112) > reference (18 838). I expected a correlation between the measure bytes/call measured by testcapi_stacksize.patch and the number of calls before a crash, but I fail to see an obvious correlation :-/ Maybe the compiler is smarter than what I would expect and emits efficient code to be able to use less stack memory? Maybe the Linux kernel does weird things which makes the behaviour on stack-overflow non-obvious :-) At least, I would expect that no_small_stack.patch would be the clear winner, since it has the smallest usage of C stack. |
Impact of the _PY_FASTCALL_SMALL_STACK constant:
test_python_call 7376
test_python_call 7272
test_python_call 7172
test_python_call 6984 Increasing _PY_FASTCALL_SMALL_STACK has a clear effect on the total. Total decreases when _PY_FASTCALL_SMALL_STACK increases. --- no_small_stack.patch with _PY_FASTCALL_SMALL_STACK=3: 368 bytes/call test_python_call 7272 |
I'm not sure that the result of pyobjectl_callfunctionobjargs_stacksize() has direct relation to stack consumption in test_python_call, test_python_getitem and test_python_iterator. Try to measure the stack consumption in these cases. This can be done with _testcapi helper that just returns the value of stack pointer. Run all three tests with fixed level of recursion and measure the difference between stack pointers. Would be nice also measure a performance effect of the patches. |
testcapi_stack_pointer.patch: add _testcapi.stack_pointer() function. |
stack_overflow_28870-sp.py: script using testcapi_stack_pointer.patch to compute the usage of the C stack. Results of this script. (*) Reference test_python_call: 7175 calls before crash, stack: 1168 bytes/call => total: 18754 calls, 4080 bytes (1) no_small_stack.patch test_python_call: 7175 calls before crash, stack: 1168 bytes/call => total: 19294 calls, 3952 bytes test_python_call is clearly not impacted by no_small_stack.patch. test_python_call loops on method_call(): method_call()
=> _PyObject_Call_Prepend()
=> _PyObject_FastCallDict()
=> _PyFunction_FastCallDict()
=> _PyEval_EvalCodeWithName()
=> PyEval_EvalFrameEx()
=> _PyEval_EvalFrameDefault()
=> call_function()
=> _PyObject_FastCallKeywords()
=> slot_tp_call()
=> PyObject_Call()
=> method_call()
=> (...) _PyObject_Call_Prepend() is in the middle of the chain. This function uses a "small stack" of _PY_FASTCALL_SMALL_STACK "PyObject*" items. We can clearly see the impact of modifying _PY_FASTCALL_SMALL_STACK on the maximum number of |
no_small_stack-2.patch: Remove all "small_stack" buffers. Reference test_python_call: 7175 calls before crash, stack: 1168 bytes/call => total: 18754 calls, 4080 bytes no_small_stack.patch test_python_call: 7482 calls (+307) before crash, stack: 1120 bytes/call (-48) => total: 19890 calls (+1136), 3840 bytes (-240) The total gain is the removal of 5 small buffers of 48 bytes: 240 bytes. |
Oops, you should read no_small_stack-2.patch in my previous message ;-) |
Python 3.5 (revision 8125d9a8152b), before all fastcall changes: test_python_call: 8314 calls before crash, stack: 1008 bytes/call => total: 22599 calls, 3360 bytes |
Python 3.4 (rev 6340c9fcc111): test_python_call: 9700 calls before crash, stack: 864 bytes/call => total: 25832 calls, 2944 bytes Python 2.7 (rev 0d4e0a736688): test_python_call: 6162 calls before crash, stack: 1360 bytes/call => total: 17999 calls, 4192 bytes Nice. At least, Python 3.7 is better than Python 2.7 (4080 bytes < |
no_small_stack-2.patch has a very bad impact on performances: haypo@speed-python$ python3 -m perf compare_to 2017-01-04_12-02-default-ee1390c9b585.json no_small_stack-2_refee1390c9b585.json -G --min-speed=5 Slower (59):
Benchmark ran on speed-python with PGO+LTO, Linux configured for benchmarks using python3 -m perf system tune. |
Thus Python 3.6 stack usage is about 20% larger than Python 3.5 and about 40% larger than Python 3.4. This is significant. :-( no_small_stack-2.patch decreases it only by 6% (with possible performance loss). |
Yeah, if we want to come back to Python 3.4 efficiency, we need to find the other functions which now uses more stack memory ;-) The discussed "small stack" buffers are only responsible of 96 bytes, not a big deal compared to the total of 4080 bytes. |
Stack used by each C function of test_python_call. 3.4: (a) method_call: 64 (b) PyObject_Call: 48 (c) PyEval_EvalFrameEx: 256 (d) slot_tp_call: 64 => total: 864 default: (a) method_call: 80 (b) _PyObject_FastCallDict: 64 (c) _PyEval_EvalFrameDefault: 320 (d) slot_tp_call: 64 => total: 1120 Groups of functions, 3.4 => default: (a) 64 => 80 (+16) I used gdb: (gdb) set $last=0
|
I created the issue bpo-29227 "Reduce C stack consumption in function calls" which contains a first simple patch with a significant effect on the C stack. |
It seems like subfunc.patch approach using the "no inline" attribute helps. |
I pushed 3 changes:
Before (rev a30cdf366c02): test_python_call: 7175 calls before crash, stack: 1168 bytes/call => total: 18754 calls, 4080 bytes With these 3 changes (rev 6478e6d0476f): test_python_call: 8587 calls before crash, stack: 976 bytes/call => total: 25712 calls, 2944 bytes The default branch is now as good as Python 3.4, in term of stack consumption, and Python 3.4 was the Python version which used the least stack memory according to my tests. I didn't touch _PY_FASTCALL_SMALL_STACK value, it's still 5 arguments (40 bytes). So my changes should not impact performances. |
Result of attached bench_recursion-2.py comparing before/after the 3 changes reducing the stack consumption: test_python_call: Median +- std dev: [a30cdf366c02] 512 us +- 12 us -> [6478e6d0476f] 467 us +- 21 us: 1.10x faster (-9%) At least, it doesn't seem to be slower. Maybe the speedup comes from call_function() inlining. This function was probably already inlined when using PGO build. The script was written by Serhiy in the issue bpo-29227, I modified it to use the Runner.timeit() API for convenience. |
Awesome! You are great Victor! |
I also ran the reliable performance benchmark suite with LTO+PGO. There is no significant performance change on these benchmarks: The largest change is on scimark_lu (-13%), but there was an hiccup on the previous change which is probably a small unstability in the benchmark. It's not a speedup of these changes. The second largest change is on spectral_norm: +9%. But this benchmark is known to be unstable, there was already a small peak previously. Again, I don't think that it's related to the changes. |
"The default branch is now as good as Python 3.4, in term of stack consumption, and Python 3.4 was the Python version which used the least stack memory according to my tests." I consider that the initial issue is now fixed, so I close the issue. Thanks Serhiy for the tests, reviews, ideas and obvious the bug report ;-) I never looked at the stack usage before. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: