Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-115754: Get singletons via function calls #116572

Closed
wants to merge 2 commits into from

Conversation

vstinner
Copy link
Member

@vstinner vstinner commented Mar 10, 2024

@vstinner
Copy link
Member Author

PR created to measure the impact on performance of getting the 5 singletons via function calls.

@vstinner
Copy link
Member Author

vstinner commented Mar 11, 2024

Patch adding a benchmark function:

diff --git a/Modules/_testcapimodule.c b/Modules/_testcapimodule.c
index b6536045e6..03dcbd0cb9 100644
--- a/Modules/_testcapimodule.c
+++ b/Modules/_testcapimodule.c
@@ -3284,6 +3284,39 @@ test_weakref_capi(PyObject *Py_UNUSED(module), PyObject *Py_UNUSED(args))
     _Py_COMP_DIAG_POP
 }
 
+static PyObject *
+bench_none(PyObject *Py_UNUSED(module), PyObject *args)
+{
+    Py_ssize_t loops;
+    if (!PyArg_ParseTuple(args, "n", &loops)) {
+        return NULL;
+    }
+
+    PyTime_t t1;
+    (void)PyTime_PerfCounter(&t1);
+    for (Py_ssize_t i=0; i < loops; i++) {
+        PyObject *obj;
+        int res = 0;
+
+        obj = Py_None;
+        res += Py_IsNone(obj) + Py_IsFalse(obj) + Py_IsTrue(obj);
+
+        obj = Py_False;
+        res += Py_IsNone(obj) + Py_IsFalse(obj) + Py_IsTrue(obj);
+
+        obj = Py_True;
+        res += Py_IsNone(obj) + Py_IsFalse(obj) + Py_IsTrue(obj);
+
+        if (res != 3) {
+            abort();
+        }
+    }
+    PyTime_t t2;
+    (void)PyTime_PerfCounter(&t2);
+
+    return PyFloat_FromDouble(PyTime_AsSecondsDouble(t2 - t1));
+}
+
 
 static PyMethodDef TestMethods[] = {
     {"set_errno",               set_errno,                       METH_VARARGS},
@@ -3423,6 +3456,7 @@ static PyMethodDef TestMethods[] = {
     {"function_set_kw_defaults", function_set_kw_defaults, METH_VARARGS, NULL},
     {"check_pyimport_addmodule", check_pyimport_addmodule, METH_VARARGS},
     {"test_weakref_capi", test_weakref_capi, METH_NOARGS},
+    {"bench_none", bench_none, METH_VARARGS},
     {NULL, NULL} /* sentinel */
 };
 

Benchmark script:

import pyperf
import _testcapi
runner = pyperf.Runner()
runner.bench_time_func('bench', _testcapi.bench_none)

Results of the microbenchmark with CPU isolation on Linux. Python built with -O3 without PGO nor LTO. PGO/LTO doesn't matter here, since _testcapi is built as a shared library: Py_None becomes a regular function call, there is nothing to optimize.

$ python3 -m pyperf  compare_to ref.json func.json 
Mean +- std dev: [ref] 25.8 ns +- 0.0 ns -> [func] 56.3 ns +- 0.1 ns: 2.18x slower

Converting Py_None constant to a function call makes accessing this variable around 2x slower. Well, that's not surprising, and that's the maximum overhead on a microbenchmark.

@vstinner
Copy link
Member Author

vstinner commented Mar 11, 2024

I also ran pyperformance 1.11.0, this time with PGO+LTO and CPU isolation on Linux. My shell script run to run pyperformance:

set -e -x

cd main
git clean -fdx
rm -rf /opt/bench_python

./configure --enable-optimizations --with-lto --prefix /opt/bench_python
make
make install
cd ..

/opt/bench_python/bin/python3 -m pip install pyperformance
/opt/bench_python/bin/python3 -m pyperformance run -o bench.json

Results.

Benchmarks with tag 'apps':
===========================

+----------------+----------+------------------------+
| Benchmark      | ref      | func                   |
+================+==========+========================+
| 2to3           | 429 ms   | 434 ms: 1.01x slower   |
+----------------+----------+------------------------+
| chameleon      | 11.0 ms  | 11.0 ms: 1.01x slower  |
+----------------+----------+------------------------+
| docutils       | 4.24 sec | 4.23 sec: 1.00x faster |
+----------------+----------+------------------------+
| tornado_http   | 194 ms   | 195 ms: 1.01x slower   |
+----------------+----------+------------------------+
| Geometric mean | (ref)    | 1.01x slower           |
+----------------+----------+------------------------+

Benchmark hidden because not significant (1): html5lib

Benchmarks with tag 'asyncio':
==============================

+-------------------------------+----------+------------------------+
| Benchmark                     | ref      | func                   |
+===============================+==========+========================+
| async_tree_io_tg              | 1.61 sec | 1.62 sec: 1.01x slower |
+-------------------------------+----------+------------------------+
| async_tree_eager_cpu_io_mixed | 669 ms   | 674 ms: 1.01x slower   |
+-------------------------------+----------+------------------------+
| async_tree_io                 | 1.60 sec | 1.62 sec: 1.01x slower |
+-------------------------------+----------+------------------------+
| async_tree_eager_io_tg        | 1.67 sec | 1.69 sec: 1.01x slower |
+-------------------------------+----------+------------------------+
| async_tree_eager_tg           | 124 ms   | 125 ms: 1.01x slower   |
+-------------------------------+----------+------------------------+
| async_tree_none_tg            | 655 ms   | 665 ms: 1.02x slower   |
+-------------------------------+----------+------------------------+
| async_tree_memoization_tg     | 797 ms   | 811 ms: 1.02x slower   |
+-------------------------------+----------+------------------------+
| Geometric mean                | (ref)    | 1.01x slower           |
+-------------------------------+----------+------------------------+

Benchmark hidden because not significant (9): async_tree_eager_cpu_io_mixed_tg, async_tree_eager, async_tree_eager_memoization_tg, async_tree_eager_memoization, async_tree_cpu_io_mixed, async_tree_cpu_io_mixed_tg, async_tree_eager_io, async_tree_memoization, async_tree_none

Benchmarks with tag 'math':
===========================

+----------------+--------+----------------------+
| Benchmark      | ref    | func                 |
+================+========+======================+
| nbody          | 132 ms | 128 ms: 1.03x faster |
+----------------+--------+----------------------+
| float          | 125 ms | 126 ms: 1.01x slower |
+----------------+--------+----------------------+
| Geometric mean | (ref)  | 1.01x faster         |
+----------------+--------+----------------------+

Benchmark hidden because not significant (1): pidigits

Benchmarks with tag 'regex':
============================

+----------------+---------+-----------------------+
| Benchmark      | ref     | func                  |
+================+=========+=======================+
| regex_v8       | 37.5 ms | 37.1 ms: 1.01x faster |
+----------------+---------+-----------------------+
| Geometric mean | (ref)   | 1.00x faster          |
+----------------+---------+-----------------------+

Benchmark hidden because not significant (3): regex_effbot, regex_compile, regex_dna

Benchmarks with tag 'serialize':
================================

+-------------------+---------+-----------------------+
| Benchmark         | ref     | func                  |
+===================+=========+=======================+
| unpickle_list     | 7.63 us | 7.53 us: 1.01x faster |
+-------------------+---------+-----------------------+
| xml_etree_process | 97.8 ms | 98.6 ms: 1.01x slower |
+-------------------+---------+-----------------------+
| json_loads        | 40.8 us | 41.3 us: 1.01x slower |
+-------------------+---------+-----------------------+
| xml_etree_parse   | 238 ms  | 241 ms: 1.01x slower  |
+-------------------+---------+-----------------------+
| json_dumps        | 15.5 ms | 16.5 ms: 1.06x slower |
+-------------------+---------+-----------------------+
| pickle            | 17.6 us | 19.7 us: 1.12x slower |
+-------------------+---------+-----------------------+
| pickle_dict       | 43.1 us | 50.2 us: 1.17x slower |
+-------------------+---------+-----------------------+
| pickle_list       | 6.64 us | 7.87 us: 1.19x slower |
+-------------------+---------+-----------------------+
| Geometric mean    | (ref)   | 1.04x slower          |
+-------------------+---------+-----------------------+

Benchmark hidden because not significant (6): unpickle, tomli_loads, unpickle_pure_python, xml_etree_generate, pickle_pure_python, xml_etree_iterparse

Benchmarks with tag 'startup':
==============================

+------------------------+---------+-----------------------+
| Benchmark              | ref     | func                  |
+========================+=========+=======================+
| python_startup         | 15.4 ms | 15.5 ms: 1.00x slower |
+------------------------+---------+-----------------------+
| python_startup_no_site | 13.6 ms | 13.6 ms: 1.01x slower |
+------------------------+---------+-----------------------+
| Geometric mean         | (ref)   | 1.00x slower          |
+------------------------+---------+-----------------------+

Benchmarks with tag 'template':
===============================

+-----------------+---------+-----------------------+
| Benchmark       | ref     | func                  |
+=================+=========+=======================+
| mako            | 16.1 ms | 16.2 ms: 1.01x slower |
+-----------------+---------+-----------------------+
| genshi_xml      | 83.8 ms | 84.6 ms: 1.01x slower |
+-----------------+---------+-----------------------+
| django_template | 56.5 ms | 57.4 ms: 1.01x slower |
+-----------------+---------+-----------------------+
| genshi_text     | 35.6 ms | 36.7 ms: 1.03x slower |
+-----------------+---------+-----------------------+
| Geometric mean  | (ref)   | 1.02x slower          |
+-----------------+---------+-----------------------+

All benchmarks:
===============

+-------------------------------+----------+------------------------+
| Benchmark                     | ref      | func                   |
+===============================+==========+========================+
| coverage                      | 499 ms   | 152 ms: 3.29x faster   |
+-------------------------------+----------+------------------------+
| dask                          | 885 ms   | 628 ms: 1.41x faster   |
+-------------------------------+----------+------------------------+
| logging_silent                | 166 ns   | 155 ns: 1.07x faster   |
+-------------------------------+----------+------------------------+
| nbody                         | 132 ms   | 128 ms: 1.03x faster   |
+-------------------------------+----------+------------------------+
| telco                         | 12.6 ms  | 12.3 ms: 1.03x faster  |
+-------------------------------+----------+------------------------+
| scimark_fft                   | 522 ms   | 511 ms: 1.02x faster   |
+-------------------------------+----------+------------------------+
| unpickle_list                 | 7.63 us  | 7.53 us: 1.01x faster  |
+-------------------------------+----------+------------------------+
| regex_v8                      | 37.5 ms  | 37.1 ms: 1.01x faster  |
+-------------------------------+----------+------------------------+
| pprint_pformat                | 2.45 sec | 2.42 sec: 1.01x faster |
+-------------------------------+----------+------------------------+
| logging_simple                | 9.51 us  | 9.43 us: 1.01x faster  |
+-------------------------------+----------+------------------------+
| pprint_safe_repr              | 1.20 sec | 1.19 sec: 1.01x faster |
+-------------------------------+----------+------------------------+
| generators                    | 45.4 ms  | 45.1 ms: 1.01x faster  |
+-------------------------------+----------+------------------------+
| coroutines                    | 35.4 ms  | 35.2 ms: 1.01x faster  |
+-------------------------------+----------+------------------------+
| docutils                      | 4.24 sec | 4.23 sec: 1.00x faster |
+-------------------------------+----------+------------------------+
| sympy_integrate               | 32.3 ms  | 32.4 ms: 1.00x slower  |
+-------------------------------+----------+------------------------+
| python_startup                | 15.4 ms  | 15.5 ms: 1.00x slower  |
+-------------------------------+----------+------------------------+
| sympy_sum                     | 254 ms   | 255 ms: 1.00x slower   |
+-------------------------------+----------+------------------------+
| dulwich_log                   | 129 ms   | 129 ms: 1.00x slower   |
+-------------------------------+----------+------------------------+
| sqlglot_optimize              | 86.5 ms  | 86.9 ms: 1.00x slower  |
+-------------------------------+----------+------------------------+
| sqlglot_normalize             | 172 ms   | 173 ms: 1.01x slower   |
+-------------------------------+----------+------------------------+
| richards_super                | 83.0 ms  | 83.4 ms: 1.01x slower  |
+-------------------------------+----------+------------------------+
| python_startup_no_site        | 13.6 ms  | 13.6 ms: 1.01x slower  |
+-------------------------------+----------+------------------------+
| chameleon                     | 11.0 ms  | 11.0 ms: 1.01x slower  |
+-------------------------------+----------+------------------------+
| asyncio_tcp_ssl               | 2.36 sec | 2.37 sec: 1.01x slower |
+-------------------------------+----------+------------------------+
| asyncio_tcp                   | 689 ms   | 694 ms: 1.01x slower   |
+-------------------------------+----------+------------------------+
| async_tree_io_tg              | 1.61 sec | 1.62 sec: 1.01x slower |
+-------------------------------+----------+------------------------+
| mako                          | 16.1 ms  | 16.2 ms: 1.01x slower  |
+-------------------------------+----------+------------------------+
| scimark_monte_carlo           | 99.3 ms  | 100.0 ms: 1.01x slower |
+-------------------------------+----------+------------------------+
| async_tree_eager_cpu_io_mixed | 669 ms   | 674 ms: 1.01x slower   |
+-------------------------------+----------+------------------------+
| float                         | 125 ms   | 126 ms: 1.01x slower   |
+-------------------------------+----------+------------------------+
| xml_etree_process             | 97.8 ms  | 98.6 ms: 1.01x slower  |
+-------------------------------+----------+------------------------+
| deepcopy                      | 553 us   | 557 us: 1.01x slower   |
+-------------------------------+----------+------------------------+
| async_generators              | 686 ms   | 692 ms: 1.01x slower   |
+-------------------------------+----------+------------------------+
| tornado_http                  | 194 ms   | 195 ms: 1.01x slower   |
+-------------------------------+----------+------------------------+
| mdp                           | 3.95 sec | 3.99 sec: 1.01x slower |
+-------------------------------+----------+------------------------+
| genshi_xml                    | 83.8 ms  | 84.6 ms: 1.01x slower  |
+-------------------------------+----------+------------------------+
| go                            | 222 ms   | 224 ms: 1.01x slower   |
+-------------------------------+----------+------------------------+
| async_tree_io                 | 1.60 sec | 1.62 sec: 1.01x slower |
+-------------------------------+----------+------------------------+
| async_tree_eager_io_tg        | 1.67 sec | 1.69 sec: 1.01x slower |
+-------------------------------+----------+------------------------+
| async_tree_eager_tg           | 124 ms   | 125 ms: 1.01x slower   |
+-------------------------------+----------+------------------------+
| nqueens                       | 121 ms   | 122 ms: 1.01x slower   |
+-------------------------------+----------+------------------------+
| hexiom                        | 8.78 ms  | 8.87 ms: 1.01x slower  |
+-------------------------------+----------+------------------------+
| sympy_expand                  | 751 ms   | 759 ms: 1.01x slower   |
+-------------------------------+----------+------------------------+
| 2to3                          | 429 ms   | 434 ms: 1.01x slower   |
+-------------------------------+----------+------------------------+
| json_loads                    | 40.8 us  | 41.3 us: 1.01x slower  |
+-------------------------------+----------+------------------------+
| richards                      | 74.0 ms  | 74.9 ms: 1.01x slower  |
+-------------------------------+----------+------------------------+
| bench_thread_pool             | 1.42 ms  | 1.43 ms: 1.01x slower  |
+-------------------------------+----------+------------------------+
| xml_etree_parse               | 238 ms   | 241 ms: 1.01x slower   |
+-------------------------------+----------+------------------------+
| raytrace                      | 402 ms   | 408 ms: 1.01x slower   |
+-------------------------------+----------+------------------------+
| django_template               | 56.5 ms  | 57.4 ms: 1.01x slower  |
+-------------------------------+----------+------------------------+
| deepcopy_reduce               | 4.89 us  | 4.96 us: 1.01x slower  |
+-------------------------------+----------+------------------------+
| async_tree_none_tg            | 655 ms   | 665 ms: 1.02x slower   |
+-------------------------------+----------+------------------------+
| crypto_pyaes                  | 103 ms   | 104 ms: 1.02x slower   |
+-------------------------------+----------+------------------------+
| deepcopy_memo                 | 57.9 us  | 58.8 us: 1.02x slower  |
+-------------------------------+----------+------------------------+
| sqlglot_transpile             | 2.50 ms  | 2.54 ms: 1.02x slower  |
+-------------------------------+----------+------------------------+
| sqlglot_parse                 | 1.99 ms  | 2.02 ms: 1.02x slower  |
+-------------------------------+----------+------------------------+
| async_tree_memoization_tg     | 797 ms   | 811 ms: 1.02x slower   |
+-------------------------------+----------+------------------------+
| spectral_norm                 | 155 ms   | 158 ms: 1.02x slower   |
+-------------------------------+----------+------------------------+
| pathlib                       | 26.5 ms  | 27.2 ms: 1.02x slower  |
+-------------------------------+----------+------------------------+
| scimark_sor                   | 199 ms   | 204 ms: 1.03x slower   |
+-------------------------------+----------+------------------------+
| genshi_text                   | 35.6 ms  | 36.7 ms: 1.03x slower  |
+-------------------------------+----------+------------------------+
| sqlite_synth                  | 3.73 us  | 3.90 us: 1.05x slower  |
+-------------------------------+----------+------------------------+
| typing_runtime_protocols      | 172 us   | 180 us: 1.05x slower   |
+-------------------------------+----------+------------------------+
| json_dumps                    | 15.5 ms  | 16.5 ms: 1.06x slower  |
+-------------------------------+----------+------------------------+
| pickle                        | 17.6 us  | 19.7 us: 1.12x slower  |
+-------------------------------+----------+------------------------+
| pickle_dict                   | 43.1 us  | 50.2 us: 1.17x slower  |
+-------------------------------+----------+------------------------+
| pickle_list                   | 6.64 us  | 7.87 us: 1.19x slower  |
+-------------------------------+----------+------------------------+
| Geometric mean                | (ref)    | 1.01x faster           |
+-------------------------------+----------+------------------------+

Benchmark hidden because not significant (35): unpickle, regex_effbot, gc_traversal, tomli_loads, meteor_contest, scimark_lu, logging_format, comprehensions, async_tree_eager_cpu_io_mixed_tg, regex_compile, async_tree_eager, unpack_sequence, bench_mp_pool, unpickle_pure_python, async_tree_eager_memoization_tg, pyflate, xml_etree_generate, deltablue, asyncio_websockets, fannkuch, pidigits, regex_dna, scimark_sparse_mat_mult, create_gc_cycles, html5lib, async_tree_eager_memoization, pickle_pure_python, xml_etree_iterparse, async_tree_cpu_io_mixed, sympy_str, chaos, async_tree_cpu_io_mixed_tg, async_tree_eager_io, async_tree_memoization, async_tree_none

@vstinner vstinner marked this pull request as ready for review March 11, 2024 08:40
@vstinner vstinner requested a review from encukou as a code owner March 11, 2024 08:40
@vstinner vstinner changed the title WIP: gh-115754: Get singletons via function calls gh-115754: Get singletons via function calls Mar 11, 2024
In the limited C API version 3.13, getting Py_None, Py_False,
Py_True, Py_Ellipsis and Py_NotImplemented singletons is now
implemented as function calls at the stable ABI level to hide
implementation details. Getting these constants still return borrowed
references.
@vstinner vstinner requested a review from a team as a code owner March 11, 2024 12:11
@vstinner
Copy link
Member Author

@encukou @zooba: Would you mind to review my PR?

@zooba
Copy link
Member

zooba commented Mar 11, 2024

Is that benchmark using the macro Py_IsNone or the exported Py_IsNone? Do we see a difference between the two? Not that I'm at all concerned by the benchmark - the operation is trivially optimised out of a tight loop. But having the check occur on our side means we can more easily optimise for comparison if we ever change how "get" works.

Did you try with the single Py_GetSingleton(int) approach as well? I'd expect that to be slower again, but far more forwards/backwards compatible than individual functions. And behind the macro it still looks the same to users.

@vstinner
Copy link
Member Author

Is that benchmark using the macro Py_IsNone or the exported Py_IsNone?

My microbenchmark used the macro.

@vstinner
Copy link
Member Author

Did you try with the single Py_GetSingleton(int) approach as well?

I didn't try. I dislike Py_GetSingleton(int) since we need a slower switch/case, the function can fail (for unknown constant and invalid values), and we need to implement new constants (identifiers for these constants).

@zooba
Copy link
Member

zooba commented Mar 11, 2024

I think it's best to have limited API functions that do the checks inside of Python, so that callers who have an unknown object would pass it in and we would return a bool saying if it is/isn't that singleton (or in the Py_GetSingleton approach, they could get the singleton id for an object). That allows us to do faster checks, completely avoids the borrowed/stolen reference count issues, and avoids any potential not-really-singleton issues that may arise in the future or in other implementations. (This is similar to _PyUnicode_EqualToASCIIString which is an API that I like.)

Getting another instance of the singleton and running a normal comparison is like converting your char * to PyUnicodeObject and doing the comparison. Technically it's correct and more flexible, but it's also easy and faster for us to do the check on our side.


Edit So what I'm saying is, hopefully the code that is calling the new getters is also calling the function comparisons, not the macros. I assume the benchmarks would look the same right now since the comparison is still cheap, though.

@vstinner
Copy link
Member Author

Alternative: PR gh-116883 "Add Py_GetConstantRef() function".

@vstinner
Copy link
Member Author

Alternative: PR #116883 "Add Py_GetConstantRef() function".

Apparently, this approach is preferred. I close this PR.

@vstinner vstinner closed this Mar 18, 2024
@vstinner vstinner deleted the getnone branch March 18, 2024 21:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants