Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-115754: Expose constants (ex: Py_None) as public ABI symbols (stable ABI) #16

Closed
vstinner opened this issue Mar 5, 2024 · 13 comments

Comments

@vstinner
Copy link

vstinner commented Mar 5, 2024

Currently, the C API implements Py_None, Py_True, Py_False, Py_Ellipsis and Py_NotImplemented constants by referring to private symbols at the ABI level:

#define Py_None (&_Py_NoneStruct)
#define Py_False _PyObject_CAST(&_Py_FalseStruct)
#define Py_True _PyObject_CAST(&_Py_TrueStruct)
#define Py_Ellipsis (&_Py_EllipsisObject)
#define Py_NotImplemented (&_Py_NotImplementedStruct)

In the stable ABI, I would like to avoid referring to private symbols, but only use public symbols.

The issue python/cpython#115754 discussed two options:

  • (A) Expose "Py_None" API as PyObject *Py_None symbol at the ABI level.
  • (B) Implement "Py_None" API as a function call. Example: #define Py_None Py_GetNone() with PyObject* Py_GetNone(void), where Py_GetNone() cannot fail.

Please read the issue python/cpython#115754 discussion, especially @zooba's comment python/cpython#115754 (comment) which lists possible constraints of such API. Extract:

  • lazy type initialization
  • thread/interpreter-local types
  • synchronized/locked access
  • switch from singleton to non-singleton
  • always return new references
  • run-time library delegation (e.g. load python3.dll and have it forward to whichever python3x.dll it's using)
  • static analysis for reference counts
  • "fake" singletons
  • deprecation/warnings/errors
  • GIL enforcement

Please vote for your preferred option, or add a comment if you want a 3rd option.

I wrote python/cpython#115755 which is a simple change implementing option (A).


I prefer option (A). I agree that option (B) avoids any potential/hypothetical issue in the future, but I don't want to write a perfect API/ABI, only make pragmatic changes.

Option (B) can have a (significant?) impact on performance, especially on hot code. So far, nobody ran microbenchmarks, no number is available.

I would prefer to only consider option (B) once we will deeply design the stable ABI. Right now, we are far from being able to support tagged pointers for example. I would prefer to not pay a performance overhead for a theoretical issue :-(

Note: Subinterpreters are sharing singletons thanks to immortal objects (PEP 683, Python 3.12).


By the way, I already added Py_Is(x, y), Py_IsNone(x), Py_IsTrue(x) and Py_IsNone(x) for PyPy. In PyPy, you cannot simply compare an object to a constant memory address (if (x == Py_None) ...), since PyPy objects can be moved in memory.

This issue is specific to CPython. Other Python implementations are free to use a different implementation and a different ABI.

@vstinner
Copy link
Author

vstinner commented Mar 6, 2024

@gvanrossum:

I don't have time for this, but I'm surprised you haven't asked @ericsnowcurrently yet.

I wrote:

Note: Subinterpreters are sharing singletons thanks to immortal objects (PEP 683, Python 3.12).

In the past, I proposed adding Py_GetNone() and have per-interpreter singletons: python/cpython#83692 Apparently, @ericsnowcurrently didn't like this idea and it was decided to go with immortal objects instead.

@encukou
Copy link
Collaborator

encukou commented Mar 6, 2024

I prefer that stable ABI is useful for alternative implementations: they nicely show the design space that CPython might want to explore in the future. If we want to hide implementation details, looking at PyPy is a pretty good way to do it.
Hence, I voted for option B.

I would prefer to only consider option (B) once we will deeply design the stable ABI.

I'd prefer that we do that now -- for new additions.
If we don't want to design new API now, let's stick to what works.

@ericsnowcurrently
Copy link

  • (B) Implement "Py_None" API as a function call. Example: #define Py_None Py_GetNone() with PyObject* Py_GetNone(void), where Py_GetNone() cannot fail.

A similar option, which I believe @zooba suggested:

  • (C) Implement a generic getter. Example: #define Py_None Py_GetBuiltinObject(Py_NONE).

@ericsnowcurrently
Copy link

  • (A) Expose "Py_None" API as PyObject *Py_None symbol at the ABI level.

Objects exposed directly by the C-API were one of the few hard problems I had to solve for a per-interpreter GIL. If it hadn't been for immortal objects, I would have had to either do some messy macro trickery or introduce costly fixup code across most of the C-API implementation (using the existing symbols as . In addition to the extra complexity, the macro hacks would have probably been a further pain point for other compilers/implementations. The fixup code solution would have been fragile. It also doesn't help that we expose values for some of the objects (e.g. the non-exception builtin types) rather than just pointers. Note that even with object immortality I still had to add a trick for static builtin types, to accommodate tp_dict and tp_subclasses. Overall, the whole situation was a real pain.

Everything would have been drastically simpler if we had never exposed the objects directly in the first place. With that in mind, I firmly favor option (B). Let's move away from exposing objects directly in the C-API, rather than doing it more.

@vstinner
Copy link
Author

vstinner commented Mar 7, 2024

Option (B) can have a (significant?) impact on performance, especially on hot code. So far, nobody ran microbenchmarks, no number is available.

Hum, I will try to work on a benchmark to have numbers. It may help to take a decision.

@vstinner
Copy link
Author

I wrote python/cpython#116572 to measure the performance overhead. I modified the constant everywhere for this benchmark, including inside Python internals. So the measurement is the worst case scenario.

Results with PGO, LTO and CPU isolation on Linux: python/cpython#116572 (comment)

  • Around 2x slower on a microbenchmark.
  • I'm not sure how to read pyperformance benchmark results. The overall summary is 1.01x faster: so it's not 2x slower, and "a little bit faster" can be just noise in the measurement.

6 benchmarks are at least 1.05x slower:

| sqlite_synth                  | 3.73 us  | 3.90 us: 1.05x slower  |
+-------------------------------+----------+------------------------+
| typing_runtime_protocols      | 172 us   | 180 us: 1.05x slower   |
+-------------------------------+----------+------------------------+
| json_dumps                    | 15.5 ms  | 16.5 ms: 1.06x slower  |
+-------------------------------+----------+------------------------+
| pickle                        | 17.6 us  | 19.7 us: 1.12x slower  |
+-------------------------------+----------+------------------------+
| pickle_dict                   | 43.1 us  | 50.2 us: 1.17x slower  |
+-------------------------------+----------+------------------------+
| pickle_list                   | 6.64 us  | 7.87 us: 1.19x slower  |

3 benchmarks are at least 1.05x faster:

| coverage                      | 499 ms   | 152 ms: 3.29x faster   |
+-------------------------------+----------+------------------------+
| dask                          | 885 ms   | 628 ms: 1.41x faster   |
+-------------------------------+----------+------------------------+
| logging_silent                | 166 ns   | 155 ns: 1.07x faster   |

3x faster for coverage looks really strange. Maybe I didn't run benchmarks properly, I'm not sure.

@vstinner
Copy link
Author

Since the majority seems to prefer function calls, I prepared a full PR which documents the change: python/cpython#116572 It's ready to be merged, waiting for a decision :-)

@iritkatriel: Do you have an opinion on this decision? Tell me if you need more details.

@iritkatriel
Copy link
Member

There doesn't seem much of a perf impact, so yeah I think option B makes more sense.

@gvanrossum
Copy link

gvanrossum commented Mar 11, 2024 via email

@vstinner
Copy link
Author

Maybe ask Mike Droettboom to help with benchmarking this?

I already ran a microbenchmark and pyperformance. Do you need more benchmarks? I'm not sure that I get your comment.

@gvanrossum
Copy link

Maybe ask Mike Droettboom to help with benchmarking this?

I already ran a microbenchmark and pyperformance. Do you need more benchmarks? I'm not sure that I get your comment.

I am used to having the benchmark output in the specific presentation that we use for our team's benchmarking:
https://github.com/faster-cpython/benchmarking-public/blob/main/README.md

But I now see that you did run pyperformance and found that it doesn't have a significant effect so I'll accept that.

My vote would currently be (A), since I abhor the idea of a macro that looks like a variable reference but actually expands to a call. Didn't we introduce immortal objects specifically so that we could have a single None object shared between interpreters?

@vstinner
Copy link
Author

I wrote 3 implementations so they can be compared:

So the majority prefers option (B): Implement "Py_None" API as a function call.

Between adding 5 functions for 5 constants and adding 2 functions for 10 constants, I prefer the "Py_GetConstantRef()" approach: only add 2 functions. Moreover, it gives the choice between borrowed and strong references, even if in practice, both are the same.

@vstinner
Copy link
Author

I close the issue, the decision is to use function calls.

The discussion on the actual implementation can happen directly on the pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants