gh-111545: Add PyHash_Double() function #112095

vstinner · 2023-11-15T00:40:13Z

Cleanup PyHash_Double() implementation based _Py_HashDouble():
- Move variable declaration to their first assignment.
- Add braces (PEP 7).
- Cast result to signed Py_hash_t before the final "== -1" test, to reduce the number of casts.
- Add an assertion on Py_IS_NAN(v) in the only code path which can return -1.
Add tests: Modules/_testcapi/hash.c and Lib/test/test_capi/test_hash.py.

Issue: Make _Py_HashDouble public again as "unstable" API #111545

📚 Documentation preview 📚: https://cpython-previews--112095.org.readthedocs.build/

vstinner · 2023-11-15T00:53:18Z

numpy has a Npy_HashDouble() compatibility layer to handle _Py_HashDouble() of Python 3.9 and older which only takes a single C double argument. numpy calls Npy_HashDouble(obj, value) in scalartypes.c.src.

Proposed API has a single argument and cannot be used as a drop-in replacement for _Py_HashDouble(). I would prefer to let C extensions decide how to handle not-a-number special case. Example:

    Py_hash_t hash = PyHash_Double(value);
    if (hash == -1) {
        return _Py_HashPointer(obj);
    }
    return hash;

Problem: _Py_HashPointer() is also private. Once this PR is merged, I will prepare a follow-up PR to add a public PyHash_Pointer() function.

Since numpy already has a Npy_HashDouble() compatibility layer, it can reimplement such "PyHash_Double() or _Py_HashPointer()" logic in a single place (npy_pycompat.h).

The _Py_HashDouble() function had 1 argument (Py_hash_t _Py_HashDouble(double v)) but was changed in Python 3.10a7 (or b1) to get a second argument: Py_hash_t _Py_HashDouble(PyObject *inst, double v).

Python now uses the identity for the hash when the value is a NaN, see gh-87641. In Python 3.9, hash(float("nan")) returned 0 (#define _PyHASH_NAN 0).

By the way, in Python 3.13, sys.hash_info.nan still exists and is equal to 0, even if hash(float("nan)) no longer return 0! See https://docs.python.org/dev/library/sys.html#sys.hash_info documentation:

hash_info.nan: (This attribute is no longer used)

vstinner · 2023-11-15T01:02:55Z

Another slice of Python history. In Python 3.2, PyObject_Hash() return type changed from long to Py_hash_t to reduce hash collisions on 64-bit Windows where long is only 32-bit (whereas Py_hash_t is 64-bit). The Py_hash_t and Py_uhash_t types were added to Python 3.2. See commit 8f67d08 and commit 8035bc5 of issue gh-53987 (bpo-9778).

vstinner · 2023-11-15T01:09:14Z

I'm not a fan of signed number of hash. For example, I prefer to avoid it when using modulo operator (x % y). I would prefer to use the unsigned Py_uhash_t type.

The signed Py_hash_t type is used as the return type of the PyTypeObject.tp_hash function.

vstinner · 2023-11-15T01:36:22Z

Problem: _Py_HashPointer() is also private. Once this PR is merged, I will prepare a follow-up PR to add a public PyHash_Pointer() function.

Draft PR gh-112096.

I prefer to only start with PyHash_Double() to discuss the PyHash API first:

Do we want to continue using annoying signed Py_hash_t, or should we move to unsigned Py_uhash_t?
I propose moving from _Py_HashXXX() naming convention to PyHash_XXX() to put hash functions in a PyHash namespace (share a common prefix). If we expose constants such as _PyHASH_MODULUS or _PyHASH_INF, I propose to expose them as PyHash_MODULUS or PyHash_INF. It would be consistent with the existing PyHash_GetFuncDef() function name.
Add Doc/c-api/hash.rst documentation.
Add Modules/_testcapi/hash.c and Lib/test/test_capi/test_hash.py to test the PyHash C API.

vstinner · 2023-11-15T01:37:58Z

cc @serhiy-storchaka @erlend-aasland @encukou @zooba

vstinner · 2023-11-15T02:46:56Z

I merged a first change to make this PR smaller and so easier to review. The PR #112098 added documentation and tests on the PyHash_GetFuncDef() function which was added by PEP 456.

vstinner · 2023-11-15T03:54:26Z

The PR adds Py_hash_t PyHash_Double(double value).

Proposed API has a single argument and cannot be used as a drop-in replacement for _Py_HashDouble(). I would prefer to let C extensions decide how to handle not-a-number special case.

If needed, a second function can be added:

Py_hash_t PyHash_DoubleOrPointer(double value, const void *ptr)

=> Compute value hash, or compute ptr hash if value is not-a-number (NaN).

For me, it's surprising that when passing a Python object in _Py_HashDouble(), the function doesn't call PyObject_Hash(obj) but _Py_HashPointer(obj).

Or do you prefer to just expose Py_hash_t _Py_HashDouble(PyObject *inst, double v) as it is?

skirpichev · 2023-11-15T04:09:26Z

By the way, in Python 3.13, sys.hash_info.nan still exists and is equal to 0

I think, this attribute should be deprecated (or just removed?).

skirpichev · 2023-11-15T03:59:47Z

Doc/c-api/hash.rst

+
+   Hash a C double number.
+
+   Return ``-1`` if *value* is not-a-number (NaN).


Maybe you should document return value for inf too? This is exposed in sys.hash_info.

I prefer to wait until the _PyHASH_INF constant is added to the API. That's the C API documentation, not the Python documentation.

skirpichev · 2023-11-15T04:07:00Z

Doc/c-api/hash.rst

+Functions
+^^^^^^^^^
+
+.. c:function:: Py_hash_t PyHash_Double(double value)


I would prefer to expose this as unstable API. Hashing of numeric types is relatively low-level detail of implementation, which was changed in past for minor (3.x.0) releases (nans hashing was last). Why not keep this freedom in future?

I don't expect PyHash_Double() API to change in the future. The result of the function can change in Python 3.x.0 releases, but I don't consider that it qualifies the function for the PyUnstable API.

The PyUnstable API is more when there is a risk that the function can be removed, that its API can change, or that a major change can happen in a Python 3.x.0 release.

In Python 3.2 (2010), _Py_HashChange() was written in commit dc787d2 of issue gh-52435.

commit dc787d2055a7b562b64ca91b8f1af6d49fa39f1c Author: Mark Dickinson <[email protected]> Date: Sun May 23 13:33:13 2010 +0000 Issue #8188: Introduce a new scheme for computing hashes of numbers (instances of int, float, complex, decimal.Decimal and fractions.Fraction) that makes it easy to maintain the invariant that hash(x) == hash(y) whenever x and y have equal value.

As written above, the latest major change was in Python 3.10 to treat NaN differently.

The result of the function can change in Python 3.x.0 releases

Previously, the function signature for _Py_HashDouble() was changed too.

vstinner · 2023-11-15T05:12:59Z

I think, this attribute should be deprecated (or just removed?).

If you consider that something should be changed, you can open a new issue about sys.hash_info.nan.

serhiy-storchaka

This PR contains many cosmetic changes, but also some changes that can affect performance in theory (like adding the "else" branch or adding additional check for -1). Please make precise benchmarks for this. Also consult with previous authors of this code.

To make PyHash_Double a replacement of _PyHash_Double you need Py_HashPointer. Maybe add it first?

BTW, should it be PyHash_Double or Py_HashDouble?

Lib/test/test_capi/test_hash.py

encukou · 2023-11-15T10:45:16Z

In terms of the proposed C API guidleines: this violates the one where negative results are reserved for errors. Actually, hashes might be a good argument for only reserving -1 for errors, leaving the other negative numbers mean success.
A bigger problem is that the -1 is returned without raising an exception, leaving the function with no way to signal unexpected failure -- see capi-workgroup/api-evolution#5

vstinner · 2023-11-15T11:36:21Z

@serhiy-storchaka:

To make PyHash_Double a replacement of _PyHash_Double you need Py_HashPointer. Maybe add it first?

Ok. I updated PR #112096 so it can be merged first: my PR #112096 adding Py_HashPointer() is now ready for review.

* Add again _PyHASH_NAN constant. * _Py_HashDouble(NULL, value) now returns _PyHASH_NAN. * Add tests: Modules/_testcapi/hash.c and Lib/test/test_capi/test_hash.py.

vstinner · 2023-11-15T12:14:43Z

I updated the PR to address @serhiy-storchaka and @encukou's comments.

@serhiy-storchaka and @encukou: Please review the updated PR.

The API changed to:

Py_hash_t PyHash_Double(double value, PyObject *obj)

The interesting case is that obj can be NULL! In that case, the function returns sys.float.nan which comes back from death!

Changes:

Add again _PyHASH_NAN constant.
Revert _Py_HashDouble() cleanup to focus on the proposed API. (I plan to propose a follow-up PR just for that.)
Just use hash(x) in tests, rather than reimplementing hash(int) in pure Python.
Add tests on positive and negative zeros.
The function cannot fail: it cannot return -1. It means that if later there will be a need for that, it will be possible to raise an exception and return -1 on error.
The function respects the latest C API guidelines.

encukou · 2023-11-15T12:49:33Z

Doc/c-api/hash.rst

+   * If *obj* is not ``NULL``, return the hash of the *obj* pointer.
+   * Otherwise, return :data:`sys.hash_info.nan <sys.hash_info>` (``0``).
+
+   The function cannot fail: it cannot return ``-1``.


But we do want users to check the result, so that the function can start failing in some cases in the future.

Suggested change

The function cannot fail: it cannot return ``-1``.

On failure, the function returns ``-1`` and sets an exception.

(``-1`` is not a valid hash value; it is only returned on failure.)

I don't see the point of asking developers to make their code slower for a case which cannot happen. It would make C extensions slower for no reason, no?

PyObject_Hash(obj) can call arbitrary __hash__() method in Python and so can fail. But PyHash_Double() is simple and cannot fail. It's just that it has the same API than PyObject_Hash() and PyTypeObject.tp_hash for convenience.

For me, it's the same as PyType_CheckExact(obj): the function cannot fail. Do you want to suggest users to start checking for -1 because the API is that it may set an exception and return -1? IMO practicability beats purity here.

I am strongly for allowing deprecation via runtime warnings, and for keeping new API consistent in that respect.

If the speed is an issue (which I doubt, with branch prediction around), let's solve that in a way that still allows the API to report errors.

I am strongly for allowing deprecation via runtime warnings, and for keeping new API consistent in that respect.

I created capi-workgroup/api-evolution#43 to discuss functions which cannot fail: when the caller is not expected to check for errors.

If the speed is an issue (which I doubt, with branch prediction around), let's solve that in a way that still allows the API to report errors.

Would you mind to elaborate how you plan to solve this issue?

My concern is more about usability of the API than performance here.

But yeah, performance matters as well. Such function can be used in a hash table (when floats as used as key), and making such function as fast as possible matters.

Would you mind to elaborate how you plan to solve this issue?

It's possible for specific compilers: add a static inline wrapper with if (result == -1) __builtin_unreachable(); or __assume(result != -1).
That way the compiler can optimize error checking away, until a later Python version decides to allow failures.

vstinner · 2023-11-17T22:49:29Z

numpy issue: numpy/numpy#25035

encukou · 2023-11-20T13:43:28Z

Stepping back, I think I figured out why this API seems awkward to me.
It looks like this should be used to get the hash of a Python float object, given only the “unboxed” double value. But it can't actually be used for that: to be able to hash a NaN, you'd needs to pass in the Python float object. (And if you already have that, you could call its tp_hash instead.)
You could pass in NULL, but then the result for NaN is indistinguishable from hash(0.0).
Also, it seems that to know what obj “should” be, you need to know a bit of internal details that aren't quite apparent from the documentation: namely, NaNs are hashed using PyHash_Pointer.

The actual use case the proposed function allows is implementing a custom object that needs to hash the same as a Python double, where NaN should be hashable (like Python doubles are). That use case is what Python needs, and I assume it's what NumPy needs, but IMO it's unnecessarily limited -- which makes the function well suited for an internal (or unstable) function, but worse for public API.

With a signature like:

int PyHash_Double(double value, Py_hash_t *result)
// -> 1 for non-NaN (*result is set to the hash)
// -> 0 for NaN (*result is set to 0)

the function would cover the more general use case as well.

The docs can note that when implementing hash for a custom object, if you get a 0 result you can:

fail, or
use a fallback that ensures hash(obj) == hash(obj) -- for example, use PyHash_Pointer(obj).

(I'll leave the fallibility argument to the WG repos, just noting that this API would allow adding -1 as a possible result.)

mdickinson · 2023-11-26T14:27:09Z

FWIW, I also find the proposed API (Py_hash_t PyHash_Double(double value, PyObject *obj)) awkward. The interesting part of the hash computation, and the part that I presume would be useful to NumPy and others, is the part that converts a non-NaN double to its integer hash. I'd prefer an API that just exposed that part, so that third parties can use it in whatever way they want to.

The signature proposed by @encukou looks reasonable to me. I don't want to get involved in the discussion about error returns, but if we really wanted to we could have a documented sentinel hash value that's only ever returned for NaNs (e.g., 2**62).

vstinner · 2023-11-27T11:28:45Z

The signature proposed by @encukou looks reasonable to me. I don't want to get involved in the discussion about error returns, but if we really wanted to we could have a documented sentinel hash value that's only ever returned for NaNs (e.g., 2**62).

There is sys.hash_info.nan which is equal to 0 but it's no longer used as I explained before.

The problem with 2**62 is that it doesn't fit into Py_hash_t on 32-bit platforms. The Py_hash_t type is defined as Py_ssize_t which is 32-bit on a platform with 32-bit pointers.

vstinner · 2023-11-27T11:54:11Z

The first PR version used the API: Py_hash_t PyHash_Double(double value), return _PyHASH_NAN if value is NaN.

The second PR version changed the API to Py_hash_t PyHash_Double(double value, PyObject *obj) to micro-optimize the code, avoid having to check PyHash_Double() result for NaN, and provide a drop-in replacement for numpy.

Well, nobody (including me, to be honest) likes Py_hash_t PyHash_Double(double value, PyObject *obj) API.

So I wrote PR #112449 which implements the API proposed by @encukou: int PyHash_Double(double value, Py_hash_t *result), return 0 if value is NaN or return 1 otherwise (if value is finite or is infinity).

serhiy-storchaka · 2023-11-27T15:16:05Z

I like the simpler API proposed in this PR more, my only concern is about performance. Can the difference be observed in microbenchmarks or it is insignificant?

Instead of checking the result of the function, the user code can check the argument before calling the function:

    if (Py_IS_NAN(value)) {
        return _Py_HashPointer(obj);
    }
    return PyHash_Double(value);

As for names, "Hash" is a verb, and "Double" is an object. _Py_HashPointer(), _Py_HashBytes() and _Py_HashDouble() names all say what to do ("to hash") and with what object ("double", "bytes" or "pointer"). PyHash_GetFuncDef() also has a verb ("Get"). If you want to use the PyHash_ prefix, you need to repeat Hash twice: PyHash_HashDouble(), PyHash_HashBytes(), PyHash_HashPointer(). I think it is better to use simple prefix Py_.

encukou · 2023-11-27T16:51:32Z

I don't see this used in tight loops, so I'd go for the prettier API even if it's a few instructions slower. (Note that NumPy uses it for scalars -- degenerate arrays of size 1 -- to ensure compatibility with Python doubles.)
If this does get performance-critical, IMO the solution (in version-specific builds) is to make it a static inline function, and let the compiler optimize the shape of the API away.

+1 on the naming note, Py_HashDouble (or PyHash_HashDouble) is a bit better.

mdickinson · 2023-11-27T17:03:08Z

There is sys.hash_info.nan which is equal to 0 but it's no longer used as I explained before.

Yes, I'm rather familiar with the history here. :-) I was simply suggesting that if we picked and documented a value that's not used for the hash of any non-NaN, then the hash value itself could be a way of detecting NaNs after the fact. 0 isn't viable for that because it's already the hash of 0.0. And yes, of course a different value would be needed for 32-bit builds; the actual value could be stored in sys.hash_info, as before.

In any case, I was just mentioning this as a possibility. I'd prefer a more direct method of NaN detection, like what you have currently implemented (or not having the API do NaN detection at all, but leave that to the user to do separately if necessary).

vstinner · 2023-11-27T22:16:19Z

I ran microbenchmarks on 3 different APIs. Measured performance is between 13.6 ns and 14.7 ns. The maximum difference is 1.1 ns: 1.08x slower. It seems like the current _Py_HashDouble() API is the fastest.

In the 3 APIs, the C double input value is passed through the xmm0 register at the ABI level.

I expected Py_hash_t PyHash_Double(double value) to be the fastest API. It's not the case. I always get bad surprises in my benchmarks. You should try to reproduce them and play with the code to double check that I didn't mess up with my measurement.

I wrote 3 benchmarks on:

(A) Py_hash_t _Py_HashDouble(PyObject*, double) of the main branch
(B) int PyHash_Double(double value, Py_hash_t *result) of PR gh-111545: Add Py_HashDouble() function #112449
(C) Py_hash_t PyHash_Double(double value)

Results using CPU isolation, Python built with gcc -O3 (without PGO, without LTO, just ./configure):

A: 13.6 ns +- 0.0 ns
B: 14.0 ns +- 0.0 ns
C: 14.7 ns +- 0.1 ns

+-----------+---------+-----------------------+-----------------------+
| Benchmark | A       | B                     | C                     |
+===========+=========+=======================+=======================+
| bench     | 13.6 ns | 14.0 ns: 1.03x slower | 14.7 ns: 1.08x slower |
+-----------+---------+-----------------------+-----------------------+

I added benchmark code in _testinternalcapi extension which is built as a shared library, so calling _Py_HashDouble() and PyHash_Double() goes go through PLT (procedure linkage table) indirection.

I added an artificial test on the hash value to use it in the benchmark, so the compiler doesn't remove the whole function call, and the code is a little bit more realistic.

(A) assembly code:

Py_hash_t hash = _Py_HashDouble(obj, d);
if (hash == -1) {
    ...
}

mov    rax,QWORD PTR [rip+0x698e]        # 0x7ffff7c09690
mov    rdi,rbp
movq   xmm0,rax
call   0x7ffff7c022b0 <_Py_HashDouble@plt>
cmp    rax,0xffffffffffffffff
jne    ...

(B) assembly code:

Py_hash_t hash;
if (PyHash_Double(d, &hash) == 0) {
    return NULL;
}

mov    rax,QWORD PTR [rip+0x6a1f]        # 0x7ffff7c09690
mov    rdi,rbp
movq   xmm0,rax
call   0x7ffff7c02990 <PyHash_Double@plt>
test   eax,eax
jne    ...

(C) assembly code:

Py_hash_t hash = PyHash_Double(d);
if (hash == -1) {
    ...
}

mov    rax,QWORD PTR [rip+0x6a26]        # 0x7ffff7c09690
movq   xmm0,rax
call   0x7ffff7c02990 <PyHash_Double@plt>
cmp    rax,0xffffffffffffffff
jne ...

Change used to benchmark (A) and (B). Measuring (C) requires minor changes.

diff --git a/Modules/_testinternalcapi.c b/Modules/_testinternalcapi.c
index 4607a3faf17..9b671acf916 100644
--- a/Modules/_testinternalcapi.c
+++ b/Modules/_testinternalcapi.c
@@ -1625,6 +1625,51 @@ get_type_module_name(PyObject *self, PyObject *type)
 }
 
 
+static PyObject *
+test_bench_private_hash_double(PyObject *Py_UNUSED(module), PyObject *args)
+{
+    Py_ssize_t loops;
+    if (!PyArg_ParseTuple(args, "n", &loops)) {
+        return NULL;
+    }
+    PyObject *obj = Py_None;
+    double d = 1.0;
+
+    _PyTime_t t1 = _PyTime_GetPerfCounter();
+    for (Py_ssize_t i=0; i < loops; i++) {
+        Py_hash_t hash = _Py_HashDouble(obj, d);
+        if (hash == -1) {
+            return NULL;
+        }
+    }
+    _PyTime_t t2 = _PyTime_GetPerfCounter();
+
+    return PyFloat_FromDouble(_PyTime_AsSecondsDouble(t2 - t1));
+}
+
+
+static PyObject *
+test_bench_public_hash_double(PyObject *Py_UNUSED(module), PyObject *args)
+{
+    Py_ssize_t loops;
+    if (!PyArg_ParseTuple(args, "n", &loops)) {
+        return NULL;
+    }
+    double d = 1.0;
+
+    _PyTime_t t1 = _PyTime_GetPerfCounter();
+    for (Py_ssize_t i=0; i < loops; i++) {
+        Py_hash_t hash;
+        if (PyHash_Double(d, &hash) == 0) {
+            return NULL;
+        }
+    }
+    _PyTime_t t2 = _PyTime_GetPerfCounter();
+
+    return PyFloat_FromDouble(_PyTime_AsSecondsDouble(t2 - t1));
+}
+
+
 static PyMethodDef module_functions[] = {
     {"get_configs", get_configs, METH_NOARGS},
     {"get_recursion_depth", get_recursion_depth, METH_NOARGS},
@@ -1688,6 +1733,8 @@ static PyMethodDef module_functions[] = {
     {"restore_crossinterp_data", restore_crossinterp_data,       METH_VARARGS},
     _TESTINTERNALCAPI_TEST_LONG_NUMBITS_METHODDEF
     {"get_type_module_name",    get_type_module_name,            METH_O},
+    {"bench_private_hash_double", test_bench_private_hash_double, METH_VARARGS},
+    {"bench_public_hash_double", test_bench_public_hash_double, METH_VARARGS},
     {NULL, NULL} /* sentinel */
 };

Bench script for (A), private API:

import pyperf
import _testinternalcapi
runner = pyperf.Runner()
runner.bench_time_func('bench', _testinternalcapi.bench_private_hash_double)

Bench script for (B) and (C), public API:

import pyperf
import _testinternalcapi
runner = pyperf.Runner()
runner.bench_time_func('bench', _testinternalcapi.bench_public_hash_double)

vstinner · 2023-11-27T22:39:35Z

Other benchmark results using PGO:

(A) 12.7 ns +- 0.0 ns
(B) 15.2 ns +- 0.1 ns: 1.19x slower than (A)
(C) 15.8 ns +- 0.0 ns: 1.24x slower than (A)

These numbers are surprising.

vstinner · 2023-11-27T23:46:27Z

These numbers are surprising.

I wrote PR #112476 to run this benchmark differently. Results look better (less surprising, more reliable than my previous benchmark).

Results with CPU isolation, gcc -O3 and PGO, but without LTO:

(A) 12.3 ns +- 0.0 ns: Py_hash_t hash_api_A(PyObject *inst, double v)
(B) 13.2 ns +- 0.0 ns: int hash_api_B(double v, Py_hash_t *result)
(C) 12.3 ns +- 0.0 ns: Py_hash_t hash_api_C(double v)

API (A) and (C) have the same performance.

API (B) is 0.9 ns slower than (A) and (C): it is 1.07x slower than (A) and (C).

I ran the benchmark with v = 1.0.

vstinner · 2023-11-28T00:26:01Z

API (B) is 0.9 ns slower than (A) and (C): it is 1.07x slower than (A) and (C).

If we only care about performance, an alternative is API (D): Py_hash_t hash_api_D(double v, int *is_nan) where is_nan can be NULL.

(A) 12.3 ns +- 0.1 ns
(B) 13.2 ns +- 0.0 ns: 0.9 ns slower / 1.07x slower than API (A) and (C)
(C) 12.3 ns +- 0.0 ns
(D) 12.7 ns +- 0.0 ns: 0.4 ns slower / 1.03x slower than API (A) and (C)

Note: Passing non-NULL is_nan or passing NULL is_nan has no significant impact on API (D) performance.

It may be interesting if you know that the number cannot be NaN.

vstinner · 2023-11-28T00:29:10Z

+1 on the naming note, Py_HashDouble (or PyHash_HashDouble) is a bit better.

It would be interesting to design the C API namespace in a similar way than Python packages and Python package sub-modules: import pyhash; h = pyhash.hash_double(1.0) would become h = PyHash_HashDouble(1.0) in C. So yeah, repeat "Hash" here. PyHash is the namespace, HashDouble() is the function of the namespace.

But Python C API is far from respecting such design :-) The "Py_" namespace is a giant bag full of "anything".

vstinner · 2023-11-28T00:29:42Z

By the way, see also PR #112096 which adds PyHash_Pointer() function.

serhiy-storchaka · 2023-11-28T07:52:43Z

How much it makes difference if remove the check for NaN from the function, but add it before calling the function, like in #112095 (comment) ?

encukou · 2023-11-28T08:08:33Z

As far as I know, microbenchmarks at this level are susceptible to “random” variations due to code layout. An individual function should be benchmarked in a variety of calling situations to get a meaningful result.

(But to reiterate: I don't think this is the place for micro-optimizations.)

vstinner · 2023-11-28T13:31:39Z

How much it makes difference if remove the check for NaN from the function, but add it before calling the function, like in #112095 (comment) ?

It would be bad to return the same hash value for +inf, -inf and NaN values. Current code:

    if (!Py_IS_FINITE(v)) {
        if (Py_IS_INFINITY(v))
            return v > 0 ? _PyHASH_INF : -_PyHASH_INF;
        else
            return _Py_HashPointer(inst);    // v is NaN
    }

vstinner · 2023-11-28T13:36:40Z

As far as I know, microbenchmarks at this level are susceptible to “random” variations due to code layout. An individual function should be benchmarked in a variety of calling situations to get a meaningful result.

We can take it in account in the API design. Now we know that the API int hash_api_B(double v, Py_hash_t *result) is around 1.07x slower than other discussed API, which means less than a nanosecond (0.9 ns) in absolute timing.

I don't think that 0.9 ns is a significant difference. If a workflow is impacted by 0.9 ns per function call, maybe they should copy PyHash_Double() code and design a very specialized flavor for their workflow (ex: remove code for infinity and NaN, and inline all code).

In terms of performance, I think that any proposed API is fine.

vstinner · 2023-11-28T13:39:55Z

@serhiy-storchaka @encukou: Do you prefer PyHash_Double(), PyHash_HashDouble() or Py_HashDouble() name?

Current public documented PyHash API: https://docs.python.org/dev/c-api/hash.html
Existing APIs: https://github.com/python/cpython/blob/3.12/Include/pyhash.h
My proposition about a PyHash_ "namespace": gh-111545: Add PyHash_Double() function #112095 (comment)

I think that now that I read previous discussions, I prefer Py_HashDouble() name. It fits better in the current C API naming scheme.

encukou · 2023-11-30T20:55:33Z

Yeah, Py_HashDouble sounds best to me.

vstinner · 2023-11-30T22:47:12Z

@mdickinson dislikes the API: gh-111545: Add PyHash_Double() function #112095
@encukou dislikes the API: gh-111545: Add PyHash_Double() function #112095 (comment)

First, I proposed Py_hash_t PyHash_Double(double v) API in this PR. Then I modified it to use the int PyHash_Double(double v, Py_hash_t *result) API.

I'm not fully comfortable with this API neither. I close this PR.

Let's continue the discussion in PR #112449 which implements the API proposed by @encukou.

vstinner requested review from a team and tiran as code owners November 15, 2023 00:40

bedevere-app bot added the awaiting core review label Nov 15, 2023

bedevere-app bot mentioned this pull request Nov 15, 2023

Make _Py_HashDouble public again as "unstable" API #111545

Closed

vstinner mentioned this pull request Nov 15, 2023

gh-111389: expose _PyHASH_INF/BITS/MODULUS/IMAG macros as public #111418

Merged

ngoldbaum mentioned this pull request Nov 15, 2023

BUG: error: implicit declaration of function ‘_Py_HashDouble’ (building with Python 3.13~alpha1) numpy/numpy#25035

Closed

vstinner force-pushed the hash_double2 branch from b3697b6 to 46fef17 Compare November 15, 2023 02:45

skirpichev reviewed Nov 15, 2023

View reviewed changes

serhiy-storchaka reviewed Nov 15, 2023

View reviewed changes

Lib/test/test_capi/test_hash.py Outdated Show resolved Hide resolved

Lib/test/test_capi/test_hash.py Outdated Show resolved Hide resolved

Lib/test/test_capi/test_hash.py Show resolved Hide resolved

serhiy-storchaka requested review from mdickinson and removed request for a team November 15, 2023 08:45

vstinner force-pushed the hash_double2 branch 3 times, most recently from 24c3f6d to a58dcd1 Compare November 15, 2023 12:12

pythongh-111545: Add PyHash_Double() function

772690a

* Add again _PyHASH_NAN constant. * _Py_HashDouble(NULL, value) now returns _PyHASH_NAN. * Add tests: Modules/_testcapi/hash.c and Lib/test/test_capi/test_hash.py.

vstinner force-pushed the hash_double2 branch from a58dcd1 to 772690a Compare November 15, 2023 12:14

encukou reviewed Nov 15, 2023

View reviewed changes

vstinner mentioned this pull request Nov 15, 2023

gh-112026: Restore removed private C API #112115

Merged

mdickinson removed their request for review November 26, 2023 11:54

vstinner mentioned this pull request Nov 27, 2023

gh-111545: Add Py_HashDouble() function #112449

Closed

vstinner closed this Nov 30, 2023

vstinner deleted the hash_double2 branch November 30, 2023 23:45

vstinner mentioned this pull request Dec 1, 2023

gh-111545: Add Py_HashPointer() function #112096

Merged

This was referenced Dec 16, 2023

Add Py_HashDouble() function capi-workgroup/decisions#2

Closed

gh-111545: Add Py_HashDouble() function #113115

Closed

vstinner mentioned this pull request Jan 30, 2024

Add Py_hash_t Py_HashDouble(double, PyObject *obj) function capi-workgroup/decisions#10

Closed

5 tasks


		Hash a C double number.

		Return ``-1`` if value is not-a-number (NaN).

	The function cannot fail: it cannot return ``-1``.
	On failure, the function returns ``-1`` and sets an exception.
	(``-1`` is not a valid hash value; it is only returned on failure.)

gh-111545: Add PyHash_Double() function #112095

gh-111545: Add PyHash_Double() function #112095

Conversation

vstinner commented Nov 15, 2023 • edited by github-actions bot Loading

vstinner commented Nov 15, 2023

vstinner commented Nov 15, 2023

vstinner commented Nov 15, 2023

vstinner commented Nov 15, 2023 • edited Loading

vstinner commented Nov 15, 2023

vstinner commented Nov 15, 2023

vstinner commented Nov 15, 2023

skirpichev commented Nov 15, 2023

Choose a reason for hiding this comment

vstinner Nov 15, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vstinner commented Nov 15, 2023

serhiy-storchaka left a comment

Choose a reason for hiding this comment

encukou commented Nov 15, 2023

vstinner commented Nov 15, 2023

vstinner commented Nov 15, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vstinner commented Nov 17, 2023

encukou commented Nov 20, 2023

mdickinson commented Nov 26, 2023

vstinner commented Nov 27, 2023

vstinner commented Nov 27, 2023

serhiy-storchaka commented Nov 27, 2023

encukou commented Nov 27, 2023

mdickinson commented Nov 27, 2023

vstinner commented Nov 27, 2023 • edited Loading

vstinner commented Nov 27, 2023

vstinner commented Nov 27, 2023 • edited Loading

vstinner commented Nov 28, 2023

vstinner commented Nov 28, 2023

vstinner commented Nov 28, 2023

serhiy-storchaka commented Nov 28, 2023

encukou commented Nov 28, 2023

vstinner commented Nov 28, 2023

vstinner commented Nov 28, 2023

vstinner commented Nov 28, 2023

encukou commented Nov 30, 2023

vstinner commented Nov 30, 2023

vstinner commented Nov 15, 2023 •

edited by github-actions bot

Loading

vstinner commented Nov 15, 2023 •

edited

Loading

vstinner Nov 15, 2023 •

edited

Loading

vstinner commented Nov 15, 2023 •

edited

Loading

vstinner commented Nov 27, 2023 •

edited

Loading

vstinner commented Nov 27, 2023 •

edited

Loading