DISC: Supporting numpy StringDType in Pandas #58503

lithomas1 · 2024-05-01T03:25:55Z

Motivation

Once numpy 2.0 becomes commonplace, users will probably try to pass in StringDType strings into pandas. As long as numpy is a required dependency of ours, I think it should make sense that we support these strings natively and not force conversion to object/Arrow. It'll also provide an alternative to needing Arrow to have a performant string dtype.

Supporting the new StringDType also has maintenance benefits, since it'll provide a path to getting rid of the object dtype that doesn't depend on requiring Arrow, because the string ufuncs that operate on StringDType are designed to match the Python semantics.

Also, it might be able to supplant the pyarrow_numpy dtype. Not sure what the plan for the pyarrow_numpy stuff will be long term, but I that if the performance of numpy strings are OK, we can just infer to them by default (if numpy is selected as the dtype backend).

Implementation Details

One thing that we'll probably want to discuss is the dtype naming conventions for pyarrow/numpy strings.

I'm really not a fan of the string[pyarrow_numpy] naming scheme (since there's ambiguity in this name as to whether the array is actually backed by an arrow or numpy array, since both are in the name :) ).

Maybe we can (deprecate) and rename this to something string[pyarrow_nplike], or just string[nplike] if we want to replace the pyarrow_numpy strings altogether
(where nplike will default to numpy 2.0 if you have that installed, and fallback to Arrow if not installed, if the pyarrow_numpy dtype will go away in the future).

PDEP-13 may also be tangentially related here (I haven't had the time to go through the discussion there yet though).

Anyone have any thoughts on this?

cc @pandas-dev/pandas-core @ngoldbaum (who I'm working with this on)

The text was updated successfully, but these errors were encountered:

WillAyd · 2024-05-01T15:02:37Z

I think PDEP-13 is going to be important for this. We have so many new string dtypes...while they all have merits in their own right I don't think this makes for a good end user experience and it is confusing how to produce and control them throughout their lifecycle in our codebase

jbrockmendel · 2024-05-01T16:16:03Z

and not force conversion to object/Arrow

100%

simonjayhawkins · 2024-05-03T09:35:48Z

Maybe we can (deprecate) and rename this to something string[pyarrow_nplike], or just string[nplike] if we want to replace the pyarrow_numpy strings altogether

what about numpy_numpy? 😉

(where nplike will default to numpy 2.0 if you have that installed, and fallback to Arrow if not installed, if the pyarrow_numpy dtype will go away in the future).

Mixing the dtype systems is a concern to others as well as myself.

Once numpy 2.0 becomes commonplace, users will probably try to pass in StringDType strings into pandas.

I think this a valid point that could be part of the discussion in #57073 , After all, interoperability was one of the 3 benefits cited in PDEP-10

simonjayhawkins · 2024-05-03T10:47:12Z

Once numpy 2.0 becomes commonplace, users will probably try to pass in StringDType strings into pandas.

I think this a valid point that could be part of the discussion in #57073 , After all, interoperability was one of the 3 benefits cited in PDEP-10

Ah. I see that you did mention this #57073 (comment). But no direct response to that comment to-date.

WillAyd · 2024-05-03T11:09:17Z

Thanks @simonjayhawkins for providing all of this input. I would support what I think you are asking for with either a new PDEP or a revote/reclarification on PDEP 10 before investing a lot of effort into these.

I do agree that we have quasi worked around what we agreed to in a lot of smaller PRs and are not in an ideal state with our string dtypes. Between the different string implementations, nullability semantics, infer_strings settings, dtype_backend arguments, requiring versus not requiring pyarrow, etc... I find it personally challenging to navigate where we stand now. At the very least having this discussed and communicated in one central location should be beneficial

jorisvandenbossche · 2024-05-03T12:27:35Z

Also, it might be able to supplant the pyarrow_numpy dtype.

Whether to use the new numpy 2.0 string dtype in pandas is IMO unrelated to the string[pyarrow_numpy] dtype. We can use numpy to eventually replace the currently existing numpy-backed string dtypes (string[python]), but not the pyarrow-backed ones (pyarrow still has a performance benefit compared to numpy).

string[pyarrow_numpy] was only introduced to have pyarrow-based string dtype suitable to make the default in pandas 3.0, on the aspect of missing value semantics. The same consideration will have to be made for a np.StringDType based dtype.

As long as numpy is a required dependency of ours, I think it should make sense that we support these strings natively and not force conversion to object/Arrow.

If we would require pyarrow as a hard dependency, then IMO it would be perfectly reasonable to force conversion of any string-like input to a pyarrow string array, including the numpy string dtype, and to converge on a single string dtype implementation in pandas.
As long as we keep pyarrow optional and have a "fallback" string dtype using numpy under the hood, then of course we can use newer numpy features to improve our existing numpy-backed string dtype.

jbrockmendel · 2024-05-03T17:08:09Z

then IMO it would be perfectly reasonable to force conversion of any string-like input to a pyarrow string array, including the numpy string dtype

reasonable but not obvious. e.g. if the user expects to be doing __setitem__s there is likely to be a performance difference. But the more important point is one on which you (joris) and I very much agree: we don't need to decide on that right now, and so shouldn't.

lithomas1 added API Design Strings String extension data type and string data Compat pandas objects compatability with Numpy or Python functions Needs Discussion Requires discussion from core team before further action API - Consistency Internal Consistency of API/Behavior labels May 1, 2024

WillAyd mentioned this issue May 3, 2024

DISC: nanoarrow-backed ArrowStringArray #58552

Open

3 tasks

flying-sheep mentioned this issue Jul 9, 2024

Nullable string columns scverse/anndata#679

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DISC: Supporting numpy StringDType in Pandas #58503

DISC: Supporting numpy StringDType in Pandas #58503

lithomas1 commented May 1, 2024

WillAyd commented May 1, 2024

jbrockmendel commented May 1, 2024

simonjayhawkins commented May 3, 2024

simonjayhawkins commented May 3, 2024

WillAyd commented May 3, 2024

jorisvandenbossche commented May 3, 2024

jbrockmendel commented May 3, 2024

DISC: Supporting numpy StringDType in Pandas #58503

DISC: Supporting numpy StringDType in Pandas #58503

Comments

lithomas1 commented May 1, 2024

Motivation

Implementation Details

WillAyd commented May 1, 2024

jbrockmendel commented May 1, 2024

simonjayhawkins commented May 3, 2024

simonjayhawkins commented May 3, 2024

WillAyd commented May 3, 2024

jorisvandenbossche commented May 3, 2024

jbrockmendel commented May 3, 2024