Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DISC: Supporting numpy StringDType in Pandas #58503

Open
lithomas1 opened this issue May 1, 2024 · 7 comments
Open

DISC: Supporting numpy StringDType in Pandas #58503

lithomas1 opened this issue May 1, 2024 · 7 comments
Labels
API - Consistency Internal Consistency of API/Behavior API Design Compat pandas objects compatability with Numpy or Python functions Needs Discussion Requires discussion from core team before further action Strings String extension data type and string data

Comments

@lithomas1
Copy link
Member

Motivation

Once numpy 2.0 becomes commonplace, users will probably try to pass in StringDType strings into pandas. As long as numpy is a required dependency of ours, I think it should make sense that we support these strings natively and not force conversion to object/Arrow. It'll also provide an alternative to needing Arrow to have a performant string dtype.

Supporting the new StringDType also has maintenance benefits, since it'll provide a path to getting rid of the object dtype that doesn't depend on requiring Arrow, because the string ufuncs that operate on StringDType are designed to match the Python semantics.

Also, it might be able to supplant the pyarrow_numpy dtype. Not sure what the plan for the pyarrow_numpy stuff will be long term, but I that if the performance of numpy strings are OK, we can just infer to them by default (if numpy is selected as the dtype backend).

Implementation Details

One thing that we'll probably want to discuss is the dtype naming conventions for pyarrow/numpy strings.

I'm really not a fan of the string[pyarrow_numpy] naming scheme (since there's ambiguity in this name as to whether the array is actually backed by an arrow or numpy array, since both are in the name :) ).

Maybe we can (deprecate) and rename this to something string[pyarrow_nplike], or just string[nplike] if we want to replace the pyarrow_numpy strings altogether
(where nplike will default to numpy 2.0 if you have that installed, and fallback to Arrow if not installed, if the pyarrow_numpy dtype will go away in the future).

PDEP-13 may also be tangentially related here (I haven't had the time to go through the discussion there yet though).

Anyone have any thoughts on this?

cc @pandas-dev/pandas-core @ngoldbaum (who I'm working with this on)

@lithomas1 lithomas1 added API Design Strings String extension data type and string data Compat pandas objects compatability with Numpy or Python functions Needs Discussion Requires discussion from core team before further action API - Consistency Internal Consistency of API/Behavior labels May 1, 2024
@WillAyd
Copy link
Member

WillAyd commented May 1, 2024

I think PDEP-13 is going to be important for this. We have so many new string dtypes...while they all have merits in their own right I don't think this makes for a good end user experience and it is confusing how to produce and control them throughout their lifecycle in our codebase

@jbrockmendel
Copy link
Member

and not force conversion to object/Arrow

100%

@simonjayhawkins
Copy link
Member

Maybe we can (deprecate) and rename this to something string[pyarrow_nplike], or just string[nplike] if we want to replace the pyarrow_numpy strings altogether

what about numpy_numpy? 😉

(where nplike will default to numpy 2.0 if you have that installed, and fallback to Arrow if not installed, if the pyarrow_numpy dtype will go away in the future).

Mixing the dtype systems is a concern to others as well as myself.

Once numpy 2.0 becomes commonplace, users will probably try to pass in StringDType strings into pandas.

I think this a valid point that could be part of the discussion in #57073 , After all, interoperability was one of the 3 benefits cited in PDEP-10

@simonjayhawkins
Copy link
Member

Once numpy 2.0 becomes commonplace, users will probably try to pass in StringDType strings into pandas.

I think this a valid point that could be part of the discussion in #57073 , After all, interoperability was one of the 3 benefits cited in PDEP-10

Ah. I see that you did mention this #57073 (comment). But no direct response to that comment to-date.

@WillAyd
Copy link
Member

WillAyd commented May 3, 2024

Thanks @simonjayhawkins for providing all of this input. I would support what I think you are asking for with either a new PDEP or a revote/reclarification on PDEP 10 before investing a lot of effort into these.

I do agree that we have quasi worked around what we agreed to in a lot of smaller PRs and are not in an ideal state with our string dtypes. Between the different string implementations, nullability semantics, infer_strings settings, dtype_backend arguments, requiring versus not requiring pyarrow, etc... I find it personally challenging to navigate where we stand now. At the very least having this discussed and communicated in one central location should be beneficial

@jorisvandenbossche
Copy link
Member

Also, it might be able to supplant the pyarrow_numpy dtype.

Whether to use the new numpy 2.0 string dtype in pandas is IMO unrelated to the string[pyarrow_numpy] dtype. We can use numpy to eventually replace the currently existing numpy-backed string dtypes (string[python]), but not the pyarrow-backed ones (pyarrow still has a performance benefit compared to numpy).

string[pyarrow_numpy] was only introduced to have pyarrow-based string dtype suitable to make the default in pandas 3.0, on the aspect of missing value semantics. The same consideration will have to be made for a np.StringDType based dtype.

As long as numpy is a required dependency of ours, I think it should make sense that we support these strings natively and not force conversion to object/Arrow.

If we would require pyarrow as a hard dependency, then IMO it would be perfectly reasonable to force conversion of any string-like input to a pyarrow string array, including the numpy string dtype, and to converge on a single string dtype implementation in pandas.
As long as we keep pyarrow optional and have a "fallback" string dtype using numpy under the hood, then of course we can use newer numpy features to improve our existing numpy-backed string dtype.

@jbrockmendel
Copy link
Member

then IMO it would be perfectly reasonable to force conversion of any string-like input to a pyarrow string array, including the numpy string dtype

reasonable but not obvious. e.g. if the user expects to be doing __setitem__s there is likely to be a performance difference. But the more important point is one on which you (joris) and I very much agree: we don't need to decide on that right now, and so shouldn't.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API - Consistency Internal Consistency of API/Behavior API Design Compat pandas objects compatability with Numpy or Python functions Needs Discussion Requires discussion from core team before further action Strings String extension data type and string data
Projects
None yet
Development

No branches or pull requests

5 participants