Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API/ENH: dtype='string' / pd.String #8640

Closed
jreback opened this issue Oct 26, 2014 · 63 comments
Closed

API/ENH: dtype='string' / pd.String #8640

jreback opened this issue Oct 26, 2014 · 63 comments
Labels
Enhancement ExtensionArray Extending pandas with custom dtypes or arrays. Performance Memory or execution speed performance Strings String extension data type and string data
Milestone

Comments

@jreback
Copy link
Contributor

jreback commented Oct 26, 2014

update for 2019-10-07: We have a StringDtype extension dtype. It's memory model is the same as the old implementation, an object-dtype ndarray of strings. The next step is to store & process it natively.


xref #8627
xref #8643, #8350

Since we introduced Categorical in 0.15.0, I think we have found 2 main uses.

  1. as a 'real' Categorical/Factor type to represent a limited of subset of values that the column can take on
  2. as a memory saving representation for object dtypes.

I could see introducting a dtype='string' where String is a slightly specialized sub-class of Categroical, with 2 differences compared to a 'regular' Categorical:

  • it allows unions of arbitrary other string types, currently Categorical will complain if you do this:
In [1]: df = DataFrame({'A' : Series(list('abc'),dtype='category')})
In [2]: df2 = DataFrame({'A' : Series(list('abd'),dtype='category')})
In [3]: pd.concat([df,df2])
ValueError: incompatible levels in categorical block merge

Note that this works if they are Series (and prob should raise as well, side -issue)

But, if these were both 'string' dtypes, then its a simple matter to combine (efficiently).

  • you can restrict the 'sub-dtype' (e.g. the dtype of the categories) to string/unicode (iow, don't allow numbers / arbitrary objects), makes the constructor a bit simpler, but more importantly, you now have a 'real' non-object string dtype.

I don't think this would be that complicated to do. The big change here would be to essentially convert any object dtypes that are strings to dtype='string' e.g. on reading/conversion/etc. might be a perf issue for some things, but I think the memory savings greatly outweigh.

We would then have a 'real' looking object dtype (and object would be relegated to actual python object types, so would be used much less).

cc @shoyer
cc @JanSchulz
cc @jorisvandenbossche
cc @mwiebe
thoughts?

@jreback jreback added Enhancement Performance Memory or execution speed performance API Design Strings String extension data type and string data Categorical Categorical Data Type labels Oct 26, 2014
@jreback jreback added this to the 0.16.0 milestone Oct 26, 2014
@jorisvandenbossche
Copy link
Member

I think it would be a very nice improvement to have a real 'string' dtype in pandas.
So no longer having the confusion in pandas of object dtype being actually in most cases a string, and sometimes a 'real' object.

However, I don't know if this should be 'coupled' to categorical. Maybe that is only a technical implementation detail, but for me it should just be a string dtype, a dtype that holds string values, and has in essence nothing to do with categorical.

If I think about a string dtype, I am more thinking about numpy's strings types (but it has of course also impractialities, that is has fixed sizes), or the CHAR/VARCHAR in sql.

@shoyer
Copy link
Member

shoyer commented Oct 26, 2014

I'm of two minds about this. This could be quite useful, but on the other hand, it would be way better if this could be done upstream in numpy or dynd. Pandas specific array types are not great for compatibility with the broader ecosystem.

I understand there are good reasons it may not be feasible to implement this upstream (#8350), but these solutions do feel very stop-gap. For example, if @teoliphant is right that dynd could be hooked up in the near future to replace numpy in pandas internals, I would be much more excited about exploring that possibility.

As for this specific proposal:

  1. Would we really use this in place of object dtype for almost all string data in pandas? If so, this needs to meet a much higher standard than if it's merely an option.
  2. It would be premature to call this the dtype "string" rather than "interned_string", unless we're sure interning is always a good idea. Also, libraries like dynd do implement a true variable length string type (unlike numpy), and I think it is a good long term goal to align pandas dtypes with dtypes on the ndarray used for storage.
  3. The worst of the performance consequences might be avoided if we do not guarantee that the string "categories" are unique. Otherwise every str op requires a call to factorize.
  4. Especially if this is the default/standard, I really think we should try to make it work for N-dimensional data (I still need to finish up my patch for categorical).

@jreback
Copy link
Contributor Author

jreback commented Oct 26, 2014

So I have tagged a related issue, about including integer NA support by using libdynd (#8643). This will actuall be the first thing I do. (as its new and cool, and I think a slightly more straightforward path to include dynd as an optional dep).

@mwiebe

can you maybe explain a bit about the tradeoffs involved with representing strings in 2 ways using libdynd

  • as a libdynd categorical (like proposing above but using the native categorical type which DOES exist in libdynd currently)
  • as vlen strings (another libdynd feature that DOES exist).

cc @teoliphant

@mwiebe
Copy link
Contributor

mwiebe commented Oct 31, 2014

I've had in mind an intention to tweak the string representation in dynd slightly, and have written that up now. libdynd/libdynd#158 The vlen string in dynd does work presently, but it has slightly different properties than what I'm writing here.

Properties that this vlen string has are a 16 byte representation, using the small string optimization. This means strings with size <= 15 bytes encoded as utf-8 will fit in that memory. Bigger strings will involve a dynamic memory allocation per string, a little bit like Python's string, but with the utf-8 encoding and knowledge that it is a string instead of having to go through dynamic dispatch like in numpy object arrays of strings.

Representing strings as a dynd categorical is a bit more complicated, and wouldn't be dynamically updatable in the same way. The types in dynd are immutable, so a categorical type, once created, has a fixed memory layout, etc. This allows for optimized storage, e.g. if the total number of categories is <= 256, each element can be stored as one byte in the array, but does not allow the assignment of a new string that was not already a in the array of categories.

@jankatins
Copy link
Contributor

The issue mentioned in the last comment is now at libdynd/libdynd#158

@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015
jreback pushed a commit that referenced this issue Aug 4, 2016
xref #8640

Author: sinhrks <[email protected]>

Closes #13827 from sinhrks/categorical_subclass and squashes the following commits:

13c456c [sinhrks] COMPAT: Categorical Subclassing
@sinhrks
Copy link
Member

sinhrks commented Aug 4, 2016

Is there any opinion to work this in 0.19? Hopefully I have some time during the summer:)

There are few comments in #13827, and I think it's OK if it can be done without breaking existing user's code. Though we may need some breaking change in 2.0, but the same limitation should be applied to Categorical...

@jreback
Copy link
Contributor Author

jreback commented Aug 4, 2016

I think want to release 0.19.0 shortly (RC in couple of weeks). So let's slate this for next major release (which will be 1.0, rather than 0.20.0) I think.

@sinhrks
Copy link
Member

sinhrks commented Aug 4, 2016

yep, but let me try this weekend. of course it's ok to put it off to 1.0 if there is no time to review:)

@jreback
Copy link
Contributor Author

jreback commented Aug 4, 2016

@sinhrks hey I think a real-string pandas dtype would be great. would allow us to be much more string about object dtype.

@wesm
Copy link
Member

wesm commented Aug 9, 2016

How much work / additional code complexity would this require? I see this as a "nice to have" rather than something that adds fundamentally new functionality to the library

@jreback
Copy link
Contributor Author

jreback commented Aug 9, 2016

maybe @sinhrks can comment more here, but I think at the very least this allows for quite some code simplification. We will then know w/o having to contstantly infer whether something is all strings or includes actual objects.

I think it could be done w/o changing much top-level API (e.g. adding another pandas dtype), we have most of this machinery already done.

@wesm
Copy link
Member

wesm commented Aug 9, 2016

My concern is that it may introduce new user APIs / semantics which may be in the line of fire for future API breakage. If the immediate user benefits (vs. developer benefits) warrant this risk then it may be worth it

@sinhrks
Copy link
Member

sinhrks commented Aug 9, 2016

I worked a little for this, and currently expect minimum API change. Because it is being like a Categorical which internally handles categories and codes automatically (user no need to care its internal repr).

I assume the impl consists from 2 parts, and mostly done by re-using / cleaning-up the current codes:

  • String class which wraps .str methods (this should simplify string.py (Maybe replaced by a StringArray(?) or its wrapper in the future).
  • string dtype (shares most of internal with Categorical)

I agree that we shouldn't force users/devs to unnecessary migration cost. I expect it can be achieved by minimizing Categorical API breakage (it should also be applied to String).

@jorisvandenbossche
Copy link
Member

Thanks for that write-up Tom!

@xhochy wouldn't it be possible to provide the same end-user experience of mutability as we have now?
When doing mutations, you would indeed need to create a new buffer, copying the existing strings while inserting the ones you want to mutate. For sure, this will decrease performance of mutating (and certainly if you mutate one by one in a for loop). But that might be a worthy trade-off for better memory user / more performant algorithms (which I think will benefit more people than efficient mutation).
In such a case, we would need to build a set of tools to do "batch mutations" still relatively efficiently (eg a replace like method or a "put" with a bunch of values to set).

@xhochy
Copy link
Contributor

xhochy commented Jul 31, 2019

@xhochy wouldn't it be possible to provide the same end-user experience of mutability as we have now?

Yes, just with a different performance feel as you described.

@maartenbreddels
Copy link

I wonder if it makes sense to have a stringarray module for Python, that uses the arrow spec but does not have an arrow dependency. Pandas and vaex could use that, or other projects that work with arrays of strings.

In vaex, almost all of the string operations are implemented in C++ (utf8 support and regex), it would not be a bad idea to split off that library. The code needs a cleanup, but it's pretty well tested, and pretty fast: https://towardsdatascience.com/vaex-a-dataframe-with-super-strings-789b92e8d861

I don't have a ton of resources to put in this, but I think it will not cost me much time. If there is serious interest in this (someone from pandas wants to do the pandas part), I'm happy to put in some hours.

Ideally, I'd like to see a clean c++ header-only library, that this library (pystringarray) and arrow could use, possibly build on xtensor (cc @SylvainCorlay @wolfv), but that can be considered an implementation default (as long as the API and the memory model stay the same).

@jorisvandenbossche
Copy link
Member

I think Arrow also plans to have some string processing methods at some point, and would welcome contributions. So that could also be a place to have such functionality live.
But you explicitly mention a library compatible with but not dependent on Arrow? In Vaex, Arrow is already a dependency, or only optional? Do you think of potential use cases / users that would be interested in this, but where an Arrow dependency is a problem? (it's a heavy dependency for sure)

@maartenbreddels
Copy link

In vaex-core, currently (because we were future compatible due to 32bit limitation) we are not depending on arrow, although the string memory layout is arrow compatible. The vaex-arrow package is required for loading/writing with arrow files/streams, so it's an optional dependency, vaex-core does not need it.

I think now, we could have a pyarrow dependency for vaex-core, although we'll inherit all installation issue that might come with it (not much experience with it), so I'm still not 100% sure (I read there were windows wheel issues).

But the same approach can be used by other libraries, such as a hypothetical pystringarray package, which would follow the arrow spec, and expose its buffers, but not have a direct pyarrow dependency.

Another approach, discussed with @xhochy is to have a c++ library (c++ could use a header only string and stringarray library), possibly build in xtensor or compatible with. This library could be something that arrow could use, and possible pystringarray could use.

My point is, I think if general algorithms (especially string algos) go into arrow, it will be 'lost' for use outside of arrow, because it's such a big dependency.

@wesm
Copy link
Member

wesm commented Sep 3, 2019

Arrow is only a large dependency if you build all the optional components. I'm concerned there's some FUD being spread here about this topic -- I think it is important to develop a collaborative community that is working together on this (with open community governance) and ensure that downstream consumers can reuse the code that they need without being burdened by optional dependencies.

@wesm
Copy link
Member

wesm commented Sep 20, 2019

We are taking two measures in Apache Arrow to make it easier for third party projects to take on the project as a dependency:

@SylvainCorlay
Copy link
Contributor

There is a bit of a divide between people who are uncomfortable with e.g. having second-order dependencies, and being uncomfortable with a large monolithical dependency.

Having a large tree of dependencies between small packages is very well adressed by a package manager. It allows a separation of concerns between components, and the teams developing them, as soon as APIs and extension points are well-defined. This has been the path of Project Jupyter since the Big Split (tm). Monolithical projects make me somewhat more uncomfortable in general. I rarelly am interested in everything in a large monolithical project...

The way we have been doing stuff in the xtensor stack is recommending the use of a package manager. We maintain the the conda packages but xtensor packages have been packaged for Fedora, Arch Linux etc.

@wesm
Copy link
Member

wesm commented Sep 22, 2019

I assure you that we hear your concerns and we will do everything we can to address them in time but it will not happen overnight. Our top priority is ensuring that our developer/contributor community is as productive as possible. Based on our contribution graph I would say we have done a good job of this.

The area where we have made the most progress on modular installs actually is in our .deb and .yum packages.

https://github.com/apache/arrow/tree/master/dev/tasks/linux-packages/debian

With recent improvements to conda / conda-forge, we can similarly achieve modularization, at least at the C++ package level.

To have modular Python installs will not be easy. We need help from more people to figure out how to address this from a tooling perspective. The current solution is optimized for developer productivity, so we have to make sure that any changes that are made to the packaging process don't make things much more difficult for contributors.

@8080labs
Copy link

8080labs commented Oct 3, 2019

So until this enhancement is implemented (and adopted by most users via upgrading the library), what is the fastest way to check if a series with dtype object only consists of strings?

For example, I have the following series with dtype object and want to detect if there are any non-string values:

series = pd.Series(["string" for i in range(1_000)])
series.loc[0] = 1
def series_has_nonstring_values(series):
    # TODO: how to implement this efficiently?
    return False
assert series_has_nonstring_values(series) is True

I hope that this is the right place to address this issue/question?

@jorisvandenbossche
Copy link
Member

@8080labs with the current public API, you can use infer_dtype for this:

In [48]: series = pd.Series(["string" for i in range(1_000)])

In [49]: pd.api.types.infer_dtype(series, skipna=True)
Out[49]: 'string'

In [50]: series.loc[0] = 1 

In [51]: pd.api.types.infer_dtype(series, skipna=True) 
Out[51]: 'mixed-integer'

There is a faster is_string_array, but that is not public, but will be exposed indirectly through the string dtype that will be included in 1.0: #27949

@WillAyd
Copy link
Member

WillAyd commented Nov 11, 2019

closed via #27949

@WillAyd WillAyd closed this as completed Nov 11, 2019
@jorisvandenbossche
Copy link
Member

There is still relevant discussion here on the second part of this enhancement: a native storage (Tom also updated the top comment to reflect this)

@maartenbreddels
Copy link

After learning more about the goal of Apache Arrow, vaex will happily depend on it in the (near?) future.

I want to ignore the discussion on where the c++ string library code should live (in or outside arrow), not to get sidetracked.

I'm happy to spend a bit of my time to see if I can move algorithms and unit tests to Apache Arrow, but it would be good if some pandas/arrow devs could assist me a bit (I believe @xhochy offered me help once, does that offer still stand?).

Vaex's string API is modeled on Pandas (80-90% compatible), so my guess is that Pandas should be able to make use of this move to Arrow, since it could simply forward many of the string method calls directly to Arrow once the algorithms are moved.

In short:

  • Is Arrow interested in string contributions from vaex' codebase (with cleanups), and willing to assist me?
  • Would pandas benefit from this, i.e. would it use Arrow for string processing if all of the vaex algorithms are in Arrow?

@TomAugspurger
Copy link
Contributor

Thanks for the update @maartenbreddels.

Speaking for myself (not pandas-dev) I don't have a strong opinion on where these algorithms should live. I think pandas will find a way to use them regardless. Putting them in Arrow is probably convenient since we're dancing around a hard dependency on pyarrow in a few places.

I may be wrong, but I don't think any of the core pandas maintainers has C++ experience. One of us could likely help with the Python bindings though, if that'd be helpful.

@TomAugspurger
Copy link
Contributor

I opened #35169 for discussing how we can expose an Arrow-backed StringArray to users.

@jbrockmendel
Copy link
Member

@mroeschke closable?

@mroeschke
Copy link
Member

Yeah I believe the current StringDtype(storage="pyarrow"|"python") has satisfied the goal of this issue so closing. Can open up more specific issues if there are followups

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement ExtensionArray Extending pandas with custom dtypes or arrays. Performance Memory or execution speed performance Strings String extension data type and string data
Projects
None yet
Development

No branches or pull requests