-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: Add Dictionary-encoded Extension Type #20899
Comments
From #20583 , @jankatins :
Sure: an exception can be when the new categories include the old ones, but that's a very specific case, and I agree the ability to add categories is less relevant for ordered categories (by "less relevant" I mean that it could for instance loose the ordering, or append the new ones according to their own order). |
... or even arbitrary Python objects (for which you gain not only RAM, but possibly CPU when e.g. comparing) Did I understand correctly that for the case of strings we could expect some significant advantage in storing them as a unique string and map (internally) categories to slices of it, rather than directly map categories to individual strings? |
Not happening for 1.0 |
@mroeschke is this covered by ArrowDtype? Closable? |
I believe so, but I am not sure if the operations described in the OP are fully covered by a |
Currently, Categorical serves two main purposes
This proposal is to add a new extension type (let's call it
DictEncodedArray
for now) for the second use case. The storage format would be the same as
Categorical: an Index of the unique "keys" (categories) and an array of codes.
Much of the implementation would be shared. But they would have different
semantics on operations
This is most useful for strings, but could even be useful for storing a large
array of 64-bit precision items (store the 64-bit items once, then use an int16
or int32 array for the codes).
The text was updated successfully, but these errors were encountered: