-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: Expand read_csv dtype for categoricals #14503
Comments
If it matters, there would be at least a little performance to be picked up by modifying the parsing code - you could pass the categories in here: Line 1523 in 7a2bcb6
And shortcut building the categories array, sorting, etc. It would also cause an error to be thrown much earlier if the data has a value not in the specified categories. |
Yes, I think you're right that it'd be better to do it in the Cython code. Though I guess we'll need to have an API discussion about what to do if the
The second option would be consistent with In [1]: pd.Categorical(['a', 'b', 'c']).set_categories(['a', 'b'])
Out[1]:
[a, b, NaN]
Categories (2, object): [a, b] On Mon, Nov 7, 2016 at 12:07 PM, chris-b1 [email protected] wrote:
|
In #13406 Chris added support for
read_csv(..., dtype={'col': 'category'})
(thanks!). This issue is for expanding that syntax to allow a more complete specification of the resulting categorical.Implementation-wise, I think we can keep all the parsing logic as is, and simply loop over
dtype
and callset_categories
(and maybeas_ordered
) on all the categoricals just before returning to the user.This would help a bit in dask, where their category type inference can fail if the first partition doesn't contain all the categories (see dask/dask#1705). This is why it'd be preferable to do it as an option to
read_csv
, rather than putting in on the user to followup with aset_categories
.The text was updated successfully, but these errors were encountered: