-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: parse categoricals in read_csv #13406
Conversation
# as a sentinel to specialize the function for reading | ||
# from the parser. | ||
|
||
# This is to avoid duplicating a bunch of code or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The type inference functions (e.g. _try_double
) read data straight from the parser - the COLITER_NEXT(it,word)
stuff. I wanted to also be able to pass in a hash table to that function, i.e. my categories, and this was my workaround.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you would instead set this in parser state and pass directly to the function when its called (e.g. maybe a list off the cols that you want to do hash tracking)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if I put it in the parser state then checks like this would get moved to runtime, which I was trying to avoid, though to be fair I haven't profiled to see what it costs.
@jreback - see if you like this any better. I am still using a fused type with inference functions, that's really the best I can think of to make them generic, still very open to suggestions though. I did add Cython's type specialization syntax, e.g. |
Current coverage is 85.30% (diff: 100%)@@ master #13406 diff @@
==========================================
Files 139 139
Lines 50108 50138 +30
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
+ Hits 42744 42770 +26
- Misses 7364 7368 +4
Partials 0 0
|
@chris-b1 like to update this. but also remove the fused type hack. see what you can do. |
@jreback - just getting back to this. How would you feel about only parsing string/object |
@chris-b1 yes that sounds reasonable |
4a657ca
to
a3b694f
Compare
@jreback - I've updated this to remove the type inference, if you want to have another look. |
looks pretty - will have to have a more detailed look can u post perf numbers? (or update top) |
Updated performance at the top, although it's basically the same as it was before. |
@@ -440,6 +440,41 @@ individual columns: | |||
Specifying ``dtype`` with ``engine`` other than 'c' raises a | |||
``ValueError``. | |||
|
|||
Specifying Categorical dtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a reference to this (and refernce from v0.19.0)
can you test when chunking with dtype='category', for with and w/o different categories in each chunk (it looks to work), but pls test. |
bd4e515
to
822a423
Compare
@jreback - updated for your comments |
|
||
``Categorical`` columns can be parsed directly by specifying ``dtype='category'`` | ||
|
||
.. ipython :: python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no space after ipython
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
otherwise it is interpreted as a comment I think (rst ..)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This actually did work when I built the docs locally, but I also had meant to change it, for consistency if nothing else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, good to know, then the github rst renderer is more picky than sphinx itself (as it is rendered as a comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could add this to the lint check
just grep thru rst and validate the pattern
The resulting categories will always be parsed as string (object dtype). | ||
If the categories are numeric they can be converted using the | ||
:func:`pd.to_numeric` function, or as appropriate, another converter | ||
such as :func:`pd.to_datetime`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment about pd here as above
expected = self.read_csv(pth, header=None, encoding=encoding) | ||
actual = self.read_csv(pth, header=None, encoding=encoding, | ||
dtype={1: 'category'}) | ||
actual[1] = actual[1].astype(object) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chris-b1 Is it possible you did switch expected and actual? (it seems more logical to me to test that reading in with an encoding actually returns category dtypes, rather than converting that to object and then comparing, as in this way, you didn't check that it was category dtype)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was just being lazy here - related to your other comment, because the current impl doesn't sort categories, it was easier to cast to object (mainly I wanted to test that the decode path was working). I'll update this too.
- needed for #13406, follow-up to #13763 Author: Chris <[email protected]> Author: sinhrks <[email protected]> Closes #13846 from chris-b1/union_categoricals_ordered and squashes the following commits: 3a710f0 [Chris] lint fix ff0bb5e [Chris] add follow-up PRs to whatsnew ecb2ae9 [Chris] more tests; handle sorth with ordered eea1777 [Chris] skip r-esort when possible on fastpath c559662 [sinhrks] ENH: add sort_categories argument to union_categoricals
822a423
to
1f6093a
Compare
@jreback, @jorisvandenbossche - this is updated and rebased if you'd like to take another look - I think I have all the previous comments addressed. In particular the |
cats = Index(cats) | ||
if not cats.is_monotonic_increasing: | ||
unsorted = cats.copy() | ||
cats = cats.sort_values() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use return_indexer=True
to get indexer at the same time.
thanks @chris-b1 as usual, check out the docs when built and issue a followup if needed! |
Closes #10153 (at least partly)
Adds the ability to directly parse a
Categorical
through thedtype
parameter toread_csv
. Currently just uses whatever is there as the categories, a possible enhancement would be to allow and enforce user-specified categories, through not quite sure what the api would be.This only parses string categories - originally I had an impl that did type inference on the categories, but it added a lot of complication without much benefit, now the recommendation in the docs is to convert after parsing.
Here's an example timing. For reasonably sparse data, a slightly worse than 2x speedup is what I'm typically seeing, along with much better memory usage.