Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: parse categoricals in read_csv #13406

Closed
wants to merge 11 commits into from

Conversation

chris-b1
Copy link
Contributor

@chris-b1 chris-b1 commented Jun 8, 2016

Closes #10153 (at least partly)

Adds the ability to directly parse a Categorical through the dtype parameter to read_csv. Currently just uses whatever is there as the categories, a possible enhancement would be to allow and enforce user-specified categories, through not quite sure what the api would be.

This only parses string categories - originally I had an impl that did type inference on the categories, but it added a lot of complication without much benefit, now the recommendation in the docs is to convert after parsing.

Here's an example timing. For reasonably sparse data, a slightly worse than 2x speedup is what I'm typically seeing, along with much better memory usage.

group1 = ['aaaaa', 'bbbbb', 'cccccc', 'ddddddd', 'eeeeeeee']

df = pd.DataFrame({'a': np.random.choice(group1, 10000000).astype('object'),
                   'b': np.random.choice(group1, 10000000).astype('object'),
                   'c': np.random.choice(group1, 10000000).astype('object')})
df.to_csv('strings.csv', index=False)


In [14]: %timeit pd.read_csv('strings.csv').apply(pd.Categorical)
1 loops, best of 3: 6.66 s per loop

In [13]: %timeit pd.read_csv('strings.csv', dtype='category')
1 loop, best of 3: 3.68 s per loop

@jreback jreback added IO CSV read_csv, to_csv Categorical Categorical Data Type labels Jun 8, 2016
# as a sentinel to specialize the function for reading
# from the parser.

# This is to avoid duplicating a bunch of code or
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type inference functions (e.g. _try_double) read data straight from the parser - the COLITER_NEXT(it,word) stuff. I wanted to also be able to pass in a hash table to that function, i.e. my categories, and this was my workaround.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you would instead set this in parser state and pass directly to the function when its called (e.g. maybe a list off the cols that you want to do hash tracking)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if I put it in the parser state then checks like this would get moved to runtime, which I was trying to avoid, though to be fair I haven't profiled to see what it costs.

@chris-b1
Copy link
Contributor Author

@jreback - see if you like this any better. I am still using a fused type with inference functions, that's really the best I can think of to make them generic, still very open to suggestions though.

I did add Cython's type specialization syntax, e.g. _try_double[use_parser_data](self.parser, ...) which at least make the intent pretty explicit. It's not that far off something like tag based dispatch in C++, though not sure if it's really idiomatic Cython.

@codecov-io
Copy link

codecov-io commented Jun 12, 2016

Current coverage is 85.30% (diff: 100%)

Merging #13406 into master will increase coverage by <.01%

@@             master     #13406   diff @@
==========================================
  Files           139        139          
  Lines         50108      50138    +30   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          42744      42770    +26   
- Misses         7364       7368     +4   
  Partials          0          0          

Powered by Codecov. Last update 2beab41...c78f39f

@jreback
Copy link
Contributor

jreback commented Jun 24, 2016

@chris-b1 like to update this. but also remove the fused type hack. see what you can do.

@chris-b1
Copy link
Contributor Author

@jreback - just getting back to this. How would you feel about only parsing string/object Categoricals? That would significantly clean up the code but still cover the biggest use-case, and I could just show in the docs how to convert to numeric after the fact if needed.

@jreback
Copy link
Contributor

jreback commented Jul 13, 2016

@chris-b1 yes that sounds reasonable

@chris-b1
Copy link
Contributor Author

@jreback - I've updated this to remove the type inference, if you want to have another look.

@jreback
Copy link
Contributor

jreback commented Jul 16, 2016

looks pretty - will have to have a more detailed look

can u post perf numbers? (or update top)
with new impl

@chris-b1
Copy link
Contributor Author

Updated performance at the top, although it's basically the same as it was before.

@@ -440,6 +440,41 @@ individual columns:
Specifying ``dtype`` with ``engine`` other than 'c' raises a
``ValueError``.

Specifying Categorical dtype
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a reference to this (and refernce from v0.19.0)

@jreback
Copy link
Contributor

jreback commented Jul 20, 2016

can you test when chunking with dtype='category', for with and w/o different categories in each chunk (it looks to work), but pls test.

@chris-b1
Copy link
Contributor Author

@jreback - updated for your comments


``Categorical`` columns can be parsed directly by specifying ``dtype='category'``

.. ipython :: python
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no space after ipython

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

otherwise it is interpreted as a comment I think (rst ..)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This actually did work when I built the docs locally, but I also had meant to change it, for consistency if nothing else.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, good to know, then the github rst renderer is more picky than sphinx itself (as it is rendered as a comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could add this to the lint check

just grep thru rst and validate the pattern

The resulting categories will always be parsed as string (object dtype).
If the categories are numeric they can be converted using the
:func:`pd.to_numeric` function, or as appropriate, another converter
such as :func:`pd.to_datetime`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment about pd here as above

expected = self.read_csv(pth, header=None, encoding=encoding)
actual = self.read_csv(pth, header=None, encoding=encoding,
dtype={1: 'category'})
actual[1] = actual[1].astype(object)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chris-b1 Is it possible you did switch expected and actual? (it seems more logical to me to test that reading in with an encoding actually returns category dtypes, rather than converting that to object and then comparing, as in this way, you didn't check that it was category dtype)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just being lazy here - related to your other comment, because the current impl doesn't sort categories, it was easier to cast to object (mainly I wanted to test that the decode path was working). I'll update this too.

jreback pushed a commit that referenced this pull request Aug 3, 2016
 - needed for #13406, follow-up to #13763

Author: Chris <[email protected]>
Author: sinhrks <[email protected]>

Closes #13846 from chris-b1/union_categoricals_ordered and squashes the following commits:

3a710f0 [Chris] lint fix
ff0bb5e [Chris] add follow-up PRs to whatsnew
ecb2ae9 [Chris] more tests; handle sorth with ordered
eea1777 [Chris] skip r-esort when possible on fastpath
c559662 [sinhrks] ENH: add sort_categories argument to union_categoricals
@chris-b1
Copy link
Contributor Author

chris-b1 commented Aug 6, 2016

@jreback, @jorisvandenbossche - this is updated and rebased if you'd like to take another look - I think I have all the previous comments addressed. In particular the categories of the parsed Categorical are now sorted.

cats = Index(cats)
if not cats.is_monotonic_increasing:
unsorted = cats.copy()
cats = cats.sort_values()
Copy link
Member

@sinhrks sinhrks Aug 6, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use return_indexer=True to get indexer at the same time.

@jreback jreback closed this in a292c13 Aug 6, 2016
@jreback
Copy link
Contributor

jreback commented Aug 6, 2016

thanks @chris-b1

as usual, check out the docs when built and issue a followup if needed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants