ENH: parse categoricals in read_csv #13406

chris-b1 · 2016-06-08T23:31:14Z

Closes #10153 (at least partly)

Adds the ability to directly parse a Categorical through the dtype parameter to read_csv. Currently just uses whatever is there as the categories, a possible enhancement would be to allow and enforce user-specified categories, through not quite sure what the api would be.

This only parses string categories - originally I had an impl that did type inference on the categories, but it added a lot of complication without much benefit, now the recommendation in the docs is to convert after parsing.

Here's an example timing. For reasonably sparse data, a slightly worse than 2x speedup is what I'm typically seeing, along with much better memory usage.

group1 = ['aaaaa', 'bbbbb', 'cccccc', 'ddddddd', 'eeeeeeee']

df = pd.DataFrame({'a': np.random.choice(group1, 10000000).astype('object'),
                   'b': np.random.choice(group1, 10000000).astype('object'),
                   'c': np.random.choice(group1, 10000000).astype('object')})
df.to_csv('strings.csv', index=False)


In [14]: %timeit pd.read_csv('strings.csv').apply(pd.Categorical)
1 loops, best of 3: 6.66 s per loop

In [13]: %timeit pd.read_csv('strings.csv', dtype='category')
1 loop, best of 3: 3.68 s per loop

jreback · 2016-06-08T23:55:57Z

pandas/parser.pyx

+# as a sentinel to specialize the function for reading
+# from the parser.
+
+# This is to avoid duplicating a bunch of code or


why is this needed?

The type inference functions (e.g. _try_double) read data straight from the parser - the COLITER_NEXT(it,word) stuff. I wanted to also be able to pass in a hash table to that function, i.e. my categories, and this was my workaround.

you would instead set this in parser state and pass directly to the function when its called (e.g. maybe a list off the cols that you want to do hash tracking)

if I put it in the parser state then checks like this would get moved to runtime, which I was trying to avoid, though to be fair I haven't profiled to see what it costs.

chris-b1 · 2016-06-11T21:20:31Z

@jreback - see if you like this any better. I am still using a fused type with inference functions, that's really the best I can think of to make them generic, still very open to suggestions though.

I did add Cython's type specialization syntax, e.g. _try_double[use_parser_data](self.parser, ...) which at least make the intent pretty explicit. It's not that far off something like tag based dispatch in C++, though not sure if it's really idiomatic Cython.

codecov-io · 2016-06-12T15:18:51Z

Current coverage is 85.30% (diff: 100%)

Merging #13406 into master will increase coverage by <.01%

@@             master     #13406   diff @@
==========================================
  Files           139        139          
  Lines         50108      50138    +30   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          42744      42770    +26   
- Misses         7364       7368     +4   
  Partials          0          0

Powered by Codecov. Last update 2beab41...c78f39f

jreback · 2016-06-24T10:08:55Z

@chris-b1 like to update this. but also remove the fused type hack. see what you can do.

chris-b1 · 2016-07-13T13:25:13Z

@jreback - just getting back to this. How would you feel about only parsing string/object Categoricals? That would significantly clean up the code but still cover the biggest use-case, and I could just show in the docs how to convert to numeric after the fact if needed.

jreback · 2016-07-13T19:31:47Z

@chris-b1 yes that sounds reasonable

chris-b1 · 2016-07-16T19:23:40Z

@jreback - I've updated this to remove the type inference, if you want to have another look.

jreback · 2016-07-16T19:56:45Z

looks pretty - will have to have a more detailed look

can u post perf numbers? (or update top)
with new impl

chris-b1 · 2016-07-16T20:06:21Z

Updated performance at the top, although it's basically the same as it was before.

jreback · 2016-07-20T11:51:37Z

doc/source/io.rst

@@ -440,6 +440,41 @@ individual columns:
    Specifying ``dtype`` with ``engine`` other than 'c' raises a
    ``ValueError``.

+Specifying Categorical dtype


add a reference to this (and refernce from v0.19.0)

jreback · 2016-07-20T12:09:49Z

can you test when chunking with dtype='category', for with and w/o different categories in each chunk (it looks to work), but pls test.

chris-b1 · 2016-07-22T00:57:32Z

@jreback - updated for your comments

jorisvandenbossche · 2016-07-22T09:15:36Z

doc/source/io.rst

+
+``Categorical`` columns can be parsed directly by specifying ``dtype='category'``
+
+.. ipython :: python


no space after ipython

otherwise it is interpreted as a comment I think (rst ..)

This actually did work when I built the docs locally, but I also had meant to change it, for consistency if nothing else.

ah, good to know, then the github rst renderer is more picky than sphinx itself (as it is rendered as a comment)

you could add this to the lint check

just grep thru rst and validate the pattern

jorisvandenbossche · 2016-07-22T09:22:02Z

doc/source/whatsnew/v0.19.0.txt

+   The resulting categories will always be parsed as string (object dtype).
+   If the categories are numeric they can be converted using the
+   :func:`pd.to_numeric` function, or as appropriate, another converter
+   such as :func:`pd.to_datetime`.


same comment about pd here as above

jorisvandenbossche · 2016-07-23T15:45:56Z

pandas/io/tests/parser/c_parser_only.py

+        expected = self.read_csv(pth, header=None, encoding=encoding)
+        actual = self.read_csv(pth, header=None, encoding=encoding,
+                               dtype={1: 'category'})
+        actual[1] = actual[1].astype(object)


@chris-b1 Is it possible you did switch expected and actual? (it seems more logical to me to test that reading in with an encoding actually returns category dtypes, rather than converting that to object and then comparing, as in this way, you didn't check that it was category dtype)

I was just being lazy here - related to your other comment, because the current impl doesn't sort categories, it was easier to cast to object (mainly I wanted to test that the decode path was working). I'll update this too.

- needed for #13406, follow-up to #13763 Author: Chris <[email protected]> Author: sinhrks <[email protected]> Closes #13846 from chris-b1/union_categoricals_ordered and squashes the following commits: 3a710f0 [Chris] lint fix ff0bb5e [Chris] add follow-up PRs to whatsnew ecb2ae9 [Chris] more tests; handle sorth with ordered eea1777 [Chris] skip r-esort when possible on fastpath c559662 [sinhrks] ENH: add sort_categories argument to union_categoricals

chris-b1 · 2016-08-06T13:42:44Z

@jreback, @jorisvandenbossche - this is updated and rebased if you'd like to take another look - I think I have all the previous comments addressed. In particular the categories of the parsed Categorical are now sorted.

sinhrks · 2016-08-06T20:52:27Z

pandas/parser.pyx

+            cats = Index(cats)
+            if not cats.is_monotonic_increasing:
+                unsorted = cats.copy()
+                cats = cats.sort_values()


You can use return_indexer=True to get indexer at the same time.

jreback · 2016-08-06T22:54:16Z

thanks @chris-b1

as usual, check out the docs when built and issue a followup if needed!

jreback added IO CSV read_csv, to_csv Categorical Categorical Data Type labels Jun 8, 2016

jreback reviewed Jun 8, 2016
View reviewed changes

chris-b1 force-pushed the categorical-parse branch from 4a657ca to a3b694f Compare July 16, 2016 17:07

jreback reviewed Jul 20, 2016
View reviewed changes

chris-b1 force-pushed the categorical-parse branch from bd4e515 to 822a423 Compare July 22, 2016 00:55

jorisvandenbossche reviewed Jul 22, 2016
View reviewed changes

chris-b1 mentioned this pull request Jul 22, 2016

BUG: union_categoricals can't handle NaN #13759

Closed

3 tasks

jorisvandenbossche reviewed Jul 23, 2016
View reviewed changes

This was referenced Jul 23, 2016

ENH: union_categorical supports identical categories with ordered #13763

Closed

ENH: add sort_categories argument to union_categoricals #13846

Closed

chris-b1 added 9 commits August 4, 2016 18:09

ENH: parse categoricals in read_csv

286d907

clean up dtype checking, add function specialization

cfa0ce4

fix some dtype checking

849a112

undo type inference add docs and asv

4e0722d

fix hash table ordering, null categories

2490949

doc fixups; addl tests

1254768

flake8 fix

da5c5b5

wip

0f0dba6

rebase

1f6093a

chris-b1 force-pushed the categorical-parse branch from 822a423 to 1f6093a Compare August 4, 2016 23:17

chris-b1 added 2 commits August 4, 2016 18:30

doc fixups

75ed6ba

rebase fixup

c78f39f

sinhrks reviewed Aug 6, 2016
View reviewed changes

jreback closed this in a292c13 Aug 6, 2016

jreback added this to the 0.19.0 milestone Aug 6, 2016

TomAugspurger mentioned this pull request Oct 26, 2016

API: Expand read_csv dtype for categoricals #14503

Open

jcrist mentioned this pull request Jan 3, 2017

Support non-uniform categoricals dask/dask#1877

Merged

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: parse categoricals in read_csv #13406

ENH: parse categoricals in read_csv #13406

chris-b1 commented Jun 8, 2016 •

edited

Loading

jreback Jun 8, 2016

chris-b1 Jun 9, 2016

jreback Jun 9, 2016

chris-b1 Jun 10, 2016

chris-b1 commented Jun 11, 2016

codecov-io commented Jun 12, 2016 •

edited

Loading

jreback commented Jun 24, 2016

chris-b1 commented Jul 13, 2016

jreback commented Jul 13, 2016

chris-b1 commented Jul 16, 2016

jreback commented Jul 16, 2016

chris-b1 commented Jul 16, 2016

jreback Jul 20, 2016

jreback commented Jul 20, 2016

chris-b1 commented Jul 22, 2016

jorisvandenbossche Jul 22, 2016

jorisvandenbossche Jul 22, 2016

chris-b1 Jul 22, 2016

jorisvandenbossche Jul 23, 2016

jreback Jul 23, 2016

jorisvandenbossche Jul 22, 2016

jorisvandenbossche Jul 23, 2016

chris-b1 Jul 23, 2016

chris-b1 commented Aug 6, 2016

sinhrks Aug 6, 2016 •

edited

Loading

jreback commented Aug 6, 2016


		``Categorical`` columns can be parsed directly by specifying ``dtype='category'``

		.. ipython :: python

ENH: parse categoricals in read_csv #13406

ENH: parse categoricals in read_csv #13406

Conversation

chris-b1 commented Jun 8, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chris-b1 commented Jun 11, 2016

codecov-io commented Jun 12, 2016 • edited Loading

Current coverage is 85.30% (diff: 100%)

jreback commented Jun 24, 2016

chris-b1 commented Jul 13, 2016

jreback commented Jul 13, 2016

chris-b1 commented Jul 16, 2016

jreback commented Jul 16, 2016

chris-b1 commented Jul 16, 2016

Choose a reason for hiding this comment

jreback commented Jul 20, 2016

chris-b1 commented Jul 22, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chris-b1 commented Aug 6, 2016

sinhrks Aug 6, 2016 • edited Loading

Choose a reason for hiding this comment

jreback commented Aug 6, 2016

chris-b1 commented Jun 8, 2016 •

edited

Loading

codecov-io commented Jun 12, 2016 •

edited

Loading

sinhrks Aug 6, 2016 •

edited

Loading