ENH: pd.cut should be able to return a Series of IntervalDtype #31459

Dr-Irv · 2020-01-30T16:01:59Z

Code Sample, a copy-pastable example if possible

In [1]: import pandas as pd

In [2]: import random

In [3]: df=pd.DataFrame({'x': [random.randint(0,10) for i in range(100)]})

In [4]: pd.cut(df.x, [2*i for i in range(6)])
Out[4]:
0      (6.0, 8.0]
1             NaN
2      (0.0, 2.0]
3      (2.0, 4.0]
4      (0.0, 2.0]
         ...
95    (8.0, 10.0]
96     (2.0, 4.0]
97    (8.0, 10.0]
98     (0.0, 2.0]
99    (8.0, 10.0]
Name: x, Length: 100, dtype: category
Categories (5, interval[int64]): [(0, 2] < (2, 4] < (4, 6] < (6, 8] < (8, 10]]

Problem description

By default, the return type of pd.cut is categories, but we know we are creating intervals, so shouldn't it be IntervalDtype ?

Expected Output

In [5]: pd.cut(df.x, [2*i for i in range(6)]).astype('Interval')
Out[5]:
0      (6.0, 8.0]
1             NaN
2      (0.0, 2.0]
3      (2.0, 4.0]
4      (0.0, 2.0]
         ...
95    (8.0, 10.0]
96     (2.0, 4.0]
97    (8.0, 10.0]
98     (0.0, 2.0]
99    (8.0, 10.0]
Name: x, Length: 100, dtype: interval

IMHO, I shouldn't need to do the 'astype('Interval')`

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : None
python : 3.7.5.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 13, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None

pandas : 1.0.0
numpy : 1.17.4
pytz : 2019.3
dateutil : 2.8.1
pip : 19.3.1
setuptools : 42.0.2.post20191203
Cython : None
pytest : 5.3.2
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : 1.2.6
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.11.1
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : 5.3.2
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : None
xlsxwriter : 1.2.6
numba : None

The text was updated successfully, but these errors were encountered:

jschendel · 2020-01-31T07:07:43Z

The logic behind returning a Categorical here is to save memory. We're expecting there to typically be many repeat intervals, as the purpose of binning is to group like elements together, and thecut/qcut implementations do not require us to materialize all the intervals, so we can keep memory low throughout.

The memory savings apply doubly when dealing with intervals since we use two arrays to represent them (left/right endpoints). For an extreme example, see the difference in memory usage between an IntervalIndex and Categorical:

In [2]: ii = pd.interval_range(0, 10).repeat(10**4)

In [3]: cat_ii = pd.Categorical(ii)

In [4]: ii.memory_usage()
Out[4]: 1600000

In [5]: cat_ii.memory_usage()
Out[5]: 100480

In [6]: cat_ii.memory_usage()/ii.memory_usage()
Out[6]: 0.0628

I wouldn't be opposed to allowing users to opt-in for getting actual intervals back. Not sure what the best way to do that is though. Perhaps by adding a new option to the labels parameter? At that point it might be the same amount of effort as an astype though.

jreback · 2020-01-31T10:47:53Z

this was done quite a while ago to make cut backwards compatible and performance

meaning you generally cut then use that during indexing (or groupby) for example; these are most useful as categorical (efficiency) both memory and time are pretty optimized

I don’t think an II is directly useful here and as @jschendel points out you can simply astype

maybe you can elaborate on why you think an II is useful here

Dr-Irv · 2020-01-31T18:15:42Z

So the reason that I found this is I had a dataset where I was doing different cuts, and counting the size of each cut, and wanted to use pd.concat to put the results in one DataFrame. Here is an example (not from my real data):

In [1]: import pandas as pd

In [2]: import random

In [3]: df=pd.DataFrame({'x': [random.random()*12 for i in range(100)]})

In [4]: df
Out[4]:
            x
0    3.874860
1    0.942185
2   10.269179
3   11.327410
4    3.022187
..        ...
95   5.467304
96  11.046697
97   7.480295
98   9.023103
99   6.972999

[100 rows x 1 columns]

In [5]: evencut = pd.cut(df.x, [2*i for i in range(7)])
   ...: threecut = pd.cut(df.x, [3*i for i in range(5)])

In [6]: pd.concat([df.groupby(evencut).size(), df.groupby(threecut).size()])
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-6-19319e3586e1> in <module>
----> 1 pd.concat([df.groupby(evencut).size(), df.groupby(threecut).size()])

-------------------- traceback omitted --------------------------------

TypeError: categories must match existing categories when appending

In [7]: pd.concat([df.groupby(evencut.astype('interval')).size(), df.groupby(th
   ...: reecut.astype('interval')).size()])
Out[7]:
x
(0, 2]      17
(2, 4]      12
(4, 6]      13
(6, 8]      18
(8, 10]     16
(10, 12]    24
(0, 3]      22
(3, 6]      20
(6, 9]      27
(9, 12]     31
dtype: int64

This was the first time that I used pd.cut, so I expected the index after the groupby() to be an IntervalIndex and the pd.concat to work. And, the TypeError that occurred when doing pd.concat on categorical dtype is something I commented on before at #14177 (I'm amazed how some other issue that I encountered in a very different context came up again!)

So I solved it by using .astype('interval') as shown in [8] above, but it just didn't seem intuitive that I'd have to do that.

Also, after doing the binning via the groupby(), I wanted to get access to the left and right values of each interval, which also required an astype('interval') as well.

We could modify the API, or maybe this is just a matter of improving the documentation to make it clear that the result is categorical and not an interval, which affects the downstream groupby() operation, and one can do astype('interval') to work around that.

jschendel added Enhancement Needs Discussion Requires discussion from core team before further action Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Jan 31, 2020

jbrockmendel added the cut cut, qcut label Feb 25, 2020

mroeschke added Interval Interval data type and removed Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Jul 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: pd.cut should be able to return a Series of IntervalDtype #31459

ENH: pd.cut should be able to return a Series of IntervalDtype #31459

Dr-Irv commented Jan 30, 2020

INSTALLED VERSIONS

jschendel commented Jan 31, 2020

jreback commented Jan 31, 2020

Dr-Irv commented Jan 31, 2020

ENH: pd.cut should be able to return a Series of IntervalDtype #31459

ENH: pd.cut should be able to return a Series of IntervalDtype #31459

Comments

Dr-Irv commented Jan 30, 2020

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

jschendel commented Jan 31, 2020

jreback commented Jan 31, 2020

Dr-Irv commented Jan 31, 2020

Output of `pd.show_versions()`