Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: pd.cut should be able to return a Series of IntervalDtype #31459

Open
Dr-Irv opened this issue Jan 30, 2020 · 3 comments
Open

ENH: pd.cut should be able to return a Series of IntervalDtype #31459

Dr-Irv opened this issue Jan 30, 2020 · 3 comments
Labels
cut cut, qcut Enhancement Interval Interval data type Needs Discussion Requires discussion from core team before further action

Comments

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Jan 30, 2020

Code Sample, a copy-pastable example if possible

In [1]: import pandas as pd

In [2]: import random

In [3]: df=pd.DataFrame({'x': [random.randint(0,10) for i in range(100)]})

In [4]: pd.cut(df.x, [2*i for i in range(6)])
Out[4]:
0      (6.0, 8.0]
1             NaN
2      (0.0, 2.0]
3      (2.0, 4.0]
4      (0.0, 2.0]
         ...
95    (8.0, 10.0]
96     (2.0, 4.0]
97    (8.0, 10.0]
98     (0.0, 2.0]
99    (8.0, 10.0]
Name: x, Length: 100, dtype: category
Categories (5, interval[int64]): [(0, 2] < (2, 4] < (4, 6] < (6, 8] < (8, 10]]

Problem description

By default, the return type of pd.cut is categories, but we know we are creating intervals, so shouldn't it be IntervalDtype ?

Expected Output

In [5]: pd.cut(df.x, [2*i for i in range(6)]).astype('Interval')
Out[5]:
0      (6.0, 8.0]
1             NaN
2      (0.0, 2.0]
3      (2.0, 4.0]
4      (0.0, 2.0]
         ...
95    (8.0, 10.0]
96     (2.0, 4.0]
97    (8.0, 10.0]
98     (0.0, 2.0]
99    (8.0, 10.0]
Name: x, Length: 100, dtype: interval

IMHO, I shouldn't need to do the 'astype('Interval')`

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.5.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 13, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None

pandas : 1.0.0
numpy : 1.17.4
pytz : 2019.3
dateutil : 2.8.1
pip : 19.3.1
setuptools : 42.0.2.post20191203
Cython : None
pytest : 5.3.2
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : 1.2.6
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.11.1
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : 5.3.2
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : None
xlsxwriter : 1.2.6
numba : None

@jschendel
Copy link
Member

The logic behind returning a Categorical here is to save memory. We're expecting there to typically be many repeat intervals, as the purpose of binning is to group like elements together, and thecut/qcut implementations do not require us to materialize all the intervals, so we can keep memory low throughout.

The memory savings apply doubly when dealing with intervals since we use two arrays to represent them (left/right endpoints). For an extreme example, see the difference in memory usage between an IntervalIndex and Categorical:

In [2]: ii = pd.interval_range(0, 10).repeat(10**4)

In [3]: cat_ii = pd.Categorical(ii)

In [4]: ii.memory_usage()
Out[4]: 1600000

In [5]: cat_ii.memory_usage()
Out[5]: 100480

In [6]: cat_ii.memory_usage()/ii.memory_usage()
Out[6]: 0.0628

I wouldn't be opposed to allowing users to opt-in for getting actual intervals back. Not sure what the best way to do that is though. Perhaps by adding a new option to the labels parameter? At that point it might be the same amount of effort as an astype though.

@jschendel jschendel added Enhancement Needs Discussion Requires discussion from core team before further action Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Jan 31, 2020
@jreback
Copy link
Contributor

jreback commented Jan 31, 2020

this was done quite a while ago to make cut backwards compatible and performance

meaning you generally cut then use that during indexing (or groupby) for example; these are most useful as categorical (efficiency) both memory and time are pretty optimized

I don’t think an II is directly useful here and as @jschendel points out you can simply astype

maybe you can elaborate on why you think an II is useful here

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Jan 31, 2020

So the reason that I found this is I had a dataset where I was doing different cuts, and counting the size of each cut, and wanted to use pd.concat to put the results in one DataFrame. Here is an example (not from my real data):

In [1]: import pandas as pd

In [2]: import random

In [3]: df=pd.DataFrame({'x': [random.random()*12 for i in range(100)]})

In [4]: df
Out[4]:
            x
0    3.874860
1    0.942185
2   10.269179
3   11.327410
4    3.022187
..        ...
95   5.467304
96  11.046697
97   7.480295
98   9.023103
99   6.972999

[100 rows x 1 columns]

In [5]: evencut = pd.cut(df.x, [2*i for i in range(7)])
   ...: threecut = pd.cut(df.x, [3*i for i in range(5)])

In [6]: pd.concat([df.groupby(evencut).size(), df.groupby(threecut).size()])
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-6-19319e3586e1> in <module>
----> 1 pd.concat([df.groupby(evencut).size(), df.groupby(threecut).size()])

-------------------- traceback omitted --------------------------------

TypeError: categories must match existing categories when appending

In [7]: pd.concat([df.groupby(evencut.astype('interval')).size(), df.groupby(th
   ...: reecut.astype('interval')).size()])
Out[7]:
x
(0, 2]      17
(2, 4]      12
(4, 6]      13
(6, 8]      18
(8, 10]     16
(10, 12]    24
(0, 3]      22
(3, 6]      20
(6, 9]      27
(9, 12]     31
dtype: int64

This was the first time that I used pd.cut, so I expected the index after the groupby() to be an IntervalIndex and the pd.concat to work. And, the TypeError that occurred when doing pd.concat on categorical dtype is something I commented on before at #14177 (I'm amazed how some other issue that I encountered in a very different context came up again!)

So I solved it by using .astype('interval') as shown in [8] above, but it just didn't seem intuitive that I'd have to do that.

Also, after doing the binning via the groupby(), I wanted to get access to the left and right values of each interval, which also required an astype('interval') as well.

We could modify the API, or maybe this is just a matter of improving the documentation to make it clear that the result is categorical and not an interval, which affects the downstream groupby() operation, and one can do astype('interval') to work around that.

@jbrockmendel jbrockmendel added the cut cut, qcut label Feb 25, 2020
@mroeschke mroeschke added Interval Interval data type and removed Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Jul 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cut cut, qcut Enhancement Interval Interval data type Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

5 participants