-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pandas take
blocking the GIL
#1404
Comments
now we just need a release |
This has been backported to 2.2.1 FWIW I'm seeing to a much lesser extend It looks like higher bit uints are not handled in https://github.com/phofl/pandas/blob/a1ce9a64384fbce80ecac443c97bf33982c5b7c7/pandas/_libs/algos_take_helper.pxi.in#L15-L35 and the 8bit uint seems to have a duplicate entry here https://github.com/phofl/pandas/blob/a1ce9a64384fbce80ecac443c97bf33982c5b7c7/pandas/_libs/algos_take_helper.pxi.in#L16-L17 I suspect uint is falling back to obj? |
Do you have a query where this happens? I'll take a look later today |
This is query 2. This is the partition_index column, see also dask/distributed#8530 the partition index column is currently a unit64 so casting this to a smaller value is appropriate. If pandas is not implementing a fastpath for uint we can also use a signed integer for now |
I don't actually know the history behind not adding uint here (maybe wheel size, but not sure), so using a singed integer should be the fastest workaround for now |
The uint thing is on purpose btw, we treat bool as uint8, so that's probably the reason that this was added |
I noticed that tpch query 1 is spending only about half it's time in parquet IO when using the dataset that's been produced by pyarrow
However, the run is heavily GIL congested and another GIL+native profile reveals that actually very few things (in python) are holding on to the GIL (native-only threads, e.g. of pyarrow are not tracked by py-spy so we don't see how arrow holds the gil to produce the dataframe)
which points to the take pandas function. There's already been a recent fix to this code area (see pandas-dev/pandas#54483) for axis0
The text was updated successfully, but these errors were encountered: