Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pandas take blocking the GIL #1404

Open
fjetter opened this issue Feb 16, 2024 · 6 comments
Open

Pandas take blocking the GIL #1404

fjetter opened this issue Feb 16, 2024 · 6 comments

Comments

@fjetter
Copy link
Member

fjetter commented Feb 16, 2024

I noticed that tpch query 1 is spending only about half it's time in parquet IO when using the dataset that's been produced by pyarrow

image

However, the run is heavily GIL congested and another GIL+native profile reveals that actually very few things (in python) are holding on to the GIL (native-only threads, e.g. of pyarrow are not tracked by py-spy so we don't see how arrow holds the gil to produce the dataframe)

image

which points to the take pandas function. There's already been a recent fix to this code area (see pandas-dev/pandas#54483) for axis0

@fjetter
Copy link
Member Author

fjetter commented Feb 16, 2024

pandas-dev/pandas#57454

now we just need a release

@fjetter
Copy link
Member Author

fjetter commented Feb 26, 2024

This has been backported to 2.2.1

FWIW I'm seeing to a much lesser extend take_2d_axis1_object_object calls being invoked and blocking the GIL despite the dataframe not having any object columns. From what I can tell the columns this is referring to are uint64.

It looks like higher bit uints are not handled in https://github.com/phofl/pandas/blob/a1ce9a64384fbce80ecac443c97bf33982c5b7c7/pandas/_libs/algos_take_helper.pxi.in#L15-L35 and the 8bit uint seems to have a duplicate entry here https://github.com/phofl/pandas/blob/a1ce9a64384fbce80ecac443c97bf33982c5b7c7/pandas/_libs/algos_take_helper.pxi.in#L16-L17

I suspect uint is falling back to obj?

@phofl
Copy link
Contributor

phofl commented Feb 26, 2024

Do you have a query where this happens?

I'll take a look later today

@fjetter
Copy link
Member Author

fjetter commented Feb 26, 2024

This is query 2. This is the partition_index column, see also dask/distributed#8530

the partition index column is currently a unit64 so casting this to a smaller value is appropriate. If pandas is not implementing a fastpath for uint we can also use a signed integer for now

@phofl
Copy link
Contributor

phofl commented Feb 26, 2024

I don't actually know the history behind not adding uint here (maybe wheel size, but not sure), so using a singed integer should be the fastest workaround for now

@phofl
Copy link
Contributor

phofl commented Feb 26, 2024

The uint thing is on purpose btw, we treat bool as uint8, so that's probably the reason that this was added

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants