Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

datafusion-cli scanning a single large parquet file uses only a single core #5995

Closed
alamb opened this issue Apr 13, 2023 · 6 comments · Fixed by #5997
Closed

datafusion-cli scanning a single large parquet file uses only a single core #5995

alamb opened this issue Apr 13, 2023 · 6 comments · Fixed by #5997
Labels
bug Something isn't working

Comments

@alamb
Copy link
Contributor

alamb commented Apr 13, 2023

Describe the bug

datafusion-cli scanning a single large parquet file uses only a single core

This is bad as it makes datafusion look bad compared to other systems such as duckdb

To Reproduce

Download this file:
slow_tpch_query_repro.zip

and follow the instructions:

setup virtual env:

python3 -m venv venv
source venv/bin/activate

pip install -r requirements.txt
pip install --pre --upgrade duckdb

# make the files:
python generate.py

Then run the query like:

# Run the query using datafusion-cli:
cd tpch_1
datafusion-cli -f q1.txt

Only one core is used and the query takes several seconds to complete

Expected behavior

I expect to see all the cores on the machine used to operate on the query

Additional context

Found while looking at #5942

@alamb alamb added the bug Something isn't working label Apr 13, 2023
@alamb
Copy link
Contributor Author

alamb commented Apr 13, 2023

Suggestion from @tustvold on #5942 (comment)

Here is the result of running parquet-layout on the lineitem.parquet file:

layout.json.txt

@tustvold
Copy link
Contributor

Do you have a profile of the CPU usage, I would have expected it to parallelize the parquet scanning part, perhaps the bottleneck is elsewhere?

@alippai
Copy link
Contributor

alippai commented Apr 13, 2023

There are 49 row groups, 48 of them has size 124928 (this is actually too small instead of one big row group which would prevent the parallel processing)

@alamb
Copy link
Contributor Author

alamb commented Apr 13, 2023

Do you have a profile of the CPU usage, I would have expected it to parallelize the parquet scanning part, perhaps the bottleneck is elsewhere?

I do not have a profile.

@tustvold
Copy link
Contributor

tustvold commented Apr 13, 2023

I added the following line to ParquetOpener::open

println!(
    "Parquet partition {} reading row groups {:?}",
    partition_index, row_groups
);

And got

Parquet partition 1 reading row groups []
Parquet partition 2 reading row groups []
Parquet partition 0 reading row groups [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48]
Parquet partition 29 reading row groups []
Parquet partition 18 reading row groups []
Parquet partition 27 reading row groups []
Parquet partition 28 reading row groups []
Parquet partition 13 reading row groups []
Parquet partition 22 reading row groups []
Parquet partition 30 reading row groups []
Parquet partition 21 reading row groups []
Parquet partition 7 reading row groups []
Parquet partition 16 reading row groups []
Parquet partition 3 reading row groups []
Parquet partition 23 reading row groups []
Parquet partition 8 reading row groups []
Parquet partition 31 reading row groups []
Parquet partition 10 reading row groups []
Parquet partition 20 reading row groups []
Parquet partition 11 reading row groups []
Parquet partition 12 reading row groups []
Parquet partition 24 reading row groups []
Parquet partition 26 reading row groups []
Parquet partition 5 reading row groups []
Parquet partition 25 reading row groups []
Parquet partition 14 reading row groups []
Parquet partition 17 reading row groups []
Parquet partition 6 reading row groups []
Parquet partition 9 reading row groups []
Parquet partition 19 reading row groups []
Parquet partition 4 reading row groups []
Parquet partition 15 reading row groups []

So whilst it is creating lots of partitions, all the row group appear to lie in a single partition. This explains why we are not seeing any parallelism. Why this is the case needs more investigation

Edit: using https://github.com/apache/arrow-rs/pull/4086/files I confirmed the byte ranges should distribute the row groups

@tustvold
Copy link
Contributor

With the fix in #5997 we have much more parallelism and the query runs much faster

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants