datafusion-cli scanning a single large parquet file uses only a single core #5995

alamb · 2023-04-13T17:10:13Z

Describe the bug

datafusion-cli scanning a single large parquet file uses only a single core

This is bad as it makes datafusion look bad compared to other systems such as duckdb

To Reproduce

Download this file:
slow_tpch_query_repro.zip

and follow the instructions:

setup virtual env:

python3 -m venv venv
source venv/bin/activate

pip install -r requirements.txt
pip install --pre --upgrade duckdb

# make the files:
python generate.py

Then run the query like:

# Run the query using datafusion-cli:
cd tpch_1
datafusion-cli -f q1.txt

Only one core is used and the query takes several seconds to complete

Expected behavior

I expect to see all the cores on the machine used to operate on the query

Additional context

Found while looking at #5942

The text was updated successfully, but these errors were encountered:

alamb · 2023-04-13T17:13:00Z

Suggestion from @tustvold on #5942 (comment)

Here is the result of running parquet-layout on the lineitem.parquet file:

layout.json.txt

tustvold · 2023-04-13T17:17:59Z

Do you have a profile of the CPU usage, I would have expected it to parallelize the parquet scanning part, perhaps the bottleneck is elsewhere?

alippai · 2023-04-13T17:19:28Z

There are 49 row groups, 48 of them has size 124928 (this is actually too small instead of one big row group which would prevent the parallel processing)

alamb · 2023-04-13T17:58:31Z

Do you have a profile of the CPU usage, I would have expected it to parallelize the parquet scanning part, perhaps the bottleneck is elsewhere?

I do not have a profile.

tustvold · 2023-04-13T18:19:31Z

I added the following line to ParquetOpener::open

println!(
    "Parquet partition {} reading row groups {:?}",
    partition_index, row_groups
);

And got

Parquet partition 1 reading row groups []
Parquet partition 2 reading row groups []
Parquet partition 0 reading row groups [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48]
Parquet partition 29 reading row groups []
Parquet partition 18 reading row groups []
Parquet partition 27 reading row groups []
Parquet partition 28 reading row groups []
Parquet partition 13 reading row groups []
Parquet partition 22 reading row groups []
Parquet partition 30 reading row groups []
Parquet partition 21 reading row groups []
Parquet partition 7 reading row groups []
Parquet partition 16 reading row groups []
Parquet partition 3 reading row groups []
Parquet partition 23 reading row groups []
Parquet partition 8 reading row groups []
Parquet partition 31 reading row groups []
Parquet partition 10 reading row groups []
Parquet partition 20 reading row groups []
Parquet partition 11 reading row groups []
Parquet partition 12 reading row groups []
Parquet partition 24 reading row groups []
Parquet partition 26 reading row groups []
Parquet partition 5 reading row groups []
Parquet partition 25 reading row groups []
Parquet partition 14 reading row groups []
Parquet partition 17 reading row groups []
Parquet partition 6 reading row groups []
Parquet partition 9 reading row groups []
Parquet partition 19 reading row groups []
Parquet partition 4 reading row groups []
Parquet partition 15 reading row groups []

So whilst it is creating lots of partitions, all the row group appear to lie in a single partition. This explains why we are not seeing any parallelism. Why this is the case needs more investigation

Edit: using https://github.com/apache/arrow-rs/pull/4086/files I confirmed the byte ranges should distribute the row groups

tustvold · 2023-04-13T18:46:57Z

With the fix in #5997 we have much more parallelism and the query runs much faster

alamb added the bug Something isn't working label Apr 13, 2023

This was referenced Apr 13, 2023

Poor reported performance of DataFusion against DuckDB and Hyper #5942

Closed

[EPIC] A list of performance improvement tickets #5546

Open

tustvold mentioned this issue Apr 13, 2023

Don't use parquet file offset for file range pruning #5997

Merged

tustvold closed this as completed in #5997 Apr 13, 2023

This was referenced Apr 14, 2023

Improve DataFusion scalability as more cores are added #5999

Open

Include byte offsets in parquet-layout apache/arrow-rs#4086

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

datafusion-cli scanning a single large parquet file uses only a single core #5995

datafusion-cli scanning a single large parquet file uses only a single core #5995

alamb commented Apr 13, 2023 •

edited

Loading

alamb commented Apr 13, 2023

tustvold commented Apr 13, 2023

alippai commented Apr 13, 2023

alamb commented Apr 13, 2023

tustvold commented Apr 13, 2023 •

edited

Loading

tustvold commented Apr 13, 2023

datafusion-cli scanning a single large parquet file uses only a single core #5995

datafusion-cli scanning a single large parquet file uses only a single core #5995

Comments

alamb commented Apr 13, 2023 • edited Loading

Describe the bug

To Reproduce

Expected behavior

Additional context

alamb commented Apr 13, 2023

tustvold commented Apr 13, 2023

alippai commented Apr 13, 2023

alamb commented Apr 13, 2023

tustvold commented Apr 13, 2023 • edited Loading

tustvold commented Apr 13, 2023

alamb commented Apr 13, 2023 •

edited

Loading

tustvold commented Apr 13, 2023 •

edited

Loading