group by high cardinality column in datafusion 10 times slower than low cardinality column #1246

jiangzhx · 2021-11-05T05:30:49Z

Describe the bug
group by high cardinality column in datafusion 10 times slower than low cardinality column.
also i tested on other olap engine, there are only 2 times slow or less;

trino olap engine write by java

low cardinality  usage ms: 1400ms±
high cardinality  usage ms: 2700ms±

doris olap engine write by c++

low cardinality  usage ms: 350ms±
high cardinality  usage ms: 500ms±

To Reproduce
Steps to reproduce the behavior:
parquet table with 60,000,000 rows; data generate by ssb-dbgen

group by LO_ORDERPRIORITY

SELECT sum(LO_EXTENDEDPRICE) AS revenue  FROM lineorder_flat group by LO_ORDERPRIORITY;
5 rows in set. Query took 0.341 seconds.

group by S_ADDRESS

SELECT sum(LO_EXTENDEDPRICE) AS revenue  FROM lineorder_flat group by S_ADDRESS;
20000 rows in set. Query took 2.582 seconds.

Expected behavior
should some with other engine;

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

xudong963 · 2021-11-05T05:45:31Z

If I recall correctly, datafusion doesn't do fine optimization about group by and aggregate functions at present. Datafusion as an AP system, It's worth adding to our RoadMap and doing it in the future.

jiangzhx · 2021-11-05T05:57:33Z

If I recall correctly, datafusion doesn't do fine optimization about group by and aggregate functions at present. It's worth adding it to our RoadMap and doing it in the future.

i try to dig code in trino and doris; there are all have streaming aggregate node; but i can't understand how they working.

aggregate functions was working fine; with sum(LO_EXTENDEDPRICE) or without; the performence has no big difference,there are also have 5~10 times slow;

low cardinality:

select 1  FROM lineorder_flat group by LO_ORDERPRIORITY;
5 rows in set. Query took 0.236 seconds.

high cardinality:

select 1  FROM lineorder_flat group by S_ADDRESS;
20000 rows in set. Query took 1.429 seconds.

xudong963 · 2021-11-05T06:05:46Z

I'll take a look at Doris on the weekend. Until then, we can wait for someone else to answer your questions. Thanks for your comparison @jiangzhx

Dandandan · 2021-11-05T17:02:39Z

Some relevant tickets:

#418
#956

alamb · 2021-11-05T17:35:12Z

Accidentally closed

alamb · 2021-11-05T17:36:16Z

I think there is a lot of overhead creating and managing group keys via ScalarValues that is a good thing to look into if we want to optimize the performance here

alamb · 2023-06-25T11:52:46Z

Related PR: #6657

alamb · 2023-06-27T18:24:19Z

see #4973 (comment) for proposal

alamb · 2023-07-13T14:26:30Z

This should be closed by #6904

jiangzhx added the bug Something isn't working label Nov 5, 2021

xudong963 mentioned this issue Nov 5, 2021

Update roadmap #1247

Merged

alamb closed this as completed in #1247 Nov 5, 2021

alamb reopened this Nov 5, 2021

ic4y mentioned this issue Dec 6, 2021

Make aggregate accumulators storage column-based #956

Closed

ic4y mentioned this issue Dec 16, 2021

The Eq method in HashAggregate takes up a lot of time, how to optimize it #1456

Closed

alamb mentioned this issue Mar 3, 2023

Improve the performance of Aggregator, grouping, aggregation #4973

Closed

4 tasks

alamb closed this as completed Jul 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

group by high cardinality column in datafusion 10 times slower than low cardinality column #1246

group by high cardinality column in datafusion 10 times slower than low cardinality column #1246

jiangzhx commented Nov 5, 2021 •

edited by Dandandan

Loading

xudong963 commented Nov 5, 2021 •

edited

Loading

jiangzhx commented Nov 5, 2021

xudong963 commented Nov 5, 2021

Dandandan commented Nov 5, 2021

alamb commented Nov 5, 2021

alamb commented Nov 5, 2021

alamb commented Jun 25, 2023

alamb commented Jun 27, 2023

alamb commented Jul 13, 2023

group by high cardinality column in datafusion 10 times slower than low cardinality column #1246

group by high cardinality column in datafusion 10 times slower than low cardinality column #1246

Comments

jiangzhx commented Nov 5, 2021 • edited by Dandandan Loading

xudong963 commented Nov 5, 2021 • edited Loading

jiangzhx commented Nov 5, 2021

xudong963 commented Nov 5, 2021

Dandandan commented Nov 5, 2021

alamb commented Nov 5, 2021

alamb commented Nov 5, 2021

alamb commented Jun 25, 2023

alamb commented Jun 27, 2023

alamb commented Jul 13, 2023

jiangzhx commented Nov 5, 2021 •

edited by Dandandan

Loading

xudong963 commented Nov 5, 2021 •

edited

Loading