Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

group by high cardinality column in datafusion 10 times slower than low cardinality column #1246

Closed
Tracked by #4973
jiangzhx opened this issue Nov 5, 2021 · 9 comments · Fixed by #1247
Closed
Tracked by #4973
Labels
bug Something isn't working

Comments

@jiangzhx
Copy link
Contributor

jiangzhx commented Nov 5, 2021

Describe the bug
group by high cardinality column in datafusion 10 times slower than low cardinality column.
also i tested on other olap engine, there are only 2 times slow or less;

  • trino olap engine write by java

    low cardinality  usage ms: 1400ms±
    high cardinality  usage ms: 2700ms±
    
  • doris olap engine write by c++

    low cardinality  usage ms: 350ms±
    high cardinality  usage ms: 500ms±
    

To Reproduce
Steps to reproduce the behavior:
parquet table with 60,000,000 rows; data generate by ssb-dbgen

group by LO_ORDERPRIORITY

SELECT sum(LO_EXTENDEDPRICE) AS revenue  FROM lineorder_flat group by LO_ORDERPRIORITY;
5 rows in set. Query took 0.341 seconds.

group by S_ADDRESS

SELECT sum(LO_EXTENDEDPRICE) AS revenue  FROM lineorder_flat group by S_ADDRESS;
20000 rows in set. Query took 2.582 seconds.

Expected behavior
should some with other engine;

Additional context
Add any other context about the problem here.

@jiangzhx jiangzhx added the bug Something isn't working label Nov 5, 2021
@xudong963
Copy link
Member

xudong963 commented Nov 5, 2021

If I recall correctly, datafusion doesn't do fine optimization about group by and aggregate functions at present. Datafusion as an AP system, It's worth adding to our RoadMap and doing it in the future.

@jiangzhx
Copy link
Contributor Author

jiangzhx commented Nov 5, 2021

If I recall correctly, datafusion doesn't do fine optimization about group by and aggregate functions at present. It's worth adding it to our RoadMap and doing it in the future.

i try to dig code in trino and doris; there are all have streaming aggregate node; but i can't understand how they working.

aggregate functions was working fine; with sum(LO_EXTENDEDPRICE) or without; the performence has no big difference,there are also have 5~10 times slow;

low cardinality:

select 1  FROM lineorder_flat group by LO_ORDERPRIORITY;
5 rows in set. Query took 0.236 seconds.

high cardinality:

select 1  FROM lineorder_flat group by S_ADDRESS;
20000 rows in set. Query took 1.429 seconds.

@xudong963
Copy link
Member

I'll take a look at Doris on the weekend. Until then, we can wait for someone else to answer your questions. Thanks for your comparison @jiangzhx

@Dandandan
Copy link
Contributor

Some relevant tickets:

#418
#956

@alamb alamb reopened this Nov 5, 2021
@alamb
Copy link
Contributor

alamb commented Nov 5, 2021

Accidentally closed

@alamb
Copy link
Contributor

alamb commented Nov 5, 2021

I think there is a lot of overhead creating and managing group keys via ScalarValues that is a good thing to look into if we want to optimize the performance here

@alamb
Copy link
Contributor

alamb commented Jun 25, 2023

Related PR: #6657

@alamb
Copy link
Contributor

alamb commented Jun 27, 2023

see #4973 (comment) for proposal

@alamb
Copy link
Contributor

alamb commented Jul 13, 2023

This should be closed by #6904

@alamb alamb closed this as completed Jul 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants