Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate performance drop for DISTINCT queries #5313

Closed
comphead opened this issue Feb 16, 2023 · 10 comments
Closed

Investigate performance drop for DISTINCT queries #5313

comphead opened this issue Feb 16, 2023 · 10 comments
Labels
enhancement New feature or request

Comments

@comphead
Copy link
Contributor

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
ClickBench reported a performance drop for COUNT(DISTINCT) computation #5276 (comment)

Describe the solution you'd like
Its needed to investigate a performance drop root cause in DISTINCT queries and find out how is it possible to increase the speed.

Describe alternatives you've considered
Not doing this

Additional context
Metrics can be found #5276 (comment)
Reproduce the case #5276 (comment)

@comphead
Copy link
Contributor Author

df-distinct-graph
Adding Flamegraph profiler for query
SELECT COUNT(*) AS c, COUNT(DISTINCT "UserID") FROM hits GROUP BY "RegionID" LIMIT 10;

@Dandandan
Copy link
Contributor

So here we have the answer - it's very inefficient because of inefficiently tracking memory (which was added some versions ago). FYI @alamb

@alamb
Copy link
Contributor

alamb commented Feb 17, 2023

So here we have the answer - it's very inefficient because of inefficiently tracking memory (which was added some versions ago). FYI @alamb

Nice sluthing -- thank you @comphead and @Dandandan - - I can file a ticket for the regression related to tracking memory if that would help

@comphead comphead reopened this Feb 17, 2023
@comphead
Copy link
Contributor Author

so yeah, size function needed to calc additional number of bytes that were allocated during aggregate_batch process.
I have commented this out and the same query runs 4 sec instead of 100sec.

@Dandandan @alamb I'll file a ticket to optimize this part

@comphead comphead changed the title Improve performance for DISTINCT queries Investigate performance drop for DISTINCT queries Feb 17, 2023
@Dandandan
Copy link
Contributor

so yeah, size function needed to calc additional number of bytes that were allocated during aggregate_batch process. I have commented this out and the same query runs 4 sec instead of 100sec.

@Dandandan @alamb I'll file a ticket to optimize this part

Yes, some quadratic complexity because of the growing state.

@Dandandan
Copy link
Contributor

Dandandan commented Feb 17, 2023

(you could identify it from your flamegraph, it's just the longest bar(s) at the top consuming all the time)

@comphead
Copy link
Contributor Author

Right, the function was evident to identify, but it was introduced recently, took some time to figure out its purpose.

@comphead
Copy link
Contributor Author

Filed #5325

@Dandandan @alamb should we close this ticket?

@alamb
Copy link
Contributor

alamb commented Feb 18, 2023

I agree the analysis is done -- thank you @comphead

@alamb alamb closed this as completed Feb 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants