Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up person-related trends queries for large users #7548

Closed
1 of 5 tasks
macobo opened this issue Dec 7, 2021 · 6 comments
Closed
1 of 5 tasks

Speed up person-related trends queries for large users #7548

macobo opened this issue Dec 7, 2021 · 6 comments
Labels
performance Has to do with performance. For PRs, runs the clickhouse query performance suite

Comments

@macobo
Copy link
Contributor

macobo commented Dec 7, 2021

In what situation are you experiencing subpar performance?

Some of our largest users have run into both memory and time limit issues in insights

Example queries affected

  • Daily active users?

Potential solution(s)

Steps to do solve

  • Reproduce the problem, mark down product areas this affects
  • Create benchmark(s), verify similar behavior on benchmarking servers
  • Attempt to apply the proposed fix

Environment

  • PostHog Cloud
  • self-hosted PostHog, version/commit: please provide

Additional context

Relevant slack thread: https://posthog.slack.com/archives/C01MM7VT7MG/p1638272862302100

Additional potential optimization: #7537 (comment)

Thank you for your performance issue report – we want PostHog to go supersonic!

@macobo macobo added clickhouse performance Has to do with performance. For PRs, runs the clickhouse query performance suite labels Dec 7, 2021
@marcushyett-ph
Copy link
Contributor

@macobo I have another potential "approach" (I won't call it a solution, yet) #7537

https://metabase.posthog.net/question/204-optimized-dau-query-for-2635

@EDsCODE
Copy link
Member

EDsCODE commented Dec 7, 2021

Some thoughts:

Balancing which table is larger isn't as concrete as we originally assumed/still may be assuming. There are cases where the person table will be larger (100 persons and only 20 events have been fired and are being considered in the join). There are cases when persons are smaller (10 persons and each person performed the event in question 10 times). At scale, this could cause the memory issues mentioned above.

Something I was trying was guaranteeing that the persons joined in would always be smaller than the events table.

https://metabase.posthog.net/question/205-dau-query-that-ensures-limit-persons-table

This query filters events and then only retrieves the person distinct ids that are associated with the events in consideration. It's not particularly fast (weekly aggregation for 5 weeks returns in 15 seconds). Could be useful in thinking about. Also need to check for correctness

@macobo
Copy link
Contributor Author

macobo commented Dec 7, 2021

Gonna dump some raw debugging notes here, will be editing this post as I continue

  • Team in question has between 10M and 20M users to 20k events in the time range.
    • Regarding query optimizing it feels like we're in the opposite place where we usually are. Usually with autocapture events >> persons, but for this team this is not the case. Implication: Query optimization might not be generally applicable

  • Without increasing max_execution_time it failed with: Estimated query execution time (297.7685079144023 seconds) is too long. Maximum: 180.
    • Note that this failure happens pretty quick and isn't reacted on by FE in any way. Future improvement!
  • After increasing max_execution_time for the query, I got these results (note execution time):
    event_time:        2021-12-07 08:31:19
    query_duration_ms: 24447
    read_rows:         318197218
    read_size:         20.86 GiB
    result_rows:       1
    result_size:       512.00 B
    memory_usage:      10.50 GiB
  • Running it a few times also got a 22500ms time, 26272ms. Variance is due to disk cache(s) and how much these could be used.

  • Flamegraphs can be found in https://github.com/PostHog/scratchpad/tree/main/karl/2021-12-07-dau-debugging-original.

    • Realtime flamegraph has the typical shape when trouble is happening - each threads spends most of it's time waiting on other threads
    • CPU graph is hard to gauge, but it seems majority of time is spent joining, copying memory and hash table related works.
      • ~20% of time is spent reading, but it seems in a paused way (continueReading)
      • Aside: datetime handling takes ~3% of the work - could potentially be optimized

  • Experiment 1: Get rid of the unneeded event subquery, group by within the events query (query-1) in scratchpad. (note that this likely has some bug in it)
    event_time:        2021-12-07 08:27:37
    query_duration_ms: 21455
    read_rows:         318191965
    read_size:         20.88 GiB
    result_rows:       1
    result_size:       512.00 B
    memory_usage:      10.62 GiB
  • Subsequent runs: 18800ms, 18767ms
  • I was expecting this to result in smoother pipelines thanks to being able to do the aggregation in one go.
  • This didn't do as much as I was expecting it to do - the join dominates over the group by it seems?

  • Experiment 2: Removing the parent queries and unions.
  • Hypothesis: This shouldn't affect results
  • Measurements: 18594ms - seems to be the case?

  • Experiment 3: Moving distinct_id query out of the event query
  • Didn't affect results much

  • Experiment 5: Do pdis join before events
  • Measurements: 15010ms, 14894ms, 22273ms.
  • This does seem to improve things!

  • Experiment 7: Use sharded_events instead of events (5 as base)
  • Idea: Maybe networking/sharding is causing issues and we're loading half the dataset into memory
  • Measurements: 14879ms, 26014ms, 18470ms

Talked with marcus and ran a few experiments together:

  • Got an intuition for moving pdis to left-hand-side. You stream it instead of having to do lookups in a huge hash table in memory.
  • We ran the query w/o pdis. it took 1.5s.
  • We ran only the pdis subquery. It took 13s
    • This explains the general slowness - that single subquery dominates everything w/ >10M users.
-- Current pdis subquery, takes 13s
                SELECT
                    count(1) as data
                FROM
                    (
                        SELECT
                            distinct_id,
                            argMax(person_id, _timestamp) as person_id
                        FROM
                            (
                                SELECT
                                    distinct_id,
                                    person_id,
                                    max(_timestamp) as _timestamp
                                FROM
                                    person_distinct_id
                                WHERE
                                    team_id = 2635
                                GROUP BY
                                    person_id,
                                    distinct_id
                                HAVING
                                    max(is_deleted) = 0
                            )
                        GROUP BY
                            distinct_id
                    ) pdi


-- Attempt at optimizing. Isn't quite correct.
SELECT count(1) FROM (
                        SELECT
                            distinct_id,
                            argMax(person_id, _timestamp) as person_id
                        FROM
                            person_distinct_id
                        WHERE
                            team_id = 2635
                        GROUP BY distinct_id
                        HAVING argMax(is_deleted, _timestamp) = 0
)

Action steps from this:

  • Talk with @yakkomajuri or @tiina303 regarding kafka message ordering and if we can make person_distinct_id query happen in a single pass.
  • Investigate using the join table engine for PDIs

Plan: new distinct_ids table

Talked with Yakko. We ended up at a different schema for person_distinct_ids table, basically:

team_id
distinct_id,
person_id,
version,
is_deleted

SORT BY team_id, distinct_id
Engine ReplacingMergeTree

with version starting at 0 and every time a distinct_id should point at a new row increasing by 1. We wouldn't emit "deletions" anymore per (distinct_id, person_id) pair, but only when distinct_id gets deleted.

This should allow for more optimized queries. The optimized query would then look like something like this:

SELECT distinct_id, argMax(person_id, version)
FROM person_distinct_id2
WHERE team_id = X
GROUP BY distinct_id
HAVING argMax(is_deleted, version) = 0

This requires:

  1. Benchmarking this
  2. Creating a new table & Updating plugin-server code to emit to a new kafka topic/
    • Create topic on heroku
    • Figure out if we need to notify self-hosted users
  3. Getting it live, populating data cloud,
  4. Updating queries for select clients only to test out the fix, with everyone else retaining the old queries.
  5. Issuing a long-running migration
  6. Fixing any potential data-consistency issues around out-of-order message processing

I'll be tacking 1-4 right now and will hand off 5 and 6 to team platform when it's in. This way they don't need to drop their current tasks and we don't block the release on 5 and 6.


Benchmarking new person_distinct_ids table:

I ran 3 different daily active users queries, taking measurements of each 4 times to account for variance in cluster load

  1. Original query ran in 24948ms, 26371ms, 21322ms, 19795ms
  2. Same query, swapping out person_distinct_ids reads took 9343ms, 12999ms, 15556ms, 13312ms
  3. Same query, swapping out person_distinct_ids and avoiding events subquery: 9298ms, 15651ms, 7683ms, 14539ms
  4. Reversing the order of joins: pdis first, join in events: 2147ms, 2052ms, 2038ms, 1949ms. metabase
  5. 2 + 4 but no 3: 1822ms, 1781ms, 3206ms, 3127ms.

Taking the 2nd fastest of each:

  • (2) Swapping out the table on it's own speeds up the query 39%
  • (2+3) Doing the daily active users specific optimization on top of that speeds up the query 57%
  • (2+3+4) Doing the query in the reverse order speeds up on top of that the query 90% (!)

Note that the (4) one uses ~5GB of memory compared to rest using ~8GB. 5 uses slightly more memory but might be easier to implement.

Given this I'd:

  1. Get the new table live
  2. Do more benchmarking on companies with more events per user, try out swapping tables in DAU and other queries

Benchmarks on the benchmarking server (before -> after):

track_trends_dau 4906.0±52 -> 4463.5±77
track_trends_dau_person_property_filter 6492.5±75 -> 6200.0±65
track_trends_dau_person_property_filter_materialized 5791.0±66 -> 5322.5±72

These wins are significant, but less so than on cloud for the large team. The issue is the test data - we have ~700k users instead of >10M in there. To be improved!

@marcushyett-ph
Copy link
Contributor

marcushyett-ph commented Dec 7, 2021

I attempted to optimize the PDI query by converting the outer argMax query into a window function.

The query is shorter but just as slow - there's something I can't quite work out abut ordering the MAX(_timestamp)

SELECT
    count(1) as data
FROM (
    SELECT
        distinct_id,
        FIRST_VALUE(person_id) OVER (PARTITION BY distinct_id, person_id ORDER BY MAX(_timestamp) DESC) as person_id_2
    FROM
        person_distinct_id
    WHERE
        team_id = 2635
        AND _timestamp < today()
    GROUP BY
        distinct_id,
        person_id
    HAVING
        max(is_deleted) = 0
)

@macobo
Copy link
Contributor Author

macobo commented Dec 8, 2021

Status update:

Technical:

  • We now should not be running into unexpected timeouts due to timeouts in trends: Change timeout to check execution time to 60 seconds #7562
  • New more efficient person_distinct_ids table is going live today. Expecting this to speed up DAU queries another 40%. TODOs:
    • Get PRs merged introducing the feature
    • Create topic on heroku
    • Verify data is flowing in
    • Run populating query on cloud
    • Verify speed increase for client
    • Remove kafka tables from backups
    • Write up ticket to get this table live for everyone
  • Next steps:
    • Verify if any other queries will fail after the previous step and them failing is urgent.
    • Experiment with swapping the order of queries (persons -> pdi -> events) in a conditional basis in trends. cc @EDsCODE
      • Will likely tackle next sprint if DAU/MAU improvement is good enough

For the client this means:

We had an issue with how we were storing people in our codebase causing it not to scale well beyond 10M users.

  1. We now have a fix live which should make their queries not time out anymore (but they can still be slow)
  2. We're going live in the next few hours with a new table for storing person distinct ids we can query much faster.

Over the next 4 weeks over the holidays getting phantom performance even better is going to be one of the main focusses of my team.

@macobo
Copy link
Contributor Author

macobo commented Dec 13, 2021

Closing this, following up with #7663

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Has to do with performance. For PRs, runs the clickhouse query performance suite
Projects
None yet
Development

No branches or pull requests

3 participants