Speed up person-related trends queries for large users #7548

macobo · 2021-12-07T06:59:41Z

In what situation are you experiencing subpar performance?

Some of our largest users have run into both memory and time limit issues in insights

Example queries affected

Daily active users?

Potential solution(s)

Marcus laid out a potential optimization here: https://metabase.posthog.net/question/203-optimized-query-for-team-2635
- It a) uses SELECT DISTINCT from events and b) does the person table select as the left-most query.
- Note it has bugs relating to person aggregation, doesn't handle merges or property changing well

Steps to do solve

Reproduce the problem, mark down product areas this affects
Create benchmark(s), verify similar behavior on benchmarking servers
Attempt to apply the proposed fix

Environment

PostHog Cloud
self-hosted PostHog, version/commit: please provide

Additional context

Relevant slack thread: https://posthog.slack.com/archives/C01MM7VT7MG/p1638272862302100

Additional potential optimization: #7537 (comment)

Thank you for your performance issue report – we want PostHog to go supersonic!

The text was updated successfully, but these errors were encountered:

marcushyett-ph · 2021-12-07T07:01:15Z

@macobo I have another potential "approach" (I won't call it a solution, yet) #7537

https://metabase.posthog.net/question/204-optimized-dau-query-for-2635

EDsCODE · 2021-12-07T07:44:27Z

Some thoughts:

Balancing which table is larger isn't as concrete as we originally assumed/still may be assuming. There are cases where the person table will be larger (100 persons and only 20 events have been fired and are being considered in the join). There are cases when persons are smaller (10 persons and each person performed the event in question 10 times). At scale, this could cause the memory issues mentioned above.

Something I was trying was guaranteeing that the persons joined in would always be smaller than the events table.

https://metabase.posthog.net/question/205-dau-query-that-ensures-limit-persons-table

This query filters events and then only retrieves the person distinct ids that are associated with the events in consideration. It's not particularly fast (weekly aggregation for 5 weeks returns in 15 seconds). Could be useful in thinking about. Also need to check for correctness

macobo · 2021-12-07T08:18:13Z

Gonna dump some raw debugging notes here, will be editing this post as I continue

Team in question has between 10M and 20M users to 20k events in the time range.
- Regarding query optimizing it feels like we're in the opposite place where we usually are. Usually with autocapture events >> persons, but for this team this is not the case. Implication: Query optimization might not be generally applicable

Without increasing max_execution_time it failed with: Estimated query execution time (297.7685079144023 seconds) is too long. Maximum: 180.
- Note that this failure happens pretty quick and isn't reacted on by FE in any way. Future improvement!
After increasing max_execution_time for the query, I got these results (note execution time):

    event_time:        2021-12-07 08:31:19
    query_duration_ms: 24447
    read_rows:         318197218
    read_size:         20.86 GiB
    result_rows:       1
    result_size:       512.00 B
    memory_usage:      10.50 GiB

Running it a few times also got a 22500ms time, 26272ms. Variance is due to disk cache(s) and how much these could be used.
Flamegraphs can be found in https://github.com/PostHog/scratchpad/tree/main/karl/2021-12-07-dau-debugging-original.
- Realtime flamegraph has the typical shape when trouble is happening - each threads spends most of it's time waiting on other threads
- CPU graph is hard to gauge, but it seems majority of time is spent joining, copying memory and hash table related works.
  - ~20% of time is spent reading, but it seems in a paused way (continueReading)
  - Aside: datetime handling takes ~3% of the work - could potentially be optimized

Experiment 1: Get rid of the unneeded event subquery, group by within the events query (query-1) in scratchpad. (note that this likely has some bug in it)

    event_time:        2021-12-07 08:27:37
    query_duration_ms: 21455
    read_rows:         318191965
    read_size:         20.88 GiB
    result_rows:       1
    result_size:       512.00 B
    memory_usage:      10.62 GiB

Subsequent runs: 18800ms, 18767ms
I was expecting this to result in smoother pipelines thanks to being able to do the aggregation in one go.
This didn't do as much as I was expecting it to do - the join dominates over the group by it seems?

Experiment 2: Removing the parent queries and unions.
Hypothesis: This shouldn't affect results
Measurements: 18594ms - seems to be the case?

Experiment 3: Moving distinct_id query out of the event query
Didn't affect results much

Experiment 5: Do pdis join before events
Measurements: 15010ms, 14894ms, 22273ms.
This does seem to improve things!

Experiment 7: Use sharded_events instead of events (5 as base)
Idea: Maybe networking/sharding is causing issues and we're loading half the dataset into memory
Measurements: 14879ms, 26014ms, 18470ms

Talked with marcus and ran a few experiments together:

Got an intuition for moving pdis to left-hand-side. You stream it instead of having to do lookups in a huge hash table in memory.
We ran the query w/o pdis. it took 1.5s.
We ran only the pdis subquery. It took 13s
- This explains the general slowness - that single subquery dominates everything w/ >10M users.

-- Current pdis subquery, takes 13s
                SELECT
                    count(1) as data
                FROM
                    (
                        SELECT
                            distinct_id,
                            argMax(person_id, _timestamp) as person_id
                        FROM
                            (
                                SELECT
                                    distinct_id,
                                    person_id,
                                    max(_timestamp) as _timestamp
                                FROM
                                    person_distinct_id
                                WHERE
                                    team_id = 2635
                                GROUP BY
                                    person_id,
                                    distinct_id
                                HAVING
                                    max(is_deleted) = 0
                            )
                        GROUP BY
                            distinct_id
                    ) pdi


-- Attempt at optimizing. Isn't quite correct.
SELECT count(1) FROM (
                        SELECT
                            distinct_id,
                            argMax(person_id, _timestamp) as person_id
                        FROM
                            person_distinct_id
                        WHERE
                            team_id = 2635
                        GROUP BY distinct_id
                        HAVING argMax(is_deleted, _timestamp) = 0
)

Action steps from this:

Talk with @yakkomajuri or @tiina303 regarding kafka message ordering and if we can make person_distinct_id query happen in a single pass.
Investigate using the join table engine for PDIs

Plan: new distinct_ids table

Talked with Yakko. We ended up at a different schema for person_distinct_ids table, basically:

team_id
distinct_id,
person_id,
version,
is_deleted

SORT BY team_id, distinct_id
Engine ReplacingMergeTree

with version starting at 0 and every time a distinct_id should point at a new row increasing by 1. We wouldn't emit "deletions" anymore per (distinct_id, person_id) pair, but only when distinct_id gets deleted.

This should allow for more optimized queries. The optimized query would then look like something like this:

SELECT distinct_id, argMax(person_id, version)
FROM person_distinct_id2
WHERE team_id = X
GROUP BY distinct_id
HAVING argMax(is_deleted, version) = 0

This requires:

Benchmarking this
Creating a new table & Updating plugin-server code to emit to a new kafka topic/
- Create topic on heroku
- Figure out if we need to notify self-hosted users
Getting it live, populating data cloud,
Updating queries for select clients only to test out the fix, with everyone else retaining the old queries.
Issuing a long-running migration
Fixing any potential data-consistency issues around out-of-order message processing

I'll be tacking 1-4 right now and will hand off 5 and 6 to team platform when it's in. This way they don't need to drop their current tasks and we don't block the release on 5 and 6.

Benchmarking new person_distinct_ids table:

I ran 3 different daily active users queries, taking measurements of each 4 times to account for variance in cluster load

Original query ran in 24948ms, 26371ms, 21322ms, 19795ms
Same query, swapping out person_distinct_ids reads took 9343ms, 12999ms, 15556ms, 13312ms
Same query, swapping out person_distinct_ids and avoiding events subquery: 9298ms, 15651ms, 7683ms, 14539ms
Reversing the order of joins: pdis first, join in events: 2147ms, 2052ms, 2038ms, 1949ms. metabase
2 + 4 but no 3: 1822ms, 1781ms, 3206ms, 3127ms.

Taking the 2nd fastest of each:

(2) Swapping out the table on it's own speeds up the query 39%
(2+3) Doing the daily active users specific optimization on top of that speeds up the query 57%
(2+3+4) Doing the query in the reverse order speeds up on top of that the query 90% (!)

Note that the (4) one uses ~5GB of memory compared to rest using ~8GB. 5 uses slightly more memory but might be easier to implement.

Given this I'd:

Get the new table live
Do more benchmarking on companies with more events per user, try out swapping tables in DAU and other queries

Benchmarks on the benchmarking server (before -> after):

track_trends_dau 4906.0±52 -> 4463.5±77
track_trends_dau_person_property_filter 6492.5±75 -> 6200.0±65
track_trends_dau_person_property_filter_materialized 5791.0±66 -> 5322.5±72

These wins are significant, but less so than on cloud for the large team. The issue is the test data - we have ~700k users instead of >10M in there. To be improved!

marcushyett-ph · 2021-12-07T11:23:58Z

I attempted to optimize the PDI query by converting the outer argMax query into a window function.

The query is shorter but just as slow - there's something I can't quite work out abut ordering the MAX(_timestamp)

SELECT
    count(1) as data
FROM (
    SELECT
        distinct_id,
        FIRST_VALUE(person_id) OVER (PARTITION BY distinct_id, person_id ORDER BY MAX(_timestamp) DESC) as person_id_2
    FROM
        person_distinct_id
    WHERE
        team_id = 2635
        AND _timestamp < today()
    GROUP BY
        distinct_id,
        person_id
    HAVING
        max(is_deleted) = 0
)

macobo · 2021-12-08T14:07:03Z

Status update:

Technical:

We now should not be running into unexpected timeouts due to timeouts in trends: Change timeout to check execution time to 60 seconds #7562
New more efficient person_distinct_ids table is going live today. Expecting this to speed up DAU queries another 40%. TODOs:
- Get PRs merged introducing the feature
- Create topic on heroku
- Verify data is flowing in
- Run populating query on cloud
- Verify speed increase for client
- Remove kafka tables from backups
- Write up ticket to get this table live for everyone
Next steps:
- Verify if any other queries will fail after the previous step and them failing is urgent.
- Experiment with swapping the order of queries (persons -> pdi -> events) in a conditional basis in trends. cc @EDsCODE
  - Will likely tackle next sprint if DAU/MAU improvement is good enough

For the client this means:

We had an issue with how we were storing people in our codebase causing it not to scale well beyond 10M users.

We now have a fix live which should make their queries not time out anymore (but they can still be slow)
We're going live in the next few hours with a new table for storing person distinct ids we can query much faster.

Over the next 4 weeks over the holidays getting phantom performance even better is going to be one of the main focusses of my team.

macobo · 2021-12-13T06:04:10Z

Closing this, following up with #7663

macobo added clickhouse performance Has to do with performance. For PRs, runs the clickhouse query performance suite labels Dec 7, 2021

macobo mentioned this issue Dec 8, 2021

Enable performance improvements on cloud for us + a client #7578

Merged

EDsCODE mentioned this issue Dec 10, 2021

Sprint 1.32.0 1/2 - Dec 13 to Jan 7 #7526

Closed

macobo mentioned this issue Dec 13, 2021

Deploy person_distinct_id2 table to everyone #7663

Closed

macobo closed this as completed Dec 13, 2021

macobo mentioned this issue Jan 19, 2022

Store person_id and person_properties on events #8132

Closed

pxpeterxu mentioned this issue Mar 21, 2022

perf: Speed up trends queries involving user property breakdowns by 20x for instances with lots of people #9162

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up person-related trends queries for large users #7548

Speed up person-related trends queries for large users #7548

macobo commented Dec 7, 2021

marcushyett-ph commented Dec 7, 2021

EDsCODE commented Dec 7, 2021 •

edited

Loading

macobo commented Dec 7, 2021 •

edited

Loading

marcushyett-ph commented Dec 7, 2021 •

edited

Loading

macobo commented Dec 8, 2021 •

edited

Loading

macobo commented Dec 13, 2021

Speed up person-related trends queries for large users #7548

Speed up person-related trends queries for large users #7548

Comments

macobo commented Dec 7, 2021

In what situation are you experiencing subpar performance?

Example queries affected

Potential solution(s)

Steps to do solve

Environment

Additional context

Thank you for your performance issue report – we want PostHog to go supersonic!

marcushyett-ph commented Dec 7, 2021

EDsCODE commented Dec 7, 2021 • edited Loading

macobo commented Dec 7, 2021 • edited Loading

Plan: new distinct_ids table

marcushyett-ph commented Dec 7, 2021 • edited Loading

macobo commented Dec 8, 2021 • edited Loading

macobo commented Dec 13, 2021

EDsCODE commented Dec 7, 2021 •

edited

Loading

macobo commented Dec 7, 2021 •

edited

Loading

marcushyett-ph commented Dec 7, 2021 •

edited

Loading

macobo commented Dec 8, 2021 •

edited

Loading