Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(web-analytics): HogQL sessions where extractor #20986

Merged
merged 20 commits into from
Mar 22, 2024

Conversation

robbie-c
Copy link
Collaborator

@robbie-c robbie-c commented Mar 18, 2024

Problem

The sessions table is an AggregatingMergeTree, and will have one row per session per day. This means that when we want to query sessions, we need to pre-group these rows, so that we only have one row per session.

We can hide this detail using a lazy table, but to make querying the table faster, we can inline the timestamp where conditions from the select on the outer lazy table to the select on the inner real table. That's what this PR adds.

In this PR I have made the assumption that all the events from the same session happen in a 3 day window, it wouldn't be a terrible problem to increase that window, it just narrows down the search space more to have a smaller number.

Changes

Add a where clause extractor for the lazy sessions table

Does this work well for both Cloud and self-hosted?

Yes

How did you test this code?

Added a lot of tests for the where clause extractor, also added a test to ensure the clickhouse printer as a whole did the right thing.

@robbie-c robbie-c force-pushed the feat/hogql-sessions-where-extractor branch 5 times, most recently from 99c8961 to f31a9e7 Compare March 21, 2024 10:29
@robbie-c robbie-c requested a review from Gilbert09 March 21, 2024 11:26
@robbie-c robbie-c marked this pull request as ready for review March 21, 2024 11:26
Comment on lines +267 to +317
if node.name in [
"parseDateTime64BestEffortOrNull",
"toDateTime",
"toTimeZone",
"assumeNotNull",
"toStartOfDay",
"toStartOfWeek",
"toStartOfMonth",
"toStartOfQuarter",
"toStartOfYear",
]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where our improved type system is much needed.. remembering to keep this list up to date as we add more clickhouse functions will be hard

posthog/hogql/database/schema/cohort_people.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@mariusandra mariusandra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a pretty neat system of visitors!

When matching timestamp fields, you're currently checking if the chain matches any certain shape like node.chain == ["e", "timestamp"].

I think there's a direct route available if you check node.type.resolve_database_field(). This could give you a pointer back to the actual database table, and/or some other Type you can compare it with. This feels more robust and alias-proof.

...

Thinking ahead just one step, we want something very similar for the persons table. The query select * from persons where properties.email like '%posthog.com' will swap persons with a subquery from raw_persons, which will select ALL email properties from all users, and return all of them to the external where. This is wasteful. We could prefilter. The same is true for distinct_ids, etc.

This PR takes us half of the way there... if only there was a way to scope creep some abstractions into here... 😅

What would such abstractions look like? Basically what's inside SessionWhereClauseExtractor.visit_compare_operation should be all the code you need to write per table.

The class SessionWhereClauseExtractor could be split into TableComparisonCollector, which is tasked with gathering all the CompareOperation-s that touch the table in question. The lazy selects would then do with this list of operations all that they can do. In this case, we'd pick out the timestamps and add them to the internal filter.

I wouldn't block this PR without such an abstraction, but if it's easy to sneak in when making the table comparison more deterministic, why not? 😅

@robbie-c robbie-c force-pushed the feat/hogql-sessions-where-extractor branch 3 times, most recently from 75a9d99 to d6d9760 Compare March 21, 2024 20:01
@robbie-c
Copy link
Collaborator Author

robbie-c commented Mar 21, 2024

@mariusandra I appreciate your scope creep suggestions, I'd love to get this PR merged and get the sessions table in front of people before scope-creeping too much :D

Copy link
Collaborator

@mariusandra mariusandra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lovely stuff! 👍

Some minor nits inline, and yeah, that scope creep was definitely out of scope 😅

from posthog.hogql.database.models import DatabaseField
from posthog.hogql.visitor import clone_expr, CloningVisitor, Visitor

SESSION_BUFFER_DAYS = 3
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should double check with the session folks, but there might have even been some 24h-ish limits on sessions. So moving this to 1 might be good enough, and faster.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last time I checked, there's no actual agreed upon limit, it's just 24 hours of inactivity. I'll leave this at 3 but I can change this later

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We now have a maximum session length of 24 hours, so I'll make a PR to change this number to 1.

@robbie-c robbie-c force-pushed the feat/hogql-sessions-where-extractor branch from 25fca1b to 3b00da1 Compare March 21, 2024 21:01
@robbie-c robbie-c force-pushed the feat/hogql-sessions-where-extractor branch from 3b00da1 to ab64f8a Compare March 21, 2024 21:01
@robbie-c robbie-c enabled auto-merge (squash) March 21, 2024 22:35
@robbie-c robbie-c merged commit f43d7a9 into master Mar 22, 2024
86 checks passed
@robbie-c robbie-c deleted the feat/hogql-sessions-where-extractor branch March 22, 2024 08:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants