-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(web-analytics): HogQL sessions where extractor #20986
Conversation
99c8961
to
f31a9e7
Compare
posthog/hogql/database/schema/util/session_where_clause_extractor.py
Outdated
Show resolved
Hide resolved
posthog/hogql/database/schema/util/session_where_clause_extractor.py
Outdated
Show resolved
Hide resolved
if node.name in [ | ||
"parseDateTime64BestEffortOrNull", | ||
"toDateTime", | ||
"toTimeZone", | ||
"assumeNotNull", | ||
"toStartOfDay", | ||
"toStartOfWeek", | ||
"toStartOfMonth", | ||
"toStartOfQuarter", | ||
"toStartOfYear", | ||
]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is where our improved type system is much needed.. remembering to keep this list up to date as we add more clickhouse functions will be hard
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a pretty neat system of visitors!
When matching timestamp
fields, you're currently checking if the chain matches any certain shape like node.chain == ["e", "timestamp"]
.
I think there's a direct route available if you check node.type.resolve_database_field()
. This could give you a pointer back to the actual database table, and/or some other Type
you can compare it with. This feels more robust and alias-proof.
...
Thinking ahead just one step, we want something very similar for the persons table. The query select * from persons where properties.email like '%posthog.com'
will swap persons
with a subquery from raw_persons
, which will select ALL email properties from all users, and return all of them to the external where. This is wasteful. We could prefilter. The same is true for distinct_ids, etc.
This PR takes us half of the way there... if only there was a way to scope creep some abstractions into here... 😅
What would such abstractions look like? Basically what's inside SessionWhereClauseExtractor.visit_compare_operation
should be all the code you need to write per table.
The class SessionWhereClauseExtractor
could be split into TableComparisonCollector
, which is tasked with gathering all the CompareOperation
-s that touch the table in question. The lazy selects would then do with this list of operations all that they can do. In this case, we'd pick out the timestamps and add them to the internal filter.
I wouldn't block this PR without such an abstraction, but if it's easy to sneak in when making the table comparison more deterministic, why not? 😅
posthog/hogql/database/schema/util/session_where_clause_extractor.py
Outdated
Show resolved
Hide resolved
75a9d99
to
d6d9760
Compare
@mariusandra I appreciate your scope creep suggestions, I'd love to get this PR merged and get the sessions table in front of people before scope-creeping too much :D |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lovely stuff! 👍
Some minor nits inline, and yeah, that scope creep was definitely out of scope 😅
from posthog.hogql.database.models import DatabaseField | ||
from posthog.hogql.visitor import clone_expr, CloningVisitor, Visitor | ||
|
||
SESSION_BUFFER_DAYS = 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should double check with the session folks, but there might have even been some 24h-ish limits on sessions. So moving this to 1
might be good enough, and faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last time I checked, there's no actual agreed upon limit, it's just 24 hours of inactivity. I'll leave this at 3 but I can change this later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We now have a maximum session length of 24 hours, so I'll make a PR to change this number to 1.
25fca1b
to
3b00da1
Compare
3b00da1
to
ab64f8a
Compare
Problem
The sessions table is an AggregatingMergeTree, and will have one row per session per day. This means that when we want to query sessions, we need to pre-group these rows, so that we only have one row per session.
We can hide this detail using a lazy table, but to make querying the table faster, we can inline the timestamp where conditions from the select on the outer lazy table to the select on the inner real table. That's what this PR adds.
In this PR I have made the assumption that all the events from the same session happen in a 3 day window, it wouldn't be a terrible problem to increase that window, it just narrows down the search space more to have a smaller number.
Changes
Add a where clause extractor for the lazy sessions table
Does this work well for both Cloud and self-hosted?
Yes
How did you test this code?
Added a lot of tests for the where clause extractor, also added a test to ensure the clickhouse printer as a whole did the right thing.