-
-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rfc(feature): Fix Memory Limitiations in Session Replay's Access Pattern #88
rfc(feature): Fix Memory Limitiations in Session Replay's Access Pattern #88
Conversation
68c7c83
to
01a816d
Compare
text/0088-fix-memory-limitiations-in-session-replays-access-pattern.md
Outdated
Show resolved
Hide resolved
- This would require another team assuming the burden for us. | ||
- Otherwise, additional budget would need to be allocated to the Replays team to hire outside experts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
google's dataproc makes this pretty straightforward, hopefully we wouldn't need too much outside help other than ops https://cloud.google.com/dataproc?utm_source=google&utm_medium=cpc&utm_campaign=na-US-all-en-dr-bkws-all-all-trial-e-dr-1605212&utm_content=text-ad-none-any-DEV_c-CRE_547545257220-ADGP_Desk%20%7C%20BKWS%20-%20EXA%20%7C%20Txt%20_%20Data%20Analytics%20_%20Dataproc_Data-KWID_43700065464355775-kwd-313714756996&utm_term=KW_google%20cloud%20dataproc-ST_google%20cloud%20dataproc&gclid=CjwKCAjw9J2iBhBPEiwAErwpeX4oajs8rvfjlwky-3tzjGeaLYmzfcu8LmSNZs15L8B5NckSAkMExhoCT-AQAvD_BwE&gclsrc=aw.ds
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
google's dataproc makes this pretty straightforward
what's the alternative for self-hosted? Can we have an adapter so folks can plug spark?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah it would be some amount of work but we can put it in a docker container.
text/0088-fix-memory-limitiations-in-session-replays-access-pattern.md
Outdated
Show resolved
Hide resolved
- This would require another team assuming the burden for us. | ||
- Otherwise, additional budget would need to be allocated to the Replays team to hire outside experts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
google's dataproc makes this pretty straightforward
what's the alternative for self-hosted? Can we have an adapter so folks can plug spark?
text/0088-fix-memory-limitiations-in-session-replays-access-pattern.md
Outdated
Show resolved
Hide resolved
text/0088-fix-memory-limitiations-in-session-replays-access-pattern.md
Outdated
Show resolved
Hide resolved
**Drawbacks** | ||
|
||
- To search for a unique value a user would need to issue a query for each day of the retention period or until the value was found. | ||
- Small customers would not see any replays on their index page due to the limited window size. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you could probably issue multiple queries in this case.
- Fetch a small initial window (past 24 hours).
- If you get more than N results, stop.
- If you get less than N results, query again for another time window (last week - last 24 hours).
- Repeat 3 until you get N results.
|
||
**Proposal** | ||
|
||
- Normalize schema. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I think it should be removed.
|
||
# Motivation | ||
|
||
Our goal is to make the Session Replay product highly scalable. We expect changes to the product will eliminate our OOM issues and increase performance for customers with large amounts of ingested data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't see this in the solution options considered if I understand them correctly: But would limiting the number of replays shown on the Index page materially improve the OOM issues or probability of them occurring?
Right now we display 50 replays upon load of this page which is * a lot * and seems unnecessary. I assume the number of replays we display on this view, will result in fetching more info on sort/filter actions and retention period changes as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't see this in the solution options considered if I understand them correctly: But would limiting the number of replays shown on the Index page materially improve the OOM issues or probability of them occurring?
Unfortunately no. Whats displayed and what was actually aggregated in the database are not the same in this case. The "scan range" (the stats period value: 14 days, 24 hours, 90 days, etc) is what creates these OOM issues.
Right now we display 50 replays upon load of this page which is * a lot * and seems unnecessary.
If it makes the product better we should do it! Reducing the row set will have a positive impact on performance.
|
||
**Proposal** | ||
|
||
Use a cron job, which runs at some `interval` (e.g. 1 hour), that would select the finished replays from the last `interval`, aggregate them, and write them to a materialized view or some destination table. We then alter the index page to reflect two different dataset. The "live" dataset and the "video-on-demand" dataset. A "live" page would fetch replays from the last `interval`. A "video-on-demand" dataset would function similarly to the current replays index page however it would only contain data that has been aggregated by the cron job. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't speak to the details of the cron job described here in this solution, but I would be in favour of a solution proposal that does not display "in progress" replays on the Index page by default -- as this leads to many short (and 'unfinished') replays for customers that have a high volume.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be in favour of a solution proposal that does not display "in progress" replays on the Index page by default
This is good to know. This gives us a lot of flexibility to solve the problem.
From a product perspective I agree as well. Some of our large customers have low utility index pages because of the number of new replays.
|
||
Use `optimize_aggregation_in_order` to limit the number of rows we need to aggregate. In testing this eliminates OOM errors. Queries complete somewhat speedily. | ||
|
||
**Drawbacks** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to mention here how this may show incorrect data for replays with a timestamp around the partition cutoff time
|
||
**Drawbacks** | ||
|
||
There are aggregated values which users want to sort and filter on. Columns such as count_errors and activity which are prominently featured on our index page. Sorting and filtering by these values disables the preflight query and only runs the aggregated query. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Examples of these queries would be very helpful to designing a performant solution. as someone who doesn't use replay, it's hard for me to map these abstract concepts onto a material reality.
|
||
There are aggregated values which users want to sort and filter on. Columns such as count_errors and activity which are prominently featured on our index page. Sorting and filtering by these values disables the preflight query and only runs the aggregated query. | ||
|
||
The non-aggregated query does not allow **exclusive** filter conditions against **non-static** columns. For example, we can not say "find a replay where this url does not exist". The query will find _rows_ where that condition is true but it will not find _replays_ where that is condition true. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What makes a column static
or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed on call. Static means the value of the column on segment 0 can be presumed to be the value for every segment.
|
||
The Session Replay index page will run out of memory when processing queries for our largest customers. This document contains proposed solutions for addressing this short-coming. | ||
|
||
# Motivation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the expected timeline of this? Those constraints would be good to know when evaluating solutions
|
||
- Race conditions will require single-threaded processing of like replay-id events. | ||
- Duplicate messages will necessitate ordering requirements. | ||
- Always a possibility for dropped and duplicated data regardless of safe guards. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the tolerance for dropped and duplicated data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Grouping can handle duplicated data. Dropped data is not ideal but is manageable depending on the data model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing that is missing from this is any solution involving the client side SDK. I think we've talked about this on Slack but documenting that discussion would be useful.
For example, why doesn't the client SDK set a "final" flag on the last segment of a replay? The SDK could also update each segment with the data from all the previous segments, so that the "final" segment would effectively be a fully formed replay sent all at once. We could then store that directly in a replay table and/or filter on the "final" flag.
@evanh Updated with two new proposals. Both rely on the SDK to buffer metadata. |
|
||
**Questions** | ||
|
||
- Is it possible for the SDK to know when its done and mark the request object with a final attribute? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@billyvg Can we assess this? Is it possible for the SDK to know when the replay has finished? If so do we have any idea how reliable it is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cmanallen It's possible, but potentially unreliable. Some cases off the top of my head:
- User closes tab -> mark as final
- sendBeacon/fetch w/ keepalive should be able to generally send off this request given that it's likely we lose the last segment
- User switches tab and goes idle -> set timer based on session expiration to mark as final
- timer drift? especially with background tab?
- User turns off computer
- Potentially can detect when computer/browser is "active" again and mark as final
- Network flakiness
**Drawbacks** | ||
|
||
- Requies SDK upgrade. | ||
- API will need to fallback to aggregating behavior if no final segments can be found (or otherwise detect old SDK usage prior to querying). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only true if you want to show "in progress" replays. If we exclude those then this isn't a problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Old SDKs won't set the final
attribute and will require us to aggregate. Eventually we could drop this requirement but only after some portion of customers upgraded.
|
||
**Questions** | ||
|
||
- Will using `FINAL` be a problem? Is it a big deal relative to the problems we're experiencing currently? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this context, the FINAL keyword will have the same effect as the GROUP BY replay_id
. Under the hood CH is doing the same thing, collapsing all the parts in memory to collect the correct result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure its exactly the same? My reading of the documentation suggests that it would return the highest version row whereas GROUP BY
will return an aggregation of the non-collapsed rows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
Any option may be accepted in whole or in part. Multiple options can be accepted to achieve the desired outcome. | ||
|
||
### 1. Change the Product's Query Access Pattern |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Infra Burden: Low
Customer Impact: High
Maintainability: Low
|
||
**Questions** | ||
|
||
### 2. Reduce the Scan Range on the Front End |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Infra Burden: Low
Customer Impact: Medium
Maintainability: Low
Notes: most people will query over 14 day window
|
||
**Questions** | ||
|
||
### 3. Reduce the Scan Range on the Back End |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scratch this, it probably won't solve the problem
|
||
**Questions** | ||
|
||
### 4. Normalize Schema and Remove Snuba Join Restriction |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔪 Remove
|
||
**Questions** | ||
|
||
### 5. Migrate to CollapsingMergeTree Table Engine and Pre-Aggregate Data in Key-Value Store |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Infra Burden: Medium
### 5. Migrate to CollapsingMergeTree Table Engine and Pre-Aggregate Data in Key-Value Store | |
### 5. Make a replays table as opposed to a replay_segments table |
Infra burden: High
Customer Impact: Low
Maintainability: High
Notes: If we have a replays
table instead of a replay_segments
table we can filter better even if we aggregate the same amount of replays
|
||
- Colton Allen asks: Is it possible to write to tables partitioned by the hour? When the hour expires the table is indexed for fast read performance. Replays on the most recent hour partition would be unavailable while we are writing to it. Does PostgreSQL expose behavior that allows us to query over multiple partitions transparently? Is this even necessary given our scale? Currently processing 200 messages per second. | ||
|
||
### 10. Manually Manage An Aggregated Materialized View |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Infra burden: medium (cron job generated inserts is a new mechanism)
Maintainability: Medium
Customer Impact: Low/medium (depending on if the product changes)
Questions:
Do we need to merge the last hour segments with aggregated?
TODO. Rendered RFC