rfc(feature): Fix Memory Limitiations in Session Replay's Access Pattern #88

cmanallen · 2023-04-24T20:49:52Z

text/0088-fix-memory-limitiations-in-session-replays-access-pattern.md

JoshFerge · 2023-04-25T18:39:16Z

text/0088-fix-memory-limitiations-in-session-replays-access-pattern.md

+  - This would require another team assuming the burden for us.
+  - Otherwise, additional budget would need to be allocated to the Replays team to hire outside experts.


google's dataproc makes this pretty straightforward, hopefully we wouldn't need too much outside help other than ops https://cloud.google.com/dataproc?utm_source=google&utm_medium=cpc&utm_campaign=na-US-all-en-dr-bkws-all-all-trial-e-dr-1605212&utm_content=text-ad-none-any-DEV_c-CRE_547545257220-ADGP_Desk%20%7C%20BKWS%20-%20EXA%20%7C%20Txt%20_%20Data%20Analytics%20_%20Dataproc_Data-KWID_43700065464355775-kwd-313714756996&utm_term=KW_google%20cloud%20dataproc-ST_google%20cloud%20dataproc&gclid=CjwKCAjw9J2iBhBPEiwAErwpeX4oajs8rvfjlwky-3tzjGeaLYmzfcu8LmSNZs15L8B5NckSAkMExhoCT-AQAvD_BwE&gclsrc=aw.ds

google's dataproc makes this pretty straightforward

what's the alternative for self-hosted? Can we have an adapter so folks can plug spark?

yeah it would be some amount of work but we can put it in a docker container.

text/0088-fix-memory-limitiations-in-session-replays-access-pattern.md

bruno-garcia · 2023-04-25T19:28:23Z

text/0088-fix-memory-limitiations-in-session-replays-access-pattern.md

+  - This would require another team assuming the burden for us.
+  - Otherwise, additional budget would need to be allocated to the Replays team to hire outside experts.


google's dataproc makes this pretty straightforward

what's the alternative for self-hosted? Can we have an adapter so folks can plug spark?

text/0088-fix-memory-limitiations-in-session-replays-access-pattern.md

evanh · 2023-04-26T16:30:57Z

text/0088-fix-memory-limitiations-in-session-replays-access-pattern.md

+**Drawbacks**
+
+- To search for a unique value a user would need to issue a query for each day of the retention period or until the value was found.
+- Small customers would not see any replays on their index page due to the limited window size.


I think you could probably issue multiple queries in this case.

Fetch a small initial window (past 24 hours).

If you get more than N results, stop.

If you get less than N results, query again for another time window (last week - last 24 hours).

Repeat 3 until you get N results.

evanh · 2023-04-26T16:31:52Z

text/0088-fix-memory-limitiations-in-session-replays-access-pattern.md

+
+**Proposal**
+
+- Normalize schema.


Can you elaborate on this?

No, I think it should be removed.

text/0088-fix-memory-limitiations-in-session-replays-access-pattern.md

jas-kas · 2023-04-28T19:35:59Z

text/0088-fix-memory-limitiations-in-session-replays-access-pattern.md

+
+# Motivation
+
+Our goal is to make the Session Replay product highly scalable. We expect changes to the product will eliminate our OOM issues and increase performance for customers with large amounts of ingested data.


I didn't see this in the solution options considered if I understand them correctly: But would limiting the number of replays shown on the Index page materially improve the OOM issues or probability of them occurring?

Right now we display 50 replays upon load of this page which is * a lot * and seems unnecessary. I assume the number of replays we display on this view, will result in fetching more info on sort/filter actions and retention period changes as well.

I didn't see this in the solution options considered if I understand them correctly: But would limiting the number of replays shown on the Index page materially improve the OOM issues or probability of them occurring?

Unfortunately no. Whats displayed and what was actually aggregated in the database are not the same in this case. The "scan range" (the stats period value: 14 days, 24 hours, 90 days, etc) is what creates these OOM issues.

Right now we display 50 replays upon load of this page which is * a lot * and seems unnecessary.

If it makes the product better we should do it! Reducing the row set will have a positive impact on performance.

jas-kas · 2023-04-28T19:43:29Z

text/0088-fix-memory-limitiations-in-session-replays-access-pattern.md

+
+**Proposal**
+
+Use a cron job, which runs at some `interval` (e.g. 1 hour), that would select the finished replays from the last `interval`, aggregate them, and write them to a materialized view or some destination table. We then alter the index page to reflect two different dataset. The "live" dataset and the "video-on-demand" dataset. A "live" page would fetch replays from the last `interval`. A "video-on-demand" dataset would function similarly to the current replays index page however it would only contain data that has been aggregated by the cron job.


I can't speak to the details of the cron job described here in this solution, but I would be in favour of a solution proposal that does not display "in progress" replays on the Index page by default -- as this leads to many short (and 'unfinished') replays for customers that have a high volume.

I would be in favour of a solution proposal that does not display "in progress" replays on the Index page by default

This is good to know. This gives us a lot of flexibility to solve the problem.

From a product perspective I agree as well. Some of our large customers have low utility index pages because of the number of new replays.

bmckerry · 2023-05-01T20:57:59Z

text/0088-fix-memory-limitiations-in-session-replays-access-pattern.md

+
+Use `optimize_aggregation_in_order` to limit the number of rows we need to aggregate. In testing this eliminates OOM errors. Queries complete somewhat speedily.
+
+**Drawbacks**


Would be good to mention here how this may show incorrect data for replays with a timestamp around the partition cutoff time

volokluev · 2023-05-05T17:37:50Z

text/0088-fix-memory-limitiations-in-session-replays-access-pattern.md

+
+**Drawbacks**
+
+There are aggregated values which users want to sort and filter on. Columns such as count_errors and activity which are prominently featured on our index page. Sorting and filtering by these values disables the preflight query and only runs the aggregated query.


Examples of these queries would be very helpful to designing a performant solution. as someone who doesn't use replay, it's hard for me to map these abstract concepts onto a material reality.

volokluev · 2023-05-05T17:39:32Z

text/0088-fix-memory-limitiations-in-session-replays-access-pattern.md

+
+There are aggregated values which users want to sort and filter on. Columns such as count_errors and activity which are prominently featured on our index page. Sorting and filtering by these values disables the preflight query and only runs the aggregated query.
+
+The non-aggregated query does not allow **exclusive** filter conditions against **non-static** columns. For example, we can not say "find a replay where this url does not exist". The query will find _rows_ where that condition is true but it will not find _replays_ where that is condition true.


What makes a column static or not?

Discussed on call. Static means the value of the column on segment 0 can be presumed to be the value for every segment.

volokluev · 2023-05-05T17:49:35Z

text/0088-fix-memory-limitiations-in-session-replays-access-pattern.md

+
+The Session Replay index page will run out of memory when processing queries for our largest customers. This document contains proposed solutions for addressing this short-coming.
+
+# Motivation


What is the expected timeline of this? Those constraints would be good to know when evaluating solutions

volokluev · 2023-05-05T17:54:04Z

text/0088-fix-memory-limitiations-in-session-replays-access-pattern.md

+
+- Race conditions will require single-threaded processing of like replay-id events.
+- Duplicate messages will necessitate ordering requirements.
+- Always a possibility for dropped and duplicated data regardless of safe guards.


what is the tolerance for dropped and duplicated data?

Grouping can handle duplicated data. Dropped data is not ideal but is manageable depending on the data model.

text/0088-fix-memory-limitiations-in-session-replays-access-pattern.md

evanh

One thing that is missing from this is any solution involving the client side SDK. I think we've talked about this on Slack but documenting that discussion would be useful.

For example, why doesn't the client SDK set a "final" flag on the last segment of a replay? The SDK could also update each segment with the data from all the previous segments, so that the "final" segment would effectively be a fully formed replay sent all at once. We could then store that directly in a replay table and/or filter on the "final" flag.

cmanallen · 2023-05-08T16:48:03Z

@evanh Updated with two new proposals. Both rely on the SDK to buffer metadata.

cmanallen · 2023-05-08T16:49:12Z

text/0088-fix-memory-limitiations-in-session-replays-access-pattern.md

+
+**Questions**
+
+- Is it possible for the SDK to know when its done and mark the request object with a final attribute?


@billyvg Can we assess this? Is it possible for the SDK to know when the replay has finished? If so do we have any idea how reliable it is?

@cmanallen It's possible, but potentially unreliable. Some cases off the top of my head:

User closes tab -> mark as final

sendBeacon/fetch w/ keepalive should be able to generally send off this request given that it's likely we lose the last segment

User switches tab and goes idle -> set timer based on session expiration to mark as final

timer drift? especially with background tab?

User turns off computer

Potentially can detect when computer/browser is "active" again and mark as final

Network flakiness

evanh · 2023-05-08T18:22:57Z

text/0088-fix-memory-limitiations-in-session-replays-access-pattern.md

+**Drawbacks**
+
+- Requies SDK upgrade.
+- API will need to fallback to aggregating behavior if no final segments can be found (or otherwise detect old SDK usage prior to querying).


This is only true if you want to show "in progress" replays. If we exclude those then this isn't a problem.

Old SDKs won't set the final attribute and will require us to aggregate. Eventually we could drop this requirement but only after some portion of customers upgraded.

evanh · 2023-05-08T18:24:36Z

text/0088-fix-memory-limitiations-in-session-replays-access-pattern.md

+
+**Questions**
+
+- Will using `FINAL` be a problem? Is it a big deal relative to the problems we're experiencing currently?


In this context, the FINAL keyword will have the same effect as the GROUP BY replay_id. Under the hood CH is doing the same thing, collapsing all the parts in memory to collect the correct result.

Are we sure its exactly the same? My reading of the documentation suggests that it would return the highest version row whereas GROUP BY will return an aggregation of the non-collapsed rows.

Specifically here: https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/replacingmergetree#ver

volokluev · 2023-05-09T20:27:19Z

text/0088-fix-memory-limitiations-in-session-replays-access-pattern.md

+
+Any option may be accepted in whole or in part. Multiple options can be accepted to achieve the desired outcome.
+
+### 1. Change the Product's Query Access Pattern


Infra Burden: Low
Customer Impact: High
Maintainability: Low

volokluev · 2023-05-09T20:28:52Z

text/0088-fix-memory-limitiations-in-session-replays-access-pattern.md

+
+**Questions**
+
+### 2. Reduce the Scan Range on the Front End


Infra Burden: Low
Customer Impact: Medium
Maintainability: Low

Notes: most people will query over 14 day window

volokluev · 2023-05-09T20:30:44Z

text/0088-fix-memory-limitiations-in-session-replays-access-pattern.md

+
+**Questions**
+
+### 3. Reduce the Scan Range on the Back End


Scratch this, it probably won't solve the problem

volokluev · 2023-05-09T20:31:16Z

text/0088-fix-memory-limitiations-in-session-replays-access-pattern.md

+
+**Questions**
+
+### 4. Normalize Schema and Remove Snuba Join Restriction


🔪 Remove

volokluev · 2023-05-09T20:54:55Z

text/0088-fix-memory-limitiations-in-session-replays-access-pattern.md

+
+**Questions**
+
+### 5. Migrate to CollapsingMergeTree Table Engine and Pre-Aggregate Data in Key-Value Store


Infra Burden: Medium

Suggested change

### 5. Migrate to CollapsingMergeTree Table Engine and Pre-Aggregate Data in Key-Value Store

### 5. Make a replays table as opposed to a replay_segments table

Infra burden: High
Customer Impact: Low
Maintainability: High

Notes: If we have a replays table instead of a replay_segments table we can filter better even if we aggregate the same amount of replays

volokluev · 2023-05-09T21:00:42Z

text/0088-fix-memory-limitiations-in-session-replays-access-pattern.md

+
+- Colton Allen asks: Is it possible to write to tables partitioned by the hour? When the hour expires the table is indexed for fast read performance. Replays on the most recent hour partition would be unavailable while we are writing to it. Does PostgreSQL expose behavior that allows us to query over multiple partitions transparently? Is this even necessary given our scale? Currently processing 200 messages per second.
+
+### 10. Manually Manage An Aggregated Materialized View


Infra burden: medium (cron job generated inserts is a new mechanism)
Maintainability: Medium
Customer Impact: Low/medium (depending on if the product changes)

Questions:
Do we need to merge the last hour segments with aggregated?

rfc(feature): Fix Memory Limitiations in Session Replay's Access Pattern

01a816d

cmanallen force-pushed the rfc/fix-memory-limitiations-in-session-replays-access-pattern branch from 68c7c83 to 01a816d Compare April 24, 2023 20:49

cmanallen added 13 commits April 25, 2023 08:34

Add outline

9e6aec7

Further progress

77e62dd

Pinot and VersionedCollapsingMergeTree

acaf8dd

More changes

1fb126e

Stateful streaming, clickhouse upgrade

e77831d

Add OLTP section

97c9603

Add disclaimer

03e7f64

Remove assumptions and disclaimer

78421a3

Number proposals

4875bc0

Add emphasis

3681298

Add more emphasis

dd4dc3c

Add drawback

d43ebb8

Reduce idle time

f1479d4

JoshFerge reviewed Apr 25, 2023

View reviewed changes

text/0088-fix-memory-limitiations-in-session-replays-access-pattern.md Outdated Show resolved Hide resolved

JoshFerge reviewed Apr 25, 2023

View reviewed changes

text/0088-fix-memory-limitiations-in-session-replays-access-pattern.md Outdated Show resolved Hide resolved

JoshFerge reviewed Apr 25, 2023

View reviewed changes

text/0088-fix-memory-limitiations-in-session-replays-access-pattern.md Show resolved Hide resolved

bruno-garcia reviewed Apr 25, 2023

View reviewed changes

More stuff

edb90fa

evanh reviewed Apr 26, 2023

View reviewed changes

cmanallen added 8 commits April 26, 2023 11:52

Add questions to cron job

724ec63

Typo

0f6212f

Language

4dfbd2d

Ryans question

62b134d

More updates

c67a8e3

Add note

2dc2f02

Add memory limit configuration proposal

4686c0d

Add optimize_aggregation_in_order proposal

6bdb97a

cmanallen added 2 commits April 27, 2023 08:57

Fix typo

f882d08

Bruno's question

9d02960

jas-kas reviewed Apr 28, 2023

View reviewed changes

bmckerry reviewed May 1, 2023

View reviewed changes

volokluev reviewed May 5, 2023

View reviewed changes

evanh reviewed May 8, 2023

View reviewed changes

Add SDK buffering proposals

9c7fb54

cmanallen commented May 8, 2023

View reviewed changes

Add comment on viability

ef6f948

evanh reviewed May 8, 2023

View reviewed changes

Add queries

1714f47

volokluev reviewed May 9, 2023

View reviewed changes

cmanallen added 3 commits May 10, 2023 12:26

Update outcome

88bdc23

Update solution

b2730ea

Add proposal 10 rejection statement

1f9e83c

cmanallen marked this pull request as ready for review May 22, 2023 15:37

cmanallen merged commit 14f1063 into main May 22, 2023

cmanallen deleted the rfc/fix-memory-limitiations-in-session-replays-access-pattern branch May 22, 2023 15:39

JoshFerge mentioned this pull request May 23, 2023

config(replays): add replays query settings getsentry/snuba#4232

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rfc(feature): Fix Memory Limitiations in Session Replay's Access Pattern #88

rfc(feature): Fix Memory Limitiations in Session Replay's Access Pattern #88

cmanallen commented Apr 24, 2023 •

edited

Loading

JoshFerge Apr 25, 2023

bruno-garcia Apr 25, 2023 •

edited by JoshFerge

Loading

JoshFerge Apr 25, 2023

bruno-garcia Apr 25, 2023 •

edited by JoshFerge

Loading

evanh Apr 26, 2023

evanh Apr 26, 2023

cmanallen Apr 27, 2023

jas-kas Apr 28, 2023

cmanallen May 1, 2023

jas-kas Apr 28, 2023

cmanallen May 1, 2023

bmckerry May 1, 2023

volokluev May 5, 2023

volokluev May 5, 2023

cmanallen May 8, 2023

volokluev May 5, 2023

volokluev May 5, 2023

cmanallen May 8, 2023

evanh left a comment

cmanallen commented May 8, 2023

cmanallen May 8, 2023

billyvg May 10, 2023

evanh May 8, 2023

cmanallen May 8, 2023

evanh May 8, 2023

cmanallen May 8, 2023

cmanallen May 8, 2023

volokluev May 9, 2023 •

edited

Loading

volokluev May 9, 2023

volokluev May 9, 2023

volokluev May 9, 2023

volokluev May 9, 2023

volokluev May 9, 2023

		- This would require another team assuming the burden for us.
		- Otherwise, additional budget would need to be allocated to the Replays team to hire outside experts.


		# Motivation

		Our goal is to make the Session Replay product highly scalable. We expect changes to the product will eliminate our OOM issues and increase performance for customers with large amounts of ingested data.


		Proposal

		Use a cron job, which runs at some `interval` (e.g. 1 hour), that would select the finished replays from the last `interval`, aggregate them, and write them to a materialized view or some destination table. We then alter the index page to reflect two different dataset. The "live" dataset and the "video-on-demand" dataset. A "live" page would fetch replays from the last `interval`. A "video-on-demand" dataset would function similarly to the current replays index page however it would only contain data that has been aggregated by the cron job.


		Use `optimize_aggregation_in_order` to limit the number of rows we need to aggregate. In testing this eliminates OOM errors. Queries complete somewhat speedily.

		Drawbacks


		Drawbacks

		There are aggregated values which users want to sort and filter on. Columns such as count_errors and activity which are prominently featured on our index page. Sorting and filtering by these values disables the preflight query and only runs the aggregated query.


		There are aggregated values which users want to sort and filter on. Columns such as count_errors and activity which are prominently featured on our index page. Sorting and filtering by these values disables the preflight query and only runs the aggregated query.

		The non-aggregated query does not allow exclusive filter conditions against non-static columns. For example, we can not say "find a replay where this url does not exist". The query will find _rows_ where that condition is true but it will not find _replays_ where that is condition true.


		The Session Replay index page will run out of memory when processing queries for our largest customers. This document contains proposed solutions for addressing this short-coming.

		# Motivation


		Questions

		- Is it possible for the SDK to know when its done and mark the request object with a final attribute?


		Questions

		- Will using `FINAL` be a problem? Is it a big deal relative to the problems we're experiencing currently?


		Any option may be accepted in whole or in part. Multiple options can be accepted to achieve the desired outcome.

		### 1. Change the Product's Query Access Pattern


		Questions

		### 4. Normalize Schema and Remove Snuba Join Restriction


		Questions

		### 5. Migrate to CollapsingMergeTree Table Engine and Pre-Aggregate Data in Key-Value Store

	### 5. Migrate to CollapsingMergeTree Table Engine and Pre-Aggregate Data in Key-Value Store
	### 5. Make a replays table as opposed to a replay_segments table


		- Colton Allen asks: Is it possible to write to tables partitioned by the hour? When the hour expires the table is indexed for fast read performance. Replays on the most recent hour partition would be unavailable while we are writing to it. Does PostgreSQL expose behavior that allows us to query over multiple partitions transparently? Is this even necessary given our scale? Currently processing 200 messages per second.

		### 10. Manually Manage An Aggregated Materialized View

rfc(feature): Fix Memory Limitiations in Session Replay's Access Pattern #88

rfc(feature): Fix Memory Limitiations in Session Replay's Access Pattern #88

Conversation

cmanallen commented Apr 24, 2023 • edited Loading

Choose a reason for hiding this comment

bruno-garcia Apr 25, 2023 • edited by JoshFerge Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bruno-garcia Apr 25, 2023 • edited by JoshFerge Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

evanh left a comment

Choose a reason for hiding this comment

cmanallen commented May 8, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

volokluev May 9, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cmanallen commented Apr 24, 2023 •

edited

Loading

bruno-garcia Apr 25, 2023 •

edited by JoshFerge

Loading

bruno-garcia Apr 25, 2023 •

edited by JoshFerge

Loading

volokluev May 9, 2023 •

edited

Loading