Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rfc(feature): Fix Memory Limitiations in Session Replay's Access Pattern #88

Merged
Merged
Show file tree
Hide file tree
Changes from 14 commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
01a816d
rfc(feature): Fix Memory Limitiations in Session Replay's Access Pattern
cmanallen Apr 24, 2023
9e6aec7
Add outline
cmanallen Apr 25, 2023
77e62dd
Further progress
cmanallen Apr 25, 2023
acaf8dd
Pinot and VersionedCollapsingMergeTree
cmanallen Apr 25, 2023
1fb126e
More changes
cmanallen Apr 25, 2023
e77831d
Stateful streaming, clickhouse upgrade
cmanallen Apr 25, 2023
97c9603
Add OLTP section
cmanallen Apr 25, 2023
03e7f64
Add disclaimer
cmanallen Apr 25, 2023
78421a3
Remove assumptions and disclaimer
cmanallen Apr 25, 2023
4875bc0
Number proposals
cmanallen Apr 25, 2023
3681298
Add emphasis
cmanallen Apr 25, 2023
dd4dc3c
Add more emphasis
cmanallen Apr 25, 2023
d43ebb8
Add drawback
cmanallen Apr 25, 2023
f1479d4
Reduce idle time
cmanallen Apr 25, 2023
edb90fa
More stuff
cmanallen Apr 25, 2023
724ec63
Add questions to cron job
cmanallen Apr 26, 2023
0f6212f
Typo
cmanallen Apr 26, 2023
4dfbd2d
Language
cmanallen Apr 26, 2023
62b134d
Ryans question
cmanallen Apr 26, 2023
c67a8e3
More updates
cmanallen Apr 26, 2023
2dc2f02
Add note
cmanallen Apr 26, 2023
4686c0d
Add memory limit configuration proposal
cmanallen Apr 27, 2023
6bdb97a
Add optimize_aggregation_in_order proposal
cmanallen Apr 27, 2023
f882d08
Fix typo
cmanallen Apr 27, 2023
9d02960
Bruno's question
cmanallen Apr 27, 2023
9c7fb54
Add SDK buffering proposals
cmanallen May 8, 2023
ef6f948
Add comment on viability
cmanallen May 8, 2023
1714f47
Add queries
cmanallen May 8, 2023
88bdc23
Update outcome
cmanallen May 10, 2023
b2730ea
Update solution
cmanallen May 12, 2023
1f9e83c
Add proposal 10 rejection statement
cmanallen May 22, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,3 +45,4 @@ This repository contains RFCs and DACIs. Lost?
- [0081-sourcemap-debugid](text/0081-sourcemap-debugid.md): Reliable JavaScript/SourceMap processing via `DebugId`
- [0084-move-docs-to-sentry-repository](text/0084-move-docs-to-sentry-repository.md): Move onboarding docs from sentry-docs over to sentry repository
- [0086-sentry-bundler-plugins-api](text/0086-sentry-bundler-plugins-api.md): Sentry Bundler Plugins API
- [0088-fix-memory-limitiations-in-session-replays-access-pattern](text/0088-fix-memory-limitiations-in-session-replays-access-pattern.md): Fix Memory Limitiations in Session Replay's Access Pattern
216 changes: 216 additions & 0 deletions text/0088-fix-memory-limitiations-in-session-replays-access-pattern.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,216 @@
- Start Date: 2023-04-24
- RFC Type: feature
- RFC PR: https://github.com/getsentry/rfcs/pull/88
- RFC Status: draft

# Summary

The Session Replay index page will run out of memory when processing queries for our largest customers. This document contains proposed solutions for addressing this short-coming.

# Motivation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the expected timeline of this? Those constraints would be good to know when evaluating solutions


Our goal is to make the Session Replay product highly scalable. We expect changes to the product will eliminate our OOM issues and increase performance for customers with large amounts of ingested data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see this in the solution options considered if I understand them correctly: But would limiting the number of replays shown on the Index page materially improve the OOM issues or probability of them occurring?

Right now we display 50 replays upon load of this page which is * a lot * and seems unnecessary. I assume the number of replays we display on this view, will result in fetching more info on sort/filter actions and retention period changes as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see this in the solution options considered if I understand them correctly: But would limiting the number of replays shown on the Index page materially improve the OOM issues or probability of them occurring?

Unfortunately no. Whats displayed and what was actually aggregated in the database are not the same in this case. The "scan range" (the stats period value: 14 days, 24 hours, 90 days, etc) is what creates these OOM issues.

Right now we display 50 replays upon load of this page which is * a lot * and seems unnecessary.

If it makes the product better we should do it! Reducing the row set will have a positive impact on performance.


# Background

The Session Replay data model is different from most at Sentry. Replays are received in parts referred to as "segments". One row does not represent one replay. Instead many rows are aggregated together to represent a single replay. When we ask a question such as "which replays did not visit this url?", we have to aggregate every row in the database (minus whatever rows were reduced by conditions in the WHERE clause). As it turns out the number of rows is very large and the amount of data is significant enough that we run out of memory on large customers.
cmanallen marked this conversation as resolved.
Show resolved Hide resolved

**Key Ingest Considerations**

- Replay segments are not guaranteed to be received in order.
- Replay segments are typically sent every 5 seconds while the session is active.
- Replays can be "idle" for up to 15 minutes before sending a new segment when user activity resumes.
- Replays can be up to 1 hour in duration.

# Existing Solution

We have introduced a second, non-aggregated query which precedes our main aggregation query. This query will trigger if we are exclusively querying by static values which are consistent throughout the lifespan of a Replay. It will return a list of replay IDs which we can then pass to our main aggregation query. The aggregation query is then only responsible for aggregating a subset of replays (the pagination limit). This optimization allows us to query very efficiently while still providing aggregated results to the end user.

**Drawbacks**

There are aggregated values which users want to sort and filter on. Columns such as count_errors and activity which are prominently featured on our index page. Sorting and filtering by these values disables the preflight query and only runs the aggregated query.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Examples of these queries would be very helpful to designing a performant solution. as someone who doesn't use replay, it's hard for me to map these abstract concepts onto a material reality.


The non-aggregated query does not allow **exclusive** filter conditions against **non-static** columns. For example, we can not say "find a replay where this url does not exist". The query will find _rows_ where that condition is true but it will not find _replays_ where that is condition true.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What makes a column static or not?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed on call. Static means the value of the column on segment 0 can be presumed to be the value for every segment.


The non-aggregated query does not allow multiple, **inclusive** filter conditions against **non-static** columns. For example, we can not say "find a replay where this url exists and this other url exists". It will find **ROWS** which have both urls but not **REPLAYS** which have both urls. Transforming the `AND` operator to an `OR` operator does not satisfy the condition because it will match replays which contain one of the urls - not both.

# Options Considered

Any option may be accepted in whole or in part. Multiple options can be accepted to achieve the desired outcome. The options are ordered from perceived ease to perceived difficulty.

### 1. Change the Product's Query Access Pattern
Copy link
Member

@volokluev volokluev May 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Infra Burden: Low
Customer Impact: High
Maintainability: Low


**Proposal**

Our current subquery solution works very well. However, there are escape hatches which require us to issue an aggregation query over the whole dataset. Those escape hatches should be closed.

- Remove ability to filter and sort by aggregated values (e.g. count_errors, activity_score).
- Remove ability to filter by negation values which change (e.g. urls).
- Remove ability to use the AND operator when a non-static value is present.
- Filtering by browser_name AND os_name would be permitted.
- Filtering by browser_name AND click_action would be disallowed.
- Click rows do not store browser information because they are internally generated.
- Should this condition change these filters could be supported.
- Filtering by error_id AND url would be disallowed.
- Filtering by error_id OR url would be allowed.

**Drawbacks**

- Significant reduction in index page quality.
- Removes the most important features of our index page. Count-errors sort, activity sort, duration sort and filter.

**Questions**

### 2. Reduce the Scan Range on the Front End
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Infra Burden: Low
Customer Impact: Medium
Maintainability: Low

Notes: most people will query over 14 day window


**Proposal**

The number of rows aggregated can be reduced by restricting the maximum time range we query over. We should validate the timestamp range such that it does not exceed a 24-hour period. This would satify every organization which ingests fewer than 1 billion replay-segments every 90 days.
cmanallen marked this conversation as resolved.
Show resolved Hide resolved

**Drawbacks**

- To search for a unique value a user would need to issue a query for each day of the retention period or until the value was found.
- Small customers would not see any replays on their index page due to the limited window size.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could probably issue multiple queries in this case.

  1. Fetch a small initial window (past 24 hours).
  2. If you get more than N results, stop.
  3. If you get less than N results, query again for another time window (last week - last 24 hours).
  4. Repeat 3 until you get N results.

- Necessitates a special flag for large customers to enable this optimization.
- We may not know who needs this flag in advance and we may present a degraded customer experience without realizing.

**Questions**

### 3. Reduce the Scan Range on the Back End
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scratch this, it probably won't solve the problem


**Proposal**

The number of rows aggregated can be reduced by restricting the maximum time range we query over. For select queries the backend can issue multiple queries on subsets of the range. For example, if we assume that no sort value was provided or that the sort value was applied to the timestamp column then the back end can transparently query a subset of the window attempting to populate the result set without querying the entire range.

**Drawbacks**

- Requires O(retention_days) queries to satisfy the result set in the worst case.
- Adds an additional layer of complexity and does not solve our OOM issue.

**Questions**

### 4. Normalize Schema and Remove Snuba Join Restriction
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔪 Remove


**Proposal**

- Normalize schema.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I think it should be removed.

- Use joins.

**Drawbacks**

- Performance.
- Requires Snuba work.

**Questions**

### 5. Migrate to CollapsingMergeTree Table Engine and Pre-Aggregate Data in Key-Value Store
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Infra Burden: Medium

Suggested change
### 5. Migrate to CollapsingMergeTree Table Engine and Pre-Aggregate Data in Key-Value Store
### 5. Make a replays table as opposed to a replay_segments table

Infra burden: High
Customer Impact: Low
Maintainability: High

Notes: If we have a replays table instead of a replay_segments table we can filter better even if we aggregate the same amount of replays


**Proposal**

Instead of aggregating in ClickHouse we should maintain a stateful representation of the Replay in an alternative service (such as a key, value store). Additionally, we will migrate to the "VersionedCollapsingMergeTree" table engine and write our progress incrementally.

The versioned engine works as follows:

- A sign is used to determine whether the row is a state row or cancel row.
- A row can be canceled by re-writing the row with a negative sign.
- New state rows can be written by incrementing the version and assigning a positive sign.
- The version and row state must be stored or otherwise fetched from ClickHouse before each write.

The ingestion process can be described as follows:

- A replay-segment is received.
- The replay-id is used to look up previously aggregated data in the KV store.
- If data was returned collate it with the new data and store.
- If data was not returned store the new data.
- The new row is written to ClickHouse with the version column incremented by 1.
- The old row is re-written to ClickHouse with the sign value set to -1.

**Drawbacks**

- The aggregation process is not expected to be atomic and we will encounter race-conditions where two segments with the same replay-id attempt to mutate the same key at the same time.
- To solve this we will need to partition our Kafka messages by replay-id and process sequentially.
- This has scalability limitations but those limitations are likely to be less than existing limitations.
- Aggregation states can be updated atomicly with some databases. See "Use Alternative OLAP Database" proposal.
- Alternatively we can tolerate losing aggregation states and continue using parallel consumers.
- Kafka will sometimes produce duplicate messages. If we assume order we can set a requirement on segment_id > previous_segment_id.
- Order can not be assumed. Valid aggregation states could be lost.
- A delay in scheduling in Relay could cause segment_id 1 to be processed prior to segment_id 0.
- This is expected to be uncommon but not impossible.
- Alternatively, we could tolerate duplicates and accept that segment_count and other values have some margin of error.
- To eliminate the need for grouping queries we would need to supply the `FINAL` keyword.
cmanallen marked this conversation as resolved.
Show resolved Hide resolved
- Our column types are not ideal for this use case.

**Questions**

### 6. Stateful Streaming using Apache Spark or Similar Service

**Proposal**

Instead of aggregating in ClickHouse we should aggregate our replays in a stateful service such as Apache Spark before writing the final result to ClickHouse.
cmanallen marked this conversation as resolved.
Show resolved Hide resolved

TODO: Josh to fill in specifics.

**Drawbacks**

- Replays are not available until they have finished.
- There are no existing Apache Spark installations within the Sentry org.
cmanallen marked this conversation as resolved.
Show resolved Hide resolved
- The Replays team is not large enough or experienced enough to manage a Spark installation.
- This would require another team assuming the burden for us.
- Otherwise, additional budget would need to be allocated to the Replays team to hire outside experts.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@bruno-garcia bruno-garcia Apr 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

google's dataproc makes this pretty straightforward

what's the alternative for self-hosted? Can we have an adapter so folks can plug spark?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah it would be some amount of work but we can put it in a docker container.


**Questions**

### 7. Upgrade ClickHouse Version to Utilize Experimental Features
cmanallen marked this conversation as resolved.
Show resolved Hide resolved

**Proposal**

Upgrading to a newer version of ClickHouse will enable us to use experimental features such as "Live View" and "Window View".

**Drawbacks**

- Would require Sentry to manage the ClickHouse installation.
- Feature is not present in a stable release.
- Feature is not present on any version of ClickHouse in Altinity.

**Questions**

### 8. Use an Alternative OLAP Database

**Proposal**

OLAP Databases such as Apache Pinot support upserts which appear to be a key requirement for the Replays product. An aggregation state schema can be defined for merging columns in real-time https://docs.pinot.apache.org/basics/data-import/upsert.

**Drawbacks**

- Changes to the data model would be necessary. We will not be able to aggregate array columns.
- Pinot has ordering constraints.
- If segments arrive after their successor has already been ingested the aggregated state will not contain those rows.
- There are no existing Apache Pinot installations within the Sentry org.
- The Replays team is not large enough or experienced enough to manage a Pinot installation.
- This would require another team assuming the burden for us.
- Otherwise, additional budget would need to be allocated to the Replays team to hire outside experts.
- We need to re-write our application logic for querying the datastore.
- Migration pains.

**Questions**

### 9. Use an Alternative OLTP Database

**Proposal**

OLTP databases such as PostgreSQL and AlloyDB support updates which appear to be a key requirement for the Replays product.

**Drawbacks**

- Race conditions will require single-threaded processing of replay events.
- Duplicate messages will necessitate ordering requirements.
- Always a possibility for dropped and duplicated data regardless of safe guards.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the tolerance for dropped and duplicated data?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grouping can handle duplicated data. Dropped data is not ideal but is manageable depending on the data model.

- AlloyDB is still in developer preview on Google Cloud.
- We need to re-write our application logic for querying the datastore.
- Migration pains.

**Questions**

cmanallen marked this conversation as resolved.
Show resolved Hide resolved
# Selected Outcome

No outcome has been decided.