Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rfc(feature): Fix Memory Limitiations in Session Replay's Access Pattern #88

Merged

Conversation

cmanallen
Copy link
Member

@cmanallen cmanallen commented Apr 24, 2023

TODO. Rendered RFC

@cmanallen cmanallen force-pushed the rfc/fix-memory-limitiations-in-session-replays-access-pattern branch from 68c7c83 to 01a816d Compare April 24, 2023 20:49
Comment on lines 158 to 159
- This would require another team assuming the burden for us.
- Otherwise, additional budget would need to be allocated to the Replays team to hire outside experts.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@bruno-garcia bruno-garcia Apr 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

google's dataproc makes this pretty straightforward

what's the alternative for self-hosted? Can we have an adapter so folks can plug spark?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah it would be some amount of work but we can put it in a docker container.

Comment on lines 158 to 159
- This would require another team assuming the burden for us.
- Otherwise, additional budget would need to be allocated to the Replays team to hire outside experts.
Copy link
Member

@bruno-garcia bruno-garcia Apr 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

google's dataproc makes this pretty straightforward

what's the alternative for self-hosted? Can we have an adapter so folks can plug spark?

**Drawbacks**

- To search for a unique value a user would need to issue a query for each day of the retention period or until the value was found.
- Small customers would not see any replays on their index page due to the limited window size.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could probably issue multiple queries in this case.

  1. Fetch a small initial window (past 24 hours).
  2. If you get more than N results, stop.
  3. If you get less than N results, query again for another time window (last week - last 24 hours).
  4. Repeat 3 until you get N results.


**Proposal**

- Normalize schema.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I think it should be removed.


# Motivation

Our goal is to make the Session Replay product highly scalable. We expect changes to the product will eliminate our OOM issues and increase performance for customers with large amounts of ingested data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see this in the solution options considered if I understand them correctly: But would limiting the number of replays shown on the Index page materially improve the OOM issues or probability of them occurring?

Right now we display 50 replays upon load of this page which is * a lot * and seems unnecessary. I assume the number of replays we display on this view, will result in fetching more info on sort/filter actions and retention period changes as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see this in the solution options considered if I understand them correctly: But would limiting the number of replays shown on the Index page materially improve the OOM issues or probability of them occurring?

Unfortunately no. Whats displayed and what was actually aggregated in the database are not the same in this case. The "scan range" (the stats period value: 14 days, 24 hours, 90 days, etc) is what creates these OOM issues.

Right now we display 50 replays upon load of this page which is * a lot * and seems unnecessary.

If it makes the product better we should do it! Reducing the row set will have a positive impact on performance.


**Proposal**

Use a cron job, which runs at some `interval` (e.g. 1 hour), that would select the finished replays from the last `interval`, aggregate them, and write them to a materialized view or some destination table. We then alter the index page to reflect two different dataset. The "live" dataset and the "video-on-demand" dataset. A "live" page would fetch replays from the last `interval`. A "video-on-demand" dataset would function similarly to the current replays index page however it would only contain data that has been aggregated by the cron job.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't speak to the details of the cron job described here in this solution, but I would be in favour of a solution proposal that does not display "in progress" replays on the Index page by default -- as this leads to many short (and 'unfinished') replays for customers that have a high volume.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be in favour of a solution proposal that does not display "in progress" replays on the Index page by default

This is good to know. This gives us a lot of flexibility to solve the problem.

From a product perspective I agree as well. Some of our large customers have low utility index pages because of the number of new replays.


Use `optimize_aggregation_in_order` to limit the number of rows we need to aggregate. In testing this eliminates OOM errors. Queries complete somewhat speedily.

**Drawbacks**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to mention here how this may show incorrect data for replays with a timestamp around the partition cutoff time


**Drawbacks**

There are aggregated values which users want to sort and filter on. Columns such as count_errors and activity which are prominently featured on our index page. Sorting and filtering by these values disables the preflight query and only runs the aggregated query.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Examples of these queries would be very helpful to designing a performant solution. as someone who doesn't use replay, it's hard for me to map these abstract concepts onto a material reality.


There are aggregated values which users want to sort and filter on. Columns such as count_errors and activity which are prominently featured on our index page. Sorting and filtering by these values disables the preflight query and only runs the aggregated query.

The non-aggregated query does not allow **exclusive** filter conditions against **non-static** columns. For example, we can not say "find a replay where this url does not exist". The query will find _rows_ where that condition is true but it will not find _replays_ where that is condition true.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What makes a column static or not?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed on call. Static means the value of the column on segment 0 can be presumed to be the value for every segment.


The Session Replay index page will run out of memory when processing queries for our largest customers. This document contains proposed solutions for addressing this short-coming.

# Motivation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the expected timeline of this? Those constraints would be good to know when evaluating solutions


- Race conditions will require single-threaded processing of like replay-id events.
- Duplicate messages will necessitate ordering requirements.
- Always a possibility for dropped and duplicated data regardless of safe guards.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the tolerance for dropped and duplicated data?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grouping can handle duplicated data. Dropped data is not ideal but is manageable depending on the data model.

Copy link
Member

@evanh evanh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that is missing from this is any solution involving the client side SDK. I think we've talked about this on Slack but documenting that discussion would be useful.

For example, why doesn't the client SDK set a "final" flag on the last segment of a replay? The SDK could also update each segment with the data from all the previous segments, so that the "final" segment would effectively be a fully formed replay sent all at once. We could then store that directly in a replay table and/or filter on the "final" flag.

@cmanallen
Copy link
Member Author

@evanh Updated with two new proposals. Both rely on the SDK to buffer metadata.


**Questions**

- Is it possible for the SDK to know when its done and mark the request object with a final attribute?
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@billyvg Can we assess this? Is it possible for the SDK to know when the replay has finished? If so do we have any idea how reliable it is?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cmanallen It's possible, but potentially unreliable. Some cases off the top of my head:

  • User closes tab -> mark as final
    • sendBeacon/fetch w/ keepalive should be able to generally send off this request given that it's likely we lose the last segment
  • User switches tab and goes idle -> set timer based on session expiration to mark as final
    • timer drift? especially with background tab?
  • User turns off computer
    • Potentially can detect when computer/browser is "active" again and mark as final
  • Network flakiness

**Drawbacks**

- Requies SDK upgrade.
- API will need to fallback to aggregating behavior if no final segments can be found (or otherwise detect old SDK usage prior to querying).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only true if you want to show "in progress" replays. If we exclude those then this isn't a problem.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Old SDKs won't set the final attribute and will require us to aggregate. Eventually we could drop this requirement but only after some portion of customers upgraded.


**Questions**

- Will using `FINAL` be a problem? Is it a big deal relative to the problems we're experiencing currently?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this context, the FINAL keyword will have the same effect as the GROUP BY replay_id. Under the hood CH is doing the same thing, collapsing all the parts in memory to collect the correct result.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure its exactly the same? My reading of the documentation suggests that it would return the highest version row whereas GROUP BY will return an aggregation of the non-collapsed rows.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


Any option may be accepted in whole or in part. Multiple options can be accepted to achieve the desired outcome.

### 1. Change the Product's Query Access Pattern
Copy link
Member

@volokluev volokluev May 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Infra Burden: Low
Customer Impact: High
Maintainability: Low


**Questions**

### 2. Reduce the Scan Range on the Front End
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Infra Burden: Low
Customer Impact: Medium
Maintainability: Low

Notes: most people will query over 14 day window


**Questions**

### 3. Reduce the Scan Range on the Back End
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scratch this, it probably won't solve the problem


**Questions**

### 4. Normalize Schema and Remove Snuba Join Restriction
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔪 Remove


**Questions**

### 5. Migrate to CollapsingMergeTree Table Engine and Pre-Aggregate Data in Key-Value Store
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Infra Burden: Medium

Suggested change
### 5. Migrate to CollapsingMergeTree Table Engine and Pre-Aggregate Data in Key-Value Store
### 5. Make a replays table as opposed to a replay_segments table

Infra burden: High
Customer Impact: Low
Maintainability: High

Notes: If we have a replays table instead of a replay_segments table we can filter better even if we aggregate the same amount of replays


- Colton Allen asks: Is it possible to write to tables partitioned by the hour? When the hour expires the table is indexed for fast read performance. Replays on the most recent hour partition would be unavailable while we are writing to it. Does PostgreSQL expose behavior that allows us to query over multiple partitions transparently? Is this even necessary given our scale? Currently processing 200 messages per second.

### 10. Manually Manage An Aggregated Materialized View
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Infra burden: medium (cron job generated inserts is a new mechanism)
Maintainability: Medium
Customer Impact: Low/medium (depending on if the product changes)

Questions:
Do we need to merge the last hour segments with aggregated?

@cmanallen cmanallen marked this pull request as ready for review May 22, 2023 15:37
@cmanallen cmanallen merged commit 14f1063 into main May 22, 2023
@cmanallen cmanallen deleted the rfc/fix-memory-limitiations-in-session-replays-access-pattern branch May 22, 2023 15:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants