Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow to respond concurrent requests for new shapes #1785

Open
msfstef opened this issue Oct 3, 2024 · 1 comment
Open

Slow to respond concurrent requests for new shapes #1785

msfstef opened this issue Oct 3, 2024 · 1 comment
Assignees

Comments

@msfstef
Copy link
Contributor

msfstef commented Oct 3, 2024

When multiple concurrent requests for new shapes come in, requests and responses are all routed through the ShapeCache causing delays and flooding of the process mailbox. Requests can instead be routed to individual shape worker processes to handle their needs, like waiting for snapshots and requesting a log etc.

The goal is to move stuff off of the critical path of the requests and enable quick responses, even if the responses are 429s because of e.g. pool connection resource limitations.

@msfstef
Copy link
Contributor Author

msfstef commented Oct 3, 2024

#1787 should address sending requests directly to the shape workers

There's more work that's easily done to remove responsibilities from ShapeCache, but I'd rather do it after merging the above PR.

msfstef added a commit that referenced this issue Oct 9, 2024
Addresses #1785 and
partially addresses #1770

Moves a lot of the operations that went through `ShapeCache` directly
into the `Shape.Consumer`, so that requests can be replied to directly
from the shape consumers rather than flooding the `ShapeCache` with
casts that take a while to reach the requesters.

I've tried to keep changes to a minimum in order to do this
incrementally and keep these PRs easily reviewable - the `ShapeStatus`
still persists data on every call, the relations and truncates still go
through `ShapeCache` rather than individual shapes, etc

I've also caught the `DBConnection.ConnectionError`s for queue timeouts
and converted them to 429 errors.
We need to also handle `GenServer.call` timeouts as sometimes the PG
query might not fail but take longer than the default 5 seconds for the
GenServer call.


NOTE: I have not updated any tests yet as I first want to ensure people
agree with the approach

PERFORMANCE CHECK:
- On my local machine, using in memory stores, running 1000 concurrent
new shape connections consistently took ~20sec with these changes,
compared to the ~33sec on main, so a ~30% improvement.
- I was also able to succesfully run 10k concurrent connections with
this, although it took ~10min to serve, but on main I wasn't able to
succsefully run it (@robacourt I think that was the case for you too?) -
at least we know it does not get into an unrecoverable state.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants