Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Move operations from ShapeCache to Shape.Consumer #1787

Merged
merged 7 commits into from
Oct 9, 2024

Conversation

msfstef
Copy link
Contributor

@msfstef msfstef commented Oct 3, 2024

Addresses #1785 and partially addresses #1770

Moves a lot of the operations that went through ShapeCache directly into the Shape.Consumer, so that requests can be replied to directly from the shape consumers rather than flooding the ShapeCache with casts that take a while to reach the requesters.

I've tried to keep changes to a minimum in order to do this incrementally and keep these PRs easily reviewable - the ShapeStatus still persists data on every call, the relations and truncates still go through ShapeCache rather than individual shapes, etc

I've also caught the DBConnection.ConnectionErrors for queue timeouts and converted them to 429 errors.
We need to also handle GenServer.call timeouts as sometimes the PG query might not fail but take longer than the default 5 seconds for the GenServer call.

NOTE: I have not updated any tests yet as I first want to ensure people agree with the approach

PERFORMANCE CHECK:

  • On my local machine, using in memory stores, running 1000 concurrent new shape connections consistently took ~20sec with these changes, compared to the ~33sec on main, so a ~30% improvement.
  • I was also able to succesfully run 10k concurrent connections with this, although it took ~10min to serve, but on main I wasn't able to succsefully run it (@robacourt I think that was the case for you too?) - at least we know it does not get into an unrecoverable state.

@msfstef msfstef requested review from alco and robacourt October 3, 2024 09:30
Copy link

netlify bot commented Oct 3, 2024

Deploy Preview for electric-next ready!

Name Link
🔨 Latest commit edf9eed
🔍 Latest deploy log https://app.netlify.com/sites/electric-next/deploys/66feb82caa29010008b9c0e0
😎 Deploy Preview https://deploy-preview-1787--electric-next.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@@ -293,9 +296,12 @@ defmodule Electric.ShapeCache do
{:reply, :started, [], state}

true ->
Logger.debug("Starting a wait on the snapshot #{shape_id} for #{inspect(from)}}")
GenServer.cast(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could avoid going through the shape cache and directly go to a shape consumer, but for the sake of incrementally refactoring this I kept it like this as I think there are more concurrency considerations in the other case

Consumer.Supervisor.clean_and_stop(%{
electric_instance_id: electric_instance_id,
shape_id: shape_id
})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was unsure if we need to also have a fallback here to forcefully do DynamicSupervisor.stop(name, pid) - I figured if the Supervisor is alive it should be able to handle its own shutdown gracefully

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sure what you're referring to here. The code feels weird in the sense that the passed name argument is not used but a name is constructed for the consumer process. It's not your changes, it was weird before them.

It's not important to dwell on this right now, IMO. We can clear up process lifecycles at another incremental step of the overall move from ShapeCache to individual shape process trees.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the idea was that we would terminate processes via the dynamic supervisor, hence the name being passed - but for some reason we check if the consumer is alive (rather than handling the error from the DynamicSupervisor)

I've changed things around such that processes clean up after themselves, so an external termination would only be a fallback

Either way, I agree that we should do these in steps to keep things reviewable!

Copy link
Member

@alco alco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great so far 👍

Consumer.Supervisor.clean_and_stop(%{
electric_instance_id: electric_instance_id,
shape_id: shape_id
})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sure what you're referring to here. The code feels weird in the sense that the passed name argument is not used but a name is constructed for the consumer process. It's not your changes, it was weird before them.

It's not important to dwell on this right now, IMO. We can clear up process lifecycles at another incremental step of the overall move from ShapeCache to individual shape process trees.

@msfstef msfstef marked this pull request as ready for review October 3, 2024 15:27
@msfstef
Copy link
Contributor Author

msfstef commented Oct 3, 2024

@balegas PR should be ready, @robacourt has been kind enough to kick off a benchmark run for this PR so we can see results on Monday and hopefully merge (?) I expect an improvement in concurrent shape creation and no regressions on other benchmarks - let's see

@balegas
Copy link
Contributor

balegas commented Oct 3, 2024

Yeah looking forward to that! If the benchmarks show anything odd, we'll be happy to have run the benchmarks before :).

Copy link
Contributor

@robacourt robacourt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! This shows a massive speed improvement for concurrent shape creation while the other benchmarks remain unchanged:
Screenshot 2024-10-08 at 20 06 01

@msfstef msfstef merged commit edb0f72 into main Oct 9, 2024
23 checks passed
@msfstef msfstef deleted the msfstef/handle-overloads-gracefully branch October 9, 2024 08:17
msfstef added a commit that referenced this pull request Oct 10, 2024
Fixes #1770

With #1787 we've managed to
return 429s whenever there's too many concurrent shape creations that
cause the database connection pool to be exhausted.

This PR just ensures that the client does indeed retry on 429s - for now
just with our regular exponential backoff, as there is no standard for
retry headers to respect.

P.S. additional changes to the openapi spec done by my formatter 👀 I can
roll them back if you think they are worse than before
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants