Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split apart web server and build server #795

Closed
jyn514 opened this issue May 31, 2020 · 15 comments · Fixed by #2038
Closed

Split apart web server and build server #795

jyn514 opened this issue May 31, 2020 · 15 comments · Fixed by #2038
Labels
E-hard Effort: This will require a lot of work S-blocked Status: marked as blocked ❌ on something else such as an RFC or other implementation work. S-needs-design Status: There's a problem here, but no obvious solution; or the solution raises other questions

Comments

@jyn514
Copy link
Member

jyn514 commented May 31, 2020

Currently, all web requests are served from the same server that performs the crate builds. This has given us trouble in the past, since builds consume a lot of resources, driving up response times for web requests. It would be useful to instead have different servers for the website and builds.

Possible implementation:

  • The web server continues having primary access to the database. Builders don't have access at all, but instead receive webhooks (or similar) telling them to start a build. When the build finishes, they report back to the main server.
  • We have a 'heartbeat' mechanism similar to crater: if a build server doesn't send a message to the web server in 5 minutes, we mark it as crashed
  • We have a new state for crates in the queue: queued, finished, and assigned, where assigned means there is a build server currently building that crate (Add crates to the database before building them #1011).

Benefits:

@pietroalbini has experience with implementing this for crater and is helping a lot with the design aspect.

@jyn514 jyn514 added E-hard Effort: This will require a lot of work S-needs-design Status: There's a problem here, but no obvious solution; or the solution raises other questions labels Jun 27, 2020
@syphar
Copy link
Member

syphar commented Apr 1, 2021

Just a thought, wouldn't it a possible intermediate step to just add a second server that just uses the same database and bucket?
Also, the local index repo could move to the build-server.

Then the webserver could benefit from having less load, and we can slowly work on the webhook logic

@Nemo157
Copy link
Member

Nemo157 commented Apr 1, 2021

IIRC the blocker last time I touched this was the metrics, currently the builder metrics are merged into the web server metrics to get to prometheus. We would either need the metrics from the builder to be stored into the database to then be served by the web server, or pushed direct from the builder to prometheus somehow.

Once those are sorted running two processes for this should be trivial.

@pietroalbini
Copy link
Member

A possible approach for the metrics would be to run a really small webserver out of the builder that just exposes a /metrics endpoint.

@syphar syphar added the S-blocked Status: marked as blocked ❌ on something else such as an RFC or other implementation work. label Jan 9, 2022
@syphar
Copy link
Member

syphar commented May 10, 2022

Just a thought, not sure if we should do it in the first version already:

when the goal is to be able to just rebuild instances via ansible then we perhaps should start

  • moving state from local files to the database (for example the last seen commit of the index?)
  • moving scripts from the server to the repo (or any repo)

IMO as long as we only have one build server (that also manages the queue & crates-index-diff) we don't need #1011 and we can just reboot / recreate the machine.

There would be a initial performance hit on every new build-server (downloading the toolchain, docker image etc), and also for every new webserver (local archive indexes come to mind, not sure if there is more).

For webservers it would be theoretically possible to run multiple machines with a ELB, while I actually prefer better caching on the CDN (#1552 and/or #1560 )

@syphar
Copy link
Member

syphar commented Jul 16, 2022

Some additional thoughts on this, steps to be able to horizontally scale the build-server:

separate registry watcher

We move the registry-watcher into a separate container / process that only exists once, saving the last seen ref additionally in the database to survive restarts, and to be able to just recreate the machine any time we want.

allow multiple build servers

  1. move the queue-lock into the database
  2. allow multiple builders by using SELECT FOR UPDATE SKIP LOCKED, and wrapping the build in a transaction

2 should scale well enough for a low number of build servers (<50) and wouldn't need much change to the current logic.
This would even allow autoscaling scaling build-servers, of course the initial setup time for the first build would take some time.

@pietroalbini
Copy link
Member

We move the registry-watcher into a separate container / process that only exists once, saving the last seen ref additionally in the database to survive restarts, and to be able to just recreate the machine any time we want.

That's a great idea, that could be deployed to ECS.

@syphar
Copy link
Member

syphar commented Jul 24, 2022

In addition to the separate registry watcher we also have a separate job for the repository stats updater. strike that, we already have cratesfyi database update-repository-fields which does exactly what the daemon triggers once an hour

I assume ECS has something that can run a command every hour? @pietroalbini

@pietroalbini
Copy link
Member

I would create a single binary that includes the registry watcher and update-repository-fields all together tbh. ECS has something to run periodic tasks but if we can avoid more pieces of infra I'd be happier.

@syphar
Copy link
Member

syphar commented Jul 28, 2022

I would create a single binary that includes the registry watcher and update-repository-fields all together tbh. ECS has something to run periodic tasks but if we can avoid more pieces of infra I'd be happier.

that's possible, thank you for the input!

@rylev
Copy link
Member

rylev commented Feb 8, 2023

@syphar does #1785 close this issue?

@Nemo157
Copy link
Member

Nemo157 commented Feb 8, 2023

One additional thing that might make sense is to a do complete redesign of the web/build/watcher entrypoints now before we move to the new infra. The current CLI interface doesn't entirely make sense with the separation. Probably by making sure we have good new entrypoints used by the new infra then deleting the daemon entrypoint used in current production. (And maybe even separating to multiple binaries containing just the web/build/watcher/utility parts of the codebase instead of one combined binary).

@rylev
Copy link
Member

rylev commented Feb 8, 2023

@Nemo157 there is already the start-build-server command for running the background server independent of the web server or registry watcher. While I think separating to different binaries could be useful I don't see how waiting on that to set up this infrastructure actually makes our lives easier.

@Nemo157
Copy link
Member

Nemo157 commented Feb 8, 2023

I don't think we should wait on it for setting up the infrastructure, it'd make more sense as a back-and-forth of "here's the independent services we are running on the new infrastructure, how can we improve the interface for them" and try to get that done before we switch over production. (I guess we're most of the way there with start-{{build,web}-server,registry-watcher}, I forgot that #1785 had pulled the registry-watcher+stats-updater out already too).

@rylev
Copy link
Member

rylev commented Feb 8, 2023

It probably makes sense to close this issue then as the split has successfully been done. Perhaps we can then create new issues for improving the interface further.

@syphar
Copy link
Member

syphar commented Feb 8, 2023

@syphar does #1785 close this issue?

While the current design is not what was envisioned in the initial issue description, I would say we're nearly there.

Only piece missing IMO are metrics endpoints.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
E-hard Effort: This will require a lot of work S-blocked Status: marked as blocked ❌ on something else such as an RFC or other implementation work. S-needs-design Status: There's a problem here, but no obvious solution; or the solution raises other questions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants