Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add remote persistent worker support #787

Open
wants to merge 24 commits into
base: main
Choose a base branch
from

Conversation

aherrmann
Copy link
Contributor

Closes #776.

Implements support for persistent workers in remote builds using the Bazel remote execution protocol and the approach documented in the Bazel remote persistent workers proposal:
https://github.com/bazelbuild/proposals/blob/main/designs/2021-03-06-remote-persistent-workers.md

Includes an example setup that works with

  • local builds without persistent worker
  • local builds with persistent worker (Buck2 protocol)
  • remote builds without persistent worker
  • remote builds with persistent worker (Bazel protocol)

The Bazel remote persistent worker protocol includes an automatic fallback in cases where the remote execution system does not yet support persistent workers. To that end actions take the shape

WORKER WORKER_ARGS... @REQUEST_ARGS_FILE

The remote execution system separates worker arguments on the command-line from request arguments in the response file and adds the --persistent_worker flag.

The demo worker included in the example in this PR distinguishes between Buck2 worker, Bazel remote worker, and one-shot modes depending on whether Buck2's WORKER_SOCKET, Bazel's --persistent_worker flag, or neither is set.

The example includes a README with detailed instructions how to test this feature.

  • Add example for remote persistent workers
  • Implement support for remote persistent workers

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 25, 2024
Copy link
Contributor

@sluongng sluongng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM from BuildBuddy side.

CI failed due to a Github outage yesterday, but I think a rebase + push should retry it.

Comment on lines 23 to 24
export BUILDBUDDY_CONTAINER_USER=... # GitHub user name
export BUILDBUDDY_CONTAINER_PASSWORD=... # GitHub access token
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just want to note that this is optional if the container image you are using is publicly downloadable.

remote_execution_properties = {
"OSFamily": "Linux",
"container-image": image,
"workload-isolation-type": "podman",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: don't need to set isolation type specifically. We(BuildBuddy) may want to change the default isolation type underneath while maintaining backward compatibility. (In fact, we did recently stopped using podman as default isolation).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to set it this way because I got a credentials error on image download with the default.

@christolliday
Copy link
Contributor

christolliday commented Oct 1, 2024

Thanks for the PR! This looks good.

We started working on support for this internally but unfortunately it doesn't match the bazel spec exactly. For reasons that aren't entirely clear to me we can't attach the 'worker key' to the platform so it's attached elsewhere on the action.

I can't see why we can't support both APIs though and have a default 'bazel mode' for this behavior.

The main other difference is that on our end we construct an RE::Action for the worker, upload it and use the digest of that action as the 'worker key', instead of what I assume is requiring that the worker args are a prefix of the action args (in which case the 'worker key' doesn't really seem to matter?). Similarly we can support both in 'bazel mode' though.

Just a heads up that there is likely to be some churn around this at some point and I'm slightly concerned we may not have an easy time testing that this does the right thing in all edge cases for 'bazel mode', but I suppose we can deal with that when we get to it.

If it's easy, it would be nice to have a github action for testing the remote example. I'm not sure how difficult that is.

@aherrmann
Copy link
Contributor Author

I can't see why we can't support both APIs though and have a default 'bazel mode' for this behavior, though.

That's great to hear! Yes, I was hoping for something along those lines.

The main other difference is that on our end we construct an RE::Action for the worker, upload it and use the digest of that action as the 'worker key', instead of what I assume is requiring that the worker args are a prefix of the action args (in which case the 'worker key' doesn't really seem to matter?).

In the Bazel version the worker key is used to associate a given action with a potentially already running worker instance on a remote executor node. But, it is not directly tied to any kind of previously uploaded blob. Bazel calculates a digest of the worker command and its inputs and uses that as a worker key. In this PR I went for the same approach.

Just a heads up that there is likely to be some churn around this at some point and I'm slightly concerned we may not have an easy time testing that this does the right thing in all edge cases for 'bazel mode', [...] If it's easy, it would be nice to have a github action for testing the remote example. I'm not sure how difficult that is.

That makes sense. I'll look into how to test this on the CI.

@aherrmann
Copy link
Contributor Author

I noticed that the example did not use the WorkerRunInfo attributes worker and exe appropriately to distinguish between (remote) persistent worker and non-worker modes respectively. I've updated the example accordingly. This highlighted that the implementation did wrongly use request.all_args_vec for the remote persistent worker case, when it should be composing worker.exe and request.args instead. I've updated the implementation accordingly.

I will continue looking into ways to test this feature on CI.

@aherrmann
Copy link
Contributor Author

Just to give a heads up on expected progress here. I'm on leave for the next two weeks and will get back to this when I'm back.

I've started making the setup independent on Nix, so that it is easier to integrate with CI here. That's already working locally.
The other thing is the remote execution system to test the remote persistent worker mode. I spoke with BuildBuddy, it would be possible to use BuildBuddy for this CI use-case, but it would require setting up a free account for the Meta Buck2 repository and configuring an access token for CI. @christolliday would that be possible for Meta to do? Otherwise, I'd have to look into other options to set up a compatible remote execution system on CI.

@christolliday
Copy link
Contributor

Hi @aherrmann, @KapJI is looking into setting up a build buddy account.

`WorkerRunInfo` has two fields `worker` and `exe` for the persistent worker command or the
non-worker mode command respectively. This commit changes the remote persistent worker example to
use these appropriately to distinguish between worker and non-worker mode execution.
The Buck2 Rust implementation needs to use the worker command instead of the non-worker command for
remote persistent worker execution mode.
We want to use a Buck2 managed hermetic Python toolchain. However, Python binaries generated by such
a toolchain are sensitive to `PWD` and don't work in other working directories than the repository
root. Unfortunately, `genrule` changes directory and is hence incompatible with such Python
binaries. This commit defines a dedicated rule to invoke protoc without changing the working
directory.
The previous setup used Nix to provide a Python toolchain and packages in a reproducible fashion.
However, this requires dedicated remote worker images with the Nix store paths pre-populated which
complicates the setup for testing on Buck2 CI.

Using a hermetic Python toolchain avoids these issues and works on the standard remote execution
image.
Remove the old genrule targets to generate the Python gRPC/protobuf bindings.
Without Nix it is no longer required to use a custom remote worker image.
It's useful to test multiple builds in parallel to catch potential issues related parallel worker
requests. However, we also don't need to be excessive in the number of tests to not unnecessarily
waste resources.
Requires a repository secret to be set up for the BuildBuddy API key named `BUILDBUDDY_API_KEY`.
@aherrmann
Copy link
Contributor Author

@christolliday @KapJI I've rebased this PR and added the changes to make it independent of Nix so we no longer require a custom worker image. I've also added a CI test for the persistent worker examples, the test requires a GitHub secret named BUILDBUDDY_API_KEY to hold the BuildBuddy token. I've tested the test script locally, but things may still fail under the GitHub actions environment. Please let me know when the BuildBuddy account is set up and a token is added as a GitHub secret, then I can test and debug the CI configuration.

@KapJI
Copy link
Member

KapJI commented Oct 29, 2024

I added BUILDBUDDY_API_KEY which holds my 20 chars API token to our github secrets. Does my BuildBuddy account need any extra setup?

@aherrmann
Copy link
Contributor Author

aherrmann commented Oct 29, 2024

@KapJI Thank you! That sounds great, I don't think it should require any additional configuration. I'll test and debug the CI setup and let you know if I run into anything.

@aherrmann
Copy link
Contributor Author

One issue I'm encountering is that repository secrets are not exposed to GH actions runs that are initiated from forks (as is the case with this PR) for security reasons (see here). Here's what I've done now:

  • I've added a CI step for the persistent worker steps.
  • In there I check if the token is available or not. If not, I skip the remote execution tests and generate a GH actions notice annotation. So, by default the remote execution cases will only be checked on main CI or PRs coming from Meta engineers.
  • I've added the workflow_dispatch trigger to the CI workflow, this should allow Meta engineers to manually trigger a CI run that should have the token set (will only be available once the workflow_dispatch trigger is merged). E.g. to test that an external PR doesn't break these tests.

@KapJI could I ask you to trigger a CI run of this PR from within the Buck2 repo to test the remote execution cases? (After convincing yourself that this PR doesn't do anything dodgy with the token).
You can do this by pulling this PR's branch and then pushing it to the facebook/buck2 repo (not onto the main branch, just as a separate branch). Something along the lines of gh pr checkout 787; git push origin persistent-remote-worker. I'd expect the push trigger to fire at that point, if not you may need to add a dummy commit.

@facebook-github-bot
Copy link
Contributor

@KapJI has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@KapJI
Copy link
Member

KapJI commented Oct 30, 2024

@aherrmann It's running on #800

@aherrmann
Copy link
Contributor Author

@KapJI Thanks! Unfortunately, it looks like the token is still not available:

Is BUILDBUDDY_API_KEY configured as a repository secret? Does gh secret list include it in its output?

@aherrmann
Copy link
Contributor Author

@KapJI friendly ping, did you have a chance to look into the above?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feature request - support remote persistent workers
5 participants