Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduler unavailability should not impact cache operations #220

Open
CaerusKaru opened this issue Sep 19, 2024 · 8 comments
Open

Scheduler unavailability should not impact cache operations #220

CaerusKaru opened this issue Sep 19, 2024 · 8 comments

Comments

@CaerusKaru
Copy link

In the scenario where bb-storage frontend is pointing to both a remote cache and a bb-scheduler instance, if the scheduler suddenly goes down, the entire frontend instance essentially becomes crippled. However, cache actions should be totally unaffected by the scheduler's availability (as in the case where a customer passes --remote_cache but not --remote_executor).

Can we make unavailability of the scheduler a log in the console for cache API calls, while still returning an error for remote execution API calls?

@EdSchouten
Copy link
Member

I suspect that what you’re seeing is that GetCapabilities() calls fail. Those need to merge properties returned by both the storage nodes and scheduler process. It’s also hard to cache/memoize these, as they depend on the credentials of the user.

@CaerusKaru
Copy link
Author

Sure, but can we have it be that the call returns the equivalent of false (or aborted merge) for everyone if scheduler is down instead of crashing?

@EdSchouten
Copy link
Member

As in, announce that the cluster supports remote caching? No, because that would cause flakiness if people try to do builds that only use remote execution without local fallback.

@CaerusKaru
Copy link
Author

If they don’t have local fallback enabled but they do have remote executor specified, wouldn’t the CLI simply error that the endpoint doesn’t support RBE and then fail the build?

@EdSchouten
Copy link
Member

Exactly. And that’s bad, because under the current model it’s possible to set —remote_retries sufficiently high, causing Bazel to simply wait for the scheduler to come online and run the build to completion.

@CaerusKaru
Copy link
Author

True, but we have to weight that against the remote cache being completely inaccessible to everyone for that duration as a penalty. Maybe this should be a configuration option, then? Fail open with scheduler unavailability vs not?

@moroten
Copy link
Contributor

moroten commented Sep 20, 2024

If we know the configuration of the scheduler, it should be possible to implement configuration of its capabilities straight in the frontend.

@EdSchouten
Copy link
Member

The scheduler is such a simple process to operate, I don’t see the value in that to be honest. Just run health checking against it and make sure it gets launched elsewhere if your server fails.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants