Scheduler unavailability should not impact cache operations #220

CaerusKaru · 2024-09-19T19:07:56Z

In the scenario where bb-storage frontend is pointing to both a remote cache and a bb-scheduler instance, if the scheduler suddenly goes down, the entire frontend instance essentially becomes crippled. However, cache actions should be totally unaffected by the scheduler's availability (as in the case where a customer passes --remote_cache but not --remote_executor).

Can we make unavailability of the scheduler a log in the console for cache API calls, while still returning an error for remote execution API calls?

The text was updated successfully, but these errors were encountered:

EdSchouten · 2024-09-20T16:30:35Z

I suspect that what you’re seeing is that GetCapabilities() calls fail. Those need to merge properties returned by both the storage nodes and scheduler process. It’s also hard to cache/memoize these, as they depend on the credentials of the user.

CaerusKaru · 2024-09-20T16:34:17Z

Sure, but can we have it be that the call returns the equivalent of false (or aborted merge) for everyone if scheduler is down instead of crashing?

EdSchouten · 2024-09-20T16:35:59Z

As in, announce that the cluster supports remote caching? No, because that would cause flakiness if people try to do builds that only use remote execution without local fallback.

CaerusKaru · 2024-09-20T16:43:03Z

If they don’t have local fallback enabled but they do have remote executor specified, wouldn’t the CLI simply error that the endpoint doesn’t support RBE and then fail the build?

EdSchouten · 2024-09-20T16:45:08Z

Exactly. And that’s bad, because under the current model it’s possible to set —remote_retries sufficiently high, causing Bazel to simply wait for the scheduler to come online and run the build to completion.

CaerusKaru · 2024-09-20T16:52:34Z

True, but we have to weight that against the remote cache being completely inaccessible to everyone for that duration as a penalty. Maybe this should be a configuration option, then? Fail open with scheduler unavailability vs not?

moroten · 2024-09-20T16:52:49Z

If we know the configuration of the scheduler, it should be possible to implement configuration of its capabilities straight in the frontend.

EdSchouten · 2024-09-20T16:55:00Z

The scheduler is such a simple process to operate, I don’t see the value in that to be honest. Just run health checking against it and make sure it gets launched elsewhere if your server fails.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scheduler unavailability should not impact cache operations #220

Scheduler unavailability should not impact cache operations #220

CaerusKaru commented Sep 19, 2024

EdSchouten commented Sep 20, 2024

CaerusKaru commented Sep 20, 2024

EdSchouten commented Sep 20, 2024

CaerusKaru commented Sep 20, 2024

EdSchouten commented Sep 20, 2024

CaerusKaru commented Sep 20, 2024

moroten commented Sep 20, 2024

EdSchouten commented Sep 20, 2024

Scheduler unavailability should not impact cache operations #220

Scheduler unavailability should not impact cache operations #220

Comments

CaerusKaru commented Sep 19, 2024

EdSchouten commented Sep 20, 2024

CaerusKaru commented Sep 20, 2024

EdSchouten commented Sep 20, 2024

CaerusKaru commented Sep 20, 2024

EdSchouten commented Sep 20, 2024

CaerusKaru commented Sep 20, 2024

moroten commented Sep 20, 2024

EdSchouten commented Sep 20, 2024