Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Implementation for Circuit breaker #18136

Closed
amishra-u opened this issue Apr 18, 2023 · 0 comments
Closed

Add Implementation for Circuit breaker #18136

amishra-u opened this issue Apr 18, 2023 · 0 comments
Labels
P3 We're not considering working on this, but happy to review a PR. (No assignee) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: feature request

Comments

@amishra-u
Copy link
Contributor

amishra-u commented Apr 18, 2023

Description of the feature request:

The interface for circuitbreaker was defined long back but it has just NOOP implementation. As it was left unimplemented any problem in remote cache has detrimental effect on build times.

What underlying problem are you trying to solve with this feature?

Implement a basic failure circuit breaker to prevent any potential issues in the remote cache from negatively impacting the build time. This would ensure that the performance with remote cache is not worse than without it

@sgowroji sgowroji added the team-Remote-Exec Issues and PRs for the Execution (Remote) team label Apr 19, 2023
@coeuvre coeuvre added P2 We'll consider working on this in future. (Assignee optional) P3 We're not considering working on this, but happy to review a PR. (No assignee) and removed untriaged P2 We'll consider working on this in future. (Assignee optional) labels Apr 25, 2023
amishra-u added a commit to amishra-u/bazel that referenced this issue May 30, 2023
Copy of bazelbuild#18120: I accidentally closed bazelbuild#18120 during rebase and doesn't have permission to reopen.

We have noticed that any problems with the remote cache have a detrimental effect on build times. On investigation we found that the interface for the circuit breaker was left unimplemented.

To address this issue, implemented a failure circuit breaker, which includes three new Bazel flags: 1) experimental_circuitbreaker_strategy, 2) experimental_remote_failure_threshold, and 3) experimental_emote_failure_window.

In this implementation, I have implemented failure strategy for circuit breaker and used failure count to trip the circuit.

Reasoning behind using failure count instead of failure rate : To measure failure rate I also need the success count. While both the failure and success count need to be an AtomicInteger as both will be modified concurrently by multiple threads. Even though getAndIncrement is very light weight operation, at very high request it might contribute to latency.

Reasoning behind using failure circuit breaker : A new instance of Retrier.CircuitBreaker is created for each build. Therefore, if the circuit breaker trips during a build, the remote cache will be disabled for that build. However, it will be enabled again
for the next build as a new instance of Retrier.CircuitBreaker will be created. If needed in the future we may add cool down strategy also. e.g. failure_and_cool_down_startegy.

closes bazelbuild#18136

Closes bazelbuild#18359.

PiperOrigin-RevId: 536349954
Change-Id: I5e1c57d4ad0ce07ddc4808bf1f327bc5df6ce704
iancha1992 pushed a commit that referenced this issue Jun 1, 2023
* feat: Implement failure circuit breaker

Copy of #18120: I accidentally closed #18120 during rebase and doesn't have permission to reopen.

We have noticed that any problems with the remote cache have a detrimental effect on build times. On investigation we found that the interface for the circuit breaker was left unimplemented.

To address this issue, implemented a failure circuit breaker, which includes three new Bazel flags: 1) experimental_circuitbreaker_strategy, 2) experimental_remote_failure_threshold, and 3) experimental_emote_failure_window.

In this implementation, I have implemented failure strategy for circuit breaker and used failure count to trip the circuit.

Reasoning behind using failure count instead of failure rate : To measure failure rate I also need the success count. While both the failure and success count need to be an AtomicInteger as both will be modified concurrently by multiple threads. Even though getAndIncrement is very light weight operation, at very high request it might contribute to latency.

Reasoning behind using failure circuit breaker : A new instance of Retrier.CircuitBreaker is created for each build. Therefore, if the circuit breaker trips during a build, the remote cache will be disabled for that build. However, it will be enabled again
for the next build as a new instance of Retrier.CircuitBreaker will be created. If needed in the future we may add cool down strategy also. e.g. failure_and_cool_down_startegy.

closes #18136

Closes #18359.

PiperOrigin-RevId: 536349954
Change-Id: I5e1c57d4ad0ce07ddc4808bf1f327bc5df6ce704

* remove target included in cherry-pick by mistake
iancha1992 added a commit that referenced this issue Jun 13, 2023
…#18559)

* feat: Implement failure circuit breaker

Copy of #18120: I accidentally closed #18120 during rebase and doesn't have permission to reopen.

We have noticed that any problems with the remote cache have a detrimental effect on build times. On investigation we found that the interface for the circuit breaker was left unimplemented.

To address this issue, implemented a failure circuit breaker, which includes three new Bazel flags: 1) experimental_circuitbreaker_strategy, 2) experimental_remote_failure_threshold, and 3) experimental_emote_failure_window.

In this implementation, I have implemented failure strategy for circuit breaker and used failure count to trip the circuit.

Reasoning behind using failure count instead of failure rate : To measure failure rate I also need the success count. While both the failure and success count need to be an AtomicInteger as both will be modified concurrently by multiple threads. Even though getAndIncrement is very light weight operation, at very high request it might contribute to latency.

Reasoning behind using failure circuit breaker : A new instance of Retrier.CircuitBreaker is created for each build. Therefore, if the circuit breaker trips during a build, the remote cache will be disabled for that build. However, it will be enabled again
for the next build as a new instance of Retrier.CircuitBreaker will be created. If needed in the future we may add cool down strategy also. e.g. failure_and_cool_down_startegy.

closes #18136

Closes #18359.

PiperOrigin-RevId: 536349954
Change-Id: I5e1c57d4ad0ce07ddc4808bf1f327bc5df6ce704

* remove target included in cherry-pick by mistake

* Use failure_rate instead of failure count for circuit breaker

---------

Co-authored-by: Ian (Hee) Cha <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P3 We're not considering working on this, but happy to review a PR. (No assignee) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: feature request
Projects
None yet
4 participants