Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix execution of failpoint should not block deactivation #65

Conversation

henrybear327
Copy link
Contributor

There are 2 main flows of the gofail library: namely, enable/disable and execution (Acquire) of the failpoints.

Currently, a mutex is protecting both flows, thus, only one action can make progress at a time.

This PR proposes a fine-grained mutex, as each failpoint is protected under a dedicated RWMutex. The existing failpointsMu will only be protecting the main shared data structures, such as the failpoints map.

Notice that in our current implementation, the execution of the same failpoint is still sequential (there is a lock within eval on the term being executed)

Reference:

@henrybear327 henrybear327 force-pushed the improvement/Execution-of-failpoint-should-not-block-deactivation branch from acc1691 to 361aac9 Compare April 18, 2024 15:17
@henrybear327
Copy link
Contributor Author

I made a small demo program to demonstrate the effect after applying this change.

runtime/runtime.go Outdated Show resolved Hide resolved
@henrybear327 henrybear327 force-pushed the improvement/Execution-of-failpoint-should-not-block-deactivation branch from 7e76f7f to f903576 Compare April 19, 2024 21:04
@henrybear327
Copy link
Contributor Author

cc: @ahrtr @serathius

@ahrtr
Copy link
Member

ahrtr commented Apr 20, 2024

Please read

gofail/runtime/http.go

Lines 40 to 45 in 93c579a

// This prevents all failpoints from being triggered. It ensures
// the server(runtime) doesn't panic due to any failpoints during
// processing the HTTP request.
// It may be inefficient, but correctness is more important than
// efficiency. Usually users will not enable too many failpoints
// at a time, so it (the efficiency) isn't a problem.

Overall, I agree that enable/disable and execution (Acquire) of the failpoints should have separate locks, but the failpoint panic is a special case. When the http server is processing request, we should block the execution of panic failpoint; otherwise it may affect the client side of the http request.

gofail/runtime/terms.go

Lines 284 to 291 in 93c579a

var actMap = map[string]actFunc{
"off": actOff,
"return": actReturn,
"sleep": actSleep,
"panic": actPanic,
"break": actBreak,
"print": actPrint,
}

@henrybear327 henrybear327 force-pushed the improvement/Execution-of-failpoint-should-not-block-deactivation branch from f903576 to e51e11e Compare May 3, 2024 15:21
@henrybear327
Copy link
Contributor Author

Please read

gofail/runtime/http.go

Lines 40 to 45 in 93c579a

// This prevents all failpoints from being triggered. It ensures
// the server(runtime) doesn't panic due to any failpoints during
// processing the HTTP request.
// It may be inefficient, but correctness is more important than
// efficiency. Usually users will not enable too many failpoints
// at a time, so it (the efficiency) isn't a problem.

Overall, I agree that enable/disable and execution (Acquire) of the failpoints should have separate locks, but the failpoint panic is a special case. When the http server is processing request, we should block the execution of panic failpoint; otherwise it may affect the client side of the http request.

gofail/runtime/terms.go

Lines 284 to 291 in 93c579a

var actMap = map[string]actFunc{
"off": actOff,
"return": actReturn,
"sleep": actSleep,
"panic": actPanic,
"break": actBreak,
"print": actPrint,
}

Hey @ahrtr,

Thanks for reviewing and pointing out this issue.

Sorry that it took a while to fix this. I didn't know if there exists a cleaner way to implement this.

Currently, I introduced a special mutex called panicMu, which ensures that panic failpoints and serving of the HTTP requests won't be able to be executed at the same time.

/cc @ahrtr

@k8s-ci-robot k8s-ci-robot requested a review from ahrtr May 3, 2024 15:40
@henrybear327 henrybear327 force-pushed the improvement/Execution-of-failpoint-should-not-block-deactivation branch from 1c33e68 to c591bde Compare May 3, 2024 15:41
@henrybear327
Copy link
Contributor Author

/cc @serathius

Copy link
Member

@ahrtr ahrtr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me.

Have you manually verified that it can resolve the issue #64?

runtime/failpoint.go Outdated Show resolved Hide resolved
@ahrtr
Copy link
Member

ahrtr commented May 13, 2024

@ArkaSaha30, please let me know if you have any comments.

runtime/runtime.go Outdated Show resolved Hide resolved
runtime/runtime.go Outdated Show resolved Hide resolved
runtime/runtime.go Outdated Show resolved Hide resolved
@henrybear327
Copy link
Contributor Author

Overall looks good to me.

Have you manually verified that it can resolve the issue #64?

Hi @ahrtr,

#65 (comment)

I used this small program to perform testing :) Maybe I should add an integration test to make sure that there isn't any regression. :)

@serathius
Copy link
Member

Agree about the integration test, it should be enough to test that we can setup a failpoint with 1 second sleep and deactivate it before 1 second passes.

@henrybear327 henrybear327 force-pushed the improvement/Execution-of-failpoint-should-not-block-deactivation branch 2 times, most recently from 28bde52 to 5f5a127 Compare May 13, 2024 12:21
@henrybear327
Copy link
Contributor Author

Agree about the integration test, it should be enough to test that we can setup a failpoint with 1 second sleep and deactivate it before 1 second passes.

That's similar to what I have in the demo program! :)

I will implement this in the integration test! Thanks @serathius

henrybear327 and others added 5 commits May 14, 2024 18:17
There are 2 main flows of the gofail library: namely enable/disable and
execution (`Acquire`) of the failpoints.

Currently, a mutex is protecting both flows, thus only one action can
make progress at a time.

This PR proposes a fine-grained mutex, as each failpoint is protected under
a dedicated `RWMutex`.

The existing `failpointsMu` will only be protecting the main shared
data structures, such as `failpoints` map.

Notice that in our current implementation, the execution of the same
failpoint is still sequential (there is a lock within `eval` on the
term being executed)

Reference:
- etcd-io#64

Signed-off-by: Chun-Hung Tseng <[email protected]>
Signed-off-by: Chun-Hung Tseng <[email protected]>
Ensures that panic failpoints and serving of the http requests won't
be able to be executed at the same time.

Signed-off-by: Chun-Hung Tseng <[email protected]>
Co-authored-by: Benjamin Wang <[email protected]>
Co-authored-by: Marek Siarkowicz <[email protected]>
Signed-off-by: Chun-Hung Tseng <[email protected]>
@henrybear327 henrybear327 force-pushed the improvement/Execution-of-failpoint-should-not-block-deactivation branch from 5f5a127 to 499c363 Compare May 14, 2024 16:17
@ahrtr
Copy link
Member

ahrtr commented May 14, 2024

I will implement this in the integration test!

PLease let me know if you want to do it in this PR or in a separate PR.

@henrybear327
Copy link
Contributor Author

@ahrtr let me try to add the integration test in this PR, so we can have the code and the test merged together.

What do you think :)

Copy link
Member

@ahrtr ahrtr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!

Thanks @henrybear327

Please feel free to add e2e or integration test in a separate PR.

@henrybear327
Copy link
Contributor Author

Great work!

Thanks @henrybear327

Please feel free to add e2e or integration test in a separate PR.

I have started another PR to temporarily park my integration test implementation (#69).

@ahrtr can you review and see if we need to have a separate discussion regarding the folder/code structure of the integration test, or I can just use that commit here and we merge everything in one go! :)

Thanks!

runtime/failpoint.go Outdated Show resolved Hide resolved
@ahrtr
Copy link
Member

ahrtr commented May 15, 2024

I have started another PR to temporarily park my integration test implementation (#69).

@ahrtr can you review and see if we need to have a separate discussion regarding the folder/code structure of the integration test, or I can just use that commit here and we merge everything in one go! :)

I was only expecting your to do manual test, because it needs big effort to setup the e2e or integration test. But big thanks for offering the e2e/integration test. My proposal:

After all of above is done, we will release v0.2.0.

@henrybear327
Copy link
Contributor Author

I have started another PR to temporarily park my integration test implementation (#69).
@ahrtr can you review and see if we need to have a separate discussion regarding the folder/code structure of the integration test, or I can just use that commit here and we merge everything in one go! :)

I was only expecting your to do manual test, because it needs big effort to setup the e2e or integration test. But big thanks for offering the e2e/integration test. My proposal:

After all of above is done, we will release v0.2.0.

Thanks for the action items @ahrtr, I will push out changes ASAP.

I would add one more step in between. I am also using etcd-io/etcd#18018 to double-check if the gofail changes are working as intended!

Signed-off-by: Chun-Hung Tseng <[email protected]>
Co-authored-by: Marek Siarkowicz <[email protected]>
@henrybear327 henrybear327 force-pushed the improvement/Execution-of-failpoint-should-not-block-deactivation branch from 837e826 to 60b65e2 Compare May 15, 2024 19:32
@henrybear327
Copy link
Contributor Author

henrybear327 commented May 15, 2024

I have started another PR to temporarily park my integration test implementation (#69).
@ahrtr can you review and see if we need to have a separate discussion regarding the folder/code structure of the integration test, or I can just use that commit here and we merge everything in one go! :)

I was only expecting your to do manual test, because it needs big effort to setup the e2e or integration test. But big thanks for offering the e2e/integration test. My proposal:

After all of above is done, we will release v0.2.0.

Thanks for the action items @ahrtr, I will push out changes ASAP.

I would add one more step in between. I am also using etcd-io/etcd#18018 to double-check if the gofail changes are working as intended!

Hi @ahrtr and @serathius,

I have renamed the mutex as suggested, and added a comment to the new mutex :)

I also created etcd-io/etcd#18018, which

  • replaces all go.etcd.io/gofail with github.com/henrybear327/gofail
  • rolled back the changes that were required due to gofail's original behavior (enabling/disabling failpoints will block the execution of failpoints)
    All the tests can pass (besides the static analysis test failed, because the PR is not using the package from go.etcd.io), which I think means that our current gofail changes aren't breaking the intended functionalities.

Signed-off-by: Chun-Hung Tseng <[email protected]>
Co-authored-by: Marek Siarkowicz <[email protected]>
@henrybear327 henrybear327 force-pushed the improvement/Execution-of-failpoint-should-not-block-deactivation branch from 96066ba to 13d483e Compare May 15, 2024 20:08
@serathius serathius merged commit baefa98 into etcd-io:master May 17, 2024
3 checks passed
@henrybear327 henrybear327 deleted the improvement/Execution-of-failpoint-should-not-block-deactivation branch May 17, 2024 12:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants