-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix/plugin server release lock #10561
Conversation
kong/runloop/plugin_servers/init.lua
Outdated
|
||
-- to prevent a potential dead loop when someone failed to release the ID | ||
wait_count = wait_count + 1 | ||
if wait_count > 10 then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the max wait count looks okay to me, but the sleep change doesn't seem necessary, as the sleep call will cause an yield. can you elaborate on it?
for the fix itself, only the reset_instance
should be necessary (and lgtm); let's please make sure it's in its own commit so we can do a rebase + merge (squash + merge will mix things up)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, when we start the Kong, the Go plugin server needs time to start. If we waited for a while, we could expect to wait longer.
When the Go plugin server does not respond, Kong can do nothing except for waiting for more rounds, and it might quickly run out of counts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got your point @StarlightIbuki, just checking: is this a theoretic scenario or have you confirmed that the first (10) wait count can be exceeded and therefore the wait increase becomes useful in practice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@samugi each round may take less than 0.01 seconds and it's very possible for the go plugin server to take 0.1 seconds to start.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of relying on loops count (which would vary in time on different hardware), does it make sense to go with a naive approach that:
- wait explictly 0.1 in every loop, check if we are good
- if have waited for more than 10 seconds, return an error
0e3706f
to
0e9c1a3
Compare
0e9c1a3
to
747be90
Compare
747be90
to
da6053f
Compare
Is it possible to add a test to verify this behavior? |
It's difficult and could be a source of flakiness. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
da6053f
to
906d0f2
Compare
Now this PR has 4 commits, should we rebase it? |
Let's squash and merge. But keep the 2 commits' message. |
Summary
NOTE: when merging, please make sure all changes are included in the message.
When the plugin server sock is not ready and a request comes, the "not ready" error will leak lock, fail to release the instance, and cause the CPU to reach 100%.
Checklist
The Pull Request has tests N/A.This is hard to test with CI.There is a user-facing docs PRN/AIssue reference
KAG-1204