-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(clustering): data-plane node id keeps changing after restarted in hybrid mode #9067
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great idea @vm-001 😀 I am surprised this was not done before.
Mind adding an integration test for this?
99b83b9
to
7ccb126
Compare
Added. |
7ccb126
to
3b0edce
Compare
ceff373
to
d5834e7
Compare
abdc012
to
6219946
Compare
Sorry for leaving this PR for some time since I have some high-priority tasks that need to work on. After some digging, I found that if Kong uses a single file to store the So I decide to use different file to store the node id for the different ngx mode.
|
Co-authored-by: Xumin <[email protected]>
Co-authored-by: Datong Sun <[email protected]>
c7f4a01
to
fe17ebe
Compare
|
It has tests and is simple enough of a change such that I'm confident any potential issues can be corrected before code freeze. I will merge once the latest CI run has concluded. @vm-001 thanks for enduring the long road to get this merged in. Can you please give all the review comments another read to see if there are any follow-up items that need another PR? Also, please look at the additional integration tests I've added to make sure that the correct behavior is reflected. I had to mark one test as |
@@ -255,6 +257,23 @@ local function new(self) | |||
return gsub(hostname, "\n$", "") | |||
end | |||
|
|||
|
|||
local prefix = self and self.configuration and self.configuration.prefix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't help but feel like all of the logic should go into kong.pdk.private.node
instead. As it is, it's still hard to follow the order of operations to know if there's a chicken/egg situation.
You have spent more time working through the problem, so maybe I'm just missing something?
Here's how I picture the call stack:
kong.init()
-> kong.global.init_pdk(self, kong_config)
-> kong.pdk.new(kong_config, self)
-> kong.pdk.node.new(self)
-> kong.pdk.private.node.init(self)
...and kong.pdk.node.private.init()
could look something like this:
-- init() is a no-op if called more than once
if already_initialized() then
return
end
-- check shm first, since this might have been a reload
local shm_id = shm:get(MY_ID_KEY)
-- the simple case
if self.configuration.role ~= "data_plane" then
if not shm_id then
shm_id = generate_uuid()
store_to_shm(shm_id)
end
return
end
local fs_id = read_id_from_filesystem()
if shm_id and not fs_id then
store_to_filesystem(shm_id)
elseif fs_id and not shm_id then
store_to_shm(fs_id)
elseif not fs_id and not shm_id then
local id = generate_uuid()
store_to_filesystem(id)
store_to_shm(id)
elseif fs_id ~= shm_id then
-- uh oh, need to pick a winner
-- probably should keep shm_id in this case because if it exists,
-- that means we're a reload, so it's already being used by the
-- current process group
log("uh oh this is weird...")
store_to_filesystem(shm_id)
else
-- no action needed if this is the case
assert(fs_id == shm_id)
end
At this point the kong.node.get_id()
PDK function just needs to be updated to not attempt to create the node_id itself:
function _NODE.get_id()
if node_id then
return node_id
end
local shm = ngx.shared.kong
- local ok, err = shm:safe_add(NODE_ID_KEY, utils.uuid())
- if not ok and err ~= "exists" then
- error("failed to set 'node_id' in shm: " .. err)
- end
node_id, err = shm:get(NODE_ID_KEY)
if err then
error("failed to get 'node_id' in shm: " .. err)
end
if not node_id then
error("no 'node_id' set in shm")
end
return node_id
end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@flrgh This is a cleaner approach. But there may have an issue:
kong.init()
-> kong.global.init_pdk(self, kong_config)
-> kong.pdk.new(kong_config, self)
-> kong.pdk.node.new(self)
-> kong.pdk.private.node.init(self)
Kong calls kong.global.init_pdk
in the cmd stage, and the store_to_shm
function won't work as ngx.shared
API is not available in the stage.
@flrgh Thank you for your help. And the fix PR is here: #9790 |
I appreciate everyone who spent time involved in this PR. |
Summary
Kong creates a new node_id every time it restarts. In hybrid mode, CP will insert a new row to the clustering_data_planes table after DP restart because it provides a new node_id. If DPs restart enough times in a period of time which is defined by cluster_data_plane_purge_delay, clustering_data_planes rows will increase a lot. This most likely fills up the prometheus_metrics shared dict.
Full changelog
Issue reference
FTI-4168