fix(clustering): data-plane node id keeps changing after restarted in hybrid mode #9067

vm-001 · 2022-07-08T06:47:15Z

Summary

Kong creates a new node_id every time it restarts. In hybrid mode, CP will insert a new row to the clustering_data_planes table after DP restart because it provides a new node_id. If DPs restart enough times in a period of time which is defined by cluster_data_plane_purge_delay, clustering_data_planes rows will increase a lot. This most likely fills up the prometheus_metrics shared dict.

Full changelog

Avoid repeatedly generating different node_id after kong restarted by writing node_id to file

Issue reference

FTI-4168

kong/pdk/node.lua

flrgh

Great idea @vm-001 😀 I am surprised this was not done before.

Mind adding an integration test for this?

vm-001 · 2022-07-14T07:42:43Z

Great idea @vm-001 😀 I am surprised this was not done before.

Mind adding an integration test for this?

Added.

kong/cluster_events/init.lua

vm-001 · 2022-09-13T10:01:48Z

Sorry for leaving this PR for some time since I have some high-priority tasks that need to work on.

After some digging, I found that if Kong uses a single file to store the node_id and share with both http and stream mode. This will cause some critical issues in cluster event, that is the cluster events that are published by the http mode instance will be ignored in the stream mode instance as they shared the same node id in the filesystem, which is wrong.

So I decide to use different file to store the node id for the different ngx mode.

${prefix}/node.id/http for http mode.
${prefix}/node.id/stream for stream mode.

Co-authored-by: Xumin <[email protected]>

Co-authored-by: Datong Sun <[email protected]>

flrgh · 2022-11-18T00:12:48Z

rebased onto master
addressed most of the code review cleanup/consistency feedback
added more test coverage

flrgh · 2022-11-18T00:22:38Z

I also addressed @dndx's requests to relocate each local knode = kong and kong.node or require "kong.pdk.node".new() expression. The tests are passing with this change having been made, so I think it's okay, but @vm-001 will know better if there are any issues to be aware of in regard to this.

flrgh · 2022-11-18T00:26:20Z

It has tests and is simple enough of a change such that I'm confident any potential issues can be corrected before code freeze. I will merge once the latest CI run has concluded.

@vm-001 thanks for enduring the long road to get this merged in. Can you please give all the review comments another read to see if there are any follow-up items that need another PR? Also, please look at the additional integration tests I've added to make sure that the correct behavior is reflected. I had to mark one test as pending because of some order-of-operation oddities, and I wasn't sure if I'd break something else by trying to fix it.

kong/pdk/node.lua

flrgh · 2022-11-18T01:05:52Z

kong/pdk/node.lua

@@ -255,6 +257,23 @@ local function new(self)
    return gsub(hostname, "\n$", "")
  end

+
+  local prefix = self and self.configuration and self.configuration.prefix


I can't help but feel like all of the logic should go into kong.pdk.private.node instead. As it is, it's still hard to follow the order of operations to know if there's a chicken/egg situation.

You have spent more time working through the problem, so maybe I'm just missing something?

Here's how I picture the call stack:

kong.init() -> kong.global.init_pdk(self, kong_config) -> kong.pdk.new(kong_config, self) -> kong.pdk.node.new(self) -> kong.pdk.private.node.init(self)

...and kong.pdk.node.private.init() could look something like this:

-- init() is a no-op if called more than once if already_initialized() then return end -- check shm first, since this might have been a reload local shm_id = shm:get(MY_ID_KEY) -- the simple case if self.configuration.role ~= "data_plane" then if not shm_id then shm_id = generate_uuid() store_to_shm(shm_id) end return end local fs_id = read_id_from_filesystem() if shm_id and not fs_id then store_to_filesystem(shm_id) elseif fs_id and not shm_id then store_to_shm(fs_id) elseif not fs_id and not shm_id then local id = generate_uuid() store_to_filesystem(id) store_to_shm(id) elseif fs_id ~= shm_id then -- uh oh, need to pick a winner -- probably should keep shm_id in this case because if it exists, -- that means we're a reload, so it's already being used by the -- current process group log("uh oh this is weird...") store_to_filesystem(shm_id) else -- no action needed if this is the case assert(fs_id == shm_id) end

At this point the kong.node.get_id() PDK function just needs to be updated to not attempt to create the node_id itself:

function _NODE.get_id() if node_id then return node_id end local shm = ngx.shared.kong - local ok, err = shm:safe_add(NODE_ID_KEY, utils.uuid()) - if not ok and err ~= "exists" then - error("failed to set 'node_id' in shm: " .. err) - end node_id, err = shm:get(NODE_ID_KEY) if err then error("failed to get 'node_id' in shm: " .. err) end if not node_id then error("no 'node_id' set in shm") end return node_id end

@flrgh This is a cleaner approach. But there may have an issue:

kong.init() -> kong.global.init_pdk(self, kong_config) -> kong.pdk.new(kong_config, self) -> kong.pdk.node.new(self) -> kong.pdk.private.node.init(self)

Kong calls kong.global.init_pdk in the cmd stage, and the store_to_shm function won't work as ngx.shared API is not available in the stage.

vm-001 · 2022-11-21T07:38:59Z

It has tests and is simple enough of a change such that I'm confident any potential issues can be corrected before code freeze. I will merge once the latest CI run has concluded.

@vm-001 thanks for enduring the long road to get this merged in. Can you please give all the review comments another read to see if there are any follow-up items that need another PR? Also, please look at the additional integration tests I've added to make sure that the correct behavior is reflected. I had to mark one test as pending because of some order-of-operation oddities, and I wasn't sure if I'd break something else by trying to fix it.

@flrgh Thank you for your help. And the fix PR is here: #9790

vm-001 · 2022-11-21T07:40:04Z

I appreciate everyone who spent time involved in this PR.

vm-001 requested a review from a team as a code owner July 8, 2022 06:47

pull-request-size bot added the size/S label Jul 8, 2022

github-actions bot added the core/pdk label Jul 8, 2022

fffonion added the pr/please review label Jul 8, 2022

flrgh reviewed Jul 8, 2022

View reviewed changes

kong/pdk/node.lua Outdated Show resolved Hide resolved

flrgh reviewed Jul 8, 2022

View reviewed changes

kong/pdk/node.lua Outdated Show resolved Hide resolved

flrgh reviewed Jul 8, 2022

View reviewed changes

vm-001 marked this pull request as draft July 13, 2022 06:01

vm-001 force-pushed the feat/node-id-persistence branch from 99b83b9 to 7ccb126 Compare July 14, 2022 07:38

pull-request-size bot added size/M and removed size/S labels Jul 14, 2022

github-actions bot added core/clustering core/db labels Jul 14, 2022

vm-001 marked this pull request as ready for review July 14, 2022 07:39

vm-001 force-pushed the feat/node-id-persistence branch from 7ccb126 to 3b0edce Compare July 14, 2022 07:54

github-actions bot added the chore Not part of the core functionality of kong, but still needed label Jul 14, 2022

vm-001 requested review from flrgh and mayocream July 14, 2022 08:09

vm-001 force-pushed the feat/node-id-persistence branch from ceff373 to d5834e7 Compare August 4, 2022 03:27

github-actions bot removed the chore Not part of the core functionality of kong, but still needed label Aug 4, 2022

flrgh reviewed Aug 5, 2022

View reviewed changes

kong/cluster_events/init.lua Outdated Show resolved Hide resolved

guanlan force-pushed the master branch from f089a5b to 62d113d Compare August 25, 2022 01:40

vm-001 force-pushed the feat/node-id-persistence branch 2 times, most recently from abdc012 to 6219946 Compare September 13, 2022 09:10

pull-request-size bot added size/L and removed size/M labels Sep 13, 2022

github-actions bot added the plugins/statsd label Sep 13, 2022

vm-001 added this to the 3.1 milestone Sep 13, 2022

vm-001 and others added 19 commits November 17, 2022 15:12

cleanup

e1dcddb

cleanup

233c2d7

fix(tests): add prefix to prevent error output

2804e7a

move init and load logic into /pkd/private/node file

3ce8a5b

update kong-3.1.0-0.rockspec

28aa47f

only persiste node_id for data_plane

1e1110d

update CHANGELOG.md

b3652fe

code style

4938230

code style

b806052

Update CHANGELOG.md

a080ec6

Co-authored-by: Xumin <[email protected]>

add cache to avoid duplicated read from filesystem

6b90a49

Apply suggestions from code review

2bb7093

Co-authored-by: Datong Sun <[email protected]>

docs(changelog): update entry and move to additions section

5b18132

chore(pdk): refine logging

9358249

chore(pdk): make error formatting more consistent

0a33aee

tests(clustering): expand node id persistence coverage

639c78c

tests(clustering): fix test file name

4f59207

tests(clustering): fix invalid stored ID test case

65575e6

tests(clustering): add more tests

fe17ebe

flrgh force-pushed the feat/node-id-persistence branch from c7f4a01 to fe17ebe Compare November 17, 2022 23:12

chore(*): relocate kong.node import sites

ea5516b

flrgh reviewed Nov 18, 2022

View reviewed changes

kong/pdk/node.lua Show resolved Hide resolved

flrgh reviewed Nov 18, 2022

View reviewed changes

flrgh merged commit 630f58d into master Nov 18, 2022

flrgh deleted the feat/node-id-persistence branch November 18, 2022 01:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(clustering): data-plane node id keeps changing after restarted in hybrid mode #9067

fix(clustering): data-plane node id keeps changing after restarted in hybrid mode #9067

vm-001 commented Jul 8, 2022 •

edited

Loading

flrgh left a comment

vm-001 commented Jul 14, 2022

vm-001 commented Sep 13, 2022

flrgh commented Nov 18, 2022

flrgh commented Nov 18, 2022

flrgh commented Nov 18, 2022

flrgh Nov 18, 2022 •

edited

Loading

vm-001 Nov 21, 2022

vm-001 commented Nov 21, 2022

vm-001 commented Nov 21, 2022

fix(clustering): data-plane node id keeps changing after restarted in hybrid mode #9067

fix(clustering): data-plane node id keeps changing after restarted in hybrid mode #9067

Conversation

vm-001 commented Jul 8, 2022 • edited Loading

Summary

Full changelog

Issue reference

flrgh left a comment

Choose a reason for hiding this comment

vm-001 commented Jul 14, 2022

vm-001 commented Sep 13, 2022

flrgh commented Nov 18, 2022

flrgh commented Nov 18, 2022

flrgh commented Nov 18, 2022

flrgh Nov 18, 2022 • edited Loading

Choose a reason for hiding this comment

vm-001 Nov 21, 2022

Choose a reason for hiding this comment

vm-001 commented Nov 21, 2022

vm-001 commented Nov 21, 2022

vm-001 commented Jul 8, 2022 •

edited

Loading

flrgh Nov 18, 2022 •

edited

Loading