Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix kube-ovn-cni crash for newly added nodes , due to old legacy event #4194

Merged
merged 1 commit into from
Jun 20, 2024

Conversation

changluyi
Copy link
Collaborator

@changluyi changluyi commented Jun 20, 2024

Pull Request

What type of this PR

Examples of user facing changes:

问题现象:

集中式网关当只有唯一一个节点的时候,将这个节点从集群中删除,然后再加回来时,偶发kube-ovn-cni 起不来

问题原因:

是由于handleDeleteNode 一直没有删除成功,导致报错

 failed to delete policy route for centralized subnet ovn-default, subnet ovn-default gws are not ready

失败后会event 一直放进deleteNodeQueue。

但是当下一次同名节点成功加入后,由于删除的事件还在deleteNodeQueue 中,仍然会走到 handleDeleteNode 所以导致了某些节点相关的db 比如lsp,portgroup被清理了。

问题解决办法:

目前没想到比较好的方法,
方法1:
如果从deleteNodeQueue 中强制清理掉删除的event,handleDeleteNode 就没完全成功,导致某些规则没有清理的话,会不会导致下一次添加节点出问题。

我这边说的不一定是这个集中式网关的例子,handleDeleteNode return err的地方也不少,如果要强制从队列清理掉删除event的话,那估计得检查,在handleDeleteNode任意一个点如果return err会不会导致下一次node 加入失败。
企业微信截图_17188503333392

方法2:
该pr 使用的,如果addNode的时候 发现有删除event在队列中,那就报警,让用户自己评估。

Which issue(s) this PR fixes

Fixes #(issue-number)

@changluyi changluyi requested a review from oilbeater June 20, 2024 03:03
@oilbeater
Copy link
Collaborator

handleDeleteNode 的时候检查一下如果 cache 里 node 还存在就返回呢,这时候以及不需要再做 delete 处理了

@changluyi
Copy link
Collaborator Author

changluyi commented Jun 20, 2024

handleDeleteNode 的时候检查一下如果 cache 里 node 还存在就返回呢,这时候以及不需要再做 delete 处理了

可以,return的时候带上个warning log,如果由于delete没执行完,某些垃圾没被清理导致下次加入失败的话,那只能用户自己通过kube-ovn-controller.log自己排了

不过本来这种case就比较少见

@oilbeater
Copy link
Collaborator

那还是优先检查一下存在在处理吧,和 pod 的处理保持一致

@changluyi changluyi force-pushed the forbid_same_node_add_when_delete_failed branch from 3c1c409 to 26db725 Compare June 20, 2024 05:39
@changluyi
Copy link
Collaborator Author

那还是优先检查一下存在在处理吧,和 pod 的处理保持一致

好的,已修复。

@oilbeater
Copy link
Collaborator

oilbeater commented Jun 20, 2024

直接 lister.Get 看能不能找到就可以了,不需要单独记录

@changluyi changluyi force-pushed the forbid_same_node_add_when_delete_failed branch from 26db725 to acaa627 Compare June 20, 2024 05:55
@changluyi changluyi force-pushed the forbid_same_node_add_when_delete_failed branch from acaa627 to a5321ae Compare June 20, 2024 07:07
@changluyi
Copy link
Collaborator Author

直接 lister.Get 看能不能找到就可以了,不需要单独记录

已修,不过handledeleteNode的时候还是需要个map存放obj。

@changluyi changluyi changed the title forbid the same node add ,when delete node is failed skip handleDeleteNode, when new node with the same name is added Jun 20, 2024
@changluyi changluyi changed the title skip handleDeleteNode, when new node with the same name is added skip handleDeleteNode, when new same name node is adding Jun 20, 2024
@changluyi changluyi merged commit 05e2ccb into release-1.12 Jun 20, 2024
55 checks passed
@changluyi changluyi deleted the forbid_same_node_add_when_delete_failed branch June 20, 2024 08:11
@changluyi changluyi changed the title skip handleDeleteNode, when new same name node is adding fix kube-ovn-cni crash for newly added nodes , due to old legacy event Jun 21, 2024
@cmdy cmdy mentioned this pull request Nov 8, 2024
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants