-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Endpoint timeout lost on reboot of worker hosting NSE and k8s-registry #376
Comments
Hi @edwarnicke and @denis-tingaikin Could you please prioritise this ticket as it's blocking our progress as of now. Thanks! |
@denis-tingaikin Could you have a look at this? |
Hello @richardstone , @edwarnicke I see two options how to resolve it
Thoughts? |
Hi, Now we are on NSM 1.6.1 and uplifting might be a bit of pain. Is there any chance of back-porting the solution to 1.6 track? |
theoretically it's possible to release v1.6.2 that will include the needed fix. @edwarnicke WDYT? |
@denis-tingaikin |
@denis-tingaikin If its not to heavy a lift, I'm fine with doing a v1.6.2 |
Here is a way to reproduce the problem (So same as this: #158):
And now you can see the NSE will never be removed (kubectl get nse -n nsm-system) even after 10 minutes (the default timeout). Here is one solution to this problem:
|
First of all, I've crated a board for v1.6.2 to monitor the progress of this release.
Good that we have a workarnoud for this moment. I'm not sure that we should use jobs in the final solution (@edwarnicke correct me if I wrong). @edwarnicke , @LionelJouin , @zolug I think we might use two tactics here to manage these cases:
Thoughts? |
I agree with you that we should not have jobs in NSM. I think this is something that is needed to be checked continuously, so might be good to trigger the logic on a configurable timer. This way if there was any unexpected change and dangling NSEs left on the system, the recovery does not require a registry to be restarted. |
Correct me if I'm wrong, but in case NSC, NSMgr and NSE are all located on the same node, and for example the node is abruptly rebooted, then having timeouts on the manager side doesn't seem to make a difference. As Richard mentioned, if restart of a registry concludes before the expiration time of a dangling NSE. Then the associated Custom Resource of such NSE would still remain in etcd db, wouldn't it (assuming we are talking about the node reboot case)? So I would prefer a GC like solution running in the background. |
That's good catch. What if then we'll do next
We already have logic for expiration. So if we have in the registry leaked resources that are not expired, then they will be removed by the timer. If a resource is outdated on fetching stage, we could remove it immediately. Thoughts? |
Just to make sure I follow: The nses would be fetched from etcd, and I guess they would get registered (unless already expired) with a timeout derived based on the current time and based on the Custom Resource's expiration time. That way the registry would not prolong the lifetime of any possibly dangling nse CR (which could get cleaned-up by the registry's general timeout logic), yet would allow available nses to refresh their registration. And in the latter case it wouldn't matter if the recently started regisrty was not serving the registration refreshes. Because in case the CR was refreshed through a different registry in the meantime, then the version mismatch would prohibit removal of an nse CR still in use. If I got it right, then I actually like this proposal. |
Yes, you're correct |
Thanks a lot @denis-tingaikin! |
Did not seem to work according to the plans: The fetched NSE CR items without NSEs kept being refreshed. I think it could be because the registryclient introduced into registry has a built in refresh chain. On the other hand, expired NSEs are registered by the registry after fetch. |
@zolug Thank you for testing. That's why we should test fixes 😉️ Nice catch, I'll take a look into these problems. |
@zolug I've pushed a new image for testing. Could you please check |
@denis-tingaikin Thanks, seemed to work for both expired NSE and for orphan NSEs as well. However, I'm concerned about the convergence time: it will mostly scale with the min of pod-eviction-timeout and node restart time. So an orphan NSE might linger on (irrespective of it's expiration time) for minutes. This might not be acceptable considering the traffic disturbance it might cause. |
I think we may found a solution for the case. I'd like to reproduce it locally, could you please share steps to reproduce? |
Just to clarification I got a few questions:
|
How big of a change are we talking about when it comes to the fix for the case @zolug mentioned? Would it take more time to implement than what you have done already? |
Currently, am trying to understand how to reproduce the proposed case from comment #376 (comment) in real world. Also if the proposed case #376 (comment) is not critical and doesn't affect key scenarious we could consider the release 1.6.2 based on the current fix. So my questions are:
|
@denis-tingaikin, I'm trying to describe the reproduction steps which are similar as detailed by @zolug in issue description with minimal variance:
|
@ljkiraly many thanks for the clarification 🙂 May we also decide the expectation? |
As I can see, Actual: nsc has a broken connection for timeout period (does timeout help?) |
Based on the feedback I've prepared two possible options to fix the issue
Please let me know about does it solve the problem. |
@denis-tingaikin Great, thanks a lot. We're going to verify them. |
@zolug Also I have got a question... Does NSM endpoint selection work correctly if we have leaked NSEs? In my mind, leaked nses should not affect NSM flow related to endpoint selection. So, I want to be sure that we don't have other bugs :) |
To my knowledge NSM can't tell apart leaked NSEs and available NSEs. That's the main reason this problem has caused us trouble, because Find() included NSEs that should have been expired. |
Got it. OK, we'll check in the separate issue networkservicemesh/sdk#1438 |
Hello @denis-tingaikin , @zolug , I verified the patch no.3 |
@LionelJouin perfect! Thank you for testing. @edwarnicke Could you have a look at networkservicemesh/sdk-k8s#431? |
We got apporval from Ed on PR meeting. @ljkiraly, @richardstone , @zolug we'll start release in a few hours, I think Let me know if we need to do something else for |
@zolug , @ljkiraly , @richardstone I'm in the middle of the releasing v1.6.2. it takes more time than I expected cuz our automation is not good with releasing based on old branches :) By the way, registry-k8s v1.6.2 is already available for using! cmd-registry-k8s:v1.6.2 |
Release v1.6.2 is out! Can we close the ticket? |
Hi @denis-tingaikin |
Closed by #376 (comment) |
When an NSE and the k8s-registry handling its registration are both hosted by the same worker, then loss of said worker will result in the NSE Custom Resource to remain in etcd.
problem is similar to: #158
reproduction with Kind (1 controller, 2 workers):
Would be nice to have some "garbage collection" like feature in k8s-registry to handle dangling nses.
The text was updated successfully, but these errors were encountered: