Endpoint timeout lost on reboot of worker hosting NSE and k8s-registry #376

zolug · 2023-03-07T16:27:32Z

When an NSE and the k8s-registry handling its registration are both hosted by the same worker, then loss of said worker will result in the NSE Custom Resource to remain in etcd.

problem is similar to: #158

reproduction with Kind (1 controller, 2 workers):

startup nsm with k8s-registry
create an NSE on a worker with k8s-registry
stop the docker container of the worker hosting NSE and registry
check the nse custom resources

Would be nice to have some "garbage collection" like feature in k8s-registry to handle dangling nses.

richardstone · 2023-03-09T14:57:31Z

Hi @edwarnicke and @denis-tingaikin

Could you please prioritise this ticket as it's blocking our progress as of now.

Thanks!

edwarnicke · 2023-03-09T15:07:15Z

@denis-tingaikin Could you have a look at this?

denis-tingaikin · 2023-03-09T15:22:51Z

Hello @richardstone , @edwarnicke

I see two options how to resolve it

Schedule minor release v1.8.1 and fix the issue on next week
Fix the issue within planned release v1.9.0. (Currently, we plan to deliver v1.9.0 till 2023-05-02)

Thoughts?

richardstone · 2023-03-09T15:39:23Z

Hi,

Now we are on NSM 1.6.1 and uplifting might be a bit of pain. Is there any chance of back-porting the solution to 1.6 track?

denis-tingaikin · 2023-03-09T15:44:43Z

theoretically it's possible to release v1.6.2 that will include the needed fix.

@edwarnicke WDYT?

richardstone · 2023-03-09T15:48:00Z

@denis-tingaikin
That would be really great!

edwarnicke · 2023-03-09T17:24:54Z

@denis-tingaikin If its not to heavy a lift, I'm fine with doing a v1.6.2

LionelJouin · 2023-03-13T09:07:38Z

Here is a way to reproduce the problem (So same as this: #158):

kubectl apply -k https://github.com/networkservicemesh/deployments-k8s/examples/spire/single_cluster/?ref=v1.8.0
kubectl apply -k https://github.com/networkservicemesh/deployments-k8s/examples/basic?ref=v1.8.0
kubectl apply -k https://github.com/networkservicemesh/deployments-k8s/examples/use-cases/Kernel2Ethernet2Kernel?ref=v1.8.0

# (It is this example until now: https://github.com/networkservicemesh/deployments-k8s/tree/v1.8.0/examples/use-cases/Kernel2Ethernet2Kernel)

kubectl delete ns ns-kernel2ethernet2kernel
registry=$(kubectl get pods --selector="app=registry" -n nsm-system --no-headers=true | awk '{print $1}')
kubectl delete pods -n nsm-system $registry

And now you can see the NSE will never be removed (kubectl get nse -n nsm-system) even after 10 minutes (the default timeout).

Here is one solution to this problem:

---
apiVersion: batch/v1
kind: CronJob
metadata:
  name: registry-k8s-gc
spec:
  schedule: "*/1 * * * *"
  successfulJobsHistoryLimit: 0
  failedJobsHistoryLimit: 0
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: registry-k8s-sa
          containers:
          - name: registry-k8s-gc
            image: bitnami/kubectl:latest
            command: ["/bin/bash"]
            args:
            - "-c"
            - |
              nses=$(kubectl get nse -n nsm-system --no-headers=true | awk '{print $1}')
              current=$(date +%s)
              while IFS= read -r nse; do
                  nse_expiration=$(kubectl get nse $nse -n nsm-system -o json | jq -r '.spec.expiration_time.seconds')
                  if (( current > nse_expiration )); then
                    kubectl delete nse -n nsm-system $nse
                  fi
              done <<< "$nses"
          restartPolicy: OnFailure

denis-tingaikin · 2023-03-13T12:15:30Z

First of all, I've crated a board for v1.6.2 to monitor the progress of this release.

@LionelJouin

Here is one solution to this problem:

Good that we have a workarnoud for this moment. I'm not sure that we should use jobs in the final solution (@edwarnicke correct me if I wrong).

@edwarnicke , @LionelJouin , @zolug

I think we might use two tactics here to manage these cases:

We could cherry-pick the commit from v1.7.0 Add default expiration option for registry sdk#1406. That allows to the managers remove leaked endpoints when registry will be restarted
We could add the logic to get list and remove outdated endpoints on the registry startup.

Thoughts?

richardstone · 2023-03-13T12:20:38Z

I agree with you that we should not have jobs in NSM.

I think this is something that is needed to be checked continuously, so might be good to trigger the logic on a configurable timer. This way if there was any unexpected change and dangling NSEs left on the system, the recovery does not require a registry to be restarted.

zolug · 2023-03-13T13:54:37Z

First of all, I've crated a board for v1.6.2 to monitor the progress of this release.

@LionelJouin

Here is one solution to this problem:

Good that we have a workarnoud for this moment. I'm not sure that we should use jobs in the final solution (@edwarnicke correct me if I wrong).

@edwarnicke , @LionelJouin , @zolug

I think we might use two tactics here to manage these cases:
1. We could cherry-pick the commit from v1.7.0 [Add default expiration option for registry sdk#1406](https://github.com/networkservicemesh/sdk/pull/1406). That allows to the managers remove leaked endpoints when registry will be restarted

2. We could add the logic to get list and remove outdated endpoints on the registry startup.
Thoughts?

Correct me if I'm wrong, but in case NSC, NSMgr and NSE are all located on the same node, and for example the node is abruptly rebooted, then having timeouts on the manager side doesn't seem to make a difference.

As Richard mentioned, if restart of a registry concludes before the expiration time of a dangling NSE. Then the associated Custom Resource of such NSE would still remain in etcd db, wouldn't it (assuming we are talking about the node reboot case)? So I would prefer a GC like solution running in the background.
(Also, not sure if we should prepare for situations where a node is not drained properly before worker node scale-in, which would not involve registry restart.)

denis-tingaikin · 2023-03-13T16:00:36Z

@zolug

That's good catch. What if then we'll do next

~~2. We could add the logic to get list and remove outdated endpoints on the registry startup.~~ -->
2. We could add the logic to get a list of endpoints/networkservices and initiate register procedure with them.

We already have logic for expiration. So if we have in the registry leaked resources that are not expired, then they will be removed by the timer. If a resource is outdated on fetching stage, we could remove it immediately.

Thoughts?

zolug · 2023-03-14T12:02:06Z

@zolug

That's good catch. What if then we'll do next

~~2. We could add the logic to get list and remove outdated endpoints on the registry startup.~~ --> 2. We could add the logic to get a list of endpoints/networkservices and initiate register procedure with them.

We already have logic for expiration. So if we have in the registry leaked resources that are not expired, then they will be removed by the timer. If a resource is outdated on fetching stage, we could remove it immediately.

Thoughts?

Just to make sure I follow:
The proposed logic would still be triggered by registry start. Or would it be triggered on other occasions as well?

The nses would be fetched from etcd, and I guess they would get registered (unless already expired) with a timeout derived based on the current time and based on the Custom Resource's expiration time.

That way the registry would not prolong the lifetime of any possibly dangling nse CR (which could get cleaned-up by the registry's general timeout logic), yet would allow available nses to refresh their registration.

And in the latter case it wouldn't matter if the recently started regisrty was not serving the registration refreshes. Because in case the CR was refreshed through a different registry in the meantime, then the version mismatch would prohibit removal of an nse CR still in use.

If I got it right, then I actually like this proposal.
(Losing a node without it ever returning might not be a valid use case. But pls correct me if I'm wrong.)

denis-tingaikin · 2023-03-14T12:12:56Z

@zolug

Yes, you're correct

denis-tingaikin · 2023-03-14T14:47:29Z

@zolug

I've prepared PR with a fix #377

Also, I've pushed the image with the fix into my docker repository.

Could you please test image tinden/cmd-registry-k8s:v1.6.2

Note: You need just to update the container image in the yaml file.

zolug · 2023-03-14T15:17:12Z

Thanks a lot @denis-tingaikin!
I'll try to verify it asap.

zolug · 2023-03-14T16:21:35Z

Did not seem to work according to the plans: The fetched NSE CR items without NSEs kept being refreshed. I think it could be because the registryclient introduced into registry has a built in refresh chain.

On the other hand, expired NSEs are registered by the registry after fetch.

denis-tingaikin · 2023-03-14T19:35:32Z

@zolug Thank you for testing. That's why we should test fixes 😉️

Nice catch, I'll take a look into these problems.

denis-tingaikin · 2023-03-15T17:41:53Z

@zolug I've pushed a new image for testing. Could you please check tinden/cmd-registry-k8s:v1.6.2-fix.1 ?

zolug · 2023-03-16T14:08:40Z

@denis-tingaikin Thanks, seemed to work for both expired NSE and for orphan NSEs as well.

However, I'm concerned about the convergence time: it will mostly scale with the min of pod-eviction-timeout and node restart time. So an orphan NSE might linger on (irrespective of it's expiration time) for minutes. This might not be acceptable considering the traffic disturbance it might cause.
Some logic triggering the "fetch based cleanup" could complement the current proposal. (Like change in the endpoints of the related registry service.) For a single registry it still wouldn't help, but it wouldn't be a problem for us (we run multiple instances of k8s registries). Do you see any way to cover that?

denis-tingaikin · 2023-03-16T18:10:36Z

@zolug

I think we may found a solution for the case.

I'd like to reproduce it locally, could you please share steps to reproduce?

denis-tingaikin · 2023-03-17T14:03:47Z

However, I'm concerned about the convergence time: it will mostly scale with the min of pod-eviction-timeout and node restart time. So an orphan NSE might linger on (irrespective of it's expiration time) for minutes. This might not be acceptable considering the traffic disturbance it might cause.
Some logic triggering the "fetch based cleanup" could complement the current proposal. (Like change in the endpoints of the related registry service.) For a single registry it still wouldn't help, but it wouldn't be a problem for us (we run multiple instances of k8s registries). Do you see any way to cover that?

@zolug

Just to clarification I got a few questions:

Does the problem block you?
Is it possible to handle the problem after release v1.6.2?

richardstone · 2023-03-17T14:57:00Z

@denis-tingaikin

How big of a change are we talking about when it comes to the fix for the case @zolug mentioned? Would it take more time to implement than what you have done already?

denis-tingaikin · 2023-03-17T15:54:37Z

@richardstone

Currently, am trying to understand how to reproduce the proposed case from comment #376 (comment) in real world.

Also if the proposed case #376 (comment) is not critical and doesn't affect key scenarious we could consider the release 1.6.2 based on the current fix.

So my questions are:

Does the problem Endpoint timeout lost on reboot of worker hosting NSE and k8s-registry #376 (comment) blocks some production use-cases?
How can I reproduce the problem Endpoint timeout lost on reboot of worker hosting NSE and k8s-registry #376 (comment) locally? Any logs/sterps to reproduce are welcome

ljkiraly · 2023-03-17T18:48:48Z

@denis-tingaikin, I'm trying to describe the reproduction steps which are similar as detailed by @zolug in issue description with minimal variance:

start spire and basic nsm config in kind with 4 workers
deploy an NSE on the node where the registry-k8s is running
deploy an NSC on some other node (this setup can be achieved by a modified Kernel2Ethernet2Kernel example)
monitor the interface changes in NSC
scale the registry-k8s to 2 replicas
stop the node where the NSE pod resides
monitor the nse cleanup by: kubectl -n nsm-system get nse
The interface in NSC is kept much more then the expected 2 mins (from max-token-lifetime) and the NSE is deleted when the new registry-k8s has started.
I hope this helps.

denis-tingaikin · 2023-03-17T19:35:54Z

@ljkiraly many thanks for the clarification 🙂

May we also decide the expectation?

denis-tingaikin · 2023-03-17T19:38:33Z

As I can see,

Actual: nsc has a broken connection for timeout period (does timeout help?)
Expected: TODO

denis-tingaikin · 2023-03-18T14:58:55Z

@ljkiraly , @zolug

Based on the feedback I've prepared two possible options to fix the issue

tinden/cmd-registry-k8s:v1.6.2-fix.2. It's based on the fix: "Endpoint timeout lost on reboot of worker hosting NSE and k8s-registry" #377. Note: you might need also to add a rule for the service account for registry-k8s to be able work with nodes.
tinden/cmd-registry-k8s:v1.6.2-fix.3. It's based on alternative fix: "Endpoint timeout lost on reboot of worker hosting NSE and k8s-registry" sdk-k8s#431. Note: just replace the image to that version.

Please let me know about does it solve the problem.

zolug · 2023-03-20T12:01:31Z

@ljkiraly , @zolug

Based on the feedback I've prepared two possible options to fix the issue

1. `tinden/cmd-registry-k8s:v1.6.2-fix.2`. It's based on the  [fix: "Endpoint timeout lost on reboot of worker hosting NSE and k8s-registry" #377](https://github.com/networkservicemesh/cmd-registry-k8s/pull/377). Note: you might need also to add a rule for the service account for registry-k8s to be able work with nodes.

2. `tinden/cmd-registry-k8s:v1.6.2-fix.3`. It's based on [alternative fix: "Endpoint timeout lost on reboot of worker hosting NSE and k8s-registry" sdk-k8s#431](https://github.com/networkservicemesh/sdk-k8s/pull/431). Note: just replace the image to that version.

Please let me know about does it solve the problem.

@denis-tingaikin Great, thanks a lot. We're going to verify them.
I myself prefer the proposal where Find() simply ignores outdated NSEs, but we'll discuss it internally. Thanks again!

denis-tingaikin · 2023-03-20T12:06:30Z

@zolug Also I have got a question...

Does NSM endpoint selection work correctly if we have leaked NSEs?

In my mind, leaked nses should not affect NSM flow related to endpoint selection.

So, I want to be sure that we don't have other bugs :)

zolug · 2023-03-20T12:55:32Z

@zolug Also I have got a question...

Does NSM endpoint selection work correctly if we have leaked NSEs?

In my mind, leaked nses should not affect NSM flow related to endpoint selection.

So, I want to be sure that we don't have other bugs :)

To my knowledge NSM can't tell apart leaked NSEs and available NSEs. That's the main reason this problem has caused us trouble, because Find() included NSEs that should have been expired.

denis-tingaikin · 2023-03-21T01:23:40Z

@zolug

To my knowledge NSM can't tell apart leaked NSEs and available NSEs. That's the main reason this problem has caused us trouble, because Find() included NSEs that should have been expired.

Got it. OK, we'll check in the separate issue networkservicemesh/sdk#1438

ljkiraly · 2023-03-21T12:20:38Z

Hello @denis-tingaikin , @zolug ,

I verified the patch no.3 tinden/cmd-registry-k8s:v1.6.2-fix.3 and works fine. The interface in NSC gone almost promptly after the NSE's worker node has been stopped.

denis-tingaikin · 2023-03-21T13:51:20Z

@LionelJouin perfect! Thank you for testing.

@edwarnicke Could you have a look at networkservicemesh/sdk-k8s#431?

denis-tingaikin · 2023-03-21T15:10:39Z

We got apporval from Ed on PR meeting.

@ljkiraly, @richardstone , @zolug we'll start release in a few hours, I think v1.6.2 will have released tommorow.

Let me know if we need to do something else for v1.6.2.

denis-tingaikin · 2023-03-22T17:03:52Z

@zolug , @ljkiraly , @richardstone I'm in the middle of the releasing v1.6.2. it takes more time than I expected cuz our automation is not good with releasing based on old branches :)

By the way, registry-k8s v1.6.2 is already available for using! cmd-registry-k8s:v1.6.2

denis-tingaikin · 2023-03-23T20:16:18Z

Release v1.6.2 is out!
https://github.com/networkservicemesh/deployments-k8s/releases/tag/v1.6.2

Can we close the ticket?

zolug · 2023-03-24T10:10:17Z

Release v1.6.2 is out! https://github.com/networkservicemesh/deployments-k8s/releases/tag/v1.6.2

Can we close the ticket?

Hi @denis-tingaikin
Yes, you can close it. Thanks for taking care of this issue especially on such a short notice!

denis-tingaikin · 2023-03-24T10:33:52Z

Closed by #376 (comment)

LionelJouin moved this to 📋 To Do in Meridio Mar 7, 2023

LionelJouin added this to Meridio Mar 7, 2023

LionelJouin moved this from 📋 To Do to 🏗 In progress in Meridio Mar 7, 2023

denis-tingaikin added this to Release v1.9.0 Mar 9, 2023

denis-tingaikin added the bug Something isn't working label Mar 9, 2023

denis-tingaikin added this to Release v1.6.2 Mar 13, 2023

denis-tingaikin self-assigned this Mar 13, 2023

denis-tingaikin moved this to In Progress in Release v1.6.2 Mar 14, 2023

denis-tingaikin mentioned this issue Mar 14, 2023

fix: "Endpoint timeout lost on reboot of worker hosting NSE and k8s-registry" #377

Closed

denis-tingaikin closed this as completed Mar 17, 2023

github-project-automation bot moved this from In Progress to Done in Release v1.6.2 Mar 17, 2023

github-project-automation bot moved this to Done in Release v1.9.0 Mar 17, 2023

denis-tingaikin reopened this Mar 17, 2023

denis-tingaikin mentioned this issue Mar 18, 2023

alternative fix: "Endpoint timeout lost on reboot of worker hosting NSE and k8s-registry" networkservicemesh/sdk-k8s#431

Merged

denis-tingaikin moved this from Done to In Progress in Release v1.6.2 Mar 20, 2023

denis-tingaikin moved this from Done to In Progress in Release v1.9.0 Mar 21, 2023

denis-tingaikin mentioned this issue Mar 21, 2023

Make sure that NSM candidate selection is working as fast as possible networkservicemesh/sdk#1438

Closed

3 tasks

denis-tingaikin moved this from In Progress to Done in Release v1.6.2 Mar 22, 2023

denis-tingaikin closed this as completed Mar 24, 2023

github-project-automation bot moved this from In Progress to Done in Release v1.9.0 Mar 24, 2023

LionelJouin moved this from 🏗 In progress to ✅ Done in Meridio Apr 2, 2023

Endpoint timeout lost on reboot of worker hosting NSE and k8s-registry #376

Endpoint timeout lost on reboot of worker hosting NSE and k8s-registry #376

Comments

zolug commented Mar 7, 2023 • edited Loading

richardstone commented Mar 9, 2023

edwarnicke commented Mar 9, 2023

denis-tingaikin commented Mar 9, 2023

richardstone commented Mar 9, 2023

denis-tingaikin commented Mar 9, 2023

richardstone commented Mar 9, 2023

edwarnicke commented Mar 9, 2023

LionelJouin commented Mar 13, 2023

denis-tingaikin commented Mar 13, 2023 • edited Loading

richardstone commented Mar 13, 2023

zolug commented Mar 13, 2023

denis-tingaikin commented Mar 13, 2023

zolug commented Mar 14, 2023 • edited Loading

denis-tingaikin commented Mar 14, 2023

denis-tingaikin commented Mar 14, 2023

zolug commented Mar 14, 2023

zolug commented Mar 14, 2023 • edited Loading

denis-tingaikin commented Mar 14, 2023 • edited Loading

denis-tingaikin commented Mar 15, 2023

zolug commented Mar 16, 2023 • edited Loading

denis-tingaikin commented Mar 16, 2023 • edited Loading

denis-tingaikin commented Mar 17, 2023

richardstone commented Mar 17, 2023

denis-tingaikin commented Mar 17, 2023

ljkiraly commented Mar 17, 2023

denis-tingaikin commented Mar 17, 2023

denis-tingaikin commented Mar 17, 2023 • edited Loading

denis-tingaikin commented Mar 18, 2023 • edited Loading

zolug commented Mar 20, 2023

denis-tingaikin commented Mar 20, 2023 • edited Loading

zolug commented Mar 20, 2023

denis-tingaikin commented Mar 21, 2023 • edited Loading

ljkiraly commented Mar 21, 2023

denis-tingaikin commented Mar 21, 2023

denis-tingaikin commented Mar 21, 2023

denis-tingaikin commented Mar 22, 2023 • edited Loading

denis-tingaikin commented Mar 23, 2023

zolug commented Mar 24, 2023

denis-tingaikin commented Mar 24, 2023

zolug commented Mar 7, 2023 •

edited

Loading

denis-tingaikin commented Mar 13, 2023 •

edited

Loading

zolug commented Mar 14, 2023 •

edited

Loading

zolug commented Mar 14, 2023 •

edited

Loading

denis-tingaikin commented Mar 14, 2023 •

edited

Loading

zolug commented Mar 16, 2023 •

edited

Loading

denis-tingaikin commented Mar 16, 2023 •

edited

Loading

denis-tingaikin commented Mar 17, 2023 •

edited

Loading

denis-tingaikin commented Mar 18, 2023 •

edited

Loading

denis-tingaikin commented Mar 20, 2023 •

edited

Loading

denis-tingaikin commented Mar 21, 2023 •

edited

Loading

denis-tingaikin commented Mar 22, 2023 •

edited

Loading