Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Endpoint timeout lost on reboot of worker hosting NSE and k8s-registry #376

Closed
zolug opened this issue Mar 7, 2023 · 39 comments
Closed
Assignees
Labels
bug Something isn't working

Comments

@zolug
Copy link

zolug commented Mar 7, 2023

When an NSE and the k8s-registry handling its registration are both hosted by the same worker, then loss of said worker will result in the NSE Custom Resource to remain in etcd.

problem is similar to: #158

reproduction with Kind (1 controller, 2 workers):

  • startup nsm with k8s-registry
  • create an NSE on a worker with k8s-registry
  • stop the docker container of the worker hosting NSE and registry
  • check the nse custom resources

Would be nice to have some "garbage collection" like feature in k8s-registry to handle dangling nses.

@LionelJouin LionelJouin moved this to 📋 To Do in Meridio Mar 7, 2023
@LionelJouin LionelJouin moved this from 📋 To Do to 🏗 In progress in Meridio Mar 7, 2023
@denis-tingaikin denis-tingaikin added the bug Something isn't working label Mar 9, 2023
@richardstone
Copy link

Hi @edwarnicke and @denis-tingaikin

Could you please prioritise this ticket as it's blocking our progress as of now.

Thanks!

@edwarnicke
Copy link
Member

@denis-tingaikin Could you have a look at this?

@denis-tingaikin
Copy link
Member

Hello @richardstone , @edwarnicke

I see two options how to resolve it

  1. Schedule minor release v1.8.1 and fix the issue on next week
  2. Fix the issue within planned release v1.9.0. (Currently, we plan to deliver v1.9.0 till 2023-05-02)

Thoughts?

@richardstone
Copy link

Hi,

Now we are on NSM 1.6.1 and uplifting might be a bit of pain. Is there any chance of back-porting the solution to 1.6 track?

@denis-tingaikin
Copy link
Member

theoretically it's possible to release v1.6.2 that will include the needed fix.

@edwarnicke WDYT?

@richardstone
Copy link

@denis-tingaikin
That would be really great!

@edwarnicke
Copy link
Member

@denis-tingaikin If its not to heavy a lift, I'm fine with doing a v1.6.2

@LionelJouin
Copy link
Member

Here is a way to reproduce the problem (So same as this: #158):

kubectl apply -k https://github.com/networkservicemesh/deployments-k8s/examples/spire/single_cluster/?ref=v1.8.0
kubectl apply -k https://github.com/networkservicemesh/deployments-k8s/examples/basic?ref=v1.8.0
kubectl apply -k https://github.com/networkservicemesh/deployments-k8s/examples/use-cases/Kernel2Ethernet2Kernel?ref=v1.8.0

# (It is this example until now: https://github.com/networkservicemesh/deployments-k8s/tree/v1.8.0/examples/use-cases/Kernel2Ethernet2Kernel)

kubectl delete ns ns-kernel2ethernet2kernel
registry=$(kubectl get pods --selector="app=registry" -n nsm-system --no-headers=true | awk '{print $1}')
kubectl delete pods -n nsm-system $registry

And now you can see the NSE will never be removed (kubectl get nse -n nsm-system) even after 10 minutes (the default timeout).

Here is one solution to this problem:

---
apiVersion: batch/v1
kind: CronJob
metadata:
  name: registry-k8s-gc
spec:
  schedule: "*/1 * * * *"
  successfulJobsHistoryLimit: 0
  failedJobsHistoryLimit: 0
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: registry-k8s-sa
          containers:
          - name: registry-k8s-gc
            image: bitnami/kubectl:latest
            command: ["/bin/bash"]
            args:
            - "-c"
            - |
              nses=$(kubectl get nse -n nsm-system --no-headers=true | awk '{print $1}')
              current=$(date +%s)
              while IFS= read -r nse; do
                  nse_expiration=$(kubectl get nse $nse -n nsm-system -o json | jq -r '.spec.expiration_time.seconds')
                  if (( current > nse_expiration )); then
                    kubectl delete nse -n nsm-system $nse
                  fi
              done <<< "$nses"
          restartPolicy: OnFailure

@denis-tingaikin
Copy link
Member

denis-tingaikin commented Mar 13, 2023

First of all, I've crated a board for v1.6.2 to monitor the progress of this release.

@LionelJouin

Here is one solution to this problem:

Good that we have a workarnoud for this moment. I'm not sure that we should use jobs in the final solution (@edwarnicke correct me if I wrong).

@edwarnicke , @LionelJouin , @zolug

I think we might use two tactics here to manage these cases:

  1. We could cherry-pick the commit from v1.7.0 Add default expiration option for registry sdk#1406. That allows to the managers remove leaked endpoints when registry will be restarted
  2. We could add the logic to get list and remove outdated endpoints on the registry startup.

Thoughts?

@richardstone
Copy link

I agree with you that we should not have jobs in NSM.

I think this is something that is needed to be checked continuously, so might be good to trigger the logic on a configurable timer. This way if there was any unexpected change and dangling NSEs left on the system, the recovery does not require a registry to be restarted.

@zolug
Copy link
Author

zolug commented Mar 13, 2023

First of all, I've crated a board for v1.6.2 to monitor the progress of this release.

@LionelJouin

Here is one solution to this problem:

Good that we have a workarnoud for this moment. I'm not sure that we should use jobs in the final solution (@edwarnicke correct me if I wrong).

@edwarnicke , @LionelJouin , @zolug

I think we might use two tactics here to manage these cases:

1. We could cherry-pick the commit from v1.7.0 [Add default expiration option for registry sdk#1406](https://github.com/networkservicemesh/sdk/pull/1406). That allows to the managers remove leaked endpoints when registry will be restarted

2. We could add the logic to get list and remove outdated endpoints on the registry startup.

Thoughts?

Correct me if I'm wrong, but in case NSC, NSMgr and NSE are all located on the same node, and for example the node is abruptly rebooted, then having timeouts on the manager side doesn't seem to make a difference.

As Richard mentioned, if restart of a registry concludes before the expiration time of a dangling NSE. Then the associated Custom Resource of such NSE would still remain in etcd db, wouldn't it (assuming we are talking about the node reboot case)? So I would prefer a GC like solution running in the background.
(Also, not sure if we should prepare for situations where a node is not drained properly before worker node scale-in, which would not involve registry restart.)

@denis-tingaikin
Copy link
Member

@zolug

That's good catch. What if then we'll do next

2. We could add the logic to get list and remove outdated endpoints on the registry startup. -->
2. We could add the logic to get a list of endpoints/networkservices and initiate register procedure with them.

We already have logic for expiration. So if we have in the registry leaked resources that are not expired, then they will be removed by the timer. If a resource is outdated on fetching stage, we could remove it immediately.

Thoughts?

@denis-tingaikin denis-tingaikin moved this to In Progress in Release v1.6.2 Mar 14, 2023
@zolug
Copy link
Author

zolug commented Mar 14, 2023

@zolug

That's good catch. What if then we'll do next

2. We could add the logic to get list and remove outdated endpoints on the registry startup. --> 2. We could add the logic to get a list of endpoints/networkservices and initiate register procedure with them.

We already have logic for expiration. So if we have in the registry leaked resources that are not expired, then they will be removed by the timer. If a resource is outdated on fetching stage, we could remove it immediately.

Thoughts?

Just to make sure I follow:
The proposed logic would still be triggered by registry start. Or would it be triggered on other occasions as well?

The nses would be fetched from etcd, and I guess they would get registered (unless already expired) with a timeout derived based on the current time and based on the Custom Resource's expiration time.

That way the registry would not prolong the lifetime of any possibly dangling nse CR (which could get cleaned-up by the registry's general timeout logic), yet would allow available nses to refresh their registration.

And in the latter case it wouldn't matter if the recently started regisrty was not serving the registration refreshes. Because in case the CR was refreshed through a different registry in the meantime, then the version mismatch would prohibit removal of an nse CR still in use.

If I got it right, then I actually like this proposal.
(Losing a node without it ever returning might not be a valid use case. But pls correct me if I'm wrong.)

@denis-tingaikin
Copy link
Member

@zolug

Yes, you're correct

@denis-tingaikin
Copy link
Member

@zolug

I've prepared PR with a fix #377

Also, I've pushed the image with the fix into my docker repository.

Could you please test image tinden/cmd-registry-k8s:v1.6.2

Note: You need just to update the container image in the yaml file.

@zolug
Copy link
Author

zolug commented Mar 14, 2023

Thanks a lot @denis-tingaikin!
I'll try to verify it asap.

@zolug
Copy link
Author

zolug commented Mar 14, 2023

Did not seem to work according to the plans: The fetched NSE CR items without NSEs kept being refreshed. I think it could be because the registryclient introduced into registry has a built in refresh chain.

On the other hand, expired NSEs are registered by the registry after fetch.

@denis-tingaikin
Copy link
Member

denis-tingaikin commented Mar 14, 2023

@zolug Thank you for testing. That's why we should test fixes 😉️

Nice catch, I'll take a look into these problems.

@denis-tingaikin
Copy link
Member

@zolug I've pushed a new image for testing. Could you please check tinden/cmd-registry-k8s:v1.6.2-fix.1 ?

@zolug
Copy link
Author

zolug commented Mar 16, 2023

@denis-tingaikin Thanks, seemed to work for both expired NSE and for orphan NSEs as well.

However, I'm concerned about the convergence time: it will mostly scale with the min of pod-eviction-timeout and node restart time. So an orphan NSE might linger on (irrespective of it's expiration time) for minutes. This might not be acceptable considering the traffic disturbance it might cause.
Some logic triggering the "fetch based cleanup" could complement the current proposal. (Like change in the endpoints of the related registry service.) For a single registry it still wouldn't help, but it wouldn't be a problem for us (we run multiple instances of k8s registries). Do you see any way to cover that?

@denis-tingaikin
Copy link
Member

denis-tingaikin commented Mar 16, 2023

@zolug

I think we may found a solution for the case.

I'd like to reproduce it locally, could you please share steps to reproduce?

@denis-tingaikin
Copy link
Member

However, I'm concerned about the convergence time: it will mostly scale with the min of pod-eviction-timeout and node restart time. So an orphan NSE might linger on (irrespective of it's expiration time) for minutes. This might not be acceptable considering the traffic disturbance it might cause.
Some logic triggering the "fetch based cleanup" could complement the current proposal. (Like change in the endpoints of the related registry service.) For a single registry it still wouldn't help, but it wouldn't be a problem for us (we run multiple instances of k8s registries). Do you see any way to cover that?

@zolug

Just to clarification I got a few questions:

  1. Does the problem block you?
  2. Is it possible to handle the problem after release v1.6.2?

@richardstone
Copy link

@denis-tingaikin

How big of a change are we talking about when it comes to the fix for the case @zolug mentioned? Would it take more time to implement than what you have done already?

@denis-tingaikin
Copy link
Member

@richardstone

Currently, am trying to understand how to reproduce the proposed case from comment #376 (comment) in real world.

Also if the proposed case #376 (comment) is not critical and doesn't affect key scenarious we could consider the release 1.6.2 based on the current fix.

So my questions are:

  1. Does the problem Endpoint timeout lost on reboot of worker hosting NSE and k8s-registry #376 (comment) blocks some production use-cases?
  2. How can I reproduce the problem Endpoint timeout lost on reboot of worker hosting NSE and k8s-registry #376 (comment) locally? Any logs/sterps to reproduce are welcome

@ljkiraly
Copy link

@denis-tingaikin, I'm trying to describe the reproduction steps which are similar as detailed by @zolug in issue description with minimal variance:

  • start spire and basic nsm config in kind with 4 workers
  • deploy an NSE on the node where the registry-k8s is running
  • deploy an NSC on some other node (this setup can be achieved by a modified Kernel2Ethernet2Kernel example)
  • monitor the interface changes in NSC
  • scale the registry-k8s to 2 replicas
  • stop the node where the NSE pod resides
  • monitor the nse cleanup by: kubectl -n nsm-system get nse
    The interface in NSC is kept much more then the expected 2 mins (from max-token-lifetime) and the NSE is deleted when the new registry-k8s has started.
    I hope this helps.

@denis-tingaikin
Copy link
Member

@ljkiraly many thanks for the clarification 🙂

May we also decide the expectation?

@github-project-automation github-project-automation bot moved this from In Progress to Done in Release v1.6.2 Mar 17, 2023
@denis-tingaikin
Copy link
Member

denis-tingaikin commented Mar 17, 2023

As I can see,

Actual: nsc has a broken connection for timeout period (does timeout help?)
Expected: TODO

@denis-tingaikin
Copy link
Member

denis-tingaikin commented Mar 18, 2023

@ljkiraly , @zolug

Based on the feedback I've prepared two possible options to fix the issue

  1. tinden/cmd-registry-k8s:v1.6.2-fix.2. It's based on the fix: "Endpoint timeout lost on reboot of worker hosting NSE and k8s-registry" #377. Note: you might need also to add a rule for the service account for registry-k8s to be able work with nodes.
  2. tinden/cmd-registry-k8s:v1.6.2-fix.3. It's based on alternative fix: "Endpoint timeout lost on reboot of worker hosting NSE and k8s-registry" sdk-k8s#431. Note: just replace the image to that version.

Please let me know about does it solve the problem.

@denis-tingaikin denis-tingaikin moved this from Done to In Progress in Release v1.6.2 Mar 20, 2023
@zolug
Copy link
Author

zolug commented Mar 20, 2023

@ljkiraly , @zolug

Based on the feedback I've prepared two possible options to fix the issue

1. `tinden/cmd-registry-k8s:v1.6.2-fix.2`. It's based on the  [fix: "Endpoint timeout lost on reboot of worker hosting NSE and k8s-registry" #377](https://github.com/networkservicemesh/cmd-registry-k8s/pull/377). Note: you might need also to add a rule for the service account for registry-k8s to be able work with nodes.

2. `tinden/cmd-registry-k8s:v1.6.2-fix.3`. It's based on [alternative fix: "Endpoint timeout lost on reboot of worker hosting NSE and k8s-registry" sdk-k8s#431](https://github.com/networkservicemesh/sdk-k8s/pull/431). Note: just replace the image to that version.

Please let me know about does it solve the problem.

@denis-tingaikin Great, thanks a lot. We're going to verify them.
I myself prefer the proposal where Find() simply ignores outdated NSEs, but we'll discuss it internally. Thanks again!

@denis-tingaikin
Copy link
Member

denis-tingaikin commented Mar 20, 2023

@zolug Also I have got a question...

Does NSM endpoint selection work correctly if we have leaked NSEs?

In my mind, leaked nses should not affect NSM flow related to endpoint selection.

So, I want to be sure that we don't have other bugs :)

@zolug
Copy link
Author

zolug commented Mar 20, 2023

@zolug Also I have got a question...

Does NSM endpoint selection work correctly if we have leaked NSEs?

In my mind, leaked nses should not affect NSM flow related to endpoint selection.

So, I want to be sure that we don't have other bugs :)

To my knowledge NSM can't tell apart leaked NSEs and available NSEs. That's the main reason this problem has caused us trouble, because Find() included NSEs that should have been expired.

@denis-tingaikin
Copy link
Member

denis-tingaikin commented Mar 21, 2023

@zolug

To my knowledge NSM can't tell apart leaked NSEs and available NSEs. That's the main reason this problem has caused us trouble, because Find() included NSEs that should have been expired.

Got it. OK, we'll check in the separate issue networkservicemesh/sdk#1438

@ljkiraly
Copy link

Hello @denis-tingaikin , @zolug ,

I verified the patch no.3 tinden/cmd-registry-k8s:v1.6.2-fix.3 and works fine. The interface in NSC gone almost promptly after the NSE's worker node has been stopped.

@denis-tingaikin
Copy link
Member

@LionelJouin perfect! Thank you for testing.

@edwarnicke Could you have a look at networkservicemesh/sdk-k8s#431?

@denis-tingaikin
Copy link
Member

We got apporval from Ed on PR meeting.

@ljkiraly, @richardstone , @zolug we'll start release in a few hours, I think v1.6.2 will have released tommorow.

Let me know if we need to do something else for v1.6.2.

@denis-tingaikin denis-tingaikin moved this from In Progress to Done in Release v1.6.2 Mar 22, 2023
@denis-tingaikin
Copy link
Member

denis-tingaikin commented Mar 22, 2023

@zolug , @ljkiraly , @richardstone I'm in the middle of the releasing v1.6.2. it takes more time than I expected cuz our automation is not good with releasing based on old branches :)

By the way, registry-k8s v1.6.2 is already available for using! cmd-registry-k8s:v1.6.2

@denis-tingaikin
Copy link
Member

Release v1.6.2 is out!
https://github.com/networkservicemesh/deployments-k8s/releases/tag/v1.6.2

Can we close the ticket?

@zolug
Copy link
Author

zolug commented Mar 24, 2023

Release v1.6.2 is out! https://github.com/networkservicemesh/deployments-k8s/releases/tag/v1.6.2

Can we close the ticket?

Hi @denis-tingaikin
Yes, you can close it. Thanks for taking care of this issue especially on such a short notice!

@denis-tingaikin
Copy link
Member

Closed by #376 (comment)

@github-project-automation github-project-automation bot moved this from In Progress to Done in Release v1.9.0 Mar 24, 2023
@LionelJouin LionelJouin moved this from 🏗 In progress to ✅ Done in Meridio Apr 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Done
Status: Done
Development

No branches or pull requests

6 participants