-
-
Notifications
You must be signed in to change notification settings - Fork 161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handlers stop working after some time #585
Comments
We've seen this happen when our k8s masters get migrated, there's an exception in a watcher and then no more updates. In our case we somewhat resolved the issue by halting if we stop seeing expected resources coming through, but it's not the nicest fix. |
Can you please try configuring
This was the case in the past in some setups — if the server is not said for how long to wait, it waits forever. For reasons unknown, while the connection is alive, the server stops sending anything to it. And so, the operator becomes "blind". It is generally a good idea to have a specific value there. There is no default value only because I am not sure what would be a good default value, and for how long can the servers guarantee the connection's liveness — i.e., if I set it to 10 mins (600 seconds), is it enough to solve the problem for sure? |
Ok, I've added:
and redeployed the snippet. I'll play with some permutations of that and see if helps. One thing that I find very strange is why I don't see this problem when watching configmaps... But yeah... after several hours just watching configmaps, I see still:
logs appear. It's also worth mentioning that I'm testing in an AKS k8s cluster with only 3 nodes and I'm the only person that has access to it. It's a very controlled environment so there shouldn't be issues related to master migrations and the like. Anyways, I'll leave the operator running with the secret handlers for a while and see how that goes. Thanks for the suggestions 👍 |
I'm happy to report that after leaving the operator running all night with those settings, it's still responding this morning! I'm also not exactly sure what the correct value to use will be for our real cluster which has hundreds of configmaps and secrets, I imagine it'd need to be higher? Guess we'll try a few things and see if the operator becomes "blind" or not. One thing that may be worth changing is this section in the documentation?
At least in my case, it definitely wasn't working without timeouts defined. Handlers were not firing and |
I cannot hint you what would be a good value. 10-60 mins sound good enough. The framework uses the Regarding the docs — yes, I agree, it is worth writing a note/warning there regarding these timeouts. I also thought about a warning/suggestion in the logs if nothing happens for some long period of time. Something like: "[WARNING] Nothing happened for 60 mins, which is unusual. Consider setting a server or client timeout to avoid server-side freezes. Read more at https://...docs..." — and start the "Troubleshooting" page for such cases. If I set the default timeout to these same 60 mins, it would cause the same weird behaviour: freezing for tens of minutes and then suddenly unfreezing — not what the users expect. |
I have not seen this issue ever since I added the client and server timeouts. |
Troubleshooting nolar/kopf#585
Set a timeout on Kubernetes watches in Kopf rather than using an unlimited watch. nolar/kopf#585 says that watches without an expiration sometimes go silent and stop receiving events, which is consistent with some problems we've seen. Mark Kubernetes object parsing failures as permanent so that Kopf won't retry with the same version of the object. Mark Kubernetes API failures as temporary so that Kopf will use its normal delay.
Long story short
The handlers for
secret
resources seem to stop silently working after some time.At the start, I'm able to create
configmap
s andsecret
s with thatfindme
label and they both show events in the logs. However, after 30 mins - 1 hour... thesecret
handlers stop firing completely... (the finalizer no longer gets added to the resources either of course)Meanwhile, the
configmap
s continue to work...UPDATE: Tried with handlers ONLY for secrets and it still stopped working... amended snippet to show this.
Description
The code snippet to reproduce the issue
Here is the
secret
I'm using for testing:The exact command to reproduce the issue
Environment
FROM python 3.7
)Python packages installed
(Ran directly on the container)
The text was updated successfully, but these errors were encountered: