Handlers stop working after some time #585

OmegaVVeapon · 2020-11-29T21:21:39Z

Long story short

The handlers for secret resources seem to stop silently working after some time.
At the start, I'm able to create configmaps and secrets with that findme label and they both show events in the logs. However, after 30 mins - 1 hour... the secret handlers stop firing completely... (the finalizer no longer gets added to the resources either of course)
Meanwhile, the configmaps continue to work...

UPDATE: Tried with handlers ONLY for secrets and it still stopped working... amended snippet to show this.

Description

The code snippet to reproduce the issue

import kopf

LABEL = "findme"

@kopf.on.startup()
def startup_tasks(settings: kopf.OperatorSettings, logger, **_):
    # Set k8s event logging
    settings.posting.enabled = False

@kopf.on.resume('', 'v1', 'secrets', labels={LABEL: kopf.PRESENT})
@kopf.on.create('', 'v1', 'secrets', labels={LABEL: kopf.PRESENT})
@kopf.on.update('', 'v1', 'secrets', labels={LABEL: kopf.PRESENT})
def log_resume_create_update(body, event, logger, **_):
    logger.info(f"Got {event.upper()} event for a {body['kind']} resource")

@kopf.on.delete('', 'v1', 'secrets', labels={LABEL: kopf.PRESENT})
def log_delete(body, event, logger, **_):
    logger.info(f"Got {event.upper()} event for a {body['kind']} resource")

Here is the secret I'm using for testing:

---
apiVersion: v1
kind: Secret
metadata:
  name: sample-secret
  labels:
    findme: "yea"
type: Opaque
data:
  # base64 encoded: my super cool \n multiline \ secret
  secret.world: bXkgc3VwZXIgY29vbAptdWx0aWxpbmUKc2VjcmV0

The exact command to reproduce the issue

kopf run simple.py --standalone --liveness --verbose --debug

Environment

Kopf version: 0.28
Kubernetes version: v1.18.8
Python version: 3.7.9
OS/platform: Debian 10.6 (FROM python 3.7)

Python packages installed

(Ran directly on the container)

# pip freeze --all
aiohttp==3.7.3
aiojobs==0.3.0
async-timeout==3.0.1
attrs==20.3.0
certifi==2020.11.8
chardet==3.0.4
click==7.1.2
idna==2.10
iso8601==0.1.13
kopf==0.28
multidict==5.0.2
pip==20.2.4
pykube-ng==20.10.0
python-json-logger==2.0.1
PyYAML==5.3.1
requests==2.25.0
setuptools==50.3.2
typing-extensions==3.7.4.3
urllib3==1.26.2
wheel==0.35.1
yarl==1.6.3

The text was updated successfully, but these errors were encountered:

jamiechapmanbrn · 2020-11-30T00:06:16Z

We've seen this happen when our k8s masters get migrated, there's an exception in a watcher and then no more updates. In our case we somewhat resolved the issue by halting if we stop seeing expected resources coming through, but it's not the nicest fix.

nolar · 2020-11-30T00:20:59Z

Can you please try configuring settings.watching.server_timeout to e.g. 60 or 600 seconds?

settings.watching.client_timeout might help too, both alone and combined with the server timeout. E.g. ask the server to disconnect in 10 mins, but if it does not, disconnect from the client side in 11 mins.

https://kopf.readthedocs.io/en/latest/configuration/#api-timeouts

This was the case in the past in some setups — if the server is not said for how long to wait, it waits forever. For reasons unknown, while the connection is alive, the server stops sending anything to it. And so, the operator becomes "blind".

It is generally a good idea to have a specific value there.

There is no default value only because I am not sure what would be a good default value, and for how long can the servers guarantee the connection's liveness — i.e., if I set it to 10 mins (600 seconds), is it enough to solve the problem for sure?

OmegaVVeapon · 2020-11-30T00:53:09Z

Ok, I've added:

settings.watching.client_timeout = 60
settings.watching.server_timeout = 60

and redeployed the snippet.

I'll play with some permutations of that and see if helps.

One thing that I find very strange is why I don't see this problem when watching configmaps...
I modified the snippet in this ticket to ONLY watch for configmaps and it's been running for hours and everything triggers as it should still. Meanwhile, when I listen for secrets, I can never make it past 30 minutes.
There shouldn't be anything special to how secrets are handled in kopf or in k8s itself... should it?

But yeah... after several hours just watching configmaps, I see still:

[2020-11-30 00:49:32,911] asyncio              [INFO    ] poll took 2401.788 ms: 1 events
[2020-11-30 00:49:38,021] asyncio              [INFO    ] poll 4999.396 ms took 5005.137 ms: timeout
[2020-11-30 00:49:40,428] asyncio              [INFO    ] poll took 2404.599 ms: 1 events
[2020-11-30 00:49:45,537] asyncio              [INFO    ] poll 4999.651 ms took 5004.884 ms: timeout
[2020-11-30 00:49:47,939] asyncio              [INFO    ] poll took 2401.089 ms: 1 events
[2020-11-30 00:49:53,046] asyncio              [INFO    ] poll 4999.340 ms took 5001.108 ms: timeout
[2020-11-30 00:49:55,451] asyncio              [INFO    ] poll took 2403.718 ms: 1 events

logs appear.
Meanwhile, when watching secrets, those logs eventually stop showing altogether.
There's something weird going on there.

It's also worth mentioning that I'm testing in an AKS k8s cluster with only 3 nodes and I'm the only person that has access to it. It's a very controlled environment so there shouldn't be issues related to master migrations and the like.

Anyways, I'll leave the operator running with the secret handlers for a while and see how that goes.

Thanks for the suggestions 👍

OmegaVVeapon · 2020-11-30T16:48:30Z

I'm happy to report that after leaving the operator running all night with those settings, it's still responding this morning!

I'm also not exactly sure what the correct value to use will be for our real cluster which has hundreds of configmaps and secrets, I imagine it'd need to be higher?

Guess we'll try a few things and see if the operator becomes "blind" or not.

One thing that may be worth changing is this section in the documentation?

Yet, it feels wrong to assume any “good” values in a framework (especially since it works without timeouts defined, just produces extra logs).

At least in my case, it definitely wasn't working without timeouts defined. Handlers were not firing and secret resources that should have been processed by the operator were being ignored. Not sure how to best communicate that issue to other users of kopf however.

nolar · 2020-11-30T17:05:35Z

I cannot hint you what would be a good value. 10-60 mins sound good enough. The framework uses the resourceVersion= API query param to continue watching since the point it was interrupted at, so there should be no big workload or traffic on reconnections. But I'm not sure when the server-side starts freezing. In that case, the new events will be processed with a delay of up to this timeout.

Regarding the docs — yes, I agree, it is worth writing a note/warning there regarding these timeouts. I also thought about a warning/suggestion in the logs if nothing happens for some long period of time. Something like: "[WARNING] Nothing happened for 60 mins, which is unusual. Consider setting a server or client timeout to avoid server-side freezes. Read more at https://...docs..." — and start the "Troubleshooting" page for such cases.

If I set the default timeout to these same 60 mins, it would cause the same weird behaviour: freezing for tens of minutes and then suddenly unfreezing — not what the users expect.

OmegaVVeapon · 2020-12-03T21:39:26Z

I have not seen this issue ever since I added the client and server timeouts.
Everything works great!
Thanks for the advice (and the framework ofc) @nolar. Wouldn't have thought that was the issue.

Troubleshooting nolar/kopf#585

Set a timeout on Kubernetes watches in Kopf rather than using an unlimited watch. nolar/kopf#585 says that watches without an expiration sometimes go silent and stop receiving events, which is consistent with some problems we've seen. Mark Kubernetes object parsing failures as permanent so that Kopf won't retry with the same version of the object. Mark Kubernetes API failures as temporary so that Kopf will use its normal delay.

OmegaVVeapon added the bug Something isn't working label Nov 29, 2020

OmegaVVeapon closed this as completed Dec 3, 2020

nolar mentioned this issue Dec 14, 2020

Operator doesn't respond after some time #614

Closed

OmegaVVeapon mentioned this issue Feb 10, 2021

One-off run? #678

Open

2 tasks

nolar mentioned this issue Nov 3, 2021

Disconnect and reconnect the hanging watch-streams #860

Closed

bergerx mentioned this issue Dec 17, 2021

Watching stops working after 10 minutes kiwigrid/k8s-sidecar#85

Closed

mstrukci mentioned this issue Sep 9, 2022

Handler stops responding after some time #957

Open

legigor added a commit to youscan/ClusterSecret that referenced this issue Sep 28, 2022

Enable verbose logging

f4c2889

Troubleshooting nolar/kopf#585

rra mentioned this issue Mar 17, 2023

DM-38376: Improve Kopf configuration and error handling lsst-sqre/gafaelfawr#713

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handlers stop working after some time #585

Handlers stop working after some time #585

OmegaVVeapon commented Nov 29, 2020 •

edited

Loading

jamiechapmanbrn commented Nov 30, 2020

nolar commented Nov 30, 2020

OmegaVVeapon commented Nov 30, 2020

OmegaVVeapon commented Nov 30, 2020

nolar commented Nov 30, 2020

OmegaVVeapon commented Dec 3, 2020 •

edited

Loading

Handlers stop working after some time #585

Handlers stop working after some time #585

Comments

OmegaVVeapon commented Nov 29, 2020 • edited Loading

Long story short

Description

Environment

jamiechapmanbrn commented Nov 30, 2020

nolar commented Nov 30, 2020

OmegaVVeapon commented Nov 30, 2020

OmegaVVeapon commented Nov 30, 2020

nolar commented Nov 30, 2020

OmegaVVeapon commented Dec 3, 2020 • edited Loading

OmegaVVeapon commented Nov 29, 2020 •

edited

Loading

OmegaVVeapon commented Dec 3, 2020 •

edited

Loading