Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Endless touch-dummy Annotation Patching On Pod Causes High API Load #686

Closed
paxbit opened this issue Feb 16, 2021 · 21 comments · Fixed by #731
Closed

Endless touch-dummy Annotation Patching On Pod Causes High API Load #686

paxbit opened this issue Feb 16, 2021 · 21 comments · Fixed by #731
Labels
bug Something isn't working

Comments

@paxbit
Copy link

paxbit commented Feb 16, 2021

Long story short

While looking into #685 I noticed weird patching behavior from kopf regarding the touch-dummy annotation. As soon as a pod reaches the completed state kopf constantly patches the touch-dummy annotation on the pod creating significant load on the API server.

Description

The operator has a handler watching for completed pods. It is decorated like so:

    @kopf.on.resume("", "v1", "pods", labels=cfg.test_pod_selector_label_dict)
    @kopf.on.field(
        "",
        "v1",
        "pods",
        id="await_pod_term",
        field="status.phase",
        labels=cfg.test_pod_selector_label_dict,
    )

There also is a 10sec interval timer handling unschedulable pods.

    @kopf.on.timer(
        "",
        "v1",
        "pods",
        interval=cfg.watch_for_unschedulable_pod.interval,
        sharp=cfg.watch_for_unschedulable_pod.sharp,
        labels=cfg.test_pod_selector_label_dict,
        when=lambda namespace, name, status, **_: pods.get_pod_condition_or_none_from( 
            status, pods.PodCondTypeEnum.PodScheduled, pods.PodCondStatusEnum.false
        )
        is not None,
    )
I have the following pod sitting in the completed state for now ~23h. (Showing only metadata as that should be the relevant part I guess)
apiVersion: v1
kind: Pod
metadata:
  annotations:
    kopf.zalando.org/await_pod_term.status.phase: '{"started":"2021-02-15T22:02:03.350703","stopped":"2021-02-15T22:02:03.353473","purpose":"update","retries"
:1,"success":true,"failure":false}'
    kopf.zalando.org/last-handled-configuration: |
      {"spec":{"volumes":[{"name":"test-config","configMap":{"name":"ts-job-1179-imuvqvkxczsjxvtpmau9pw","defaultMode":420}},{"name":"rasbuild","persistentVolumeClaim":{"claimName":"job-sesam-test-1179-rasbuild"}},{"name":"default-token-xlckc","secret":{"secretName":"default-token-xlckc","defaultMode":420}}],"conta
iners":[{"name":"ts-runner","image":"[redacted]","command":["/bin/sh"],"args":["-c","python [redacted]"],"env":[{"name":"TS_RABBITMQ_HOST","value":"[redacted]"},{"name":"TS_RABBITMQ_USER","value":"[redacted]"},{"name":"TS_RABBITMQ_PASSWORD","value":"[redacted]"}],"resources":{"limits":{"cpu":"2","memory":"2Gi"},"requests":{"cpu":"100m","memory":"1Gi"}},"volumeMounts":[{"name":"test-config","mountPath":"[redacted]"},{"name":"rasbuild","mountPath":"[redacted]"},{"name":"default-token-xlckc","readOnly":true,"mountPath":"/var/run/secrets/kubernetes.io/serviceaccount"}],"lifecycle":{"preStop":{"exec":{"command":["echo","'bye'"]}}},"terminationMessagePath":"/dev/termination-log","terminationMessagePolicy":"File","imagePullPolicy":"Always","securityContext":{"allowPrivilegeEscalation":false}},{"name":"ts-agent","image":"[redacted]","ports":[{"name":"agent-endpoint","containerPort":2233,"protocol":"TCP"}],"resources":
{"requests":{"cpu":"500m","memory":"128Mi"}},"volumeMounts":[{"name":"default-token-xlckc","readOnly":true,"mountPath":"/var/run/secrets/kubernetes.io/service
account"}],"lifecycle":{"preStop":{"exec":{"command":["echo","'bye'"]}}},"terminationMessagePath":"/dev/termination-log","terminationMessagePolicy":"File","imagePullPolicy":"Always","securityContext":{"allowPrivilegeEscalation":false}}],"restartPolicy":"Never","terminationGracePeriodSeconds":30,"dnsPolicy":"Cluster
First","serviceAccountName":"default","serviceAccount":"default","nodeName":"[redacted]","securityContext":{},"affinity":{"nodeAffinity":{"requiredDuringSchedulingIgnoredDuringExecution":{"nodeSelectorTerms":[{"matchExpressions":[{"key":"feature.node.kubernetes.io/cpu-cpuid.VMX","operator":"In","values":["true"]}]}]}}},"schedulerName":"default-scheduler","tolerations":[{"key":"node.kubernetes.io/not-ready","operator":"Exists","effect":"NoExecute","tolerationSeconds":300}
,"key":"node.kubernetes.io/unreachable","operator":"Exists","effect":"NoExecute","tolerationSeconds":300}],"priorityClassName":"ts-default-priority-intel","priority":1,"enableServiceLinks":false},"metadata":{"labels":{"app":"ts-runner","ts.job.name":"sesam-test","ts.job.run-id":"1179"}},"status":{"phase":"Running"}}
    kopf.zalando.org/mark_containters_ready: '{"started":"2021-02-15T22:02:03.350719","purpose":"update","retries":0,"success":false,"failure":false}'
    kopf.zalando.org/register_pod_handling.__locals__.watch_-P7tEVg: '{"started":"2021-02-16T10:03:13.216603","stopped":"2021-02-16T10:03:13.218999","purpose":"update","retries":1,"success":true,"failure":false}'
    kopf.zalando.org/register_pod_handling.__locals__.watch_for_completed_pod: '{"started":"2021-02-16T10:03:13.216603","stopped":"2021-02-16T10:03:13.218999","purpose":"update","retries":1,"success":true,"failure":false}'
    kopf.zalando.org/touch-dummy: 2021-02-16T21:23:57.630443
  creationTimestamp: "2021-02-15T22:01:34Z"
  finalizers:
  - kopf.zalando.org/KopfFinalizerMarker
  labels:
    app: ts-runner
    ts.job.name: sesam-test
    ts.job.run-id: "1179"
  name: ts-job-runner-1179-jcxbd3kbmruutrshxpky5k
  namespace: kopf-prototype-target
  ownerReferences:
  - apiVersion: [redacted]/v1
    blockOwnerDeletion: true
    controller: true
    kind: TsJob
    name: ts-job-1179-evuncgeuwcqrndyiqdsmk6
    uid: 007cc7a9-a1d7-4013-9579-e12edfe8988b
  resourceVersion: "5215406"
  selfLink: /api/v1/namespaces/kopf-prototype-target/pods/ts-job-runner-1179-jcxbd3kbmruutrshxpky5k
  uid: fba8f54e-2a75-4075-9420-734f92599477
...
[redacted]
...

As soon as kopf starts the touch-dummy updates start on the completed pod multiple times a second.

kubectl -n kopf-prototype-target get pod --watch --template='{{.metadata.name}}: {{index .metadata.annotations "kopf.zalando.org/touch-dummy"}}{{"\n"}}'
ts-job-runner-1179-jcxbd3kbmruutrshxpky5k: 2021-02-16T17:23:18.714713
ts-job-runner-1179-jcxbd3kbmruutrshxpky5k: 2021-02-16T17:23:18.886211
ts-job-runner-1179-jcxbd3kbmruutrshxpky5k: 2021-02-16T17:23:19.068260
ts-job-runner-1179-jcxbd3kbmruutrshxpky5k: 2021-02-16T17:23:19.251591
ts-job-runner-1179-jcxbd3kbmruutrshxpky5k: 2021-02-16T17:23:19.425195
ts-job-runner-1179-jcxbd3kbmruutrshxpky5k: 2021-02-16T17:23:19.596808
ts-job-runner-1179-jcxbd3kbmruutrshxpky5k: 2021-02-16T17:23:19.764809
ts-job-runner-1179-jcxbd3kbmruutrshxpky5k: 2021-02-16T17:23:19.933829
ts-job-runner-1179-jcxbd3kbmruutrshxpky5k: 2021-02-16T17:23:20.124813
ts-job-runner-1179-jcxbd3kbmruutrshxpky5k: 2021-02-16T17:23:20.307736
ts-job-runner-1179-jcxbd3kbmruutrshxpky5k: 2021-02-16T17:23:20.485288
ts-job-runner-1179-jcxbd3kbmruutrshxpky5k: 2021-02-16T17:23:20.665172
ts-job-runner-1179-jcxbd3kbmruutrshxpky5k: 2021-02-16T17:23:20.846178
ts-job-runner-1179-jcxbd3kbmruutrshxpky5k: 2021-02-16T17:23:21.022081
ts-job-runner-1179-jcxbd3kbmruutrshxpky5k: 2021-02-16T17:23:21.196700
ts-job-runner-1179-jcxbd3kbmruutrshxpky5k: 2021-02-16T17:23:21.370981
ts-job-runner-1179-jcxbd3kbmruutrshxpky5k: 2021-02-16T17:23:21.550282
ts-job-runner-1179-jcxbd3kbmruutrshxpky5k: 2021-02-16T17:23:21.718212
ts-job-runner-1179-jcxbd3kbmruutrshxpky5k: 2021-02-16T17:23:21.889359
...

Do you have any idea why that is happening? When debugging this it looked like every MODIFIED event on the pod triggers the patching. The patching however creates a new MODIFIED event, triggering another patch > a new event > rinse > repeat.

Environment

  • Kopf version: 1.29.2
  • Kubernetes version: 1.17
  • Python version: 3.9.1rc1
  • OS/platform: Linux
Python packages installed
aioconsole==0.3.1
aioelasticsearch==0.7.0
aiohttp==3.7.3
aiojobs==0.3.0
aiomonitor==0.4.5
alembic==1.4.3
amqp==5.0.2
appdirs==1.4.4
APScheduler==3.6.3
arrow==0.17.0
async-timeout==3.0.1
asynctest==0.13.0
asynqp==0.6
attrs==20.3.0
billiard==3.6.3.0
black==20.8b1
bleach==3.2.1
blinker==1.4
cachetools==4.1.1
celery==5.0.2
certifi==2020.11.8
cffi==1.14.4
chardet==3.0.4
click==5.1
click-didyoumean==0.0.3
click-repl==0.1.6
colorama==0.4.4
croniter==0.3.36
cryptography==3.2.1
distro==1.5.0
docker==4.4.0
docutils==0.16
elasticsearch==7.10.0
elasticsearch-dsl==7.3.0
fcache==0.4.7
flake8==3.8.4
flake8-codeclimate==0.3.0
flake8-comprehensions==3.3.0
Flask==1.1.2
freezegun==1.0.0
future==0.18.2
gitdb==4.0.5
GitPython==3.1.11
google-auth==1.23.0
httplib2==0.18.1
humanize==3.1.0
idna==2.10
iniconfig==1.1.1
iso8601==0.1.13
isort==5.6.4
itsdangerous==1.1.0
janus==0.6.1
jeepney==0.6.0
Jinja2==2.11.2
json-tricks==3.15.5
keyring==21.5.0
kombu==5.0.2
kopf==1.29.2
kubernetes==12.0.1
ldap3==2.8.1
lxml==4.6.2
lz4==3.1.1
Mako==1.1.3
MarkupSafe==1.1.1
matrix-client==0.3.2
mccabe==0.6.1
mmh3==2.5.1
multidict==5.1.0
mypy-extensions==0.4.3
natsort==7.1.0
oauth2client==4.1.3
oauthlib==3.1.0
opentelemetry-api==0.16b1
opentelemetry-exporter-jaeger==0.16b1
opentelemetry-instrumentation==0.16b1
opentelemetry-instrumentation-aiohttp-client==0.16b1
opentelemetry-instrumentation-sqlalchemy==0.16b1
opentelemetry-sdk==0.16b1
packaging==20.7
pathspec==0.8.1
pip==21.0.1
pkg-resources==0.0.0
pkginfo==1.6.1
pluggy==0.13.1
prometheus-client==0.9.0
prompt-toolkit==3.0.8
psutil==5.7.3
psycopg2-binary==2.8.6
py==1.9.0
pyasn1==0.4.8
pyasn1-modules==0.2.8
pycodestyle==2.6.0
pycparser==2.20
pydantic==1.7.3
pydevd-pycharm==203.5981.165
pyflakes==2.2.0
Pygments==2.7.2
pykube-ng==20.10.0
pyOpenSSL==20.0.0
pyparsing==2.4.7
pystack-debugger==0.9.0
pytest==6.1.2
pytest-aiohttp==0.3.0
pytest-asyncio==0.14.0
pytest-freezegun==0.4.2
pytest-mock==3.3.1
python-dateutil==2.8.1
python-editor==1.0.4
python-json-logger==2.0.1
pytz==2020.4
PyYAML==5.3.1
readme-renderer==28.0
regex==2020.11.13
requests==2.25.0
requests-oauthlib==1.3.0
requests-toolbelt==0.9.1
rfc3986==1.4.0
rsa==4.6
sd-notify==0.1.0
SecretStorage==3.3.0
sentry-sdk==0.19.4
setproctitle==1.2
setuptools==44.0.0
setuptools-scm==4.1.2
shortuuid==1.0.1
singleton-decorator==1.0.0
six==1.15.0
smmap==3.0.4
SQLAlchemy==1.3.20
tabulate==0.8.7
termcolor==1.1.0
terminaltables==3.1.0
TestLink-API-Python-client==0.8.1
thrift==0.13.0
toml==0.10.2
tqdm==4.54.0
twine==3.2.0
typed-ast==1.4.1
typing-extensions==3.7.4.3
tzlocal==2.1
UnleashClient==3.5.0
urllib3==1.26.2
uvloop==0.14.0
vine==5.0.0
voluptuous==0.12.0
wcwidth==0.2.5
webencodings==0.5.1
websocket-client==0.57.0
Werkzeug==1.0.1
wheel==0.36.0
wrapt==1.12.1
yarl==1.6.3
zstandard==0.14.0
zulip==0.7.1
@paxbit paxbit added the bug Something isn't working label Feb 16, 2021
@nolar
Copy link
Owner

nolar commented Feb 17, 2021

The touch-dummy patch is used to artificially trigger a change and get back to Kopf's unfinished tasks. But it is used only if there are unfinished tasks (handlers, in this case). Usually (but not always), this happens after sleeping for some time. The default delay is 60s, so this flooding looks unusual.

The only unfinished handler I see is mark_containters_ready, but I do not know what it is and what it does.

Can you please run the operator with --verbose and copy the logs here (maybe only the excerpts around the touch-dummies)? That might help — it would show if there are any attempts to sleep before that, for how long, why so, and if there are any "temporary errors" with delay=0 preceding that.

@paxbit
Copy link
Author

paxbit commented Feb 17, 2021

Thank you for your quick response!

Enabling debug and verbose generates the following log:

INFO:kopf.activities.authentication:Activity 'KopfRunner.__kopf_thread_setup.<locals>.login_fn' succeeded.
INFO:kopf.reactor.activities:Initial authentication has finished.
WARNING:asyncio:Executing <Task pending name='credentials retriever' coro=<guard() running at [redacted]/venv/lib/python3.9/site-packages/kopf/utilities/aiotasks.py:69> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f5d882afe80>()] created at /usr/lib/python3.9/asyncio/locks.py:287> created at /usr/lib/python3.9/asyncio/tasks.py:361> took 0.134 seconds
[2021-02-17 12:20:50,377] asyncio              [WARNING ] Executing <Task pending name='credentials retriever' coro=<guard() running at [redacted]/venv/lib/python3.9/site-packages/kopf/utilities/aiotasks.py:69> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f5d882afe80>()] created at /usr/lib/python3.9/asyncio/locks.py:287> created at /usr/lib/python3.9/asyncio/tasks.py:361> took 0.134 seconds
WARNING:kopf.reactor.running:OS signals are ignored: running not in the main thread.
DEBUG:asyncio:<uvloop.loop.SSLProtocol object at 0x7f5d885f4e00> starts SSL handshake
[2021-02-17 12:20:50,481] asyncio              [DEBUG   ] <uvloop.loop.SSLProtocol object at 0x7f5d885f4e00> starts SSL handshake
DEBUG:asyncio:<uvloop.loop.SSLProtocol object at 0x7f5d885f4720> starts SSL handshake
... lots of those handshake logs cut out ...
DEBUG:asyncio:<uvloop.loop.SSLProtocol object at 0x7f5d7878be00>: SSL handshake took 188.0 ms
[2021-02-17 12:20:51,061] asyncio              [DEBUG   ] <uvloop.loop.SSLProtocol object at 0x7f5d7878be00>: SSL handshake took 188.0 ms
INFO:ts_runner_operator.kopf_operator.handlers.job_handling:Waiting for quotas to become fully initialized before requesting jobs...
Affected resource was: Pod/ts-job-runner-1179-jcxbd3kbmruutrshxpky5k
WARNING:kopf.objects:Patching failed with inconsistencies: (('remove', ('status', 'kopf'), {'dummy': '2021-02-17T11:20:51.718131'}, None),)
INFO:kopf.objects:Resuming is processed: 1 succeeded; 0 failed.
WARNING:asyncio:Executing <Task pending name='Task-294' coro=<worker() running at [redacted]/venv/lib/python3.9/site-packages/kopf/reactor/queueing.py:254> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f5d785f6670>()] created at /usr/lib/python3.9/asyncio/tasks.py:458> cb=[Job._done_callback()] created at [redacted]/venv/lib/python3.9/site-packages/aiojobs/_job.py:108> took 0.135 seconds
[2021-02-17 12:20:52,148] asyncio              [WARNING ] Executing <Task pending name='Task-294' coro=<worker() running at [redacted]/venv/lib/python3.9/site-packages/kopf/reactor/queueing.py:254> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f5d785f6670>()] created at /usr/lib/python3.9/asyncio/tasks.py:458> cb=[Job._done_callback()] created at [redacted]/venv/lib/python3.9/site-packages/aiojobs/_job.py:108> took 0.135 seconds
WARNING:asyncio:Executing <Task pending name='runner of register_cleanup_handling.__locals__.cleanup_stale_ts_jobs' coro=<_runner() running at [redacted]/venv/lib/python3.9/site-packages/kopf/reactor/daemons.py:332> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f5d7854a2e0>()] created at /usr/lib/python3.9/asyncio/tasks.py:458> created at /usr/lib/python3.9/asyncio/tasks.py:361> took 0.185 seconds
[2021-02-17 12:20:52,353] asyncio              [WARNING ] Executing <Task pending name='runner of register_cleanup_handling.__locals__.cleanup_stale_ts_jobs' coro=<_runner() running at [redacted]/python3.9/site-packages/kopf/reactor/daemons.py:332> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f5d7854a2e0>()] created at /usr/lib/python3.9/asyncio/tasks.py:458> created at /usr/lib/python3.9/asyncio/tasks.py:361> took 0.185 seconds
Affected resource was: Pod/ts-job-runner-1179-jcxbd3kbmruutrshxpky5k
WARNING:kopf.objects:Patching failed with inconsistencies: (('remove', ('status', 'kopf'), {'dummy': '2021-02-17T11:20:52.417689'}, None),)
Affected resource was: Pod/ts-job-runner-1179-jcxbd3kbmruutrshxpky5k
WARNING:kopf.objects:Patching failed with inconsistencies: (('remove', ('status', 'kopf'), {'dummy': '2021-02-17T11:20:52.779093'}, None),)
Affected resource was: Pod/ts-job-runner-1179-jcxbd3kbmruutrshxpky5k
WARNING:kopf.objects:Patching failed with inconsistencies: (('remove', ('status', 'kopf'), {'dummy': '2021-02-17T11:20:53.157254'}, None),)
Affected resource was: Pod/ts-job-runner-1179-jcxbd3kbmruutrshxpky5k
WARNING:kopf.objects:Patching failed with inconsistencies: (('remove', ('status', 'kopf'), {'dummy': '2021-02-17T11:20:53.528505'}, None),)
Affected resource was: Pod/ts-job-runner-1179-jcxbd3kbmruutrshxpky5k
WARNING:kopf.objects:Patching failed with inconsistencies: (('remove', ('status', 'kopf'), {'dummy': '2021-02-17T11:20:54.031613'}, None),)
Affected resource was: Pod/ts-job-runner-1179-jcxbd3kbmruutrshxpky5k
WARNING:kopf.objects:Patching failed with inconsistencies: (('remove', ('status', 'kopf'), {'dummy': '2021-02-17T11:20:55.236738'}, None),)
Affected resource was: Pod/ts-job-runner-1179-jcxbd3kbmruutrshxpky5k
WARNING:kopf.objects:Patching failed with inconsistencies: (('remove', ('status', 'kopf'), {'dummy': '2021-02-17T11:20:55.933774'}, None),)
Affected resource was: Pod/ts-job-runner-1179-jcxbd3kbmruutrshxpky5k
WARNING:kopf.objects:Patching failed with inconsistencies: (('remove', ('status', 'kopf'), {'dummy': '2021-02-17T11:20:56.323987'}, None),)
Affected resource was: Pod/ts-job-runner-1179-jcxbd3kbmruutrshxpky5k
WARNING:kopf.objects:Patching failed with inconsistencies: (('remove', ('status', 'kopf'), {'dummy': '2021-02-17T11:20:56.721222'}, None),)
Affected resource was: Pod/ts-job-runner-1179-jcxbd3kbmruutrshxpky5k
WARNING:kopf.objects:Patching failed with inconsistencies: (('remove', ('status', 'kopf'), {'dummy': '2021-02-17T11:20:57.341201'}, None),)
Affected resource was: Pod/ts-job-runner-1179-jcxbd3kbmruutrshxpky5k
WARNING:kopf.objects:Patching failed with inconsistencies: (('remove', ('status', 'kopf'), {'dummy': '2021-02-17T11:20:57.855112'}, None),)
Affected resource was: Pod/ts-job-runner-1179-jcxbd3kbmruutrshxpky5k
WARNING:kopf.objects:Patching failed with inconsistencies: (('remove', ('status', 'kopf'), {'dummy': '2021-02-17T11:20:58.340119'}, None),)
Affected resource was: Pod/ts-job-runner-1179-jcxbd3kbmruutrshxpky5k
WARNING:kopf.objects:Patching failed with inconsistencies: (('remove', ('status', 'kopf'), {'dummy': '2021-02-17T11:20:59.124779'}, None),)
---
Here I manually stopped the operator.
---
INFO:ts_runner_operator.kopf_operator.kopf_runner:Signaling kopf termination...
INFO:kopf.reactor.running:Stop-flag is set to True. Operator is stopping.

The unfinished handler is basically doing this. It flags the resource owning the pod for which the handler was triggered with a "ready" state. It also deals with the edge case of deleting the pod when for whatever reason its parent is missing.

def mark_containters_ready(namespace, name, status, body, meta, **_):

        pod = pykube.Pod(k_api, body)
        # `ts_job` is a custom resource owning the pod
        ts_job = ts_job_or_none_from(pod)

        if ts_job is None:
            events.push_operator_event(
                "Warning",
                pod.namespace,
                "TsRunnerCreation",
                "ParentJobMissing",
                f"Pod {pod.namespace}/{pod.name} became ready however its parent TsJob is missing. Deleting orphaned pod",
                pod,
            )
            pod.delete(force=True)

            return

        if ts_job.obj["status"]["state"] != JobStateEnum.Running.value:
            state.update_state_on(
                ts_job,
                JobStateEnum.Running,
                "PodReady",
            )

@paxbit
Copy link
Author

paxbit commented Feb 17, 2021

After configuring the progress storage to kopf.AnnotationsProgressStorage(...) (see #685) the constant touch-dummy patches stopped. However what this single change also did was allowing the watch_for_completed_pod handler to fire on the completed pod shown above.

I'm a bit baffled though. I have been developing this operator for the last couple of weeks always aware that at some point I need to deal with this patch warning (#685). I treated it as a nuisance b/c within all that time the patch warning only appeared for completed or terminating pods. And since all the operator handlers were firing as expected, as was the watch_for_completed_pod handler, I did not see enough reason to dig deeper. Form my perspective it looks like the very same codebase was able to orchestrate thousands of those always identical pods (we're still in the test phase) without issue and without dropping out on any other completed pod before, always calling the completion handler and never keeping firing large amounts of touch-dummy patches.

I just went back trying to reproduce this by:

  1. Configuring status progress storage again
  2. Letting the operator spawn the pod
  3. Stopping the operator
  4. Wait for the pod to complete
  5. Start the operator again

I was not successful though. Even with using the status storage again and with the same test lifecycle that was triggering the constant patch condition before the endless patching did not happen now and also the watch_for_completed_pod handler was called immediately as expected. It basically worked like it did all the way back :|

So with this in mind I suspect something went "wrong" with this one particular pod. Presumably sth. in the metadata. I will try to reproduce this by manually recreating a completed pod with the same metadata as the one above. This will take a moment though and I hope this will not turn out to be a Heisenbug.

@paxbit
Copy link
Author

paxbit commented Feb 17, 2021

After some testing I found out the following things.

I started by letting a pod run to completion then applying the following annotations from the problematic pod. Then I started the operator and did observe the constant patch attempts again.

metadata:
  annotations:
    kopf.zalando.org/await_pod_term.status.phase: '{"started":"2021-02-15T22:02:03.350703","stopped":"2021-02-15T22:02:03.353473","purpose":"update","retries":1,"success":true,"failure":false}'
    kopf.zalando.org/last-handled-configuration: |
      {"spec":{"volumes":[{"name":"test-config","configMap":{"name":"ts-job-1179-imuvqvkxczsjxvtpmau9pw","defaultMode":420}},{"name":"rasbuild","persistentVolumeClaim":{"claimName":"job-sesam-test-1179-rasbuild"}},{"name":"default-token-xlckc","secret":{"secretName":"default-token-xlckc","defaultMode":420}}],"containers":[{"name":"ts-runner","image":"[redacted]/ts_kube_all:dev","command":["/bin/sh"],"args":["-c","python [redacted]"],"env":[{"name":"TS_RABBITMQ_HOST","value":"[redacted]"},{"name":"TS_RABBITMQ_USER","value":"[redacted]"},{"name":"TS_RABBITMQ_PASSWORD","value":"[redacted]"}],"resources":{"limits":{"cpu":"2","memory":"2Gi"},"requests":{"cpu":"100m","memory":"1Gi"}},"volumeMounts":[{"name":"test-config","mountPath":"/opt/ts/jobspec"},{"name":"rasbuild","mountPath":"/mnt/rasimages"},{"name":"default-token-xlckc","readOnly":true,"mountPath":"/var/run/secrets/kubernetes.io/serviceaccount"}],"lifecycle":{"preStop":{"exec":{"command":["echo","'bye'"]}}},"terminationMessagePath":"/dev/termination-log","terminationMessagePolicy":"File","imagePullPolicy":"Always","securityContext":{"allowPrivilegeEscalation":false}},{"name":"ts-agent","image":"[redacted]","ports":[{"name":"agent-endpoint","containerPort":[redacted],"protocol":"TCP"}],"resources":{"requests":{"cpu":"500m","memory":"128Mi"}},"volumeMounts":[{"name":"default-token-xlckc","readOnly":true,"mountPath":"/var/run/secrets/kubernetes.io/serviceaccount"}],"lifecycle":{"preStop":{"exec":{"command":["echo","'bye'"]}}},"terminationMessagePath":"/dev/termination-log","terminationMessagePolicy":"File","imagePullPolicy":"Always","securityContext":{"allowPrivilegeEscalation":false}}],"restartPolicy":"Never","terminationGracePeriodSeconds":30,"dnsPolicy":"ClusterFirst","serviceAccountName":"default","serviceAccount":"default","nodeName":"k-test-n1","securityContext":{},"affinity":{"nodeAffinity":{"requiredDuringSchedulingIgnoredDuringExecution":{"nodeSelectorTerms":[{"matchExpressions":[{"key":"feature.node.kubernetes.io/cpu-cpuid.VMX","operator":"In","values":["true"]}]}]}}},"schedulerName":"default-scheduler","tolerations":[{"key":"node.kubernetes.io/not-ready","operator":"Exists","effect":"NoExecute","tolerationSeconds":300},{"key":"node.kubernetes.io/unreachable","operator":"Exists","effect":"NoExecute","tolerationSeconds":300}],"priorityClassName":"ts-default-priority-intel","priority":1,"enableServiceLinks":false},"metadata":{"labels":{"app":"ts-runner","ts.job.name":"sesam-test","ts.job.run-id":"1179"}},"status":{"phase":"Running"}}
    kopf.zalando.org/mark_containters_ready: '{"started":"2021-02-15T22:02:03.350719","purpose":"update","retries":0,"success":false,"failure":false}'
    kopf.zalando.org/register_pod_handling.__locals__.watch_-P7tEVg: '{"started":"2021-02-16T10:03:13.216603","stopped":"2021-02-16T10:03:13.218999","purpose":"update","retries":1,"success":true,"failure":false}'
    kopf.zalando.org/register_pod_handling.__locals__.watch_for_completed_pod: '{"started":"2021-02-16T10:03:13.216603","stopped":"2021-02-16T10:03:13.218999","purpose":"update","retries":1,"success":true,"failure":false}'
    kopf.zalando.org/touch-dummy: 2021-02-16T21:23:57.630443

I can make the repeated patch attempts stop by doing either of the following:

  1. Deleting annotation kopf.zalando.org/last-handled-configuration
  2. Deleting annotation kopf.zalando.org/mark_containters_ready

Any idea why that is?

BTW mark_containers_ready is decorated like so:

    @kopf.on.resume(
        "",
        "v1",
        "pods",
        labels=cfg.test_pod_selector_label_dict,
        when=lambda namespace, name, status, **_: status.get("phase")  # noqa
        == PodPhaseEnum.Running.value
        and pods.get_pod_condition_or_none_from(
            status,
            pods.PodCondTypeEnum.ContainersReady,
            pods.PodCondStatusEnum.true,
        ),
    )
    @kopf.on.update(
        "",
        "v1",
        "pods",
        id="mark_containters_ready",
        labels=cfg.test_pod_selector_label_dict,
        when=lambda namespace, name, status, **_: status.get("phase")  # noqa
        == PodPhaseEnum.Running.value
        and pods.get_pod_condition_or_none_from(
            status,
            pods.PodCondTypeEnum.ContainersReady,
            pods.PodCondStatusEnum.true,
        ),
    )

@paxbit
Copy link
Author

paxbit commented Mar 1, 2021

@nolar

I can reproduce the behavior with this:

Resources

touch-hammer.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  annotations:
    kopf.zalando.org/last-handled-configuration: |
      {"metadata":{"labels":{"app":"ts-runner"}}}
    anon.operator.ts/kopf-managed: "yes"
    anon.operator.ts/the_handler: '{"started":"2021-03-01T12:34:03.155586","purpose":"update","retries":0,"success":false,"failure":false}'
    anon.operator.ts/touch-dummy: 2021-03-01T20:02:40.186348
  finalizers:
  - kopf.zalando.org/KopfFinalizerMarker
  labels:
    app: foo-runner
  name: bug
  namespace: bug-target
data:
  phase: "Done"
touch-hammer.py
import logging
import asyncio
import threading
import contextlib

import kopf

logger = logging.getLogger(__name__)

selector = {"app": "foo-runner"}

class KopfRunner(object):
    def __init__(self) -> None:
        self.readyEvent = threading.Event()
        self.stopFlag = threading.Event()

        self.kopfThread = threading.Thread(
            name="kopf-main",
            target=self.__setup_kopf_event_loop,
            kwargs=dict(
                stop_flag=self.stopFlag,
                ready_flag=self.readyEvent,
            ),
        )

    @kopf.on.startup()
    def configure(settings: kopf.OperatorSettings, **_):
        settings.persistence.progress_storage = kopf.AnnotationsProgressStorage(
            prefix="anon.operator.ts"
        )

    @kopf.on.login(retries=1, backoff=3.0)
    def login_fn(**kwargs):
        return kopf.login_via_pykube(**kwargs)

    @kopf.on.resume(
        "",
        "v1",
        "configmaps",
        labels=selector,
    )
    @kopf.on.update(
        "",
        "v1",
        "configmaps",
        id="the_handler",
        labels=selector,
        when=lambda body, **_: body["data"].get("phase") == "Foobar",
    )
    def the_handler(**_):
        print("the_handler called")

    @staticmethod
    def __setup_kopf_event_loop(
        ready_flag: threading.Event, stop_flag: threading.Event
    ):
        kopf_loop = asyncio.new_event_loop()
        asyncio.set_event_loop(kopf_loop)

        with contextlib.closing(kopf_loop):
            kopf.configure(verbose=True, debug=True)
            kopf_loop.run_until_complete(
                kopf.operator(
                    namespace="bug-target",
                    ready_flag=ready_flag,
                    stop_flag=stop_flag,
                )
            )

    def start(self):
        logger.info("Starting kopf...")

        self.kopfThread.start()

        self.readyEvent.wait()
        logger.info("Kopf ready.")

        self.stopFlag.wait()


if __name__ == "__main__":

    runner = KopfRunner()
    runner.start()

Setup And Run

  1. kubectl create ns bug-target
  2. kubectl apply -f touch-hammer.yaml
  3. In secondary terminal run:
    kubectl -n bug-target get cm --watch --template='{{.metadata.name}}: {{index .metadata.annotations "anon.operator.ts/touch-dummy"}}{{"\n"}}'
  4. python touch-hammer.py

Result

When I do this I get:

kubectl -n bug-target get cm --watch --template='{{.metadata.name}}: {{index .metadata.annotations "anon.operator.ts/touch-dummy"}}{{"\n"}}'
bug: 2021-03-01T21:20:12.765007
bug: 2021-03-01T21:21:10.100777
bug: 2021-03-01T21:21:10.279276
bug: 2021-03-01T21:21:10.460133
bug: 2021-03-01T21:21:10.648754
bug: 2021-03-01T21:21:10.835736
bug: 2021-03-01T21:21:11.033883
bug: 2021-03-01T21:21:11.219593
bug: 2021-03-01T21:21:11.394481
bug: 2021-03-01T21:21:11.602270
bug: 2021-03-01T21:21:11.784301
bug: 2021-03-01T21:21:11.971768
bug: 2021-03-01T21:21:12.143277
bug: 2021-03-01T21:21:12.314016
bug: 2021-03-01T21:21:12.490774
bug: 2021-03-01T21:21:12.686455
bug: 2021-03-01T21:21:12.871613
bug: 2021-03-01T21:21:13.048998

... touch dummies in quick succession.

[email protected]

Background

When running the operator after some time it ends up leaving Pods in a completed state without calling the appropriate handler
finalizing those Pods. Instead those Pods are touched by kopf with a delay only dictated by the HTTP latency to the API. The above watch output was from my local machine which talks to the cluster via VPN. Running it in-cluster might cause 50-100 requests/s/pod. This happens after the operator handled hundreds or thousands of instances of the same Pod without issue.

I'd appreciate your feedback on this.
Thanks!

@paxbit
Copy link
Author

paxbit commented Mar 2, 2021

Analysis

Here's what I was able to find out about the behavior while debugging kopf using the above example.

kopf/reactor/processing.py:376
If a handler has no outcome, which is the case for the_handler from the example b/c its when= does not match, then in...

kopf/storage/states.py:230
... handler_state.with_outcome(...)is not called, which seems to be the only place where HandlerState.delayed is actually calculated. In my understanding this seems crucial for understanding the touch dummy behavior as...

kopf/reactor/processing.py:396
... calls kopf.storage.states.State.delays which bases its "delays" calculation on the aforementioned HandlerState.delayed. Since that is always None, the delay will always be zero.

The resulting zeroed resource_changing_delays (kopf/reactor/processing.py:220) are then handed down to kopf/reactor/processing.py:79 where they are passed to effects.apply(...) at kopf/reactor/processing.py:101. This is where the touch is actually triggered...

kopf/reactor/effects.py:66
Holds the following note:
# Note: a zero-second or negative sleep is still a sleep, it will trigger a dummy patch.
which then is duly followed trough on at kopf/reactor/effects.py:91, explicitly for a zero delay.

At this point the circle closes as the touch triggers an UPDATE event which causes the same chain of events over and over again.

@nolar
So far I believe I do not understand enough about the rationale behind implementing it like this but I think the contract of HandlerState.delayed (kopf/storage/states.py:55) is not implemented correctly in cases where a handler when= condition is given but not matched. What do you think?

@paxbit
Copy link
Author

paxbit commented Mar 4, 2021

Hi @nolar,

it would be awesome if you could drop a quick note saying if you either acknowledge this as a bug or so far had no time to look into it but may do so in the future or if you think I'm mistaken or misapplied some of kopf's concepts. I'm happy with any of those or whatever else it might be. It's just that I want to avoid working on a PR and then reading from you at 80% into completing it that you also implemented a change or think any of my previous assumptions were wrong or maybe I should RTFM (I did but maybe my glasses were dirty o_0).

@nolar
Copy link
Owner

nolar commented Mar 4, 2021

Hello @paxbit.

Sorry for no answer for a long time. Yes, I saw the messages here but didn't have time to dive deep yet to understand what is happening and why.

Regarding your last comment — this is an excellent explanation of what is happening (or might be happening). I usually have troubles with those "delays" myself (see below).

The rationale beyond treating 0 as a real sleep is that 0 is not far away from e.g. 0.1 (or -0.1, i.e. an overdue sleep-and-touch), which would cause the same API flooding problems. If the delays are to be prevented, they should be all None, not 0.

What's confusing me is that there is an outcome for a handler with a mismatching when= filter. There should be no outcome. It should not be even attempted (because it should not be even selected).

I can neither confirm nor deny that your suspicion is correct here — need to experiment with that case myself (maybe the next weekend).

But where I would blindly(*) put my suspicion is in effects.apply(). I always had a suspicion that it has some bug(s) with those numerous if's, and is not aligned with sleep_or_wait() in regards to None delays (which should be treated as inexistent) — but I wasn't able to imagine or simulate the case when this would happen. Maybe I can catch one with your snippets. Or maybe you can tell what's the value of resource_changing_delays in your case?

(*) blindly — as in "gut-feeling-based".

@nolar
Copy link
Owner

nolar commented Mar 4, 2021

@paxbit Regarding the 80% — I am now at 0% of this. So, if you have a fix, I would be happy not to dive deep into this bug, and to continue with another task (admission hooks). Just let's align on what exactly is happening and what is the proposed fix in general (verbally) — it might affect other aspects unintentionally.

@paxbit
Copy link
Author

paxbit commented Mar 4, 2021

Hi @nolar,

Thanks for the reply! I very well understand a tight time budget ;)

What's confusing me is that there is an outcome for a handler with a mismatching when= filter. There should be no outcome. It should not be even attempted (because it should not be even selected).

I probably could have written it more clearly. I actually did find what you say, there is no outcome and yes I do not see an attempt to call the handler to get one. So this part seems correct to me.

But where I would blindly(*) put my suspicion is in effects.apply(). I always had a suspicion that it has some bug(s) with those numerous if's, and is not aligned with sleep_or_wait() in regards to None delays (which should be treated as inexistent) — but I wasn't able to imagine or simulate the case when this would happen. Maybe I can catch one with your snippets. Or maybe you can tell what's the value of resource_changing_delays in your case?

I agree the complexity of those ifs could be lower. I'm not sure I understand what you meant by saying "not aligned with sleep_or_wait() in regards to None delays". Do you mean those should also trigger a zero length sleep?

resource_changing_delays is [0] when I step through the example.


Today I turned to the kopf code base and thought about a fix, see below, but could you please verify the following of my assumptions.

  • Writing the expression HanderState.delayed is not None is meant to be equivalent to saying "the handler threw something". Its value is established as timestamp saying "retry no earlier than this point in time". The actual delay interval is always either of: the decorator backoff=...or the kopf.*Error(delay=...)

  • The only purpose of the sleeping in effects.apply(...) is to actually back-off before retrying. If new events interrupt the sleeping then that new event will not trigger the delayed handler because of handlers_todo in kopf/reactor/handling.py:201. Also this is the one and only mechanism preventing a premature call to a delayed handler.

  • The actual way to ensure an interrupted sleep period is waited for in full is by relying on the update event generated by the handler status patch which triggers a reevaluation of its delayed value in the next iteration. This continues until delayed arrived. The iteration where a handler raised an error will never sleep b/c of the status patch.
    (This means throwing a kopf.*Error will always immediately cause a kube API rountrip. I understand now this needs to be done for durability reasons to uphold a requested backoff even after operator restarts. This Error > HTTP Request relationship makes sense, however It wasn't obvious to me up until now after reading only the documentation.)

If all of the above applies then I believe the actual bug is here:
kopf/storage/states.py:330

where today it says:

max(0, (handler_state.delayed - now).total_seconds()) if handler_state.delayed else 0
...

but should probably say:

[
    max(0, (handler_state.delayed - now).total_seconds())
    for handler_state in self._states.values()
    if handler_state.delayed
    if not handler_state.finished
    if self.purpose is None or handler_state.purpose is None
       or handler_state.purpose == self.purpose
]

this would allow effects.apply(...) to effectively behave the same except for cases where no patch and a "non-delay" (0 or None) is present and would allow reaching kopf/reactor/effects.py:92 which states # no patch/touch and no delay. Because the way I see it as of now there is no actual sleep_and_wait() for the case of a zero delay. All branches leading to sleep_and_wait(...) are guarded by forms of delay > 0. If in your opinion this is a bug then instead of the above proposed change to kopf/storage/states.py:330 I think two changes to effects.apply(...) are necessary.

  1. Also sleep for zero delays
  2. Guard the touch dummy branch to not run in cases of no patch and no delay.

As for effects.apply(...) I can make it more readable by unwrapping it and using early-returns.

@nolar
Copy link
Owner

nolar commented Mar 4, 2021

Please give me some time till tomorrow — I need to process this, and it is a bit late already. The ”.delayed” fix looks promising on a first glance. But the devil is in edge cases, as usually: I probably meant something with “else 0”.

Would it work for “raise TemporaryError(..., delay=0)” with the meaning of “retry asap”?

@nolar
Copy link
Owner

nolar commented Mar 6, 2021

  • Writing the expression HanderState.delayed is not None is meant to be equivalent to saying "the handler threw something". Its value is established as timestamp saying "retry no earlier than this point in time". The actual delay interval is always either of: the decorator backoff=...or the kopf.*Error(delay=...)

Correct. Almost.

"The handler threw something" would be not HandlerState.finished. The delay is secondary here.

But it is correlated: the delay exists only if the handler threw something, so it can be interpreted your way too: if the delay is not None, the handler definitely threw something.

But not the other way: HandlerState.delayed is None doesn't mean that the handler didn't throw anything — it only means that there is (somehow) no delay requested, while an exception is still possible.

  • The only purpose of the sleeping in effects.apply(...) is to actually back-off before retrying. If new events interrupt the sleeping then that new event will not trigger the delayed handler because of handlers_todo in kopf/reactor/handling.py:201. Also this is the one and only mechanism preventing a premature call to a delayed handler.

Correct.

Just to verify, here is the same with other words:

Since Kubernetes has no such thing as delayed processing or events, the delay is done by the operator at the end of each processing cycle (when and if needed), before the next "delayed" cycle begins. The whole feature of creation/update/deletion handlers is Kopf's construct, which is absent in Kubernetes.

But the sleep is interrupted by any change on the resource ("[event] stream pressure", renamed from "replenished") — in that case, the new events are processed as usual, and the sleep is repeated with a recalculated delay (hence, "unslept time").

  • The actual way to ensure an interrupted sleep period is waited for in full is by relying on the update event generated by the handler status patch which triggers a reevaluation of its delayed value in the next iteration. This continues until delayed arrived. The iteration where a handler raised an error will never sleep b/c of the status patch.

Correct.

Dummy touch-patches or regular patches are needed to get control back to the operator after the sleep to re-evaluate the handlers and execute those that were delayed.

Here is why it is done so:

Historically, touch-patches were implemented first in the initial prototype, and they slept uninterruptedly and then triggered the guaranteed resource event to get back to the operator with all new fresh state of the resource's body (spec, metadata, status, etc). All changes during the sleep time were ignored until the next awakening time.

Much later, those "stream pressure"/"replenished" events were added, so the sleep became interruptable by new changes.

Now, it might be so that the touch-patches are an unnecessary atavism and the whole cycle of sleeping and re-execution can be done fully in-memory without dummy touch-patches: the resource's freshness is guaranteed since we know when and if the new changes arrive via these interruptions — and if they didn't arrive during the sleep, there were no changes and we are free to go.

But this would be a much bigger refactoring, actually a redesign of the current flow of how Kopf works, for which I am not ready now, so I keep this improvement idea for later. Or, better say, it was estimated as a big redesign when I thought about it when the interruptions were introduced; it might be much easier with all the new changes (e.g. per-resource "memories"), but I didn't re-evaluate the complexity since then.

(This means throwing a kopf.*Error will always immediately cause a kube API rountrip. I understand now this needs to be done for durability reasons to uphold a requested backoff even after operator restarts. This Error > HTTP Request relationship makes sense, however It wasn't obvious to me up until now after reading only the documentation.)

Correct. The framework is designed with some resilience in mind: the operator can be SIGKILL'ed at any time (literally at any step or sleep), remain down for hours or days, and must continue from where it has left when finally restarted. Hence, all state is persisted immediately once it becomes known/renewed, with no postponing (typically in annotations, but there are options).

If all of the above applies then I believe the actual bug is here:

So, I have processed those ideas. The delays of [0] seems like a valid value for effects.apply(). It means, re-execute the processing with no delay. This is exactly what should be happening with explicit raise TemporaryError(..., delay=0) (which can be e.g. conditional for the first N retries, then with increased delays).

So, my suspicion on effects.apply() can be dismissed. It does exactly what it was told to do.

The question is how does it come to [0] without explicitly triggering this.

From the "Analysis":

Since that is always None, the delay will always be zero.

It would be 0, but only if the handler state is not handler_state.finished; the finished handlers (succeeded or permanently failed) do not get into .delays. Which means it raised a temporary/arbitrary error. But somehow, the delay/delayed was not set -- I do not get "how".

but should probably say:

    if handler_state.delayed
    if not handler_state.finished

That would exclude handlers that failed and want re-execution from being re-executed.


You know what is interesting… I tried to copy-paste the repro from the "Analysis" above, and it indeed touch-patches both with K8s 1.17 and 1.18 (in K3d/K3s, but it is the same). Thanks for that -- repros are highly useful for magic bugs and are difficult to build.

Then I started to cleanup things, and only shifted the handlers to the left -- i.e. from the class to the root level. And the problem was gone. Shifted them back to the class level -- and the problem returned.

The only essential difference was this line (class-level vs. root-level):

[2021-03-06 09:05:59,885] kopf.objects         [DEBUG   ] Patching with: {'metadata': {'annotations': {'anon.operator.ts/KopfRunner.the_handler': '{"started":"2021-03-06T08:05:59.880107","stopped":"2021-03-06T08:05:59.884528","purpose":"update","retries":1,"success":true,"failure":false}'}}}

vs.

[2021-03-06 09:09:35,594] kopf.objects         [DEBUG   ] Patching with: {'metadata': {'annotations': {'anon.operator.ts/the_handler': None, 'kopf.zalando.org/last-handled-configuration': '{"data":{"phase":"Done"},"metadata":{"labels":{"app":"foo-runner"}}}\n'}}}

It succeeds with the KopfRunner.the_handler handler, but never resets the_handler handler.

Then I put id="the_handler" to the resuming handler. And the problem was gone again even at the class-level.

So, I guess, the problem has been introduced by #674, where the deduplication of handlers was changed from fn alone to fn+id (essentially, fn+id+field, since the field is a part of the id).

It means that there is still some bug with handler selection and their interpretation and merging with outcomes. But it is not related to sleeps or delays, but rather to handler ids.

I guess, this case can be reproduced with resume+update handlers with just different ids.

By randomly changing things, I was able to reduce the repro to this only:

apiVersion: v1
kind: ConfigMap
metadata:
  annotations:
    kopf.zalando.org/last-handled-configuration: |
      {"simulated-change":123}
    kopf.zalando.org/bbb: '{"started":"2021-03-01T12:34:03.155586","purpose":"update","retries":0,"success":false,"failure":false}'
  name: bug
import kopf

@kopf.on.startup()
def configure(settings: kopf.OperatorSettings, **_):
    # Status storage must not intervene with its data:
    settings.persistence.progress_storage = kopf.AnnotationsProgressStorage()

@kopf.on.resume("configmaps", id="aaa")
@kopf.on.update("configmaps", id="bbb",
    when=lambda **_: False,  # <<< comment it, and the bug is gone. WHY?!
)
def the_handler(**_):
    print("the_handler called")
kubectl delete -f _issue686.yaml
kubectl create -f _issue686.yaml
kopf run _issue686.py --verbose -n default

It also has something to do with the when= filter. An attempt to remove it unexpectedly fixed the issue.

@nolar
Copy link
Owner

nolar commented Mar 6, 2021

TL;DR: The quickest workaround for you would be to either put id="the_handler" to the resuming decorator or remove id=... from the updating decorator.

Please, keep this issue open. There is still some unknown bug left since such a "touch-hammer" scenario should not have happened in the first place -- even with this setup of handler ids. It is worth fixing it anyway, even later.

@nolar
Copy link
Owner

nolar commented Mar 6, 2021

A note for self: perhaps, there should also be a configurable or hard-coded minimum sleep time (e.g. 10 seconds by default), as there is now a maximum sleep time (WAITING_KEEPALIVE_INTERVAL, 10 minutes) — to not kill or overload the cluster in such cases, regardless of the bug fix, even if the short sleep is induced by the operator (backoff= or delay=; unless explicitly reconfigured to 0 by the operator developers).

@paxbit
Copy link
Author

paxbit commented Mar 10, 2021

  • The only purpose of the sleeping in effects.apply(...) is to actually back-off before retrying. If new events interrupt the sleeping then that new event will not trigger the delayed handler because of handlers_todo in kopf/reactor/handling.py:201. Also this is the one and only mechanism preventing a premature call to a delayed handler.

Correct.

Just to verify, here is the same with other words:

Since Kubernetes has no such thing as delayed processing or events, the delay is done by the operator at the end of each processing cycle (when and if needed), before the next "delayed" cycle begins. The whole feature of creation/update/deletion handlers is Kopf's construct, which is absent in Kubernetes.

But the sleep is interrupted by any change on the resource ("[event] stream pressure", renamed from "replenished") — in that case, the new events are processed as usual, and the sleep is repeated with a recalculated delay (hence, "unslept time").

Yeah, I think we're on the same page here.

If all of the above applies then I believe the actual bug is here:

So, I have processed those ideas. The delays of [0] seems like a valid value for effects.apply(). It means, re-execute the processing with no delay. This is exactly what should be happening with explicit raise TemporaryError(..., delay=0) (which can be e.g. conditional for the first N retries, then with increased delays).

So, my suspicion on effects.apply() can be dismissed. It does exactly what it was told to do.

The question is how does it come to [0] without explicitly triggering this.

From the "Analysis":

Since that is always None, the delay will always be zero.

It would be 0, but only if the handler state is not handler_state.finished; the finished handlers (succeeded or permanently failed) do not get into .delays. Which means it raised a temporary/arbitrary error. But somehow, the delay/delayed was not set -- I do not get "how".

What I found is that HandlerState.delayed is None in cases where the handler had no outcome because its when= did not match. I guess I meant to say: Since that HandlerState.delayed is always None in cases where when=False, the delay as calculated by State.delays will always be zero b/c State.delays implicitly bases its delay calculation on the availability of an outcome.delay via HandlerState.with_outcome(...), which is absent b/c when=False guarded the handler execution."

You said "the finished handlers (succeeded or permanently failed) do not get into .delays". I think that should include succeeded, permanently failed or guarded by a when=false.

Then I started to cleanup things, and only shifted the handlers to the left -- i.e. from the class to the root level. And the problem was gone. Shifted them back to the class level -- and the problem returned.

The only essential difference was this line (class-level vs. root-level):
It succeeds with the KopfRunner.the_handler handler, but never resets the_handler handler.
...
Then I put id="the_handler" to the resuming handler. And the problem was gone again even at the class-level.

Thanks for figuring that out. I tried it and it does indeed stop the behavior, although I believe it may only mask what I describe above which I still believe is not quite correct.

So, I guess, the problem has been introduced by #674, where the deduplication of handlers was changed from fn alone to fn+id (essentially, fn+id+field, since the field is a part of the id).
...
It also has something to do with the when= filter. An attempt to remove it unexpectedly fixed the issue.
See above. I think a when=False leads to an empty outcome which leads to absent outcome.delay which leads to a None HandlerState.delayed which leads to [0] State.delays for all unfinished handlers.

It's late now... will reply to the remaining topics tomorrow.

@nolar
Copy link
Owner

nolar commented Mar 11, 2021

Yes, I agree that there is a bug, and it only was hidden before. I am just not sure what is the source area of that bug. It is fuzzy at the moment.

You said "the finished handlers (succeeded or permanently failed) do not get into .delays". I think that should include succeeded, permanently failed or guarded by a when=false.

Yes, correct. Maybe I can extend my statement.

In general, when a resource is filtered out (e.g. by when=...False), the HandlerState is not in the State at all. And it yields no delay.

However, in this case, the handler state is in the State because it is restored from the persisted state in annotations — probably originally there from state = state.with_purpose(cause.reason).with_handlers(cause_handlers) of the first iteration, but restored by states.State.from_storage(...) on the 2nd iteration and later.

The state is not later updated or changed because the handler is not selected due to filters. And, indeed, the delayed field remains None forever.

The handler purposing was added in #606. There is a sentence in that PR (in the very top):

… those handlers should be taken into account when purging the state (#557), but not for .done & .delays — these ones must be calculated only for the handlers currently in scope.

A bell rings in my head on this sentence. From this issue's discussion, it looks like the intention is not followed (i.e., that PR is buggy).

Might it be so that the problem is that the handler_state.delayed of None is at all taken into account when calculating the state.delays? So, if the handler is filtered out, it must not yield 0 for delays at all.

So, we might assume that "purposes" are used improperly and with a too optimistic assumption that if a properly purposed handler's state is/was persisted, it is relevant and needed (in fact, it is not).

If that is true (I would need time to verify the hypothesis), as a chain reaction, .done/.delays should be changed from calculated properties of the state to functions depending on the current set of cause_handlers: state.get_delays(cause_handlers). Maybe the same for state.check_done(cause_handlers), though I am not that sure here. Which, in turn, renders the whole thing with "purposes" useless. Or maybe not useless, because they are used in the log counters (the handlers presumably were executed after all), but not needed for .delays.

So, back to your sentence:

You said "the finished handlers (succeeded or permanently failed) do not get into .delays". I think that should include succeeded, permanently failed or guarded by a when=false.

Yes, this might be one way. However, I am not sure how difficult it would be to select all handlers that are relevant to the current cause's type/reason but without filters — it goes deep to the registries of handlers and to the filtering logic and explodes in complexity for the whole framework (registries and handlers are used in many other places: activities, probes, daemons/times, indices, admission hooks, etc).

Another way to explore is to ignore the deselected handlers for delays/done computation. This seems easier — unless proven that this way is wrong.

What do you think?

@paxbit
Copy link
Author

paxbit commented Mar 11, 2021

What do you think?

;) Well thanks for the elaborate reply, for sure. However as the scope of the issue broadens I begin to struggle finding time to wrap my head around this still new to me codebase and its history. I had to stop working on it last night picked it up this morning and will continue today for a couple of hours using the information you provided. In case I do not have a eureka moment on the way I will have to move on with staging our operator as that is already late. The id assignment to the resume handler sufficiently suppresses the touch "DoS attack" it was previously mounting against the cluster API. Since the rest seems to be working I'd like to begin feeding actual production workloads to it on integration. Once that is running I will come back to this issue.

@paxbit
Copy link
Author

paxbit commented Mar 11, 2021

To summarize so far:

When not de-duplicated,

  • for each iteration only the on.resume handler is selected for cause_handlers, the on.update handler is filtered in iter_handlers(...) b/c its when=False
  • the first iteration creates a state for the resume handler from_scratch(...). Also the handler function is called once and the on.resume handler state is marked finished. The on.update handler state is left alone.
  • every consecutive iteration after the first one again only picks the resume handler however will never run it since handlers_todo will be empty b/c the handler is not .awakened, it is already finished
  • this way there is never an outcome except for the first iteration triggering a resume and delays is miscalculated b/c one of the two handlers is handler_state.finished==True (on.resume) while both are not delayed.
  • There is also no patch b/c no state or anything else changed and this, in conjunction with with the not None but 0 delay, is what sends effects.apply(...) into the touching branch.
  • de-duplication is broken because it relies on handler ids while kopfs decorator API allows passing arbitrary ids for each decorator.

Might it be so that the problem is that the handler_state.delayed of None is at all taken into account when calculating the state.delays? So, if the handler is filtered out, it must not yield 0 for delays at all.

I think so too. This brings me back to my original point of suggesting that this seems off:
if handler_state.delayed else 0
That statement, with the knowledge of how delays are treated later on, seems incoherent to me as in: Why does "handler not delayed" mean "delay it by 0" instead of "delay by None"?

You said above that this would block failed handlers from re-execution.

[
    max(0, (handler_state.delayed - now).total_seconds())
    for handler_state in self._states.values()
    if handler_state.delayed is not None
    if not handler_state.finished
    if self.purpose is None or handler_state.purpose is None
       or handler_state.purpose == self.purpose
]

... but what if HandlerOutcome.delay gets defaulted to 0 instead of None. I didn't try but should that not allow failed handler re-execution even if the delay was omitted? But maybe doing it like this would be a band-aid anyway... however one that would improve its semantics.

I had a look and think I now understand this...

If that is true (I would need time to verify the hypothesis), as a chain reaction, .done/.delays should be changed from calculated properties of the state to functions depending on the current set of cause_handlers: state.get_delays(cause_handlers). Maybe the same for state.check_done(cause_handlers), though I am not that sure here.
...
Another way to explore is to ignore the deselected handlers for delays/done computation. This seems easier — unless proven that this way is wrong.
...
What do you think?

I agree. Deselected handlers should not be part of the delay calculation and I believe a .delays > .get_delays(cause_handlers) refactoring would prevent this kind of failure.

About:

Another way to explore is to ignore the deselected handlers for delays/done computation...

I'm not sure I understand you correct. Do you mean by accessing the registry from within State.delays while leaving it a property instead of making it a getter?

@nolar
Copy link
Owner

nolar commented Apr 3, 2021

I have digged into the difference of [] vs [0] for the delays, and investigated the suggested fix:

- max(0, (handler_state.delayed - now).total_seconds()) if handler_state.delayed else 2
- for handler_state in self._states.values()
- if not handler_state.finished
+ max(0, (handler_state.delayed - now).total_seconds())
+ for handler_state in self._states.values()
+ if not handler_state.finished
+ if handler_state.delayed

The conclusions are:

  1. Filtering out cases when handler_state.delayed is None from the delays works for most common cases, but not for all and therefore is not sufficient alone.
  2. Filtering handler states not only by purpose, but also by the current handlers in scope, is a must — regardless of the suggested fix.

Otherwise, in this example below, the operator falls into an infinite "patch-hammer" after 5 seconds — even with the suggested fix. No additional hacking of the initial state is needed, it happens in a simple realistic scenario:

apiVersion: v1
kind: ConfigMap
metadata:
  labels:
    x: x
    z: z
  annotations:
    kopf.zalando.org/last-handled-configuration: |
      {"simulated-change":123}
  name: bug
import kopf


@kopf.on.startup()
def configure(settings: kopf.OperatorSettings, **_):
    # Status storage must not intervene with its data:
    settings.persistence.progress_storage = kopf.AnnotationsProgressStorage()


@kopf.on.update("configmaps", id="zzz", labels={'z': 'z'})
def called_once(**_):
    raise kopf.TemporaryError("boo", delay=5)


@kopf.on.resume("configmaps", id="aaa", labels={'x': 'x'})
def the_handler(patch, **_):
    patch.meta.labels['z'] = 'not-z'  # or any external circumstances changing this

This is caused by [0] in delays, though not due to the right-side ... else 0 condition as before the suggested fix, but from the left-side max(0, ...).

It happens this way: Since the handler zzz falls out of the filtering criteria quite soon (either by external circumstances, or by internal ones as in this artificial example), this handler will never be re-executed again after the first failed try. It will remain both delayed and overdue forever, thus producing a delay of 0 (or a negative delay if not capped with max()), thus triggering the processing cycles non-stop.

In general, having never-finished "owned" handlers (those with the current purpose of handling) is a problem, not the delays per se. Zero-delays are a cascading derivative problem (a.k.a. consequences, a.k.a. a domino-effect).

The proper fix for the root cause would be to not consider zzz's state since the moment zzz falls out of scope/filters/criteria. I.e., the proper fix would be to use cause_handlers in check_done(...) & get_delays(...) as suggested in the previous comments and in #728. Alternatively, remember the handlers as "active" after enforced via state.with_handlers(handlers), and process only them in .done/.delays — as suggested in #731.


With the new overarching fix applied, the original fix becomes unnecessary: the [0] delays will not be a problem anymore since those delays will be for the relevant handlers, so we will have no need to avoid the potentially problematic [0]-delays in the first place.

In other words, it can be applied or it can be skipped — with no difference in behaviour.

I was not able to draft a clean example that would be affected by this additional change. It should be something with 2+ matching handlers and the 1st one producing no patch — which is currently unrealistic (might be realistic if or when handlers have an idle=... option the same as in timers).

However, the removal of if handler_state.delayed might change some other aspects of operators, which people may rely on. This state — "not finished but not delayed" — can mean "scheduled but not yet started" in case of multi-handler operators, before it gets to the first attempt for that handler. With this weak but outweighing argument in favour of keeping the existing behaviour, let it be there as it is.

The overarching fix is applied in #731.


Side-notes:

  • Experiments with the right-side ... else 0 have shown that this 0 can be changed to anything else (e.g. 1-2 seconds), and that will become a latency of scheduled-but-not-yet-started handlers in multi-handler operators. This might be a measure of reducing the K8s API load at operator restarts e.g. in case of multiple on-resuming handlers.

  • Semantically, [0] means "come back as soon as possible (in 0 seconds)", while [] means "do not come back, I have nothing to do anymore".

  • State.done (or State.check_done()) seems unnecessary now, and only the delays can be sufficient to report if the handling is over or not. However, for code readability in activities/daemons/timers, it is better to keep it named as [check_]done.

@nolar
Copy link
Owner

nolar commented Apr 3, 2021

@paxbit Thank you very much for investigating and debugging this extremely complicated issue! That was really a challenge, and definitely a good bug-hunt.

The fix is released as kopf==1.31rc3.

@retr0h
Copy link

retr0h commented Mar 15, 2022

@paxbit are those functions you have written e.g. get_pod_condition_or_none_from custom?
I am looking to write a timer which runs when the given deployment is completed, and was looking to do similar. This solution looks elegant, but I don't see any code anywhere which references this.

    @kopf.on.timer(
        "",
        "v1",
        "pods",
        interval=cfg.watch_for_unschedulable_pod.interval,
        sharp=cfg.watch_for_unschedulable_pod.sharp,
        labels=cfg.test_pod_selector_label_dict,
        when=lambda namespace, name, status, **_: pods.get_pod_condition_or_none_from( 
            status, pods.PodCondTypeEnum.PodScheduled, pods.PodCondStatusEnum.false
        )
        is not None,
    )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants