-
-
Notifications
You must be signed in to change notification settings - Fork 160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Endless touch-dummy Annotation Patching On Pod Causes High API Load #686
Comments
The touch-dummy patch is used to artificially trigger a change and get back to Kopf's unfinished tasks. But it is used only if there are unfinished tasks (handlers, in this case). Usually (but not always), this happens after sleeping for some time. The default delay is 60s, so this flooding looks unusual. The only unfinished handler I see is Can you please run the operator with |
Thank you for your quick response! Enabling debug and verbose generates the following log:
The unfinished handler is basically doing this. It flags the resource owning the pod for which the handler was triggered with a "ready" state. It also deals with the edge case of deleting the pod when for whatever reason its parent is missing. def mark_containters_ready(namespace, name, status, body, meta, **_):
pod = pykube.Pod(k_api, body)
# `ts_job` is a custom resource owning the pod
ts_job = ts_job_or_none_from(pod)
if ts_job is None:
events.push_operator_event(
"Warning",
pod.namespace,
"TsRunnerCreation",
"ParentJobMissing",
f"Pod {pod.namespace}/{pod.name} became ready however its parent TsJob is missing. Deleting orphaned pod",
pod,
)
pod.delete(force=True)
return
if ts_job.obj["status"]["state"] != JobStateEnum.Running.value:
state.update_state_on(
ts_job,
JobStateEnum.Running,
"PodReady",
) |
After configuring the progress storage to I'm a bit baffled though. I have been developing this operator for the last couple of weeks always aware that at some point I need to deal with this patch warning (#685). I treated it as a nuisance b/c within all that time the patch warning only appeared for completed or terminating pods. And since all the operator handlers were firing as expected, as was the I just went back trying to reproduce this by:
I was not successful though. Even with using the status storage again and with the same test lifecycle that was triggering the constant patch condition before the endless patching did not happen now and also the So with this in mind I suspect something went "wrong" with this one particular pod. Presumably sth. in the metadata. I will try to reproduce this by manually recreating a completed pod with the same metadata as the one above. This will take a moment though and I hope this will not turn out to be a Heisenbug. |
After some testing I found out the following things. I started by letting a pod run to completion then applying the following annotations from the problematic pod. Then I started the operator and did observe the constant patch attempts again. metadata:
annotations:
kopf.zalando.org/await_pod_term.status.phase: '{"started":"2021-02-15T22:02:03.350703","stopped":"2021-02-15T22:02:03.353473","purpose":"update","retries":1,"success":true,"failure":false}'
kopf.zalando.org/last-handled-configuration: |
{"spec":{"volumes":[{"name":"test-config","configMap":{"name":"ts-job-1179-imuvqvkxczsjxvtpmau9pw","defaultMode":420}},{"name":"rasbuild","persistentVolumeClaim":{"claimName":"job-sesam-test-1179-rasbuild"}},{"name":"default-token-xlckc","secret":{"secretName":"default-token-xlckc","defaultMode":420}}],"containers":[{"name":"ts-runner","image":"[redacted]/ts_kube_all:dev","command":["/bin/sh"],"args":["-c","python [redacted]"],"env":[{"name":"TS_RABBITMQ_HOST","value":"[redacted]"},{"name":"TS_RABBITMQ_USER","value":"[redacted]"},{"name":"TS_RABBITMQ_PASSWORD","value":"[redacted]"}],"resources":{"limits":{"cpu":"2","memory":"2Gi"},"requests":{"cpu":"100m","memory":"1Gi"}},"volumeMounts":[{"name":"test-config","mountPath":"/opt/ts/jobspec"},{"name":"rasbuild","mountPath":"/mnt/rasimages"},{"name":"default-token-xlckc","readOnly":true,"mountPath":"/var/run/secrets/kubernetes.io/serviceaccount"}],"lifecycle":{"preStop":{"exec":{"command":["echo","'bye'"]}}},"terminationMessagePath":"/dev/termination-log","terminationMessagePolicy":"File","imagePullPolicy":"Always","securityContext":{"allowPrivilegeEscalation":false}},{"name":"ts-agent","image":"[redacted]","ports":[{"name":"agent-endpoint","containerPort":[redacted],"protocol":"TCP"}],"resources":{"requests":{"cpu":"500m","memory":"128Mi"}},"volumeMounts":[{"name":"default-token-xlckc","readOnly":true,"mountPath":"/var/run/secrets/kubernetes.io/serviceaccount"}],"lifecycle":{"preStop":{"exec":{"command":["echo","'bye'"]}}},"terminationMessagePath":"/dev/termination-log","terminationMessagePolicy":"File","imagePullPolicy":"Always","securityContext":{"allowPrivilegeEscalation":false}}],"restartPolicy":"Never","terminationGracePeriodSeconds":30,"dnsPolicy":"ClusterFirst","serviceAccountName":"default","serviceAccount":"default","nodeName":"k-test-n1","securityContext":{},"affinity":{"nodeAffinity":{"requiredDuringSchedulingIgnoredDuringExecution":{"nodeSelectorTerms":[{"matchExpressions":[{"key":"feature.node.kubernetes.io/cpu-cpuid.VMX","operator":"In","values":["true"]}]}]}}},"schedulerName":"default-scheduler","tolerations":[{"key":"node.kubernetes.io/not-ready","operator":"Exists","effect":"NoExecute","tolerationSeconds":300},{"key":"node.kubernetes.io/unreachable","operator":"Exists","effect":"NoExecute","tolerationSeconds":300}],"priorityClassName":"ts-default-priority-intel","priority":1,"enableServiceLinks":false},"metadata":{"labels":{"app":"ts-runner","ts.job.name":"sesam-test","ts.job.run-id":"1179"}},"status":{"phase":"Running"}}
kopf.zalando.org/mark_containters_ready: '{"started":"2021-02-15T22:02:03.350719","purpose":"update","retries":0,"success":false,"failure":false}'
kopf.zalando.org/register_pod_handling.__locals__.watch_-P7tEVg: '{"started":"2021-02-16T10:03:13.216603","stopped":"2021-02-16T10:03:13.218999","purpose":"update","retries":1,"success":true,"failure":false}'
kopf.zalando.org/register_pod_handling.__locals__.watch_for_completed_pod: '{"started":"2021-02-16T10:03:13.216603","stopped":"2021-02-16T10:03:13.218999","purpose":"update","retries":1,"success":true,"failure":false}'
kopf.zalando.org/touch-dummy: 2021-02-16T21:23:57.630443 I can make the repeated patch attempts stop by doing either of the following:
Any idea why that is? BTW
|
I can reproduce the behavior with this: Resourcestouch-hammer.yamlapiVersion: v1
kind: ConfigMap
metadata:
annotations:
kopf.zalando.org/last-handled-configuration: |
{"metadata":{"labels":{"app":"ts-runner"}}}
anon.operator.ts/kopf-managed: "yes"
anon.operator.ts/the_handler: '{"started":"2021-03-01T12:34:03.155586","purpose":"update","retries":0,"success":false,"failure":false}'
anon.operator.ts/touch-dummy: 2021-03-01T20:02:40.186348
finalizers:
- kopf.zalando.org/KopfFinalizerMarker
labels:
app: foo-runner
name: bug
namespace: bug-target
data:
phase: "Done" touch-hammer.pyimport logging
import asyncio
import threading
import contextlib
import kopf
logger = logging.getLogger(__name__)
selector = {"app": "foo-runner"}
class KopfRunner(object):
def __init__(self) -> None:
self.readyEvent = threading.Event()
self.stopFlag = threading.Event()
self.kopfThread = threading.Thread(
name="kopf-main",
target=self.__setup_kopf_event_loop,
kwargs=dict(
stop_flag=self.stopFlag,
ready_flag=self.readyEvent,
),
)
@kopf.on.startup()
def configure(settings: kopf.OperatorSettings, **_):
settings.persistence.progress_storage = kopf.AnnotationsProgressStorage(
prefix="anon.operator.ts"
)
@kopf.on.login(retries=1, backoff=3.0)
def login_fn(**kwargs):
return kopf.login_via_pykube(**kwargs)
@kopf.on.resume(
"",
"v1",
"configmaps",
labels=selector,
)
@kopf.on.update(
"",
"v1",
"configmaps",
id="the_handler",
labels=selector,
when=lambda body, **_: body["data"].get("phase") == "Foobar",
)
def the_handler(**_):
print("the_handler called")
@staticmethod
def __setup_kopf_event_loop(
ready_flag: threading.Event, stop_flag: threading.Event
):
kopf_loop = asyncio.new_event_loop()
asyncio.set_event_loop(kopf_loop)
with contextlib.closing(kopf_loop):
kopf.configure(verbose=True, debug=True)
kopf_loop.run_until_complete(
kopf.operator(
namespace="bug-target",
ready_flag=ready_flag,
stop_flag=stop_flag,
)
)
def start(self):
logger.info("Starting kopf...")
self.kopfThread.start()
self.readyEvent.wait()
logger.info("Kopf ready.")
self.stopFlag.wait()
if __name__ == "__main__":
runner = KopfRunner()
runner.start() Setup And Run
ResultWhen I do this I get: kubectl -n bug-target get cm --watch --template='{{.metadata.name}}: {{index .metadata.annotations "anon.operator.ts/touch-dummy"}}{{"\n"}}'
bug: 2021-03-01T21:20:12.765007
bug: 2021-03-01T21:21:10.100777
bug: 2021-03-01T21:21:10.279276
bug: 2021-03-01T21:21:10.460133
bug: 2021-03-01T21:21:10.648754
bug: 2021-03-01T21:21:10.835736
bug: 2021-03-01T21:21:11.033883
bug: 2021-03-01T21:21:11.219593
bug: 2021-03-01T21:21:11.394481
bug: 2021-03-01T21:21:11.602270
bug: 2021-03-01T21:21:11.784301
bug: 2021-03-01T21:21:11.971768
bug: 2021-03-01T21:21:12.143277
bug: 2021-03-01T21:21:12.314016
bug: 2021-03-01T21:21:12.490774
bug: 2021-03-01T21:21:12.686455
bug: 2021-03-01T21:21:12.871613
bug: 2021-03-01T21:21:13.048998 ... touch dummies in quick succession. BackgroundWhen running the operator after some time it ends up leaving Pods in a completed state without calling the appropriate handler I'd appreciate your feedback on this. |
AnalysisHere's what I was able to find out about the behavior while debugging kopf using the above example.
The resulting zeroed
At this point the circle closes as the touch triggers an UPDATE event which causes the same chain of events over and over again. @nolar |
Hi @nolar, it would be awesome if you could drop a quick note saying if you either acknowledge this as a bug or so far had no time to look into it but may do so in the future or if you think I'm mistaken or misapplied some of kopf's concepts. I'm happy with any of those or whatever else it might be. It's just that I want to avoid working on a PR and then reading from you at 80% into completing it that you also implemented a change or think any of my previous assumptions were wrong or maybe I should RTFM (I did but maybe my glasses were dirty o_0). |
Hello @paxbit. Sorry for no answer for a long time. Yes, I saw the messages here but didn't have time to dive deep yet to understand what is happening and why. Regarding your last comment — this is an excellent explanation of what is happening (or might be happening). I usually have troubles with those "delays" myself (see below). The rationale beyond treating 0 as a real sleep is that 0 is not far away from e.g. 0.1 (or -0.1, i.e. an overdue sleep-and-touch), which would cause the same API flooding problems. If the delays are to be prevented, they should be all What's confusing me is that there is an outcome for a handler with a mismatching I can neither confirm nor deny that your suspicion is correct here — need to experiment with that case myself (maybe the next weekend). But where I would blindly(*) put my suspicion is in (*) blindly — as in "gut-feeling-based". |
@paxbit Regarding the 80% — I am now at 0% of this. So, if you have a fix, I would be happy not to dive deep into this bug, and to continue with another task (admission hooks). Just let's align on what exactly is happening and what is the proposed fix in general (verbally) — it might affect other aspects unintentionally. |
Hi @nolar, Thanks for the reply! I very well understand a tight time budget ;)
I probably could have written it more clearly. I actually did find what you say, there is no outcome and yes I do not see an attempt to call the handler to get one. So this part seems correct to me.
I agree the complexity of those ifs could be lower. I'm not sure I understand what you meant by saying "not aligned with
Today I turned to the kopf code base and thought about a fix, see below, but could you please verify the following of my assumptions.
If all of the above applies then I believe the actual bug is here: where today it says: max(0, (handler_state.delayed - now).total_seconds()) if handler_state.delayed else 0
... but should probably say: [
max(0, (handler_state.delayed - now).total_seconds())
for handler_state in self._states.values()
if handler_state.delayed
if not handler_state.finished
if self.purpose is None or handler_state.purpose is None
or handler_state.purpose == self.purpose
] this would allow
As for |
Please give me some time till tomorrow — I need to process this, and it is a bit late already. The ”.delayed” fix looks promising on a first glance. But the devil is in edge cases, as usually: I probably meant something with “else 0”. Would it work for “raise TemporaryError(..., delay=0)” with the meaning of “retry asap”? |
Correct. Almost. "The handler threw something" would be But it is correlated: the delay exists only if the handler threw something, so it can be interpreted your way too: if the delay is not None, the handler definitely threw something. But not the other way:
Correct. Just to verify, here is the same with other words: Since Kubernetes has no such thing as delayed processing or events, the delay is done by the operator at the end of each processing cycle (when and if needed), before the next "delayed" cycle begins. The whole feature of creation/update/deletion handlers is Kopf's construct, which is absent in Kubernetes. But the sleep is interrupted by any change on the resource ("[event] stream pressure", renamed from "replenished") — in that case, the new events are processed as usual, and the sleep is repeated with a recalculated delay (hence, "unslept time").
Correct. Dummy touch-patches or regular patches are needed to get control back to the operator after the sleep to re-evaluate the handlers and execute those that were delayed. Here is why it is done so: Historically, touch-patches were implemented first in the initial prototype, and they slept uninterruptedly and then triggered the guaranteed resource event to get back to the operator with all new fresh state of the resource's body (spec, metadata, status, etc). All changes during the sleep time were ignored until the next awakening time. Much later, those "stream pressure"/"replenished" events were added, so the sleep became interruptable by new changes. Now, it might be so that the touch-patches are an unnecessary atavism and the whole cycle of sleeping and re-execution can be done fully in-memory without dummy touch-patches: the resource's freshness is guaranteed since we know when and if the new changes arrive via these interruptions — and if they didn't arrive during the sleep, there were no changes and we are free to go. But this would be a much bigger refactoring, actually a redesign of the current flow of how Kopf works, for which I am not ready now, so I keep this improvement idea for later. Or, better say, it was estimated as a big redesign when I thought about it when the interruptions were introduced; it might be much easier with all the new changes (e.g. per-resource "memories"), but I didn't re-evaluate the complexity since then.
Correct. The framework is designed with some resilience in mind: the operator can be SIGKILL'ed at any time (literally at any step or sleep), remain down for hours or days, and must continue from where it has left when finally restarted. Hence, all state is persisted immediately once it becomes known/renewed, with no postponing (typically in annotations, but there are options).
So, I have processed those ideas. The delays of So, my suspicion on The question is how does it come to From the "Analysis":
It would be
That would exclude handlers that failed and want re-execution from being re-executed. You know what is interesting… I tried to copy-paste the repro from the "Analysis" above, and it indeed touch-patches both with K8s 1.17 and 1.18 (in K3d/K3s, but it is the same). Thanks for that -- repros are highly useful for magic bugs and are difficult to build. Then I started to cleanup things, and only shifted the handlers to the left -- i.e. from the class to the root level. And the problem was gone. Shifted them back to the class level -- and the problem returned. The only essential difference was this line (class-level vs. root-level):
vs.
It succeeds with the Then I put So, I guess, the problem has been introduced by #674, where the deduplication of handlers was changed from It means that there is still some bug with handler selection and their interpretation and merging with outcomes. But it is not related to sleeps or delays, but rather to handler ids. I guess, this case can be reproduced with resume+update handlers with just different ids. By randomly changing things, I was able to reduce the repro to this only: apiVersion: v1
kind: ConfigMap
metadata:
annotations:
kopf.zalando.org/last-handled-configuration: |
{"simulated-change":123}
kopf.zalando.org/bbb: '{"started":"2021-03-01T12:34:03.155586","purpose":"update","retries":0,"success":false,"failure":false}'
name: bug import kopf
@kopf.on.startup()
def configure(settings: kopf.OperatorSettings, **_):
# Status storage must not intervene with its data:
settings.persistence.progress_storage = kopf.AnnotationsProgressStorage()
@kopf.on.resume("configmaps", id="aaa")
@kopf.on.update("configmaps", id="bbb",
when=lambda **_: False, # <<< comment it, and the bug is gone. WHY?!
)
def the_handler(**_):
print("the_handler called")
It also has something to do with the |
TL;DR: The quickest workaround for you would be to either put Please, keep this issue open. There is still some unknown bug left since such a "touch-hammer" scenario should not have happened in the first place -- even with this setup of handler ids. It is worth fixing it anyway, even later. |
A note for self: perhaps, there should also be a configurable or hard-coded minimum sleep time (e.g. 10 seconds by default), as there is now a maximum sleep time ( |
Yeah, I think we're on the same page here.
What I found is that You said "the finished handlers (succeeded or permanently failed) do not get into
Thanks for figuring that out. I tried it and it does indeed stop the behavior, although I believe it may only mask what I describe above which I still believe is not quite correct.
It's late now... will reply to the remaining topics tomorrow. |
Yes, I agree that there is a bug, and it only was hidden before. I am just not sure what is the source area of that bug. It is fuzzy at the moment.
Yes, correct. Maybe I can extend my statement. In general, when a resource is filtered out (e.g. by However, in this case, the handler state is in the State because it is restored from the persisted state in annotations — probably originally there from The state is not later updated or changed because the handler is not selected due to filters. And, indeed, the The handler purposing was added in #606. There is a sentence in that PR (in the very top):
A bell rings in my head on this sentence. From this issue's discussion, it looks like the intention is not followed (i.e., that PR is buggy). Might it be so that the problem is that the So, we might assume that "purposes" are used improperly and with a too optimistic assumption that if a properly purposed handler's state is/was persisted, it is relevant and needed (in fact, it is not). If that is true (I would need time to verify the hypothesis), as a chain reaction, So, back to your sentence:
Yes, this might be one way. However, I am not sure how difficult it would be to select all handlers that are relevant to the current cause's type/reason but without filters — it goes deep to the registries of handlers and to the filtering logic and explodes in complexity for the whole framework (registries and handlers are used in many other places: activities, probes, daemons/times, indices, admission hooks, etc). Another way to explore is to ignore the deselected handlers for delays/done computation. This seems easier — unless proven that this way is wrong. What do you think? |
;) Well thanks for the elaborate reply, for sure. However as the scope of the issue broadens I begin to struggle finding time to wrap my head around this still new to me codebase and its history. I had to stop working on it last night picked it up this morning and will continue today for a couple of hours using the information you provided. In case I do not have a eureka moment on the way I will have to move on with staging our operator as that is already late. The id assignment to the resume handler sufficiently suppresses the touch "DoS attack" it was previously mounting against the cluster API. Since the rest seems to be working I'd like to begin feeding actual production workloads to it on integration. Once that is running I will come back to this issue. |
To summarize so far: When not de-duplicated,
I think so too. This brings me back to my original point of suggesting that this seems off: You said above that this would block failed handlers from re-execution. [
max(0, (handler_state.delayed - now).total_seconds())
for handler_state in self._states.values()
if handler_state.delayed is not None
if not handler_state.finished
if self.purpose is None or handler_state.purpose is None
or handler_state.purpose == self.purpose
] ... but what if I had a look and think I now understand this...
I agree. Deselected handlers should not be part of the delay calculation and I believe a About:
I'm not sure I understand you correct. Do you mean by accessing the registry from within |
I have digged into the difference of - max(0, (handler_state.delayed - now).total_seconds()) if handler_state.delayed else 2
- for handler_state in self._states.values()
- if not handler_state.finished
+ max(0, (handler_state.delayed - now).total_seconds())
+ for handler_state in self._states.values()
+ if not handler_state.finished
+ if handler_state.delayed The conclusions are:
Otherwise, in this example below, the operator falls into an infinite "patch-hammer" after 5 seconds — even with the suggested fix. No additional hacking of the initial state is needed, it happens in a simple realistic scenario: apiVersion: v1
kind: ConfigMap
metadata:
labels:
x: x
z: z
annotations:
kopf.zalando.org/last-handled-configuration: |
{"simulated-change":123}
name: bug import kopf
@kopf.on.startup()
def configure(settings: kopf.OperatorSettings, **_):
# Status storage must not intervene with its data:
settings.persistence.progress_storage = kopf.AnnotationsProgressStorage()
@kopf.on.update("configmaps", id="zzz", labels={'z': 'z'})
def called_once(**_):
raise kopf.TemporaryError("boo", delay=5)
@kopf.on.resume("configmaps", id="aaa", labels={'x': 'x'})
def the_handler(patch, **_):
patch.meta.labels['z'] = 'not-z' # or any external circumstances changing this This is caused by It happens this way: Since the handler In general, having never-finished "owned" handlers (those with the current purpose of handling) is a problem, not the delays per se. Zero-delays are a cascading derivative problem (a.k.a. consequences, a.k.a. a domino-effect). The proper fix for the root cause would be to not consider With the new overarching fix applied, the original fix becomes unnecessary: the In other words, it can be applied or it can be skipped — with no difference in behaviour. I was not able to draft a clean example that would be affected by this additional change. It should be something with 2+ matching handlers and the 1st one producing no patch — which is currently unrealistic (might be realistic if or when handlers have an However, the removal of The overarching fix is applied in #731. Side-notes:
|
@paxbit Thank you very much for investigating and debugging this extremely complicated issue! That was really a challenge, and definitely a good bug-hunt. The fix is released as |
@paxbit are those functions you have written e.g. @kopf.on.timer(
"",
"v1",
"pods",
interval=cfg.watch_for_unschedulable_pod.interval,
sharp=cfg.watch_for_unschedulable_pod.sharp,
labels=cfg.test_pod_selector_label_dict,
when=lambda namespace, name, status, **_: pods.get_pod_condition_or_none_from(
status, pods.PodCondTypeEnum.PodScheduled, pods.PodCondStatusEnum.false
)
is not None,
) |
Long story short
While looking into #685 I noticed weird patching behavior from kopf regarding the
touch-dummy
annotation. As soon as a pod reaches the completed state kopf constantly patches thetouch-dummy
annotation on the pod creating significant load on the API server.Description
The operator has a handler watching for completed pods. It is decorated like so:
There also is a 10sec interval timer handling unschedulable pods.
I have the following pod sitting in the completed state for now ~23h. (Showing only metadata as that should be the relevant part I guess)
As soon as kopf starts the
touch-dummy
updates start on the completed pod multiple times a second.Do you have any idea why that is happening? When debugging this it looked like every
MODIFIED
event on the pod triggers the patching. The patching however creates a newMODIFIED
event, triggering another patch > a new event > rinse > repeat.Environment
Python packages installed
The text was updated successfully, but these errors were encountered: