Using cache from informer for operator deleted #150

ppatierno · 2021-03-04T21:37:22Z

This PR refactor the operand deleted part using the cache from informer and it adds logging the deletion state.
It also refactor the methods to get ConfigMap and Secret related to Kafka.

Added logging on operator deletion status

ppatierno · 2021-03-04T21:37:41Z

@shawkins @rareddy

rareddy

lgtm

shawkins · 2021-03-04T23:42:31Z

operator/src/main/java/org/bf2/operator/operands/AdminServer.java

+        if (openShiftClient != null) {
+            isDeleted = isDeleted && cachedRoute(managedKafka) == null;
+        }
+        log.info("Admin Server isDeleted = {}", isDeleted);


Do you plan on going back and adjusting the log levels? Seems like the logs will fill up pretty quickly at an info level.

Yes, it's planned and I would love to use the same logging library across operator and sync. We should make a final decision.

Same library and/or same usage of levels? Didn't @kornys have a pr at one point to switch the operator to jboss logging? Any chance we could resurrect that along with a tweak of the logging levels?

Just wondering why we should use the jboss one. Tbh I have never seen people using it. Just because it's the default for Quarkus?

Yes that's the major reason.

I'm not too concerned if we use different logging frameworks - the only issue will be making sure we know which formatting to use.

But we should try to be consistent with the levels.

shawkins

Just wanted to walk through a little bit more on what is and is not making api calls. Currently when kafkaInstance.createOrUpdate or delete is called, that will eventually make an api call per resource correct? I don't see anything in the nested createOrDelete logic that is comparing the desired to the known state first.

With this change, when isDeleted is called it won't make api calls - but isDeleted is called with roughly the same frequency (less as it's first checking the isDeleted spec flag) as createOrUpdate correct?

ppatierno · 2021-03-05T08:16:33Z

I had this discussion multiple times with Java operator SDK team and they don't consider to be correct to compare old with new but just apply the resource as idempotent operation. This is why I am getting every time the current and building the new from that while filling/reverting the fields we need. The drawback is maybe related to the fact that events are raised not only when something change but even due to the re-sync interval from informer.

shawkins · 2021-03-05T13:17:21Z

but even due to the re-sync interval from informer

To rephrase a little - either you can monitor the dependent resources and flip them back to the desired state almost immediately, or you could rely on the resync interval of ManagedKafka and just flip them back on the resync.

rareddy · 2021-03-05T13:33:47Z

they don't consider to be correct to compare old with new but just apply the resource as idempotent operation

Even if it is an idempotent operation, it is an extra API call and in the end, someone has to do that check. If there is a simple way to avoid it why not?

ppatierno · 2021-03-05T13:57:18Z

Comparing resources is not simple because Kubernetes starts to add a lot of fields that will make the comparing to fail even if they are the same. I raised this point long time ago. Otherwise we have to compare field by field what is meaningful to us for saying that two resources are different or not.

ppatierno · 2021-03-05T14:03:24Z

Not sure if we can rely on the resourceVersion resource ... sometimes I had problems in the past using it.

shawkins · 2021-03-05T14:16:34Z

To weigh in more. There are two concerns here:

knowing when a resource update is actually needed.
how quickly should dependent / managed resources be returned to the "desired state"

On the second one there is some cross-talk with the other pr. I'll assume that we'll have that conversation over there.

On the first there's a couple of considerations. For what Paolo is doing he can't just look at the resourceVersion of dependent resources. The resourceVersion is updated even for status changes. Beyond that what he really cares about is the comparison to his ideal state - not the before/after of update event the informer is reporting. As for what should be compared, presumably it should just be the spec correct? Are you seeing kube or dependent resources add stuff to specs? If yes that makes this comparison quite difficult. If not, then we have the option of using lombok (for any api we can effectively update), or compare based upon the json / yaml serialization (kube is probably doing something similar once it gets the update request). Ultimately whether we should be doing that is based upon the number of update calls that are being made - if it's not that many it's not worth worrying about this.

rareddy · 2021-03-05T14:57:16Z

Are you seeing kube or dependent resources add stuff to specs?

That should never happen, that is the contract.

then we have the option of using lombok (for any api we can effectively update)

Is there any reason why lombok based equality check on Spec not preferable?

shawkins · 2021-03-05T15:00:10Z

Is there any reason why lombok based equality check on Spec not preferable?

There seems to be some bias against lombok at least from Max. You also need to make sure your objects provide the desired quality - use a Set when possible instead of a List if ordering of something doesn't matter. Beyond that this about dependent resources, so you'd be asking strimzi to add that.

ppatierno · 2021-03-05T15:05:18Z

It's not about Strimzi only. Not just Kafka is a dependent resource but even native K8S resources like ConfigMap(s) for metrics, Secret(s) for certificates and Deployment(s) for canary and admin server.
I also read about Max was against lombok

shawkins · 2021-03-05T15:12:59Z

Yes, I realize that it's other things as well, which is why I brought up comparing the serialized document. But the bigger question is the frequency of these update - if it's not that much we don't need to worry about this. As far as I know you have a resync interval of a minute - so at worst every minute you are trying to update every resource. On top of this you are updating for every event including status changes - from what we say strimzi does it's own heart beat of status updates at a two minute interval, and other than that there aren't a lot of rapid status updates. So can we tolerate a set of mostly no-op updates against all of the resources (1000s of ManagedKafkas x resource fanout) every minute. If no, then consider increasing the resync or we'll need to consider adding comparison logic. If yes, then we can drop this for now.

ppatierno · 2021-03-05T15:18:22Z

Just to add another problem to the comparison logic, is the managed-by label that is not part of the spec.
It could be removed and we are going to lost all the subsequent events; this is just to say that comparing spec could not be enough.
Anyway, I agree with your points. The way it works now is more defensive but I would not expect things changing spec or status (unless the Strimzi operator on Kafka resource).
I think we can live with what we have today and see if we have problem at scale.

shawkins · 2021-03-05T15:39:22Z

@rareddy what's being done here is a defined pattern - see https://www.openshift.com/blog/kubernetes-operators-best-practices level based triggering.

shawkins · 2021-03-05T18:54:17Z

@ppatierno @rareddy its worth revisiting this another time as I didn't completely describe the number of updates performed. There are 10 resources created currently for each managedkafka - each of them is producing an update event on the resync interval which then causes 10 updates. That's actually 100 updates per ManagedKafka per resync. And since there's been no movement on fabric8io/kubernetes-client#2812 - it's likely double that. Given the potential for 1000s of managedkafka, this does need more consideration.

At a minimum I propose filtering out events from the informer onUpdate where the resourceVersion has not changed - that will filter out the bulk of the resync events you don't care about.

rareddy · 2021-03-05T20:08:57Z

Also maybe one can reduce any updates by checking the namespace of resource with namepace of managed kafka if matched we can reduce that calls back to 10 events per MK CR. That still is 100 updates but capped at that number. Otherwise we need to multiply that with #Mk-CR deployed.

Update: looks like this filtering is already done during the EventSource conversation based on Owner Reference, so ignore this

shawkins · 2021-03-05T21:19:32Z

To clarify I'm proposing #154

Using cache from informer for operator deleted

ee6e331

Added logging on operator deletion status

ppatierno requested a review from rareddy March 4, 2021 21:37

rareddy approved these changes Mar 4, 2021

View reviewed changes

shawkins reviewed Mar 4, 2021

View reviewed changes

ppatierno merged commit ec94c8f into bf2fc6cc711aee1a0c2a:main Mar 5, 2021

ppatierno deleted the refactoring-operand-deleted branch March 5, 2021 08:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using cache from informer for operator deleted #150

Using cache from informer for operator deleted #150

ppatierno commented Mar 4, 2021

ppatierno commented Mar 4, 2021

rareddy left a comment

shawkins Mar 4, 2021

ppatierno Mar 5, 2021

shawkins Mar 5, 2021

ppatierno Mar 5, 2021

shawkins Mar 5, 2021

shawkins left a comment

ppatierno commented Mar 5, 2021

shawkins commented Mar 5, 2021

rareddy commented Mar 5, 2021

ppatierno commented Mar 5, 2021

ppatierno commented Mar 5, 2021

shawkins commented Mar 5, 2021

rareddy commented Mar 5, 2021

shawkins commented Mar 5, 2021

ppatierno commented Mar 5, 2021

shawkins commented Mar 5, 2021

ppatierno commented Mar 5, 2021

shawkins commented Mar 5, 2021

shawkins commented Mar 5, 2021

rareddy commented Mar 5, 2021 •

edited

Loading

shawkins commented Mar 5, 2021

Using cache from informer for operator deleted #150

Using cache from informer for operator deleted #150

Conversation

ppatierno commented Mar 4, 2021

ppatierno commented Mar 4, 2021

rareddy left a comment

Choose a reason for hiding this comment

shawkins Mar 4, 2021

Choose a reason for hiding this comment

ppatierno Mar 5, 2021

Choose a reason for hiding this comment

shawkins Mar 5, 2021

Choose a reason for hiding this comment

ppatierno Mar 5, 2021

Choose a reason for hiding this comment

shawkins Mar 5, 2021

Choose a reason for hiding this comment

shawkins left a comment

Choose a reason for hiding this comment

ppatierno commented Mar 5, 2021

shawkins commented Mar 5, 2021

rareddy commented Mar 5, 2021

ppatierno commented Mar 5, 2021

ppatierno commented Mar 5, 2021

shawkins commented Mar 5, 2021

rareddy commented Mar 5, 2021

shawkins commented Mar 5, 2021

ppatierno commented Mar 5, 2021

shawkins commented Mar 5, 2021

ppatierno commented Mar 5, 2021

shawkins commented Mar 5, 2021

shawkins commented Mar 5, 2021

rareddy commented Mar 5, 2021 • edited Loading

shawkins commented Mar 5, 2021

rareddy commented Mar 5, 2021 •

edited

Loading