Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using cache from informer for operator deleted #150

Merged

Conversation

ppatierno
Copy link
Contributor

This PR refactor the operand deleted part using the cache from informer and it adds logging the deletion state.
It also refactor the methods to get ConfigMap and Secret related to Kafka.

Added logging on operator deletion status
@ppatierno ppatierno requested a review from rareddy March 4, 2021 21:37
@ppatierno
Copy link
Contributor Author

@shawkins @rareddy

Copy link
Contributor

@rareddy rareddy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

if (openShiftClient != null) {
isDeleted = isDeleted && cachedRoute(managedKafka) == null;
}
log.info("Admin Server isDeleted = {}", isDeleted);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you plan on going back and adjusting the log levels? Seems like the logs will fill up pretty quickly at an info level.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's planned and I would love to use the same logging library across operator and sync. We should make a final decision.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same library and/or same usage of levels? Didn't @kornys have a pr at one point to switch the operator to jboss logging? Any chance we could resurrect that along with a tweak of the logging levels?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering why we should use the jboss one. Tbh I have never seen people using it. Just because it's the default for Quarkus?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's the major reason.

I'm not too concerned if we use different logging frameworks - the only issue will be making sure we know which formatting to use.

But we should try to be consistent with the levels.

Copy link
Contributor

@shawkins shawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wanted to walk through a little bit more on what is and is not making api calls. Currently when kafkaInstance.createOrUpdate or delete is called, that will eventually make an api call per resource correct? I don't see anything in the nested createOrDelete logic that is comparing the desired to the known state first.

With this change, when isDeleted is called it won't make api calls - but isDeleted is called with roughly the same frequency (less as it's first checking the isDeleted spec flag) as createOrUpdate correct?

@ppatierno
Copy link
Contributor Author

I had this discussion multiple times with Java operator SDK team and they don't consider to be correct to compare old with new but just apply the resource as idempotent operation. This is why I am getting every time the current and building the new from that while filling/reverting the fields we need. The drawback is maybe related to the fact that events are raised not only when something change but even due to the re-sync interval from informer.

@ppatierno ppatierno merged commit ec94c8f into bf2fc6cc711aee1a0c2a:main Mar 5, 2021
@ppatierno ppatierno deleted the refactoring-operand-deleted branch March 5, 2021 08:17
@shawkins
Copy link
Contributor

shawkins commented Mar 5, 2021

but even due to the re-sync interval from informer

To rephrase a little - either you can monitor the dependent resources and flip them back to the desired state almost immediately, or you could rely on the resync interval of ManagedKafka and just flip them back on the resync.

@rareddy
Copy link
Contributor

rareddy commented Mar 5, 2021

they don't consider to be correct to compare old with new but just apply the resource as idempotent operation

Even if it is an idempotent operation, it is an extra API call and in the end, someone has to do that check. If there is a simple way to avoid it why not?

@ppatierno
Copy link
Contributor Author

Comparing resources is not simple because Kubernetes starts to add a lot of fields that will make the comparing to fail even if they are the same. I raised this point long time ago. Otherwise we have to compare field by field what is meaningful to us for saying that two resources are different or not.

@ppatierno
Copy link
Contributor Author

Not sure if we can rely on the resourceVersion resource ... sometimes I had problems in the past using it.

@shawkins
Copy link
Contributor

shawkins commented Mar 5, 2021

To weigh in more. There are two concerns here:

  1. knowing when a resource update is actually needed.
  2. how quickly should dependent / managed resources be returned to the "desired state"

On the second one there is some cross-talk with the other pr. I'll assume that we'll have that conversation over there.

On the first there's a couple of considerations. For what Paolo is doing he can't just look at the resourceVersion of dependent resources. The resourceVersion is updated even for status changes. Beyond that what he really cares about is the comparison to his ideal state - not the before/after of update event the informer is reporting. As for what should be compared, presumably it should just be the spec correct? Are you seeing kube or dependent resources add stuff to specs? If yes that makes this comparison quite difficult. If not, then we have the option of using lombok (for any api we can effectively update), or compare based upon the json / yaml serialization (kube is probably doing something similar once it gets the update request). Ultimately whether we should be doing that is based upon the number of update calls that are being made - if it's not that many it's not worth worrying about this.

@rareddy
Copy link
Contributor

rareddy commented Mar 5, 2021

Are you seeing kube or dependent resources add stuff to specs?

That should never happen, that is the contract.

then we have the option of using lombok (for any api we can effectively update)

Is there any reason why lombok based equality check on Spec not preferable?

@shawkins
Copy link
Contributor

shawkins commented Mar 5, 2021

Is there any reason why lombok based equality check on Spec not preferable?

There seems to be some bias against lombok at least from Max. You also need to make sure your objects provide the desired quality - use a Set when possible instead of a List if ordering of something doesn't matter. Beyond that this about dependent resources, so you'd be asking strimzi to add that.

@ppatierno
Copy link
Contributor Author

It's not about Strimzi only. Not just Kafka is a dependent resource but even native K8S resources like ConfigMap(s) for metrics, Secret(s) for certificates and Deployment(s) for canary and admin server.
I also read about Max was against lombok

@shawkins
Copy link
Contributor

shawkins commented Mar 5, 2021

Yes, I realize that it's other things as well, which is why I brought up comparing the serialized document. But the bigger question is the frequency of these update - if it's not that much we don't need to worry about this. As far as I know you have a resync interval of a minute - so at worst every minute you are trying to update every resource. On top of this you are updating for every event including status changes - from what we say strimzi does it's own heart beat of status updates at a two minute interval, and other than that there aren't a lot of rapid status updates. So can we tolerate a set of mostly no-op updates against all of the resources (1000s of ManagedKafkas x resource fanout) every minute. If no, then consider increasing the resync or we'll need to consider adding comparison logic. If yes, then we can drop this for now.

@ppatierno
Copy link
Contributor Author

Just to add another problem to the comparison logic, is the managed-by label that is not part of the spec.
It could be removed and we are going to lost all the subsequent events; this is just to say that comparing spec could not be enough.
Anyway, I agree with your points. The way it works now is more defensive but I would not expect things changing spec or status (unless the Strimzi operator on Kafka resource).
I think we can live with what we have today and see if we have problem at scale.

@shawkins
Copy link
Contributor

shawkins commented Mar 5, 2021

@rareddy what's being done here is a defined pattern - see https://www.openshift.com/blog/kubernetes-operators-best-practices level based triggering.

@shawkins
Copy link
Contributor

shawkins commented Mar 5, 2021

@ppatierno @rareddy its worth revisiting this another time as I didn't completely describe the number of updates performed. There are 10 resources created currently for each managedkafka - each of them is producing an update event on the resync interval which then causes 10 updates. That's actually 100 updates per ManagedKafka per resync. And since there's been no movement on fabric8io/kubernetes-client#2812 - it's likely double that. Given the potential for 1000s of managedkafka, this does need more consideration.

At a minimum I propose filtering out events from the informer onUpdate where the resourceVersion has not changed - that will filter out the bulk of the resync events you don't care about.

@rareddy
Copy link
Contributor

rareddy commented Mar 5, 2021

Also maybe one can reduce any updates by checking the namespace of resource with namepace of managed kafka if matched we can reduce that calls back to 10 events per MK CR. That still is 100 updates but capped at that number. Otherwise we need to multiply that with #Mk-CR deployed.

Update: looks like this filtering is already done during the EventSource conversation based on Owner Reference, so ignore this

@shawkins
Copy link
Contributor

shawkins commented Mar 5, 2021

To clarify I'm proposing #154

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants