-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Wrong error message on resource conflicts due to same topicName #9270
Comments
@tombentley fyi |
Can we sort first by |
Discussed on the Community call on 02.11.2023: @fvaleri is already working on it. |
I've looked at this and #9303 in more detail. The current algorithm is certainly wrong. The question is really how best to fix it. Fundamentally the problem revolves around the fact that Kubernetes doesn't provide us with an efficient way to query the That map isn't being maintained correctly. In particular, it is added-to only as a side-effect of a reconciliation. This means that when the UTO starts up (when there are pre-existing KafkaTopics), it has partial information for a time. But the UTO will still act on this partial information as it is complete. I think the algorithm we want would prioritise the existing KafkaTopic which "owns" the topic in Kafka, even if it's not as old as some other KafkaTopic for the same topic. In other words, consistency over ownership is more important than age. Its only when there is no existing owner that we should consider the creation time. And when we do use the creation time it's only really best effort (i.e. we shouldn't assume that the modifications to
Important properties of this algorithm are:
One necessary precondition of this algorithm is that we build This is of sufficient complexity that it's probably worth pulling the management of Thoughts @fvaleri ? |
TBH, I don't see any risk of ownership change. The problem is limited to the error message on the failed KafkaTopic CR, which sometimes reports the wrong KafkaTopic name that is managing the Kafka topic. This is a corner case that can happen only when there are two KafkaTopics with the same topicName and creationTime processed in the same batch. As an example, let's say we have t1 and t2 with the same topicName that are created at about the same time (t1.topicName == t2.topicName, t1.creationTime == t2.creationTime). If the UTO process them in the same batch, then it can happen that t1 is reconciled and t2 fails with an error that says it is already managed by t2, when it should say t1. This is a random behavior determined by the wrong ordering in validateSingleManagingResource. My last commit in #9303 addresses exactly this corner case, introducing the concept of processing order. Do you see anything wrong with that logic? |
Bug Description
This is related to the UnidirectionalTopicOperator (UTO).
It may happen that two topics with the same topicName are created at the exact same creationTime. Nevertheless, only the first one is reconciled, while the other reports a ResourceConflict in its status (single resource principle: 1 kt -> 1 topic -> 1 cluster).
The problem is that, when we create the condition message for the failing KafkaTopic status, we use creationTime to determine the name of the reconciled topic, so it can happen to get the wrong name (i.e. KubeRef of the failing KafkaTopic).
https://github.com/strimzi/strimzi-kafka-operator/blob/main/topic-operator/src/main/java/io/strimzi/operator/topic/v2/BatchingTopicController.java#L209
Example:
Steps to reproduce
Run the flaky test
TopicControllerIT.shouldFailIfNumPartitionsDivergedWithConfigChange
multiple times.Expected behavior
No response
Strimzi version
0.38.0
Kubernetes version
1.27.3
Installation method
No response
Infrastructure
No response
Configuration files and logs
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: