[UTO] Fix wrong error message on resource conflicts #9303

fvaleri · 2023-10-29T15:02:25Z

In case of ResourceConflict (two or more KafkaTopic resources with the same topicName) we build the error message using creationTime to determine the the oldest resource, that is supposed to be ready. Due to the poor resolution of the timestamp provided by Kubernetes, we sometimes generate the wrong error message, as shown in the following example.

apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaTopic
metadata:
  creationTimestamp: "2023-10-19T14:46:47Z"
  name: bar
spec:
  topicName: foo
status:
  conditions:
  - lastTransitionTime: "2023-10-19T14:46:47.311487120Z" 
    message: Managed by Ref{namespace='test', name='bar'} 
    reason: ResourceConflict status: "False" type: Ready

apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaTopic
metadata:
  creationTimestamp: "2023-10-19T14:46:47Z"
  name: foo
spec:
  topicName: foo
status:
    conditions: 
    - lastTransitionTime: "2023-10-19T14:46:47.230152899Z"
      status: "True" 
      type: Ready

In order to fix this, I simply propose to use System.nanoTime, whithout changing the original logic. I also created a new integration test that makes the concurrency issue more evident.

Should fix #9270.

tombentley · 2023-10-29T20:40:08Z

topic-operator/src/main/java/io/strimzi/operator/topic/v2/KubeRef.java

    KubeRef(KafkaTopic kt) {
-        this(kt.getMetadata().getNamespace(), kt.getMetadata().getName(), StatusUtils.isoUtcDatetime(kt.getMetadata().getCreationTimestamp()).toEpochMilli());
+        this(kt.getMetadata().getNamespace(), kt.getMetadata().getName(), System.nanoTime());


I don't see how this works. The old logic was to use the creation time of the KafkaTopic, which meant that comparing any two KT would always result in the same ordering, even between two successive UTO processes (even if, annoyingly, sometimes two KT would compare as equal).

Substituting System.nanoTime means that how two KafkaTopics compare now depends on encounter order. That:

Might not be stable between between two UTO processes.

Still doesn't guarantee that no two KT are equal, because System.nanoTime doesn't provide any guarantee about when that clock ticks, only that we can read that clock to nanosecond precision.

Thinking again about how the ResourceConflict can have the wrong resource name, I'm unconvinced that there's anything wrong with the comparison code in validateSingleManagingResource:

KafkaTopics A and B are created within the same second (as measured on the apiserver)

UTO reconciles A and decides it's Ready

UTO reconciles B.

It shouldn't matter whether A and B get reconciled in the same batch or not.

If they're in different batches then in step 3 when UTO compares A and B and sees them as equal and their creationTime as being equal, we will satisfy the condition in validateSingleManagingResource which is looking at whether one of the KT is already Ready. Topic A has already been reconciled so it is Ready, so it's B that will be seen as the conflicting one. And this behaviour should be stable over time and between UTO processes.

If they're in the same batch then the logic should still works because we should set the status on A before we validate B.

The question is why the existing logic in validateSingleManagingResource doesn't work.

Looking at the code, I think it could be a mistake in how we're using partitionedByManaged.get(true).stream().filter().filter().toList() in validateManagedTopics. The call to validate() has the side effect of calling rememberTopic, which we depend on in validateSingleManagingResource() in the case where A and B are in the same reconciliation. We should problably be forcing that all topics in the batch are validate()ed before they're validateSingleManagingResource()ed. I.e. partitionedByManaged.get(true).filter().toList().stream().filter().toList().

I think that will make the existing logic in validateSingleManagingResource() work out correctly.

Hi Tom, thanks for looking in to this issue.

Right, System.nanoTime makes the issue less frequent, but it doesn't solve it, plus the issues you mentioned. I also tried to order by creationTime and uid, but it doesn't work either.

The problem, as I understand, is that the creationTime information that we use to sort refs in validateSingleManagingResource is coming from Kubernetes and has low resolution, so there is no guarantee that you will get the true oldest ref first when two or more refs have the same creationTime.

I changed the original PR introducing KubeRef.processingOrder, which should help to disambiguate two KafkaTopic resources with the same creationTime in the same batch. Wdyt?

In case of ResourceConflict (two or more kt with the same topicName) we build the error message using creationTime to determine the the oldest resource, that is supposed to be ready. Due to the poor resolution of the timestamp provided by Kubernetes, we sometimes generate the wrong error message, as shown in the following example. apiVersion: kafka.strimzi.io/v1beta2 kind: KafkaTopic metadata: creationTimestamp: "2023-10-19T14:46:47Z" name: bar spec: topicName: foo status: conditions: - lastTransitionTime: "2023-10-19T14:46:47.311487120Z" message: Managed by Ref{namespace='test', name='bar'} reason: ResourceConflict status: "False" type: Ready apiVersion: kafka.strimzi.io/v1beta2 kind: KafkaTopic metadata: creationTimestamp: "2023-10-19T14:46:47Z" name: foo spec: topicName: foo status: conditions: - lastTransitionTime: "2023-10-19T14:46:47.230152899Z" status: "True" type: Ready In order to fix this, I simply propose to use System.nanoTime, whithout changing the original logic. I also created a new integration test that makes the concurrency issue more evident. Signed-off-by: Federico Valeri <[email protected]>

Signed-off-by: Federico Valeri <[email protected]>

scholzj · 2023-11-10T03:03:47Z

@fvaleri @tombentley Any plans to proceed with this?

tombentley · 2023-11-10T03:24:29Z

Yes. It's on my todo list to look at it more closely.

scholzj · 2023-11-30T16:11:51Z

Discussed on the community call on 30.11.2023: @tombentley Could you please find some time for this? Thanks.

Signed-off-by: Tom Bentley <[email protected]>

fvaleri · 2024-01-31T07:31:54Z

Closing in favor of #9617 which includes some of the test bits created here.

fvaleri requested review from tombentley, scholzj and ppatierno October 29, 2023 15:03

fvaleri added the topic operator label Oct 29, 2023

fvaleri added this to the 0.39.0 milestone Oct 29, 2023

tombentley requested changes Oct 29, 2023

View reviewed changes

fvaleri added 2 commits November 2, 2023 18:53

Use processing order to disambiguate in the same batch

bd3597f

Signed-off-by: Federico Valeri <[email protected]>

fvaleri force-pushed the fix-uto-conflict-msg branch from 8455b2b to bd3597f Compare November 2, 2023 18:13

tombentley mentioned this pull request Nov 21, 2023

[Bug]: Wrong error message on resource conflicts due to same topicName #9270

Closed

fvaleri requested a review from tombentley November 30, 2023 10:06

scholzj modified the milestones: 0.39.0, 0.40.0 Dec 13, 2023

tombentley added a commit to tombentley/barnabas that referenced this pull request Jan 30, 2024

Test for strimzi#9303

3aa788b

Signed-off-by: Tom Bentley <[email protected]>

fvaleri closed this Jan 31, 2024

fvaleri deleted the fix-uto-conflict-msg branch January 31, 2024 07:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[UTO] Fix wrong error message on resource conflicts #9303

[UTO] Fix wrong error message on resource conflicts #9303

fvaleri commented Oct 29, 2023 •

edited

Loading

tombentley Oct 29, 2023

fvaleri Nov 2, 2023 •

edited

Loading

scholzj commented Nov 10, 2023

tombentley commented Nov 10, 2023

scholzj commented Nov 30, 2023

fvaleri commented Jan 31, 2024

[UTO] Fix wrong error message on resource conflicts #9303

[UTO] Fix wrong error message on resource conflicts #9303

Conversation

fvaleri commented Oct 29, 2023 • edited Loading

tombentley Oct 29, 2023

Choose a reason for hiding this comment

fvaleri Nov 2, 2023 • edited Loading

Choose a reason for hiding this comment

scholzj commented Nov 10, 2023

tombentley commented Nov 10, 2023

scholzj commented Nov 30, 2023

fvaleri commented Jan 31, 2024

fvaleri commented Oct 29, 2023 •

edited

Loading

fvaleri Nov 2, 2023 •

edited

Loading