Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[UTO] Fix wrong error message on resource conflicts #9303

Closed
wants to merge 2 commits into from

Conversation

fvaleri
Copy link
Contributor

@fvaleri fvaleri commented Oct 29, 2023

In case of ResourceConflict (two or more KafkaTopic resources with the same topicName) we build the error message using creationTime to determine the the oldest resource, that is supposed to be ready. Due to the poor resolution of the timestamp provided by Kubernetes, we sometimes generate the wrong error message, as shown in the following example.

apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaTopic
metadata:
  creationTimestamp: "2023-10-19T14:46:47Z"
  name: bar
spec:
  topicName: foo
status:
  conditions:
  - lastTransitionTime: "2023-10-19T14:46:47.311487120Z" 
    message: Managed by Ref{namespace='test', name='bar'} 
    reason: ResourceConflict status: "False" type: Ready

apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaTopic
metadata:
  creationTimestamp: "2023-10-19T14:46:47Z"
  name: foo
spec:
  topicName: foo
status:
    conditions: 
    - lastTransitionTime: "2023-10-19T14:46:47.230152899Z"
      status: "True" 
      type: Ready

In order to fix this, I simply propose to use System.nanoTime, whithout changing the original logic. I also created a new integration test that makes the concurrency issue more evident.

Should fix #9270.

@fvaleri fvaleri added this to the 0.39.0 milestone Oct 29, 2023
KubeRef(KafkaTopic kt) {
this(kt.getMetadata().getNamespace(), kt.getMetadata().getName(), StatusUtils.isoUtcDatetime(kt.getMetadata().getCreationTimestamp()).toEpochMilli());
this(kt.getMetadata().getNamespace(), kt.getMetadata().getName(), System.nanoTime());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how this works. The old logic was to use the creation time of the KafkaTopic, which meant that comparing any two KT would always result in the same ordering, even between two successive UTO processes (even if, annoyingly, sometimes two KT would compare as equal).

Substituting System.nanoTime means that how two KafkaTopics compare now depends on encounter order. That:

  1. Might not be stable between between two UTO processes.
  2. Still doesn't guarantee that no two KT are equal, because System.nanoTime doesn't provide any guarantee about when that clock ticks, only that we can read that clock to nanosecond precision.

Thinking again about how the ResourceConflict can have the wrong resource name, I'm unconvinced that there's anything wrong with the comparison code in validateSingleManagingResource:

  1. KafkaTopics A and B are created within the same second (as measured on the apiserver)
  2. UTO reconciles A and decides it's Ready
  3. UTO reconciles B.

It shouldn't matter whether A and B get reconciled in the same batch or not.

  • If they're in different batches then in step 3 when UTO compares A and B and sees them as equal and their creationTime as being equal, we will satisfy the condition in validateSingleManagingResource which is looking at whether one of the KT is already Ready. Topic A has already been reconciled so it is Ready, so it's B that will be seen as the conflicting one. And this behaviour should be stable over time and between UTO processes.
  • If they're in the same batch then the logic should still works because we should set the status on A before we validate B.

The question is why the existing logic in validateSingleManagingResource doesn't work.

Looking at the code, I think it could be a mistake in how we're using partitionedByManaged.get(true).stream().filter().filter().toList() in validateManagedTopics. The call to validate() has the side effect of calling rememberTopic, which we depend on in validateSingleManagingResource() in the case where A and B are in the same reconciliation. We should problably be forcing that all topics in the batch are validate()ed before they're validateSingleManagingResource()ed. I.e. partitionedByManaged.get(true).filter().toList().stream().filter().toList().

I think that will make the existing logic in validateSingleManagingResource() work out correctly.

Copy link
Contributor Author

@fvaleri fvaleri Nov 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Tom, thanks for looking in to this issue.

Right, System.nanoTime makes the issue less frequent, but it doesn't solve it, plus the issues you mentioned. I also tried to order by creationTime and uid, but it doesn't work either.

The problem, as I understand, is that the creationTime information that we use to sort refs in validateSingleManagingResource is coming from Kubernetes and has low resolution, so there is no guarantee that you will get the true oldest ref first when two or more refs have the same creationTime.

I changed the original PR introducing KubeRef.processingOrder, which should help to disambiguate two KafkaTopic resources with the same creationTime in the same batch. Wdyt?

In case of ResourceConflict (two or more kt with the same topicName) we build the error message using creationTime to determine the the oldest resource, that is supposed to be ready.
Due to the poor resolution of the timestamp provided by Kubernetes, we sometimes generate the wrong error message, as shown in the following example.

apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaTopic
metadata:
  creationTimestamp: "2023-10-19T14:46:47Z"
  name: bar
spec:
  topicName: foo
status:
  conditions:
  - lastTransitionTime: "2023-10-19T14:46:47.311487120Z"
    message: Managed by Ref{namespace='test', name='bar'}
    reason: ResourceConflict
    status: "False"
    type: Ready

apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaTopic
metadata:
  creationTimestamp: "2023-10-19T14:46:47Z"
  name: foo
spec:
  topicName: foo
status:
    conditions:
    - lastTransitionTime: "2023-10-19T14:46:47.230152899Z"
      status: "True"
      type: Ready

In order to fix this, I simply propose to use System.nanoTime, whithout changing the original logic.
I also created a new integration test that makes the concurrency issue more evident.

Signed-off-by: Federico Valeri <[email protected]>
@scholzj
Copy link
Member

scholzj commented Nov 10, 2023

@fvaleri @tombentley Any plans to proceed with this?

@tombentley
Copy link
Member

Yes. It's on my todo list to look at it more closely.

@scholzj
Copy link
Member

scholzj commented Nov 30, 2023

Discussed on the community call on 30.11.2023: @tombentley Could you please find some time for this? Thanks.

@scholzj scholzj modified the milestones: 0.39.0, 0.40.0 Dec 13, 2023
tombentley added a commit to tombentley/barnabas that referenced this pull request Jan 30, 2024
Signed-off-by: Tom Bentley <[email protected]>
@fvaleri
Copy link
Contributor Author

fvaleri commented Jan 31, 2024

Closing in favor of #9617 which includes some of the test bits created here.

@fvaleri fvaleri closed this Jan 31, 2024
@fvaleri fvaleri deleted the fix-uto-conflict-msg branch January 31, 2024 07:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: Wrong error message on resource conflicts due to same topicName
3 participants