Cluster observedGeneration is updated without changing conditions on topology upgrades #11292

dkoshkin · 2024-10-15T20:37:38Z

What steps did you take and what happened?

Followed the Quick Start for Docker with Kubernetes v1.31.0. Then updated the Cluster object and change the version to v1.31.1.

The observedGeneration got updated without any of the conditions changing.

apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"cluster.x-k8s.io/v1beta1","kind":"Cluster","metadata":{"annotations":{},"name":"capi-quickstart","namespace":"default"},"spec":{"clusterNetwork":{"pods":{"cidrBlocks":["192.168.0.0/16"]},"serviceDomain":"k8s.test","services":{"cidrBlocks":["10.96.0.0/12"]}},"topology":{"class":"quick-start","controlPlane":{"metadata":{},"replicas":1},"variables":[{"name":"imageRepository","value":""},{"name":"etcdImageTag","value":""},{"name":"coreDNSImageTag","value":""},{"name":"podSecurityStandard","value":{"audit":"restricted","enabled":false,"enforce":"baseline","warn":"restricted"}}],"version":"v1.31.0","workers":{"machineDeployments":[{"class":"default-worker","name":"md-0","replicas":1}],"machinePools":[{"class":"default-worker","name":"mp-0","replicas":1}]}}}}
  creationTimestamp: "2024-10-15T19:42:12Z"
  finalizers:
  - cluster.cluster.x-k8s.io
  generation: 4
  labels:
    cluster.x-k8s.io/cluster-name: capi-quickstart
    topology.cluster.x-k8s.io/owned: ""
  name: capi-quickstart
  namespace: default
  resourceVersion: "1936"
  uid: d5602238-28ad-4b63-b785-d9766362e561
status:
  conditions:
  - lastTransitionTime: "2024-10-15T19:42:45Z"
    status: "True"
    type: Ready
  - lastTransitionTime: "2024-10-15T19:42:45Z"
    status: "True"
    type: ControlPlaneInitialized
  - lastTransitionTime: "2024-10-15T19:42:45Z"
    status: "True"
    type: ControlPlaneReady
  - lastTransitionTime: "2024-10-15T19:42:13Z"
    status: "True"
    type: InfrastructureReady
  - lastTransitionTime: "2024-10-15T19:42:15Z"
    status: "True"
    type: TopologyReconciled
  infrastructureReady: true
  observedGeneration: 3
  phase: Provisioned

Notice the observedGeneration is now 4 without any conditions or other status changes.

status:
  conditions:
  - lastTransitionTime: "2024-10-15T19:42:45Z"
    status: "True"
    type: Ready
  - lastTransitionTime: "2024-10-15T19:42:45Z"
    status: "True"
    type: ControlPlaneInitialized
  - lastTransitionTime: "2024-10-15T19:42:45Z"
    status: "True"
    type: ControlPlaneReady
  - lastTransitionTime: "2024-10-15T19:42:13Z"
    status: "True"
    type: InfrastructureReady
  - lastTransitionTime: "2024-10-15T19:42:15Z"
    status: "True"
    type: TopologyReconciled
  infrastructureReady: true
  observedGeneration: 4
  phase: Provisioned

What did you expect to happen?

We rely on a combination of watching for observedGeneration to equal generation and then for conditions on the Cluster to be True when determining when a cluster completes an upgrade.

In this case though, the observedGeneration is updated without any changes in the conditions, which causes a race condition and return prematurely before the upgrade is even started.

I would expect for some status to change along with observedGeneration that indicates that the spec has changes and needs to be processed.

Cluster API version

$ clusterctl version
clusterctl version: &version.Info{Major:"", Minor:"", GitVersion:"1.8.4", GitCommit:"brew", GitTreeState:"clean", BuildDate:"2024-10-08T05:24:23Z", GoVersion:"go1.23.2", Compiler:"gc", Platform:"darwin/arm64"}

Kubernetes version

v1.31.0 > v1.31.1

Anything else you would like to add?

We're currently hacked around this using a sleep between changing the resource and starting the wait, but would appreciate some guidance from others who have seen this and have other ideas.

Label(s) to be applied

/kind bug
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2024-10-15T20:37:47Z

This issue is currently awaiting triage.

If CAPI contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

sbueringer · 2024-10-16T06:29:45Z

Which of the conditions would you have expected to change?

That being said we're working on new conditions for the next release #11291
I think this will be probably solved through new additional conditions that also contain observedGeneration

dkoshkin · 2024-10-22T23:41:46Z

Which of the conditions would you have expected to change?

That being said we're working on new conditions for the next release #11291 I think this will be probably solved through new additional conditions that also contain observedGeneration

@sbueringer it doesn't matter too much, our code can be extended to handle any new conditions. The current wait code checks the following to know when a Cluster has finished upgrading:

// InfrastructureReady and ControlPlaneReady will be the first 2 statuses set as the cluster is being created or upgraded.
// TopologyReconciled will be true once the Cluster.spec.topology have been applied to the objects in the Cluster,
// but this does not imply those objects are already reconciled to the spec provided.
// The final Ready condition will be true once the Cluster objects have finished reconciling.
// But, the ready condition will be true before the MachineDeployments.
return infrastructureReady && controlPlaneReady && topologyReconciled && ready

When I initially wrote this I made an incorrect assumption that InfrastructureReady and ControlPlaneReady will become false once observedGeneration, but missed the race.

We use a similar strategy of relying on observedGeneration change along with different conditions for both MachineDeployments and KubeadmControlPlane and haven't seen any race conditions there.

fabriziopandini · 2024-10-30T13:36:45Z

TBH, I don't see a correlation between observed generation and upgrades being completed or even condition changing.

If you want to check for upgrade being completed, you should check machines versions.
New conditions will also help, but I do not consider this a bug in CAPI

/close

k8s-ci-robot · 2024-10-30T13:36:50Z

@fabriziopandini: Closing this issue.

In response to this:

TBH, I don't see a correlation between observed generation and upgrades being completed or even condition changing.

If you want to check for upgrade being completed, you should check machines versions.
New conditions will also help, but I do not consider this a bug in CAPI

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-priority Indicates an issue lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 15, 2024

k8s-ci-robot closed this as completed Oct 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster observedGeneration is updated without changing conditions on topology upgrades #11292

Cluster observedGeneration is updated without changing conditions on topology upgrades #11292

dkoshkin commented Oct 15, 2024

k8s-ci-robot commented Oct 15, 2024

sbueringer commented Oct 16, 2024 •

edited

Loading

dkoshkin commented Oct 22, 2024 •

edited

Loading

fabriziopandini commented Oct 30, 2024

k8s-ci-robot commented Oct 30, 2024

Cluster observedGeneration is updated without changing conditions on topology upgrades #11292

Cluster observedGeneration is updated without changing conditions on topology upgrades #11292

Comments

dkoshkin commented Oct 15, 2024

What steps did you take and what happened?

What did you expect to happen?

Cluster API version

Kubernetes version

Anything else you would like to add?

Label(s) to be applied

k8s-ci-robot commented Oct 15, 2024

sbueringer commented Oct 16, 2024 • edited Loading

dkoshkin commented Oct 22, 2024 • edited Loading

fabriziopandini commented Oct 30, 2024

k8s-ci-robot commented Oct 30, 2024

sbueringer commented Oct 16, 2024 •

edited

Loading

dkoshkin commented Oct 22, 2024 •

edited

Loading