Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster observedGeneration is updated without changing conditions on topology upgrades #11292

Closed
dkoshkin opened this issue Oct 15, 2024 · 5 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-priority Indicates an issue lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.

Comments

@dkoshkin
Copy link
Contributor

What steps did you take and what happened?

Followed the Quick Start for Docker with Kubernetes v1.31.0. Then updated the Cluster object and change the version to v1.31.1.

The observedGeneration got updated without any of the conditions changing.

apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"cluster.x-k8s.io/v1beta1","kind":"Cluster","metadata":{"annotations":{},"name":"capi-quickstart","namespace":"default"},"spec":{"clusterNetwork":{"pods":{"cidrBlocks":["192.168.0.0/16"]},"serviceDomain":"k8s.test","services":{"cidrBlocks":["10.96.0.0/12"]}},"topology":{"class":"quick-start","controlPlane":{"metadata":{},"replicas":1},"variables":[{"name":"imageRepository","value":""},{"name":"etcdImageTag","value":""},{"name":"coreDNSImageTag","value":""},{"name":"podSecurityStandard","value":{"audit":"restricted","enabled":false,"enforce":"baseline","warn":"restricted"}}],"version":"v1.31.0","workers":{"machineDeployments":[{"class":"default-worker","name":"md-0","replicas":1}],"machinePools":[{"class":"default-worker","name":"mp-0","replicas":1}]}}}}
  creationTimestamp: "2024-10-15T19:42:12Z"
  finalizers:
  - cluster.cluster.x-k8s.io
  generation: 4
  labels:
    cluster.x-k8s.io/cluster-name: capi-quickstart
    topology.cluster.x-k8s.io/owned: ""
  name: capi-quickstart
  namespace: default
  resourceVersion: "1936"
  uid: d5602238-28ad-4b63-b785-d9766362e561
status:
  conditions:
  - lastTransitionTime: "2024-10-15T19:42:45Z"
    status: "True"
    type: Ready
  - lastTransitionTime: "2024-10-15T19:42:45Z"
    status: "True"
    type: ControlPlaneInitialized
  - lastTransitionTime: "2024-10-15T19:42:45Z"
    status: "True"
    type: ControlPlaneReady
  - lastTransitionTime: "2024-10-15T19:42:13Z"
    status: "True"
    type: InfrastructureReady
  - lastTransitionTime: "2024-10-15T19:42:15Z"
    status: "True"
    type: TopologyReconciled
  infrastructureReady: true
  observedGeneration: 3
  phase: Provisioned

Notice the observedGeneration is now 4 without any conditions or other status changes.

status:
  conditions:
  - lastTransitionTime: "2024-10-15T19:42:45Z"
    status: "True"
    type: Ready
  - lastTransitionTime: "2024-10-15T19:42:45Z"
    status: "True"
    type: ControlPlaneInitialized
  - lastTransitionTime: "2024-10-15T19:42:45Z"
    status: "True"
    type: ControlPlaneReady
  - lastTransitionTime: "2024-10-15T19:42:13Z"
    status: "True"
    type: InfrastructureReady
  - lastTransitionTime: "2024-10-15T19:42:15Z"
    status: "True"
    type: TopologyReconciled
  infrastructureReady: true
  observedGeneration: 4
  phase: Provisioned

What did you expect to happen?

We rely on a combination of watching for observedGeneration to equal generation and then for conditions on the Cluster to be True when determining when a cluster completes an upgrade.

In this case though, the observedGeneration is updated without any changes in the conditions, which causes a race condition and return prematurely before the upgrade is even started.

I would expect for some status to change along with observedGeneration that indicates that the spec has changes and needs to be processed.

Cluster API version

$ clusterctl version
clusterctl version: &version.Info{Major:"", Minor:"", GitVersion:"1.8.4", GitCommit:"brew", GitTreeState:"clean", BuildDate:"2024-10-08T05:24:23Z", GoVersion:"go1.23.2", Compiler:"gc", Platform:"darwin/arm64"}

Kubernetes version

v1.31.0 > v1.31.1

Anything else you would like to add?

We're currently hacked around this using a sleep between changing the resource and starting the wait, but would appreciate some guidance from others who have seen this and have other ideas.

Label(s) to be applied

/kind bug
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-priority Indicates an issue lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 15, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If CAPI contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@sbueringer
Copy link
Member

sbueringer commented Oct 16, 2024

Which of the conditions would you have expected to change?

That being said we're working on new conditions for the next release #11291
I think this will be probably solved through new additional conditions that also contain observedGeneration

@dkoshkin
Copy link
Contributor Author

dkoshkin commented Oct 22, 2024

Which of the conditions would you have expected to change?

That being said we're working on new conditions for the next release #11291 I think this will be probably solved through new additional conditions that also contain observedGeneration

@sbueringer it doesn't matter too much, our code can be extended to handle any new conditions. The current wait code checks the following to know when a Cluster has finished upgrading:

// InfrastructureReady and ControlPlaneReady will be the first 2 statuses set as the cluster is being created or upgraded.
// TopologyReconciled will be true once the Cluster.spec.topology have been applied to the objects in the Cluster,
// but this does not imply those objects are already reconciled to the spec provided.
// The final Ready condition will be true once the Cluster objects have finished reconciling.
// But, the ready condition will be true before the MachineDeployments.
return infrastructureReady && controlPlaneReady && topologyReconciled && ready

When I initially wrote this I made an incorrect assumption that InfrastructureReady and ControlPlaneReady will become false once observedGeneration, but missed the race.

We use a similar strategy of relying on observedGeneration change along with different conditions for both MachineDeployments and KubeadmControlPlane and haven't seen any race conditions there.

@fabriziopandini
Copy link
Member

TBH, I don't see a correlation between observed generation and upgrades being completed or even condition changing.

If you want to check for upgrade being completed, you should check machines versions.
New conditions will also help, but I do not consider this a bug in CAPI

/close

@k8s-ci-robot
Copy link
Contributor

@fabriziopandini: Closing this issue.

In response to this:

TBH, I don't see a correlation between observed generation and upgrades being completed or even condition changing.

If you want to check for upgrade being completed, you should check machines versions.
New conditions will also help, but I do not consider this a bug in CAPI

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-priority Indicates an issue lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.
Projects
None yet
Development

No branches or pull requests

4 participants