Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shoot cluster nodes are not able to join cluster #68

Closed
akin-alalade opened this issue Jun 21, 2022 · 2 comments
Closed

Shoot cluster nodes are not able to join cluster #68

akin-alalade opened this issue Jun 21, 2022 · 2 comments
Labels
status/closed Issue is closed (either delivered or triaged)

Comments

@akin-alalade
Copy link

Hello,

We are seeing more and more nodes not being able to join the shoot cluster in reasonable time (under the 20mins default time)...only suse based OS are affected.

Sample error:
task "Waiting until shoot worker nodes have been reconciled" failed: Error while waiting for Worker shoot--hc-can-ns2--407-e3-hdl/407-e3-hdl to become ready: error during reconciliation: Error reconciling worker: Failed while waiting for all machine deployments to be ready: 'machine(s) failed: 1 error occurred: "shoot--hc-can-ns2--407-e3-hdl-iq-large-v2-z1-6d59f-76dbq": Machine shoot--hc-can-ns2--407-e3-hdl-iq-large-v2-z1-6d59f-76dbq failed to join the cluster in 20m0s minutes.'

Logs from an affected node:
2022-06-07T17:56:01.328267+00:00 localhost cloud-init[1521]: Cloud-init v. 21.2-8.51.1 running 'modules:final' at Tue, 07 Jun 2022 17:55:37 +0000. Up 31.54 seconds.
2022-06-07T17:56:01.328446+00:00 localhost cloud-init[1521]: Cloud-init v. 21.2-8.51.1 finished at Tue, 07 Jun 2022 17:56:01 +0000. Datasource DataSourceEc2Local. Up 55.11 seconds
2022-06-07T17:56:01.340849+00:00 localhost download-cloud-config.sh[2243]: Could not retrieve the shoot access secret with name cloud-config-downloader with bootstrap token
2022-06-07T17:56:01.341162+00:00 localhost systemd[1]: cloud-config-downloader.service: Main process exited, code=exited, status=1/FAILURE
2022-06-07T17:56:01.341792+00:00 localhost systemd[1]: cloud-config-downloader.service: Unit entered failed state.
2022-06-07T17:56:01.341959+00:00 localhost systemd[1]: cloud-config-downloader.service: Failed with result 'exit-code'.
2022-06-07T17:56:01.363888+00:00 localhost systemd[1]: Started Execute cloud user/final scripts.
2022-06-07T17:56:01.364931+00:00 localhost systemd[1]: Reached target Cloud-init target.

The symptom is that the affected nodes are unable to retrieve the shoot access secret with name cloud-config-downloader with bootstrap token. This is required as part of the “download-cloud-config.sh” bootstrap. One of the bad nodes had the text “<<BOOTSTRAP_TOKEN>>” inside the file that is supposed to contain the actual token (/var/lib/cloud-config-downloader/credentials/bootstrap-token).

Userdata (snippet) from good node:
mkdir -p '/var/lib/cloud-config-downloader/credentials'
cat << EOF > '/var/lib/cloud-config-downloader/credentials/bootstrap-token'
383273.kfxv0fjyl4l67vrz
EOF

Userdata (snippet) from bad node:
mkdir -p '/var/lib/cloud-config-downloader/credentials'
cat << EOF | base64 -d > '/var/lib/cloud-config-downloader/credentials/bootstrap-token'
PDxCT09UU1RSQVBfVE9LRU4+Pg==
EOF

gardener/machine-controller-manager#351
gardener/gardener#3898

“In the proposed new flow the worker controller (here MCM) creates a temporary bootstrap-token for each created vm.
If the featureFlag BootstrapTokenForVMs is enabled a file with the content "<<BOOTSTRAP_TOKEN>>" is added to the operatingsystem-config. It is passed to the worker controller. The worker controller generates a temporary token for every new worker instance(node). It replaces the "<<BOOTSTRAP_TOKEN>>" string with the created token and adds it to the user-data placed on the vm on startup.
The cloud-config-downloader(original user-data) will then refer to the new temporary bootstrap-token in the kubelet-bootstrap script.”

(I see a note in #41 about not being able to successfully test a requirement specifically relating to suse os, not sure if this is related but the majority occurrence is on suse os).

According to the diagram in one of the links pasted above, MCM is supposed to write the token directly to the fs...meaning the userdata will always contain the placeholder text (<<BOOTSTRAP_TOKEN>>) but this is not the case; the userdata has the actual token.

Please clarify, thanks.

@vpnachev
Copy link
Member

It is not an os extension responsibility to replace the <<BOOTSTRAP_TOKEN>> placeholder with valid token, but the MCM.
@rfranzke, may I ask you to move this issue to https://github.com/gardener/machine-controller-manager?

@vpnachev
Copy link
Member

/close

in favor of gardener/machine-controller-manager#731

@gardener-robot gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Jun 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status/closed Issue is closed (either delivered or triaged)
Projects
None yet
Development

No branches or pull requests

3 participants