You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We are seeing more and more nodes not being able to join the shoot cluster in reasonable time (under the 20mins default time)...only suse based OS are affected.
Sample error: task "Waiting until shoot worker nodes have been reconciled" failed: Error while waiting for Worker shoot--hc-can-ns2--407-e3-hdl/407-e3-hdl to become ready: error during reconciliation: Error reconciling worker: Failed while waiting for all machine deployments to be ready: 'machine(s) failed: 1 error occurred: "shoot--hc-can-ns2--407-e3-hdl-iq-large-v2-z1-6d59f-76dbq": Machine shoot--hc-can-ns2--407-e3-hdl-iq-large-v2-z1-6d59f-76dbq failed to join the cluster in 20m0s minutes.'
Logs from an affected node: 2022-06-07T17:56:01.328267+00:00 localhost cloud-init[1521]: Cloud-init v. 21.2-8.51.1 running 'modules:final' at Tue, 07 Jun 2022 17:55:37 +0000. Up 31.54 seconds.
2022-06-07T17:56:01.328446+00:00 localhost cloud-init[1521]: Cloud-init v. 21.2-8.51.1 finished at Tue, 07 Jun 2022 17:56:01 +0000. Datasource DataSourceEc2Local. Up 55.11 seconds
2022-06-07T17:56:01.340849+00:00 localhost download-cloud-config.sh[2243]: Could not retrieve the shoot access secret with name cloud-config-downloader with bootstrap token
2022-06-07T17:56:01.341162+00:00 localhost systemd[1]: cloud-config-downloader.service: Main process exited, code=exited, status=1/FAILURE
2022-06-07T17:56:01.341792+00:00 localhost systemd[1]: cloud-config-downloader.service: Unit entered failed state.
2022-06-07T17:56:01.341959+00:00 localhost systemd[1]: cloud-config-downloader.service: Failed with result 'exit-code'.
2022-06-07T17:56:01.363888+00:00 localhost systemd[1]: Started Execute cloud user/final scripts.
2022-06-07T17:56:01.364931+00:00 localhost systemd[1]: Reached target Cloud-init target.
The symptom is that the affected nodes are unable to retrieve the shoot access secret with name cloud-config-downloader with bootstrap token. This is required as part of the “download-cloud-config.sh” bootstrap. One of the bad nodes had the text “<<BOOTSTRAP_TOKEN>>” inside the file that is supposed to contain the actual token (/var/lib/cloud-config-downloader/credentials/bootstrap-token).
Userdata (snippet) from good node: mkdir -p '/var/lib/cloud-config-downloader/credentials'
cat << EOF > '/var/lib/cloud-config-downloader/credentials/bootstrap-token'
383273.kfxv0fjyl4l67vrz
EOF
Userdata (snippet) from bad node: mkdir -p '/var/lib/cloud-config-downloader/credentials'
cat << EOF | base64 -d > '/var/lib/cloud-config-downloader/credentials/bootstrap-token'
PDxCT09UU1RSQVBfVE9LRU4+Pg==
EOF
“In the proposed new flow the worker controller (here MCM) creates a temporary bootstrap-token for each created vm.
If the featureFlag BootstrapTokenForVMs is enabled a file with the content "<<BOOTSTRAP_TOKEN>>" is added to the operatingsystem-config. It is passed to the worker controller. The worker controller generates a temporary token for every new worker instance(node). It replaces the "<<BOOTSTRAP_TOKEN>>" string with the created token and adds it to the user-data placed on the vm on startup.
The cloud-config-downloader(original user-data) will then refer to the new temporary bootstrap-token in the kubelet-bootstrap script.”
(I see a note in #41 about not being able to successfully test a requirement specifically relating to suse os, not sure if this is related but the majority occurrence is on suse os).
According to the diagram in one of the links pasted above, MCM is supposed to write the token directly to the fs...meaning the userdata will always contain the placeholder text (<<BOOTSTRAP_TOKEN>>) but this is not the case; the userdata has the actual token.
Please clarify, thanks.
The text was updated successfully, but these errors were encountered:
Hello,
We are seeing more and more nodes not being able to join the shoot cluster in reasonable time (under the 20mins default time)...only suse based OS are affected.
Sample error:
task "Waiting until shoot worker nodes have been reconciled" failed: Error while waiting for Worker shoot--hc-can-ns2--407-e3-hdl/407-e3-hdl to become ready: error during reconciliation: Error reconciling worker: Failed while waiting for all machine deployments to be ready: 'machine(s) failed: 1 error occurred: "shoot--hc-can-ns2--407-e3-hdl-iq-large-v2-z1-6d59f-76dbq": Machine shoot--hc-can-ns2--407-e3-hdl-iq-large-v2-z1-6d59f-76dbq failed to join the cluster in 20m0s minutes.'
Logs from an affected node:
2022-06-07T17:56:01.328267+00:00 localhost cloud-init[1521]: Cloud-init v. 21.2-8.51.1 running 'modules:final' at Tue, 07 Jun 2022 17:55:37 +0000. Up 31.54 seconds.
2022-06-07T17:56:01.328446+00:00 localhost cloud-init[1521]: Cloud-init v. 21.2-8.51.1 finished at Tue, 07 Jun 2022 17:56:01 +0000. Datasource DataSourceEc2Local. Up 55.11 seconds
2022-06-07T17:56:01.340849+00:00 localhost download-cloud-config.sh[2243]: Could not retrieve the shoot access secret with name cloud-config-downloader with bootstrap token
2022-06-07T17:56:01.341162+00:00 localhost systemd[1]: cloud-config-downloader.service: Main process exited, code=exited, status=1/FAILURE
2022-06-07T17:56:01.341792+00:00 localhost systemd[1]: cloud-config-downloader.service: Unit entered failed state.
2022-06-07T17:56:01.341959+00:00 localhost systemd[1]: cloud-config-downloader.service: Failed with result 'exit-code'.
2022-06-07T17:56:01.363888+00:00 localhost systemd[1]: Started Execute cloud user/final scripts.
2022-06-07T17:56:01.364931+00:00 localhost systemd[1]: Reached target Cloud-init target.
The symptom is that the affected nodes are unable to retrieve the shoot access secret with name cloud-config-downloader with bootstrap token. This is required as part of the “download-cloud-config.sh” bootstrap. One of the bad nodes had the text “<<BOOTSTRAP_TOKEN>>” inside the file that is supposed to contain the actual token (/var/lib/cloud-config-downloader/credentials/bootstrap-token).
Userdata (snippet) from good node:
mkdir -p '/var/lib/cloud-config-downloader/credentials'
cat << EOF > '/var/lib/cloud-config-downloader/credentials/bootstrap-token'
383273.kfxv0fjyl4l67vrz
EOF
Userdata (snippet) from bad node:
mkdir -p '/var/lib/cloud-config-downloader/credentials'
cat << EOF | base64 -d > '/var/lib/cloud-config-downloader/credentials/bootstrap-token'
PDxCT09UU1RSQVBfVE9LRU4+Pg==
EOF
gardener/machine-controller-manager#351
gardener/gardener#3898
“In the proposed new flow the worker controller (here MCM) creates a temporary bootstrap-token for each created vm.
If the featureFlag BootstrapTokenForVMs is enabled a file with the content "<<BOOTSTRAP_TOKEN>>" is added to the operatingsystem-config. It is passed to the worker controller. The worker controller generates a temporary token for every new worker instance(node). It replaces the "<<BOOTSTRAP_TOKEN>>" string with the created token and adds it to the user-data placed on the vm on startup.
The cloud-config-downloader(original user-data) will then refer to the new temporary bootstrap-token in the kubelet-bootstrap script.”
(I see a note in #41 about not being able to successfully test a requirement specifically relating to suse os, not sure if this is related but the majority occurrence is on suse os).
According to the diagram in one of the links pasted above, MCM is supposed to write the token directly to the fs...meaning the userdata will always contain the placeholder text (<<BOOTSTRAP_TOKEN>>) but this is not the case; the userdata has the actual token.
Please clarify, thanks.
The text was updated successfully, but these errors were encountered: