OKD 4.7 Baremetal UPI cluster upgrade to 4.7.0-0.okd-2021-06-04-191031 fails: host network configuration failed #698

Bengrunt · 2021-06-17T13:23:13Z

Bengrunt
Jun 17, 2021

Hello there,

Describe the bug
We've been running an OKD 4.7 cluster for a few months, carrying out a few updates (stable releases) successfully.
However, when upgrading from 4.7.0-0.okd-2021-05-22-050008 to 4.7.0-0.okd-2021-06-04-191031 a few days ago, the process failed.

The Operator upgrades went well, with all of them succeeding, but the nodes upgrades failed, and we ended up stuck in the middle of the process with one master and one worker node unavailable:

We tried to ssh on the affected nodes to see what happened but the SSH server was not available. Using the hardware console we saw that the machines went through a FCOS upgrade to FCOS 34.20210518.3.0 and after it boots it immediately looses networking connectivity and the hostname of the host gets set to localhost (that can be seen on the prompt).

It seems the issue occurs as soon as we're done with the FCOS 34 boot process.
From then on, we don't really know what we could do and any help would be greatly appreciated.

Is it possible to rollback the upgrade process?
Can we force the nodes to stay on a specific FCOS version?
Can we skip this upgrade to go to the more recent 4.7.0-0.okd-2021-06-13-090745 release?

Version
UPI Baremetal clusters with 3 master and 3 worker nodes running 4.7.0-0.okd-2021-05-22-050008.

How reproducible
The issue occurs every time. We tried reinstalling the nodes (PXE installs) from the FCOS 34 image directly but the node ends but downgraded to FCOS 33.20210426.3.0 then back to 34.20210518.3.0 and we loose the connectivity.

Log bundle
We cannot complete the oc adm must-gather process since it can't reach some of the nodes.
I'll try to gather a console boot log from our remote console.

Thanks a lot ! :)

vrutkovs · 2021-06-17T13:59:51Z

vrutkovs
Jun 17, 2021
Maintainer

We cannot complete the oc adm must-gather process since it can't reach some of the nodes.
I'll try to gather a console boot log from our remote console.

Are the nodes up? If master is up is has a local kubeconfig at /etc/kubernetes/static-pod-resources/kube-apiserver-certs/secrets/node-kubeconfigs/localhost.kubeconfig, which can be used to collect must-gather

4 replies

Bengrunt Jun 17, 2021
Author

The nodes that didn't reboot are up.
The one that rebooted are not.

# oc get nodes
NAME                STATUS                        ROLES    AGE   VERSION
cckum101.domain.tld   NotReady,SchedulingDisabled   master   49d   v1.20.0+df9c838-1073
cckum102.domain.tld   Ready                         master   49d   v1.20.0+df9c838-1073
cckum103.domain.tld   Ready                         master   49d   v1.20.0+df9c838-1073
cckun101.domain.tld   NotReady,SchedulingDisabled   worker   21h   v1.20.0+df9c838-1073
cckun102.domain.tld   Ready                         worker   18h   v1.20.0+df9c838-1073
cckun103.domain.tld   Ready                         worker   17h   v1.20.0+df9c838-1073

On which node should i run the must-gather?

Bengrunt Jun 17, 2021
Author

So I started the must-gather process from cckum102.domain.tld, using the aforementioned kubeconfig file, but it got stuck:

[must-gather      ] OUT Using must-gather plug-in image: quay.io/openshift/okd-content@sha256:c9df868c75ce5b8bea5b787f5a89dff23d4f081f078bf723cc7616a7c16fa278
When opening a support case, bugzilla, or issue please include the following summary data along with any other requested information.
ClusterID: 584a64e0-bcc2-453e-a710-4475c096d0fb
ClusterVersion: Updating to "4.7.0-0.okd-2021-06-04-191031" from "4.7.0-0.okd-2021-05-22-050008" for 9 days: Working towards 4.7.0-0.okd-2021-06-04-191031: 70 of 670 done (10% complete)
ClusterOperators:
        clusteroperator/authentication is not available (WellKnownAvailable: The well-known endpoint is not yet available: need at least 3 kube-apiservers, got 2) because APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-oauth-apiserver ()
WellKnownReadyControllerDegraded: need at least 3 kube-apiservers, got 2
        clusteroperator/dns is degraded because DNS default is degraded
        clusteroperator/etcd is degraded because NodeControllerDegraded: The master nodes not ready: node "cckum101.domain.tld" not ready since 2021-06-07 20:58:46 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)
EtcdMembersDegraded: 2 of 3 members are available, cckum101.domain.tld is unhealthy
        clusteroperator/kube-apiserver is degraded because NodeControllerDegraded: The master nodes not ready: node "cckum101.domain.tld" not ready since 2021-06-07 20:58:46 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)
        clusteroperator/kube-controller-manager is degraded because NodeControllerDegraded: The master nodes not ready: node "cckum101.domain.tld" not ready since 2021-06-07 20:58:46 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)
        clusteroperator/kube-scheduler is degraded because NodeControllerDegraded: The master nodes not ready: node "cckum101.domain.tld" not ready since 2021-06-07 20:58:46 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)
        clusteroperator/machine-config is not available (Cluster not available for 4.7.0-0.okd-2021-06-04-191031) because Failed to resync 4.7.0-0.okd-2021-06-04-191031 because: timed out waiting for the condition during waitForDaemonsetRollout: Daemonset machine-config-daemon is not ready. status: (desired: 6, updated: 6, ready: 4, unavailable: 2)
        clusteroperator/monitoring is not available () because Failed to rollout the stack. Error: running task Updating node-exporter failed: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of openshift-monitoring/node-exporter: got 2 unavailable nodes
        clusteroperator/network is degraded because DaemonSet "openshift-multus/multus" rollout is not making progress - last change 2021-06-16T20:31:06Z
DaemonSet "openshift-ovn-kubernetes/ovnkube-master" rollout is not making progress - last change 2021-06-07T20:58:47Z
DaemonSet "openshift-ovn-kubernetes/ovnkube-node" rollout is not making progress - last change 2021-06-16T20:30:43Z
DaemonSet "openshift-network-diagnostics/network-check-target" rollout is not making progress - last change 2021-06-16T20:32:05Z
        clusteroperator/openshift-apiserver is degraded because APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-apiserver ()
        clusteroperator/openshift-controller-manager is progressing:


[must-gather      ] OUT namespace/openshift-must-gather-rrrn4 created
[must-gather      ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-pdmxn created
[must-gather      ] OUT pod for plug-in image quay.io/openshift/okd-content@sha256:c9df868c75ce5b8bea5b787f5a89dff23d4f081f078bf723cc7616a7c16fa278 created

Seems like the pod was scheduled on a node that is not available:

# oc get po
NAME                READY   STATUS    RESTARTS   AGE
must-gather-9bpvk   0/2     Pending   0          5m28s

# oc describe pod/must-gather-9bpvk
Name:         must-gather-9bpvk
Namespace:    openshift-must-gather-rrrn4
Priority:     0
Node:         cckum101.domain.tld/
Labels:       app=must-gather
Annotations:  k8s.ovn.org/pod-networks:
                {"default":{"ip_addresses":["172.19.0.12/23"],"mac_address":"0a:58:ac:13:00:0c","gateway_ips":["172.19.0.1"],"ip_address":"172.19.0.12/23"...
Status:       Pending
IP:
IPs:          <none>
Containers:
  gather:
    Image:      quay.io/openshift/okd-content@sha256:c9df868c75ce5b8bea5b787f5a89dff23d4f081f078bf723cc7616a7c16fa278
    Port:       <none>
    Host Port:  <none>
    Command:
      /bin/bash
      -c
      /usr/bin/gather; sync
    Environment:  <none>
    Mounts:
      /must-gather from must-gather-output (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-52gzq (ro)
  copy:
    Image:      quay.io/openshift/okd-content@sha256:c9df868c75ce5b8bea5b787f5a89dff23d4f081f078bf723cc7616a7c16fa278
    Port:       <none>
    Host Port:  <none>
    Command:
      /bin/bash
      -c
      trap : TERM INT; sleep infinity & wait
    Environment:  <none>
    Mounts:
      /must-gather from must-gather-output (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-52gzq (ro)
Conditions:
  Type           Status
  PodScheduled   True
Volumes:
  must-gather-output:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  default-token-52gzq:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-52gzq
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  kubernetes.io/os=linux
                 node-role.kubernetes.io/master=
Tolerations:     op=Exists
Events:
  Type    Reason     Age    From               Message
  ----    ------     ----   ----               -------
  Normal  Scheduled  5m45s  default-scheduler  Successfully assigned openshift-must-gather-rrrn4/must-gather-9bpvk to cckum101.domain.tld

Is there a way to schedule the pod on a working node? Does it make sense to do that?

Thanks a lot for the help anyway. ;)

vrutkovs Jun 17, 2021
Maintainer

cckum101.domain.tld NotReady,SchedulingDisabled

This is alarming (and preventing further rollouts). Check that it boots, has valid hostname and kubelet / crio services are running

pyfontan Jun 17, 2021

Hello

cckum101.domain.tld boots correctly and pings for a while but as soon as it finishes, and gives us the possibility to connect, he stops pinging.
Based on the login prompt that appears on the console, it seems to have a valid name.

We tried to add a password to Core user with the MCO but it seems that it will not render the configuration with the update running. Without network, we can't connect with SSH key. So what can we do ?

Thank you.

pyfontan · 2021-06-17T16:22:00Z

pyfontan
Jun 17, 2021

We finally managed to connect on this master with the console.
We booted with rd.break and enforcing=0 kernel options and we changed the core password.
Now we have access and the master pings. But there are errors

2 replies

vrutkovs Jun 17, 2021
Maintainer

enforcing=0

Any selinux alerts in the log?

ovsdb-server is necessary to run SDN there, check its log to find out what's going on there

pyfontan Jun 17, 2021

Without rebooting (enforcing=0 options), we have restarted all failed services. And now this master is in ready state

With FCOS34 but with an update failed

Perhaps we can now (with this failed state) cancel this update ?

pyfontan · 2021-06-17T20:59:00Z

pyfontan
Jun 17, 2021

While trying to reboot cckum101, the master took some time to stop crio processes and update started again.

We're now back to NotReady status, no ping to cckum101 as soon as it seems to obtain ignition file, and a master with no name (localhost login prompt on the console).

1 reply

Bengrunt Jun 18, 2021
Author

We're working on sanitizing a boot time console log to provide to this case.

Hopefully it will help analyzing the situation.

Bengrunt · 2021-06-21T09:32:01Z

Bengrunt
Jun 21, 2021
Author

Hello,

We found this SELinux denied alert, that seems to occur just when we loose network connectivity, at the end of the FCOS 34 upgrade process (ie. when we get access to the prompt). This is extracted from the server console:

audit: type=1400 audit(1623099697.060:174): avc: denied { write } for pid=3085 comm="systemd-hostnam" name="systemd" dev="tmpfs" ino=2 scontext=system_u:system_r:systemd_hostnamed_t:s0 tcontext=system_u:object_r:init_var_run_t:s0 tclass=dir permisscckun101: ive=0

This was on the failed worker node.

6 replies

Bengrunt Jun 22, 2021
Author

Here is the curated boot log of the worker node cckun101: 2021.06.07-cckun101-console-curated.log

It is ordered with the latest logs first. Since it's a console log gathered through IPMI it may be a little bit messy at times. Sorry about that.

I hope that might help @vrutkovs
Thanks a lot !

vrutkovs Jun 22, 2021
Maintainer

2021-06-07T23:01:37.725+02:00 cckun101 console [ 159.072463] audit: type=1400 audit(1623099697.060:174): avc: denied { write } for pid=3085 comm="systemd-hostnam" name="systemd" dev="tmpfs" ino=2 scontext=system_u:system_r:systemd_hostnamed_t:s0 tcontext=system_u:object_r:init_var_run_t:s0 tclass=dir permisscckun101: ive=0

Was it reported to Fedora bugzilla?

Bengrunt Jun 22, 2021
Author

I don't know.

I found this on the RedHat Bugzilla but I'm not really familiar with all this and I can't access the duplicate issue. Sorry.

vrutkovs Jun 22, 2021
Maintainer

Right, its fixed in next selinux-policy. Lets see openshift/okd-machine-os#150 passes tests so that we could create a new nightly to test

Bengrunt Jun 25, 2021
Author

Hello, just checking in to see if there have been progress on this?

Thanks a lot !

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OKD 4.7 Baremetal UPI cluster upgrade to 4.7.0-0.okd-2021-06-04-191031 fails: host network configuration failed #698

{{title}}

Replies: 4 comments 13 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

OKD 4.7 Baremetal UPI cluster upgrade to 4.7.0-0.okd-2021-06-04-191031 fails: host network configuration failed #698

Bengrunt Jun 17, 2021

Replies: 4 comments · 13 replies

vrutkovs Jun 17, 2021 Maintainer

Bengrunt Jun 17, 2021 Author

Bengrunt Jun 17, 2021 Author

vrutkovs Jun 17, 2021 Maintainer

pyfontan Jun 17, 2021

pyfontan Jun 17, 2021

vrutkovs Jun 17, 2021 Maintainer

pyfontan Jun 17, 2021

pyfontan Jun 17, 2021

Bengrunt Jun 18, 2021 Author

Bengrunt Jun 21, 2021 Author

Bengrunt Jun 22, 2021 Author

vrutkovs Jun 22, 2021 Maintainer

Bengrunt Jun 22, 2021 Author

vrutkovs Jun 22, 2021 Maintainer

Bengrunt Jun 25, 2021 Author

Bengrunt
Jun 17, 2021

Replies: 4 comments 13 replies

vrutkovs
Jun 17, 2021
Maintainer

Bengrunt Jun 17, 2021
Author

Bengrunt Jun 17, 2021
Author

vrutkovs Jun 17, 2021
Maintainer

pyfontan
Jun 17, 2021

vrutkovs Jun 17, 2021
Maintainer

pyfontan
Jun 17, 2021

Bengrunt Jun 18, 2021
Author

Bengrunt
Jun 21, 2021
Author

Bengrunt Jun 22, 2021
Author

vrutkovs Jun 22, 2021
Maintainer

Bengrunt Jun 22, 2021
Author

vrutkovs Jun 22, 2021
Maintainer

Bengrunt Jun 25, 2021
Author