Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Agent Installer installation "loses" the dns config at some point and need a manual reboot for rendez-vous host #1906

Closed
titou10titou10 opened this issue Mar 21, 2024 · 2 comments

Comments

@titou10titou10
Copy link

titou10titou10 commented Mar 21, 2024

OKD version: 4.15.0-0.okd-2024-03-10-010116

Summary

I tried to install OKD on bare metal with the agent installer as described here Globally I succeeded but encountered two problems:

  • at some point, the bootstrap/rendez-vous (ie "BS" node) "lost" the DNS configuration as declared in theagent-config.yamlfile and it has to be manually (re-)entered
  • the installation was not able to terminate as the BS node did not appear as a node (in oc get nodes). All other nodes were there but not the BS one. Manually rebooting the node forced it to finished its initialization and to appear amongst the list of nodes

Topoloy

  • 3 masters : okd5-master[1-3], 192.168.5.[63-65]
  • 2 workers: okd5-worker[1-2], 192.168.5.[66-67]
  • 1 load balancer in front of the cluster with HA proxy (well..) configured
  • DNS, DHCP all setup and working as all other prerequisites
  • okd5-master1 is designated as the rendez-vous / bootstrap node (ip: 192.168.5.63)
  • 192.168.5.18 is the "private" dns server
  • domain name: "denis.prive", cluster name: okd5

Part of the agent-config.yaml:

rendezvousIP: 192.168.5.63
additionalNTPSources:
- 0.pool.ntp.org
- 1.pool.ntp.org
hosts: 
 - hostname: okd5-master1
    role: master
    interfaces:
      - name: ens18
        macAddress: aa:bb:cc:dd:ee:63
    rootDeviceHints: 
      deviceName: /dev/sda
    networkConfig: 
      interfaces:
        - name: ens18
          type: ethernet
          state: up
          mac-address: aa:bb:cc:dd:ee:63
          ipv4:
            enabled: true
            dhcp: true
            auto-dns: false
            auto-gateway: true
            auto-routes: true
          ipv6:
            enabled: false
      dns-resolver:
        config:
          search:
            - denis.prive
          server:
            - 192.168.5.18
            - 8.8.8.8

The 5 other nodes are on the same pattern

Installation

First problem

After having created the iso image etc, all the 5 nodes are started at the same time and the installation starts
The progress is monitored with

./openshift-install --dir install agent wait-for install-complete

...
INFO Host okd5-master2: updated status from insufficient to known (Host is ready to be installed)
INFO Cluster is ready for install
INFO Cluster validation: All hosts in the cluster are ready to install.
INFO Preparing cluster for installation
INFO Host okd5-master2: updated status from known to preparing-for-installation (Host finished successfully to prepare for installation)
INFO Host okd5-master3 validation: Host NTP is synced
INFO Host okd5-master2 validation: Host NTP is synced
INFO Host okd5-worker2 validation: Host NTP is synced
INFO Host okd5-worker2: validation 'ntp-synced' is now fixed
INFO Host okd5-worker1 validation: Host NTP is synced
INFO Host okd5-master1 validation: Host NTP is synced
INFO Host okd5-worker1: validation 'ntp-synced' is now fixed
INFO Host okd5-master1: New image status quay.io/openshift/okd-content@sha256:786a746a4cdce34c925e0cf10082a2b9caa27edd9c0bc037272cd8a85f79f922. result: success. time: 4.04 seconds; size: 509.25 Megabytes; download rate: 132.32 MBps
INFO Host okd5-worker1: updated status from preparing-for-installation to preparing-successful (Host finished successfully to prepare for installation)
INFO Cluster installation in progress
INFO Host: okd5-master1, reached installation stage Writing image to disk
INFO Host: okd5-master2, reached installation stage Rebooting
INFO Host: okd5-master1, reached installation stage Waiting for control plane: Waiting for bootstrap node preparation
INFO Host: okd5-master1, reached installation stage Waiting for control plane: Waiting for masters to join bootstrap control plane

Then everything stops. The console of okd5-master1 shows that something is looping:

Sans titre2

I then sshed to the node:

  [root@okd5-master1 ~]# podman ps -a
  CONTAINER ID  IMAGE                                                                                                  COMMAND               CREATED        STATUS                    PORTS       NAMES
  a86556f2908e  localhost/podman-pause:4.7.0-1695838680                                                                                      8 minutes ago  Up 7 minutes                          11e0716db4f5-infra
  6eda9b76734b  quay.io/openshift/okd-content@sha256:ae9c813b78902dc4fc99cafd7b8f3d76b06aa11b4205d18f931cf62200a2c6d5  /bin/bash start_d...  7 minutes ago  Up 7 minutes                          assisted-db
  e33f5947e76e  quay.io/openshift/okd-content@sha256:ae9c813b78902dc4fc99cafd7b8f3d76b06aa11b4205d18f931cf62200a2c6d5  /assisted-service     7 minutes ago  Up 7 minutes                          service
  ebf19a760d6d  quay.io/openshift/okd-content@sha256:ae9c813b78902dc4fc99cafd7b8f3d76b06aa11b4205d18f931cf62200a2c6d5  /usr/local/bin/ag...  7 minutes ago  Exited (0) 7 minutes ago              apply-host-config
  85c950aa98b6  quay.io/openshift/okd-content@sha256:57109646c2e66aee05c7003d0e0b7f1538f37a01c2f633fad8e962b3e1727335  next_step_runner ...  7 minutes ago  Up 7 minutes                          next-step-runner
  7d601e5cca4f  quay.io/openshift/okd-content@sha256:786a746a4cdce34c925e0cf10082a2b9caa27edd9c0bc037272cd8a85f79f922  --role bootstrap ...  4 minutes ago  Up 4 minutes                          assisted-installer
  d3e4d0f0bb0c  quay.io/openshift/okd-content@sha256:b4aa05ed09915158bbf554dff010f1a5adde269a8c9a207fae85a8739b627583  start --node-name...  4 minutes ago  Exited (0) 3 minutes ago              suspicious_chandrasekhar

  [root@okd5-master1 ~]#journalctl -xn -u crio | less

  Mar 21 01:49:39 okd5-master1 crio[6495]: time="2024-03-21 01:49:39.388636025Z" level=info msg="Registered SIGHUP reload watcher"
  Mar 21 01:49:39 okd5-master1 crio[6495]: time="2024-03-21 01:49:39.389892926Z" level=info msg="Starting seccomp notifier watcher"
  Mar 21 01:49:39 okd5-master1 crio[6495]: time="2024-03-21 01:49:39.390031988Z" level=info msg="Create NRI interface"
  Mar 21 01:49:39 okd5-master1 crio[6495]: time="2024-03-21 01:49:39.390052759Z" level=info msg="NRI interface is disabled in the configuration."
  Mar 21 01:49:39 okd5-master1 crio[6495]: time="2024-03-21 01:49:39.391515863Z" level=info msg="Serving metrics on :9537 via HTTP"
  Mar 21 01:49:39 okd5-master1 systemd[1]: Started crio.service - Container Runtime Interface for OCI (CRI-O).
  ¦¦ Subject: A start job for unit crio.service has finished successfully
  ¦¦ Defined-By: systemd
  ¦¦ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
  ¦¦
  ¦¦ A start job for unit crio.service has finished successfully.
  ¦¦
  ¦¦ The job identifier is 1618.
  Mar 21 01:49:41 okd5-master1 crio[6495]: time="2024-03-21 01:49:41.209741849Z" level=info msg="Checking image status: quay.io/openshift/okd-content@sha256:6308b9e9ba777ea62ad55ea4ea6a9a06aa770ad40f11fc310fc915fdaf48ddb2" id=4f6e2aaa-4c1b-4252-81d2-851c74658612 name=/runtime.v1.ImageService/ImageStatus
  Mar 21 01:49:41 okd5-master1 crio[6495]: time="2024-03-21 01:49:41.210172943Z" level=info msg="Image quay.io/openshift/okd-content@sha256:6308b9e9ba777ea62ad55ea4ea6a9a06aa770ad40f11fc310fc915fdaf48ddb2 not found" id=4f6e2aaa-4c1b-4252-81d2-851c74658612 name=/runtime.v1.ImageService/ImageStatus
  Mar 21 01:54:41 okd5-master1 crio[6495]: time="2024-03-21 01:54:41.327860027Z" level=info msg="Checking image status: quay.io/openshift/okd-content@sha256:6308b9e9ba777ea62ad55ea4ea6a9a06aa770ad40f11fc310fc915fdaf48ddb2" id=e5855e4d-94d7-4e45-b4e2-aa9bc6ba86d4 name=/runtime.v1.ImageService/ImageStatus
  Mar 21 01:54:41 okd5-master1 crio[6495]: time="2024-03-21 01:54:41.328261482Z" level=info msg="Image quay.io/openshift/okd-content@sha256:6308b9e9ba777ea62ad55ea4ea6a9a06aa770ad40f11fc310fc915fdaf48ddb2 not found" id=e5855e4d-94d7-4e45-b4e2-aa9bc6ba86d4 name=/runtime.v1.ImageService/ImageStatus

  [root@okd5-master1 ~]# ping quay.io
  ping: quay.io: Temporary failure in name resolution

  [root@okd5-master1 ~]# more /etc/resolv.conf
  [root@okd5-master1 ~]#

So the BS node was not able to continue because it could not download image from quay.io because theresolv.confis empty at this stage ! ("Image quay.io/openshift/okd-content@sha256:... not found")

I added the lines from agent-config.yaml in /etc/resolv.conf`and immediatly the installation stops looping and goes on...

    search denis.prive
    nameserver 192.168.5.18
    nameserver 8.8.8.8

and the installation of the 4 other nodes continued and succedded etc..

Second problem

Then the installation stopped again and never finished. After waiting a long time (and all nodes at about 5% cpu...), I managed to open an oc session to okd-master1

oc get nodes returned the list of all the nodes as "ready" except the BS node (okd5-master1) that was not even in the list. and of course oc get coand oc get clusterversionindicated that many operators were broken because 1/3 of the masters was missing...

[root@kutils okd5]# oc get nodes -o wide
NAME           STATUS   ROLES                  AGE   VERSION           INTERNAL-IP    EXTERNAL-IP   OS-IMAGE                        KERNEL-VERSION          CONTAINER-RUNTIME
okd5-master2   Ready    control-plane,master   29m   v1.28.7+6e2789b   192.168.5.64   <none>        Fedora CoreOS 39.20240210.3.0   6.7.4-200.fc39.x86_64   cri-o://1.28.2
okd5-master3   Ready    control-plane,master   29m   v1.28.7+6e2789b   192.168.5.65   <none>        Fedora CoreOS 39.20240210.3.0   6.7.4-200.fc39.x86_64   cri-o://1.28.2
okd5-worker1   Ready    worker                 15m   v1.28.7+6e2789b   192.168.5.66   <none>        Fedora CoreOS 39.20240210.3.0   6.7.4-200.fc39.x86_64   cri-o://1.28.2
okd5-worker2   Ready    worker                 15m   v1.28.7+6e2789b   192.168.5.67   <none>        Fedora CoreOS 39.20240210.3.0   6.7.4-200.fc39.x86_64   cri-o://1.28.2

At this point the status is this:

INFO Bootstrap Kube API Initialized
INFO Bootstrap configMap status is complete
INFO cluster bootstrap is complete

So I sshed again in okd5-master1 and force a reboot withshutdown -r nowand tada...the installation of the BS node finished and finally the cluster installation went to the end with all the 5 nodes known to the cluster and "ready"

@titou10titou10
Copy link
Author

titou10titou10 commented Mar 23, 2024

"must-gather" direct from okd5-master1 when the installation loops, before editing the empty /etc/resolv.conffile:

ssh core@okd5-master1 sudo /usr/local/bin/agent-gather -O > okd5-master1_agent-gather.tar.gz

okd5-master1_agent-gather.tar.gz

"Must-gather" before rebooting, where all nodes are there except the BS node

export KUBECONFIG=...
oc login ...
oc adm must-gather 

okd5-master1-before-reboot_must-gather.tar.gz

@titou10titou10 titou10titou10 changed the title Agent Installer installation "loose" the dns config at some point and need a manual reboot for rendez-vous host Agent Installer installation "loses" the dns config at some point and need a manual reboot for rendez-vous host Mar 25, 2024
@JaimeMagiera
Copy link
Contributor

Hi,

We are not working on FCOS builds of OKD any more. Please see these documents...

https://okd.io/blog/2024/06/01/okd-future-statement
https://okd.io/blog/2024/07/30/okd-pre-release-testing

Please test with the OKD SCOS nightlies and file a new issue as needed.

Many thanks,

Jaime

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants