The MCO manages the Red Hat Enterprise Linux CoreOS (RHCOS) operating system. Further,
the operating system itself is just another part of the release image, called machine-os-content
.
In other words, the cluster controls the operating system.
We will use the term "bootimage" to mean an initial RHCOS disk image, such
as an AMI, bare metal raw disk image, VMWare VMDK, OpenStack qcow2, etc.
These bootimages are built using coreos-assembler.
Today, the installer pins the "bootimages"
it uses, and released installers also pin the release image. As noted above,
release images contain machine-os-content
, which can be a different
RHCOS version. You can find the installer-pinned bootimage in e.g. this file.
A pending enhancement describes generating and inspecting bootimage data from the release image (not yet implemented).
It's essential to understand that both the bootimage and the machine-os-content
container
are both essentially wrappers for an OSTree commit.
The OSTree format is an image format designed for in-place operating system updates; it operates
at the filesystem level (like container images) but (unlike container runtimes) has
tooling to manage things such as the bootloader and handling persistence of /etc
and /var
.
The reason we wrap an OSTree commit inside a container image is so that the release image encapsulates basically everything about a cluster (except the bootimage). This makes it easy to mirror updates offline.
We do not want to require that a new bootimage is released for every update, and in general it can be hard to require that in every environment (for example, bare metal PXE setups).
As of today, when a node boots the MCO serves it Ignition for configuration,
including a systemd unit called machine-config-operator-firstboot.service
which pulls code onto the host, and then it runs Before=kubelet.service
to perform an OS update and reboot.
One important property of this is that it means OS updates are applied
before any potentially untrusted workloads land on the node. Because we just
use podman
, we also don't have to worry about having old kubelet
versions
join a cluster.
First, be sure you've read and understood openshift/installer overview. It's also important to understand the cluster-version-operator and release image.
In this example we'll discuss AWS, but this process is similar to a situation of booting bare metal machines via PXE, or Azure, or a private OpenStack instance.
The openshift-installer starts, and uses the AMI it has pinned as the bootstrap node as well as for the control plane.
The bootstrap node's bootkube.sh
service pulls the release image, which
contains a reference to the MCO (machine-config-operator
) and also a
reference to a newer machine-os-content
. The bootkube.sh
service runs the MCO in
"bootstrap" mode to generate and serve Ignition to the master machines.
The control plane nodes wait in the initramfs, retrying until they are able to fetch the Ignition config from the bootstrap node.
When that succeeds, the above process of machine-config-daemon-firstboot.service
runs which extracts OS updates from the machine-os-content
container image,
and the control plane nodes each reboot (before kubelet.service
has started).
When the control plane nodes reboot and form a cluster, the bootstrap node is torn down.
At this point, Ignition has been executed, and that only runs once.
The machine-config-daemon-firstboot.service
is no longer used for OS updates.
The master machines use the machine-api-operator to boot the workers. Each worker pulls Ignition configs from the MCS running on the control plane. The exact same process of performing an upgrade and reboot happens for each worker.
After a node (whether control plane or worker) has joined the cluster, the MCO takes over. Previously, each individual node was running systemd units; now changes are coordinated via the MCO.
When the administrator starts an oc adm upgrade
, if a new machine-os-content
is provided in the release image, it will be rolled out to the control plane
and workers.
Every change now will be managed by a machineconfigpool
, ensuring
that only 1 machine at a time is changed (via maxUnavailable: 1
default).
Today mostly because of SELinux reasons the
MCD copies itself to the host, and runs as part of the host context.
Then it provides updated content to rpm-ostreed.service
, which is already
a daemon running on the host.
The MCD tries to watch the systemd journal for relevant service and proxy logs,
so you should be able to oc -n openshift-machine-config-operator logs -c machine-config-daemon pod/machine-config-daemon-...
to debug.
See this pull request.
When a node boots for the first time, it's the main CoreOS Ignition process which handles the config provided. This Ignition runs in the initramfs and performs any repartitioning and filesystem layout, etc.
However, MachineConfig
objects also have higher level extensions
like kernelArguments
that aren't handled by Ignition.
The invention of /etc/ignition-machine-config-encapsulated.json
is
a way to bridge these two worlds; it's the target MachineConfig
object
in JSON form. The MCS injects this file into the Ignition it serves to the node.
Then when the node boots into e.g. rendered-worker-0x1234
, the machine-config-daemon-firstboot.service
(also written by Ignition) reads that file and handles the bits (kernel arguments, extensions, )
that weren't handled by the "main Ignition" process.
These OS level changes are done along with the OS update process described above.
This ensures that for example if one specifies nosmt
as a kernel argument,
to turn off hyperthreading to more strongly isolate workloads, that kernel
argument will be applied before kubelet.service
starts.
Q: I upgraded OpenShift and noticed that my AMI hasn't changed, is this normal?
Yes, see: openshift/enhancements#201 (As well as the rest of this document - we do in-place updates without changing the bootimage)
Q: Is the integrity of operating system upgrades verified?
The overall approach here is that the operating system is just one part of the cluster.
Integrity of the OpenShift platform is handled to start by the
cluster version operator.
Today the CVO will by default GPG verify the integrity of the release image
before applying it. The release image contains a sha256
digest of machine-os-content
which is used by the MCO for updates. On the host, the container runtime
podman
verifies the integrity of that sha256
when pulling the image,
before the MCO reads its content. Hence, there is end-to-end GPG-verified integrity
for the operating system updates (as well as the rest of the cluster components
which run as regular containers).
Q: Why do you do this weird "ostree repository in container" thing? Why ostree?
We're using a system that works; ostree is a well tested "image like" update system that has been in use for many years by multiple distributions. It handles SELinux and bootloaders, etc. We're just "encapsulating" that system inside a container image for all of the above reasons (management, etc.).
At some point in the future though it's likely that we will try to change
the machine-os-content
container to look more like an unpacked container image.
Q: How do I look at the content in the ostree repository inside the container?
You can get the ostree
tool from many distributions; try e.g.
yum -y install ostree
inside a RHEL UBI
container for example.
From there, probably the simplest thing is to use oc image extract
to unpack the container image. Something like this:
$ mkdir machine-os-content
$ oc image extract quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:02d810d3eb284e684bd20d342af3a800e955cccf0bb55e23ee0b434956221bdd --path /:machine-os-content
$ find machine-os-content/srv/repo/ -name '*.commit'
machine-os-content/srv/repo/objects/33/dd81479490fbb61a58af8525a71934e7545b9ed72d846d3e32a3f33f6fac9d.commit
$ ostree --repo=machine-os-content/srv/repo ls 33dd81479490fbb61a58af8525a71934e7545b9ed72d846d3e32a3f33f6fac9d
d00755 0 0 0 /
l00777 0 0 0 /bin -> usr/bin
l00777 0 0 0 /home -> var/home
l00777 0 0 0 /lib -> usr/lib
l00777 0 0 0 /lib64 -> usr/lib64
l00777 0 0 0 /media -> run/media
l00777 0 0 0 /mnt -> var/mnt
l00777 0 0 0 /opt -> var/opt
l00777 0 0 0 /ostree -> sysroot/ostree
l00777 0 0 0 /root -> var/roothome
l00777 0 0 0 /sbin -> usr/sbin
l00777 0 0 0 /srv -> var/srv
d00755 0 0 0 /boot
d00755 0 0 0 /dev
d00755 0 0 0 /proc
d00755 0 0 0 /run
d00755 0 0 0 /sys
d00755 0 0 0 /sysroot
d01777 0 0 0 /tmp
d00755 0 0 0 /usr
d00755 0 0 0 /var
$