Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vm/vm-under-test: Move first boot to new service #56

Merged
merged 6 commits into from
Jan 10, 2024

Conversation

RamLavi
Copy link
Collaborator

@RamLavi RamLavi commented Jan 9, 2024

When creating the vm-under-test container-disk image, the first-boot script running by the virt-builder command is run on a service created by virt-builder.

This service is guaranteed to run in the final stages of the boot process [0].
This behavior may create a race with the checkup, that is waiting on the
agentConnected condition [1] being added in order to run the checkup's
executor package.
The race is happening since the guest-agent service also runs during the final stages of the boot process.

This PR is solving this race by moving the content of the first-boot script to a new service that is guaranteed to run before the guest-agent service.

[0] https://www.libguestfs.org/virt-builder.1.html
[1]

if condition.Type == kvcorev1.VirtualMachineInstanceAgentConnected && condition.Status == corev1.ConditionTrue {

Moving the snippet disabling the services to a function.

Signed-off-by: Ram Lavi <[email protected]>
Moving the snippet installing packages to a function.

Signed-off-by: Ram Lavi <[email protected]>
Moving the snippet disabling swap to a function.

Signed-off-by: Ram Lavi <[email protected]>
@RamLavi RamLavi force-pushed the move_first_boot_to_new_service branch from f073dd8 to ea27364 Compare January 9, 2024 14:23
@RamLavi RamLavi requested a review from orelmisan January 9, 2024 14:25
@RamLavi
Copy link
Collaborator Author

RamLavi commented Jan 9, 2024

passes e2e test on a CNV4.15 cluster:


@orelmisan do we want to add a verbose print of the /proc/cmdline to the checkup's vm image so that we would be able to see if the cmdline ever breaks for any reason?

@RamLavi RamLavi force-pushed the move_first_boot_to_new_service branch from 854ab74 to 8bae30c Compare January 9, 2024 18:53
Copy link
Member

@orelmisan orelmisan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR @RamLavi
Looks very promising.

Comment on lines 22 to 24
if systemctl --type swap list-units | grep -q '.swap'; then
systemctl mask "$(systemctl --type swap list-units | grep '.swap' | awk '{print $1}')"
fi
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please give a few words in the commit message about what these commands do?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DONE

Makefile Outdated
@@ -38,7 +38,7 @@ build:
-v $(PWD)/_go-cache:/root/.cache/go-build:Z \
--workdir $(PROJECT_WORKING_DIR) \
$(GO_IMAGE_NAME):$(GO_IMAGE_TAG) \
go build -v -o ./bin/kubevirt-realtime-checkup ./cmd/
go build -v -buildvcs=false -o ./bin/kubevirt-realtime-checkup ./cmd/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leftover.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

The script is turning off swap space services by searching ans masking
them. Currently if the "systemctl --type swap list-units" fails because there
are no units with swap type then the entire script fails.
Fix this by checking if there are any Units before masking them.

Signed-off-by: Ram Lavi <[email protected]>
The first-boot script running by the virt-builder command is run on a
service created by virt-builder. This service is guaranteed to run in
the final stages of the boot process [0].
This may create a race with the checkup, that is waiting on the
agentConnected condition [1] being added in order to run the checkup's
executor package. The race is happening since the guest-agent service
also runs during the final stages of the boot process.

In order to eliminate this race, removing the first-boot script in favor
of the new service, and moving the first boot script content
into a new service. This service is
- manually created on the customize-vm script.
- explicitly scheduled to run before the guest-agent service runs.
- set to run once, using ConditionPathExists unit option [2]. Before the
script reboots it creates an empty file, so that the service will not
run again.

[0] https://www.libguestfs.org/virt-builder.1.html
[1]
https://github.com/kiagnose/kubevirt-realtime-checkup/blob/97b5fbe47fc5df6513671db6a40e8e31b0815580/pkg/internal/checkup/checkup.go#L144
[2]
https://www.freedesktop.org/software/systemd/man/latest/systemd.unit.html#ConditionPathExists=

Signed-off-by: Ram Lavi <[email protected]>
It makes sense to log the guest's kernel args for debugging purposes.

Signed-off-by: Ram Lavi <[email protected]>
@RamLavi RamLavi force-pushed the move_first_boot_to_new_service branch from 8bae30c to ecec1c7 Compare January 9, 2024 19:21
@RamLavi
Copy link
Collaborator Author

RamLavi commented Jan 9, 2024

Change: Fix comments.

@RamLavi
Copy link
Collaborator Author

RamLavi commented Jan 9, 2024

passes e2e on CNV4.15 cluster:

make e2e-test
mkdir -p /home/ralavi/go/src/github.com/kiagnose/kubevirt-realtime-checkup/_go-cache
podman run --rm \
	-v /home/ralavi/go/src/github.com/kiagnose/kubevirt-realtime-checkup:/go/src/github.com/kiagnose/kubevirt-realtime-checkup:Z \
	-v /home/ralavi/go/src/github.com/kiagnose/kubevirt-realtime-checkup/_go-cache:/root/.cache/go-build:Z \
	-v /home/ralavi/.kube/sno03-cnvqe2-rdu2:/root/.kube:Z,ro \
	--workdir /go/src/github.com/kiagnose/kubevirt-realtime-checkup \
	-e KUBECONFIG=/root/.kube/kubeconfig \
	-e TEST_NAMESPACE=realtime-checkup-1 \
	-e TEST_CHECKUP_IMAGE=quay.io/ramlavi/kubevirt-realtime-checkup:devel \
	-e VM_UNDER_TEST_CONTAINER_DISK_IMAGE=quay.io/ramlavi/kubevirt-realtime-checkup-vm:latest \
	docker.io/library/golang:1.20.12 \
	go test -v ./tests/... -test.v -test.timeout=1h -ginkgo.v -ginkgo.timeout=1h 
=== RUN   TestTests
Running Suite: Tests Suite - /go/src/github.com/kiagnose/kubevirt-realtime-checkup/tests
========================================================================================
Random Seed: 1704828323

Will run 1 of 1 specs
------------------------------
[BeforeSuite] 
/go/src/github.com/kiagnose/kubevirt-realtime-checkup/tests/tests_suite_test.go:39
[BeforeSuite] PASSED [0.002 seconds]
------------------------------
Checkup execution should complete successfully
/go/src/github.com/kiagnose/kubevirt-realtime-checkup/tests/checkup_test.go:80
• [732.305 seconds]
------------------------------

Ran 1 of 1 Specs in 732.307 seconds
SUCCESS! -- 1 Passed | 0 Failed | 0 Pending | 0 Skipped
--- PASS: TestTests (732.31s)
PASS
ok  	github.com/kiagnose/kubevirt-realtime-checkup/tests	732.319s

logs show kernel is correctly updated:

...
2024/01/09 19:27:21 VMI under test guest kernel Args: cat /proc/cmdline
BOOT_IMAGE=(hd0,gpt2)/vmlinuz-4.18.0-348.rt7.130.el8.x86_64 root=UUID=a9332d7d-1762-41cd-a702-6b2cc556c248 ro console=tty0 rd_NO_PLYMOUTH crashkernel=auto resume=UUID=d4ac9572-e828-4c87-9b76-e59c9fa6e426 console=ttyS0,115200 skew_tick=1 tsc=reliable rcupdate.rcu_normal_after_boot=1 isolcpus=managed_irq,domain,2-3 intel_pstate=disable nosoftlockup nohz=on nohz_full=2-3 rcu_nocbs=2-3 irqaffinity=0,1
[root@realtime-vmi-under-test-jqtz4 cloud-user]# 

@orelmisan
Copy link
Member

@RamLavi Please check it again after you merge the PR with quay.io/kiagnose/kubevirt-realtime-checkup-vm:main as the container disk image.

@RamLavi RamLavi merged commit b5db4d2 into kiagnose:main Jan 10, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants