Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add security-opt, device, priviledged and bindmount to the buildah task #1530

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

shi2wei3
Copy link

Resolve https://issues.redhat.com/browse/BIFROST-475

To build bootc case image with buildah, we need additional options to be supported.

xref: https://gitlab.com/fedora/bootc/base-images/-/blob/main/Containerfile?ref_type=heads#L6

@shi2wei3
Copy link
Author

I have no idea about how to test my changes by myself because I don't know how to build a buildah task container image, it uses the released one in the task definition.

@chmeliik
Copy link
Contributor

/ok-to-test

@chmeliik
Copy link
Contributor

I have no idea about how to test my changes by myself because I don't know how to build a buildah task container image, it uses the released one in the task definition.

You can use the tkn CLI tool:

tkn bundle push quay.io/acmiel-test/tekton-catalog/task-buildah:test -f task/buildah/0.2/buildah.yaml

Push it to your own quay.io namespace, make the repo public, use the task in your pipeline

@chmeliik
Copy link
Contributor

Would this change potentially allow users to do insecure things / something that would bypass the hermeticity of the build / break SBOM completeness?

@brianwcook
Copy link
Contributor

brianwcook commented Oct 25, 2024

Would this change potentially allow users to do insecure things / something that would bypass the hermeticity of the build / break SBOM completeness?

well...will there be enough information about what options were used that we can control it via enterprise contract policies?

@shi2wei3
Copy link
Author

Those are all the options we need for bootc image right now

buildah build --security-opt=label=disable --cap-add=all --device /dev/fuse ...

@chmeliik
Copy link
Contributor

Would this change potentially allow users to do insecure things / something that would bypass the hermeticity of the build / break SBOM completeness?

well...will there be enough information about what options were used that we can control it via enterprise contract policies?

I believe so. Should we have the policies ready before we allow production use?

@brianwcook
Copy link
Contributor

Would this change potentially allow users to do insecure things / something that would bypass the hermeticity of the build / break SBOM completeness?

well...will there be enough information about what options were used that we can control it via enterprise contract policies?

I believe so. Should we have the policies ready before we allow production use?

Yes, or else this will probably weaken the security of all other pipelines. Meaning, I think we should ensure the existing policies require sane values for new options before releasing the ability to configure those same new options.

@brianwcook
Copy link
Contributor

@shi2wei3 Is this going to eventually replace the ostree builder task we have now?

@shi2wei3
Copy link
Author

@shi2wei3 Is this going to eventually replace the ostree builder task we have now?

For bootc image, I think we plan to use buildah task to replace rpm-ostree task, but I don't know if there are other projects whick still using it.

xref: https://issues.redhat.com/browse/BIFROST-408

Cc: @cgwalters , please correct me if I'm wrong.

@cgwalters
Copy link
Contributor

Yes the goal is to drop the rpm-ostree task

@cgwalters
Copy link
Contributor

Would this change potentially allow users to do insecure things

Yeah but I think we established that's true already in
https://issues.redhat.com/browse/KFLUXBUGS-1499

It'd make sense for this to be opt-in...and perhaps the high level flag should just be --enable-nested-containers (which would map to these underlying options). For example, --device /dev/fuse is quite safe to add, but other devices less so.

@brianwcook
Copy link
Contributor

brianwcook commented Oct 27, 2024 via email

@syedriko
Copy link

syedriko commented Oct 27, 2024

@syedriko wdyt, could this feature work in instead of the specific The GPU one?

It absolutely could.
And while we're adding knobs to the buildah task, can we expose TMPDIR, as well? With hermetic build, I'm running out of space under /var/tmp and I'm not seeing a way to point TMPDIR someplace else.

@shi2wei3 shi2wei3 marked this pull request as draft October 28, 2024 01:36
@shi2wei3
Copy link
Author

I need to re-work on this PR since it not works as expected during my test.

@shi2wei3
Copy link
Author

@cgwalters , sorry to bother, but the workflow in this konflux buildah task is different than directly run buildah build --security-opt=label=disable --cap-add=all --device /dev/fuse ... locally, I don't think I have enough experience to sort it out for now, could you help me to take a look?

What I did in this PR will add --security-opt=label=disable and --device /dev/fuse to the podman run cmd which runs remotely, and also pass them to the generated scripts/script-build.sh, and finally add those two options to the buildah_cmd in the generated build script and it will be triggered in a new unshare ns.

With this PR, our bootc image build will return the following error message:

[build] Adding the entitlement to the build
[build] [1/2] STEP 1/3: FROM registry.stage.redhat.io/rhel9/bootc-image-builder:9.6 AS builder
[build] [1/2] STEP 2/3: ARG MANIFEST=rhel-bootc.yaml
[build] [1/2] STEP 3/3: RUN --mount=type=cache,target=/workdir --mount=type=bind,rw=true,src=.,dst=/buildcontext,bind-propagation=shared . /cachi2/cachi2.env &&     rpm-ostree compose image --cachedir=/workdir --format=ociarchive --initialize /buildcontext/${MANIFEST} /buildcontext/out.ociarchive
[build] bwrap: Can't mount proc on /newroot/proc: Operation not permitted
[build] error: bwrap test failed, see <https://github.com/coreos/rpm-ostree/pull/429>: bwrap(true): Child process killed by signal 1
[build] error: compose tree failed: ExitStatus(unix_wait_status(256))
[build] subprocess exited with status 1
[build] subprocess exited with status 1
[build] Error: building at STEP "RUN --mount=type=cache,target=/workdir --mount=type=bind,rw=true,src=.,dst=/buildcontext,bind-propagation=shared . /cachi2/cachi2.env &&     rpm-ostree compose image --cachedir=/workdir --format=ociarchive --initialize /buildcontext/${MANIFEST} /buildcontext/out.ociarchive": exit status 1

I checked from the log, these options has been passed to the podman run command which runs over ssh:

[build] + ssh -o StrictHostKeyChecking=no -o ServerAliveInterval=60 -o ServerAliveCountMax=10 <user>@<ip> podman run --cap-add=all --security-opt=label=disable --device=/dev/fuse ...

@cgwalters
Copy link
Contributor

That podman run invocation is just going to end up invoking buildah right? Are the options being passed to buildah too?

Remember there's actually three levels of containerization going on AIUI:

  • podman
  • buildah
  • rpm-ostree invoking bwrap

And it's the 3rd level we're trying to enable.

Isn't the outer podman run invocation already passing some of these caps?

@shi2wei3 shi2wei3 force-pushed the buildah-nested branch 2 times, most recently from d27f871 to 5cad861 Compare November 4, 2024 06:39
@shi2wei3
Copy link
Author

shi2wei3 commented Nov 4, 2024

That podman run invocation is just going to end up invoking buildah right? Are the options being passed to buildah too?

Right, and it invokes buildah inside of unshare, see

unshare -Uf "${UNSHARE_ARGS[@]}" --keep-caps -r --map-users 1,1,65536 --map-groups 1,1,65536 -w "${SOURCE_CODE_DIR}/$CONTEXT" -- sh -c "$command"

And I've already passed those options to buildah cmd, from the debug log, I can find those options and please help me to check whether the caps in buildah debug log is sufficient for bootc image building.

# unshare -Uf --net --keep-caps -r --map-users 1,1,65536 --map-groups 1,1,65536 -w source/. -- sh -c ip link set lo up && buildah build --volume /tmp/cachi2:/cachi2 --volume /var/workdir/fetched.repos.d:/etc/yum.repos.d --volume /tmp/entitlement:/etc/pki/entitlement --log-level=debug --pull=never --security-opt=label=disable --device=/dev/fuse --cap-add=all --label build-date=2024-11-04T07:51:51 --label architecture=aarch64 --label vcs-type=git --label vcs-ref=37fb332eaf93837afb28dd0dffa01fa456b1cc4f --label quay.expires-after=5d --tls-verify=true --no-cache --ulimit nofile=4096:4096 -f /tmp/Containerfile.BhWzeP -t <repo:tag> .

[build] time="2024-11-04T07:51:52Z" level=debug msg="effective capabilities: [audit_control=true audit_read=true audit_write=true block_suspend=true bpf=true checkpoint_restore=true chown=true dac_override=true dac_read_search=true fowner=true fsetid=true ipc_lock=true ipc_owner=true kill=true lease=true linux_immutable=true mac_admin=true mac_override=true mknod=true net_admin=true net_bind_service=true net_broadcast=true net_raw=true perfmon=true setfcap=true setgid=true setpcap=true setuid=true sys_admin=true sys_boot=true sys_chroot=true sys_module=true sys_nice=true sys_pacct=true sys_ptrace=true sys_rawio=true sys_resource=true sys_time=true sys_tty_config=true syslog=true wake_alarm=true]"
[build] time="2024-11-04T07:51:52Z" level=debug msg="Pull Policy for pull [never]"
...

Remember there's actually three levels of containerization going on AIUI:

  • podman
  • buildah
  • rpm-ostree invoking bwrap

And it's the 3rd level we're trying to enable.

Sorry, I didn't get what you mean here, is the development work is not completed yet to support building bootc image from buildah with Containerfile?

Isn't the outer podman run invocation already passing some of these caps?

The outer podman run runs over ssh with a non-root user, this remote user will be delete after the task is completed. I'm not sure what caps it has, but I'm sure I've added these options to the outer podman run command, I can find them in the debug output:

ssh -o StrictHostKeyChecking=no -o ServerAliveInterval=60 -o ServerAliveCountMax=10 <non-root-user>@<ip> podman run --cap-add=all --security-opt=label=disable --device=/dev/fuse <other options>  --user=0 --rm quay.io/konflux-ci/buildah-task:latest@sha256:b2d6c32d1e05e91920cd4475b2761d58bb7ee11ad5dff3ecb59831c7572b4d0c /scripts/script-build.sh

I tried to reproduce this failure by simulate what konflux buildah task did in my local env, but without any luck, I got a different error, maybe that's because there are still some differences I didn't find out between my local environment and the konflux environment.

# cat /scripts/script-build.sh
#!/bin/bash

# ssh -o StrictHostKeyChecking=no -o ServerAliveInterval=60 -o ServerAliveCountMax=10 <user>@<local_vm> \
#   podman run --rm --user=0 \
#   --cap-add=all --security-opt=label=disable --device=/dev/fuse \
#   -v "/rhel-bootc:/rhel-bootc:Z" \
#   -v "/scripts:/scripts:Z" \
#   quay.io/konflux-ci/buildah-task:latest /scripts/script-build.sh

chown root:root /var/lib/containers
echo 'root:1:4294967294' | tee -a /etc/subuid >>/etc/subgid
COMMAND="buildah build --security-opt=label=disable --cap-add=all --device /dev/fuse -t localhost/rhel-bootc -f Containerfile"
unshare -Ufp --keep-caps -r --map-users 1,1,65536 --map-groups 1,1,65536 -- buildah --log-level=debug pull <bib stage image>
unshare -Uf --net --keep-caps -r --map-users 1,1,65536 --map-groups 1,1,65536 -w /rhel-bootc -- sh -c "$COMMAND"


# podman run --rm --user=0 --cap-add=all --security-opt=label=disable --device=/dev/fuse -v "/rhel-bootc:/rhel-bootc:Z" -v "/scripts:/scripts:Z" quay.io/konflux-ci/buildah-task:latest /scripts/script-build.sh
Trying to pull quay.io/konflux-ci/buildah-task:latest...
Getting image source signatures
Copying blob 3a8929253bff done   |
Copying blob 885f6bb378fc done   |
Copying blob 01f40b4c926e done   |
Copying blob 17b076046ff4 done   |
Copying blob 3a2b607ffaba done   |
Copying config f4c834bdec done   |
Writing manifest to image destination
time="2024-11-04T08:55:19Z" level=debug msg="effective capabilities: [audit_control=true audit_read=true audit_write=true block_suspend=true bpf=true checkpoint_restore=true chown=true dac_override=true dac_read_search=true fowner=true fsetid=true ipc_lock=true ipc_owner=true kill=true lease=true linux_immutable=true mac_admin=true mac_override=true mknod=true net_admin=true net_bind_service=true net_broadcast=true net_raw=true perfmon=true setfcap=true setgid=true setpcap=true setuid=true sys_admin=true sys_boot=true sys_chroot=true sys_module=true sys_nice=true sys_pacct=true sys_ptrace=true sys_rawio=true sys_resource=true sys_time=true sys_tty_config=true syslog=true wake_alarm=true]"
time="2024-11-04T08:55:19Z" level=debug msg="[graphdriver] trying provided driver \"overlay\""
time="2024-11-04T08:55:19Z" level=debug msg="overlay: imagestore=/var/lib/shared"
time="2024-11-04T08:55:19Z" level=debug msg="overlay: imagestore=/usr/lib/containers/storage"
time="2024-11-04T08:55:19Z" level=debug msg="overlay: mount_program=/usr/bin/fuse-overlayfs"
Error: mount /var/lib/containers/storage/overlay:/var/lib/containers/storage/overlay, flags: 0x1000: operation not permitted
time="2024-11-04T08:55:19Z" level=debug msg="[graphdriver] trying provided driver \"overlay\""
time="2024-11-04T08:55:19Z" level=debug msg="overlay: imagestore=/var/lib/shared"
time="2024-11-04T08:55:19Z" level=debug msg="overlay: imagestore=/usr/lib/containers/storage"
time="2024-11-04T08:55:19Z" level=debug msg="overlay: mount_program=/usr/bin/fuse-overlayfs"
time="2024-11-04T08:55:19Z" level=warning msg="failed to shutdown storage: \"mount /var/lib/containers/storage/overlay:/var/lib/containers/storage/overlay, flags: 0x1000: operation not permitted\""
Error: mount /var/lib/containers/storage/overlay:/var/lib/containers/storage/overlay, flags: 0x1000: operation not permitted
time="2024-11-04T08:55:19Z" level=warning msg="failed to shutdown storage: \"mount /var/lib/containers/storage/overlay:/var/lib/containers/storage/overlay, flags: 0x1000: operation not permitted\""


# unshare -Ufp --keep-caps -r --map-users 1,1,65536 --map-groups 1,1,65536 -- buildah info
Error: mount /var/lib/containers/storage/overlay:/var/lib/containers/storage/overlay, flags: 0x1000: operation not permitted
WARN[0000] failed to shutdown storage: "mount /var/lib/containers/storage/overlay:/var/lib/containers/storage/overlay, flags: 0x1000: operation not permitted"
# cat /etc/redhat-release
Fedora release 40 (Forty)
# rpm -q buildah
buildah-1.37.3-1.fc40.x86_64

If you want to check the full debug output of my failed test, please find the link in the last pipeline job on MR 1446 in the downsteam rhel-bootc repo.

@cgwalters
Copy link
Contributor

Sorry, I didn't get what you mean here, is the development work is not completed yet to support building bootc image from buildah with Containerfile?

This one definitely works, I use it often (usually with podman, not buildah but it should be the same in general). We can add a CI check that it works but hopefully of course this gets finished and every PR is our CI.

@cgwalters
Copy link
Contributor

time="2024-11-04T08:55:19Z" level=debug msg="[graphdriver] trying provided driver "overlay""

I thought konflux was forcing on vfs? That may be a difference.

@shi2wei3
Copy link
Author

shi2wei3 commented Nov 5, 2024

@cgwalters I can reproduce the failure locally now with the following one-line command:

podman run --rm --user=0 --cap-add=all --security-opt=label=disable --device=/dev/fuse -v "/rhel-bootc:/rhel-bootc:Z" -v "/home/wshi/.docker/:/root/.docker:Z" -e STORAGE_DRIVER=vfs quay.io/konflux-ci/buildah-task:latest unshare -Uf --keep-caps -r --map-users 1,1,65536 --map-groups 1,1,65536 -w /rhel-bootc -- sh -c "buildah build --security-opt=label=disable --cap-add=all --device /dev/fuse -t localhost/rhel-bootc -f Containerfile"

I tried to add --privileged to podman run, then the bwrap error is gone, but I hit a new error now...

Downloading from 'appstream'...done
Downloading from 'baseos'...done
Importing packages...done: 441
Skipping file /usr/lib/systemd/system/sysinit.target.wants/systemd-firstboot.service from checkout
Skipping file /usr/lib/systemd/system-generators/systemd-gpt-auto-generator from checkout
Skipping file /usr/etc/grub.d/08_fallback_counting from checkout
Skipping file /usr/etc/grub.d/10_reset_boot_success from checkout
Skipping file /usr/etc/grub.d/12_menu_auto_hide from checkout
Skipping file /usr/lib/systemd/ from checkout
Checking out packages...done
Running pre scripts...20 done
Running post scripts...done
error: Installing packages: While applying overrides for pkg shadow-utils: Copyup usr/bin/chage: Setting xattrs: fsetxattr(security.selinux): Operation not supported
error: compose tree failed: ExitStatus(unix_wait_status(256))
subprocess exited with status 1
subprocess exited with status 1
Error: building at STEP "RUN --mount=type=cache,target=/workdir --mount=type=bind,rw=true,src=.,dst=/buildcontext,bind-propagation=shared rpm-ostree compose image --cachedir=/workdir --format=ociarchive --initialize /buildcontext/${MANIFEST} /buildcontext/out.ociarchive": exit status 1

And I'm not sure if adding --privileged to this rootless container can get approved, I saw we use it in rpm-ostree task.

@cgwalters
Copy link
Contributor

cgwalters commented Nov 5, 2024

And I'm not sure if adding --privileged to this rootless container can get approved,

I think this is not really different from --cap-add=all. In the end, what we need to do is aim to drop some of the magic in this container build (exactly things like attempting to write security.selinux to the filesystem), we can also drop the FUSE requirement etc. In the end some nested containerization support is going to be needed, but that's relatively straightforward and safe with user namespaces and at that point I think we can safely allow "nested containers" to be an opt-in flag.

Which BTW ultimately we do want to productize in kube, so the flag eventually here would look like "the workspace can enable hostUsers: false in generated pod workloads" or so (ref https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/pod-v1/#alpha-level )

error: Installing packages: While applying overrides for pkg shadow-utils: Copyup usr/bin/chage: Setting xattrs: fsetxattr(security.selinux): Operation not supported

The way this is working today with a default podman build is because of the RUN --mount=type=cache,target=/workdir - that cache directory will be a temporary dir that's not on overlayfs so we can set security.selinux on it. Is that somehow being lost here maybe? Or maybe when using VFS it has different semantics?

We may also be able to fix this by changing that inner containerfile to use RUN --mount=type=tmpfs,target=/workdir or so?

@brianwcook
Copy link
Contributor

brianwcook commented Nov 5, 2024

And I'm not sure if adding --privileged to this rootless container can get approved

as long as it is off by default and controlled by a parameter then we can allow it only in certain circumstances where it is needed and it should be acceptable.

Anything that needs more permissions than the current OpenShift security context allows will need to be scheduled to a VM but bootc builds already are anyway.

@cgwalters
Copy link
Contributor

Anything that needs more permissions than the current OpenShift security context allows will need to be scheduled to a VM but bootc builds already are anyway.

Yes. This relates to a larger picture point around why we haven't prioritized trying to lessen the privileges that the build requires - it's basically because anyone who is making bootc container images is going to want to boot it for testing - and that means a VM or bare metal.

So in the end, that infrastructure (containers and VMs/metal) needs to be coupled and available. The reality is of course today the build and test infrastructure is wildly distinct; to the best of my knowledge the Konflux hardware provisioning and management is distinct from Testing Farm for example which is where a lot of the bootc tests run.

@chmeliik
Copy link
Contributor

chmeliik commented Nov 6, 2024

Just FYI, the Checkton errors should go away if you rebase (it diffs the errors between main and your branch and tries to report only the new ones, which doesn't seem to work correctly if your branch is behind)

@shi2wei3
Copy link
Author

shi2wei3 commented Nov 6, 2024

We may also be able to fix this by changing that inner containerfile to use RUN --mount=type=tmpfs,target=/workdir or so?

I google it and found you answered it when Brian hit the similar issue, and it works when I test it locally, I've update the buildah task and test it in MR1446.

coreos/rpm-ostree#4648

@shi2wei3
Copy link
Author

shi2wei3 commented Nov 6, 2024

Just FYI, the Checkton errors should go away if you rebase (it diffs the errors between main and your branch and tries to report only the new ones, which doesn't seem to work correctly if your branch is behind)

Thanks for the remind, I'll rebase and update the changes when my test is pass.

@cgwalters
Copy link
Contributor

I google it and found you answered it when Brian hit the similar issue,

Heh yeah, totally forgot about that...

@shi2wei3 shi2wei3 marked this pull request as ready for review November 6, 2024 16:47
@openshift-ci openshift-ci bot requested a review from chmeliik November 6, 2024 16:47
@shi2wei3 shi2wei3 changed the title Add security-opt and device to the buildah task Add security-opt, device, priviledged and bindmount to the buildah task Nov 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants