Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

buildah-remote task: concurrency problems with source workspace #1275

Open
owtaylor opened this issue Aug 9, 2024 · 23 comments
Open

buildah-remote task: concurrency problems with source workspace #1275

owtaylor opened this issue Aug 9, 2024 · 23 comments

Comments

@owtaylor
Copy link
Contributor

owtaylor commented Aug 9, 2024

With the buildah-remote task, there are problems if several are run at the same time with the same workspace. For example, I saw:

  • arch1 completes, $(workspaces.source.path)/rhtap-final-image is rsynced back from the arch1 remote
  • arch2 starts, rhtap-final-image is rsynced to the arch2 remote
  • The arch2 image is added to rhtap-final-image
  • arch1 completes, rhtap-final-image is rsynced back from the arch2 remote - with both architectures
  • buildah pull oci:rhtap-final-image fails because the directory now has two images and buildah doesn't know which to use

But there are all sorts of other failures and corruptions that can occur. The workaround people have used for this is to use separate workspaces and duplicate the prefetch-dependencies steps. That works, with the downsides:

  • You have to know to use that workaround and why
  • A more complex, harder to maintain pipeline
  • Right now, cachi2 can't prefetch for only one architecture, making this approach much less efficient.

As far as I can tell, this could be fixed pretty simply by making the buildah/buildah-remote tasks align more closely with the buildah-remote-oci-tasks - use an emptydir workdir for temporary files rather than doing that in the workspace.

@brianwcook
Copy link
Contributor

is this somehow specific to using prefetch? I haven't had any of these concurrency problems in my multi-arch pipelines, but I usually do not have prefetch turned there.

@owtaylor
Copy link
Contributor Author

owtaylor commented Aug 9, 2024

is this somehow specific to using prefetch? I haven't had any of these concurrency problems in my multi-arch pipelines, but I usually do not have prefetch turned there.

My guess is that your multi-arch pipelines use separate workspaces. I only hit this because I wasn't following a well tested example. The downside of using separate workspaces (other than complexity) is prefetch specific, and especially when the prefetch is large.

@lcarva
Copy link
Contributor

lcarva commented Aug 9, 2024

Maybe the fix is to move to using Trusted Artifacts?

@owtaylor
Copy link
Contributor Author

owtaylor commented Aug 9, 2024

Maybe the fix is to move to using Trusted Artifacts?

From discussion with @arewm and @brianwcook yesterday, it sounded like they thought keeping the non-oci-ta variants around was going to be necessary for projects with "very large" artifacts.

Certainly for the flatpak-runtime / flatpak-sdk images I'm working on, I have some doubts about the performance and storage usage of putting the prefetched RPMs (> 10GB for all architectures + source) into a registry artifact, though I haven't tried it yet.

@brianwcook
Copy link
Contributor

Ah, I am pretty much exclusively using trusted artifacts so that makes sense.

@brianwcook
Copy link
Contributor

Maybe the fix is to move to using Trusted Artifacts?

From discussion with @arewm and @brianwcook yesterday, it sounded like they thought keeping the non-oci-ta variants around was going to be necessary for projects with "very large" artifacts.

Certainly for the flatpak-runtime / flatpak-sdk images I'm working on, I have some doubts about the performance and storage usage of putting the prefetched RPMs (> 10GB for all architectures + source) into a registry artifact, though I haven't tried it yet.

so 10GB / 4 Arch = 2.5GB per arch ... I think there is a reasonable chance you will have a better experience using trusteed artifacts instead of PVC workspaces.

@brianwcook
Copy link
Contributor

@arewm will the Matrix feature be able to reduce the complexity of using multiple prefetch tasks & workspaces?

@owtaylor
Copy link
Contributor Author

owtaylor commented Aug 9, 2024

so 10GB / 4 Arch = 2.5GB per arch ... I think there is a reasonable chance you will have a better experience using trusteed artifacts instead of PVC workspaces.

If we can't split the prefetch1 by arch and source then a single PVC workspace is clearly way better than downloading 10GB 4 times to use 2.5G of it. If we can split it, then both the Trusted Artifact and PVC versions benefit from that - the PVC version will still avoid pushing and pulling 10GB to the registry and leaving it there indefinitely.

Anyways, I filed an issue here to publicly record the issue and a possible fix. Could also do something like:

diff --git a/task/buildah-remote/0.2/buildah-remote.yaml b/task/buildah-remote/0.2/buildah-remote.yaml
index 2c5ae78..7c793c5 100644
--- a/task/buildah-remote/0.2/buildah-remote.yaml
+++ b/task/buildah-remote/0.2/buildah-remote.yaml
@@ -447,6 +447,10 @@ spec:
        -v "$BUILD_DIR/tekton-results/:/tekton/results:Z" \
        -v $BUILD_DIR/scripts:/script:Z \
       --user=0  --rm  "$BUILDER_IMAGE" /script/script-build.sh
+      if [ -d rhtap-final-image ] ; then
+          echo "rhtap-final-image already exists, multiple buildah-remote tasks must use separate workspaces"
+          exit 1
+      fi
       rsync -ra "$SSH_HOST:$BUILD_DIR/workspaces/source/" "$(workspaces.source.path)/"
       rsync -ra "$SSH_HOST:$BUILD_DIR/volumes/shared/" /shared/
       rsync -ra "$SSH_HOST:$BUILD_DIR/tekton-results/" "/tekton/results/"

Footnotes

  1. Splitting a RPM prefetch should be doable without any cachi2 changes, though clumsy, by using separate rpms.lock.yaml in separate subdirectories of the source. Haven't tested that yet.

@arewm
Copy link
Member

arewm commented Aug 10, 2024

Matrix builds are defined by setting parameters for task runs so I don't think that it would support associating specific workspaces for each of the runs. The challenge here is that the workspace exists outside of just the taskrun as you need to specify each of the shared workspaces for the entire pipeline run.

It seems like it could be reasonable to use a matrix for prefetching as well as builds using TA. If the OCI trusted artifact is pushed with some architecture specification in the prefetch then it should be possible for the consumer to identify the proper arch match for its needs.

@arewm
Copy link
Member

arewm commented Aug 10, 2024

I have heard that it was an intentional decision to have a single prefetch task in the pipeline and not farm it out to multiple arch-specific ones. I am not familiar with the specifics of that decision, however.

@brunoapimentel, are you aware or do you know who would be able to shed light on the current implementation?

@brianwcook
Copy link
Contributor

I don't know why it is the way it is either, but I am really curious to see how much stuff breaks if we did do per-arch prefetch so I added it to a fork of cachi2

brianwcook/cachi2@707a7d0

@chmeliik
Copy link
Contributor

chmeliik commented Aug 12, 2024

I don't know why it is the way it is either, but I am really curious to see how much stuff breaks if we did do per-arch prefetch so I added it to a fork of cachi2

brianwcook/cachi2@707a7d0

Probably just source containers (they won't break, but they'll potentially be missing sources)

And for everything except RPMs, running prefetch more than once is a waste

@owtaylor
Copy link
Contributor Author

I don't know why it is the way it is either, but I am really curious to see how much stuff breaks if we did do per-arch prefetch so I added it to a fork of cachi2

brianwcook/cachi2@707a7d0

Cool. I think it might make more sense to do it as a command line option, so the input definition can be shared among tasks. When I prototyped implementing split prefetch using separate rpms.lock.in in subdirs, it was quite awkward to use a different input per prefetch - I could no longer just define the prefetch input for the entire pipeline - there's no way I know to say "Use params.prefetch-input subsituting @arch@ with $arch".

Probably just source containers (they won't break, but they'll potentially be missing sources)

I think you'd do the source prefetch separately - either by having cachi2 merge all arches[].source or by changing the rpms.lock.yaml format so that source packages were listed separately from the architectures.

And for everything except RPMs, running prefetch more than once is a waste

There are probably sticky cases where you want to download a lot of RPMs, and a lot of something else, but the worst case for RPM prefetch is something like a base image where you have gigabytes of RPMs and nothing else. I'll collect stats for size and performance when I finish Flatpak runtime and SDK and post that somewhere - may give some idea of when split prefetch might be useful, if at all.

@brianwcook
Copy link
Contributor

brianwcook commented Aug 12, 2024 via email

@owtaylor
Copy link
Contributor Author

Here's some timings for the two Flatpak images we produce the runtime (common dependencies) and SDK (extended runtime with compilers, headers, etc.)

  Size (MiB) Prefetch (s) Create/Upload (s) Restore (s) rsync (s)
flatpak-runtime - amd64 295 40 35 9 4
flatpak-runtime - arm64 271 40 30 7 2
flatpak-runtime - ppc64le 281 44 31 7 6
flatpak-runtime - s390x 285 33 24 9 6
flatpak-runtime - source 2719 130 257 42 ---
 Total: 3851 287 377 74 18
           
flatpak-sdk - amd64 572 545 61 18 4
flatpak-sdk - arm64 536 528 50 20 3
flatpak-sdk - ppc64le 546 528 52 25 11
flatpak-sdk - s390x 544 530 48 13 11
flatpak-sdk - source 3552 888 336 159 ---
Total:  5750 3019 547 235 29

The timings are a bit all over the place - especially for the prefetch step. (The network connection between the Konflux cluster and the internal repositories prefetch is pulling from seems to be the bottleneck.) But it gives a a general idea.

My observations:

  • For these images, splitting the prefetch is quite useful, because prefetch and creating the artifacts is slow. Restoring the RPMs from the registry on the other hand is pretty fast - it took almost 3 minute for the SDK sources, but that was a bit of an outlier.
  • For these images, duplicated prefetch like is done by other RHEL containers wouldn't work at all.
  • For images with less RPM content, splitting the prefetch doesn't matter as much - wouldn't be bad to save a few minutes per build, though.

@owtaylor
Copy link
Contributor Author

Cool. I think it might make more sense to do it as a command line option, so the input definition can be shared among tasks. When I prototyped implementing split prefetch using separate rpms.lock.in in subdirs, it was quite awkward to use a different input per prefetch - I could no longer just define the prefetch input for the entire pipeline - there's no way I know to say "Use params.prefetch-input subsituting https://github.com/arch@ with $arch".
I did add it as a command line option (cachi2 taks a json payload on the cli).

Well, OK :-) , but what I mean is a command line option that's separate from the input specification, since the input specification is something that the component should be able to control without modifying individual pipeline taskruns.

@arewm
Copy link
Member

arewm commented Aug 13, 2024

I'm not sure if it makes more sense to have 4 prefetch tasks, or to have 1
prefetch task that produces a cache which is well separated so that it can
be transferred separately (maybe this wouldn't break source containers, but
it is definitely much more complicated to implement).

There are two goals that are at odds with each other:

  1. Reducing the amount of content that is pulled by the build container if prefetches are arch-specific (i.e. multiple artifacts)
  2. Ensuring that all prefetched content is present for generating the source container (i.e. a single artifact)

I haven't looked into how the directory is structured when prefetching content, but a potential high-level overview of the required changes is:

  • Define a parameter for prefetch which would separate out the platform-specific dependencies into their own subdirectories
  • Teach the TA pushing to create a composite artifact (i.e. an image manifest with one layer for each architecture) upon request or based on the directory structure in the file system. If there is no arch-specific content, a single layer would be pushed.
  • Teach the TA pulling to be able to inspect the image manifest and only pull a specific layer if all content isn't needed. If there isn't any arch specific content, the single layer would be pulled.
  • Enable the buildah task to modify the TA pull to provide an architecture (i.e. the last slug of the PLATFORM), pulling only that content (we will need to ensure that each of the configured platforms support this mode of operation)
  • Tasks like the source build wouldn't know/care about the arch-specific content, so they would just pull all layers. These tasks would, however, need to be able to handle the modified directory structure (if present).
    • NOTE: An alternative to this could just be to generate the source containers at the same time as the prefetch. The challenge here is that we wouldn't be able to use the referrer's API directly because the subject doesn't exist. There might need to be a process in the build task (or elsewhere in the pipeline) that attaches artifacts which are unattached.

@mmorhun @zregvart , fyi as well to bring your attention to this discussion.

@zregvart
Copy link
Member

  1. Reducing the amount of content that is pulled by the build container if prefetches are arch-specific (i.e. multiple artifacts)

    1. Ensuring that all prefetched content is present for generating the source container (i.e. a single artifact)

Not sure why we could not generate a prefetch trusted artifact per-platform? A single source and platform specific trusted artifact can be used in per-platform build step, and restoring the single source and multiple per-platform trusted artifacts when building the source container. This all relies on destinations for per-platform prefetch trusted artifacts not overwriting each other, i.e. being created and restored to platform-specific directories.

@chmeliik
Copy link
Contributor

chmeliik commented Aug 14, 2024

  1. Reducing the amount of content that is pulled by the build container if prefetches are arch-specific (i.e. multiple artifacts)

    1. Ensuring that all prefetched content is present for generating the source container (i.e. a single artifact)

Not sure why we could not generate a prefetch trusted artifact per-platform? A single source and platform specific trusted artifact can be used in per-platform build step, and restoring the single source and multiple per-platform trusted artifacts when building the source container. This all relies on destinations for per-platform prefetch trusted artifacts not overwriting each other, i.e. being created and restored to platform-specific directories.

That does sound like a better idea. But if it also needs to work for the non-TA pipeline, it complicates things.

Most multiarch pipelines I've seen use one workspace per arch and run one prefetch per arch (I don't fully understand why, but I think it was due to the volumes being ReadWriteOnce - not possible to run builds in parallel with a single workspace). Which would mean the source-container task needs to mount a variable number of workspaces.

In practice, the way to achieve this would be to declare a hardcoded number of workspaces in the source task (let's say 10) and Konflux users would have to pass their per-arch workspaces explicitly, something like this

- name: workspace-0
  workspace: workspace-x86_64
- name: workspace-1
  workspace: workspace-aarch64
- name: workspace-2
  workspace: workspace-ppc64le
- name: workspace-3
  workspace: workspace-s390x

Side note: the only reason why this is currently not necessary is that it's almost impossible to make the prefetch task behave differently per arch, except for Owen's workaround:

Splitting a RPM prefetch should be doable without any cachi2 changes, though clumsy, by using separate rpms.lock.yaml in separate subdirectories of the source.

@arewm
Copy link
Member

arewm commented Aug 14, 2024

Most of the multi-arch pipelines have a single PVC per workspace to prevent the current issue ... preventing the unnecessary copying/syncing of data to remote workers for the other architectures.

In my proposed pipeline for multi-arch with matrix (and OCI trusted artifacts), none of these workspaces exist: #1236.

While the requested feature to split the prefetch can be done when the data is stored in PVC, there is a more pressing need for it just in the OCI trusted artifacts because of the additional data transfer to/from the registry (this cost is not realized with the PVC except for when we rsync data to the remote host).

@chmeliik
Copy link
Contributor

I kind of lost track of what we're discussing in this issue

@lkolacek
Copy link

lkolacek commented Nov 4, 2024

I also lost track of the issue we're discussing here and the discussion itself seems to be quiet for a long time. Is it possible that the issue is not relevant anymore with moving to Trusted artifacts pipeline? I'll close it in a week if there won't be any response.

@arewm
Copy link
Member

arewm commented Nov 4, 2024

I think that the original issue that was described (i.e. not having uniqueness in the images when synced back to a shared PVC) might have been incidentally/unintentionally resolved with #1216.

We now push and pull the image which automatically has the platform appended to it.

The resulting conversation is unresolved ... i.e. how should we properly handle/design prefetching large payloads for multiple architectures.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants