-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate how we ended up with a test extension binary in the CAPD image #9752
Comments
This issue is currently awaiting triage. If CAPI contributors determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
I was running the script provided in v1.6.0-X: [v1.6.0-rc.0, v1.6.0-beta.1, v1.6.0-beta.0]v1.5.x: [v1.5.3, v1.5.2, v1.5.1]v1.4.x: [v1.4.8, v1.4.7, v1.4.6]and here are the results of the script runs: Tags with failure:v1.6.0-beta.1
v1.5.3
Tags with success:v1.6.0-rc.0
v1.6.0-beta.0
v1.5.2
v1.5.1
v1.4.8 (ignore error, capim was not introduced by then but script checks it still)
v1.4.7 (ignore error, capim was not introduced by then but script checks it still)
v1.4.6 (ignore error, capim was not introduced by then but script checks it still)
cc @chrischdi |
Thank you very much!!! |
I would say let's keep an eye on it for upcoming releases the next 1-2 weeks and then we can close this issue. It would be interesting to know what exactly was broken (assuming it's solved now), but that's mostly interesting when we want to improve the performance again the same way |
Further testing:Tags with failure:v20231122-v1.5.3-28-g8653d4de4 - image tag on v1.5 branch before revert PR v20231122-v1.5.3-28-g8653d4de4
Tags with success:v20231122-v1.5.3-34-gfd9f33369 - image tag on v1.5 branch after revert PR v20231122-v1.5.3-34-gfd9f33369
This signals pretty much we do see the successful image builts right after the revert and it is highly possible that reverting helped to solve this issue. The only remaining thing I am still unsure is why we have not seen it in v1.4.8 tag image, since the culprit change was also backported to release-1.4 branch and included in v1.4.8 release 🤔 |
Small update: |
Further testing 2 (tests after revert PR in all branches):Tags with failure:None Tags with success:v20231128-v1.5.3-36-g1eb9be941v20231128-v1.5.3-36-g1eb9be941
v20231203-v1.6.0-rc.0-42-g6cba8c212v20231203-v1.6.0-rc.0-42-g6cba8c212
v20231130-v1.6.0-rc.0-39-g0240b58a0v20231130-v1.6.0-rc.0-39-g0240b58a0
v20231128-v1.6.0-rc.0-31-g0970c323dv20231128-v1.6.0-rc.0-31-g0970c323d
v20231129-v1.6.0-rc.0-35-g6b516980bv20231129-v1.6.0-rc.0-35-g6b516980b
v20231129-v1.6.0-rc.0-37-g0cd7db232v20231129-v1.6.0-rc.0-37-g0cd7db232
|
@sbueringer 👋 this seems like a good candidate for the release team to pick back up, do you mind if I add this issue to the improvement tasks board? |
took another look here -- i think we may have run into this bug: moby/buildkit#1368
1.5.3
1.6.0-beta.1
|
That would be good! |
/assign |
@sbueringer 👋 i'm pretty confident i found the issue #9752 (comment). pushed up some resources that will hopefully help get additional eyes/ validation for my explanation: cahillsf#2 once this explanation has been reviewed/confirmed i'm thinking we should implement one of the workarounds suggested in the issue because if i'm right, it was really just luck that we didn't hit this bug even using the serial build. we are working on this issue (#9758) to add a test in CI/CD and after that is implemented along with the workaround, maybe we can explore reintroducing the parallelization to speed up the job. wdyt? cc @kubernetes-sigs/cluster-api-release-team |
@cahillsf Thx for looking into this! I agree with your analysis and I think your proposed resolution makes sense I was wondering if it would make sense to try to bump the docker version used in So I guess as we can't really enforce a minimum Docker version we have to implement one of the workarounds. |
I believe the docker version (client) running inside the container is pretty current ( |
we could explore attempting to run the daemon inside the container and forcing the build to use that server, but from what I've read so far it seems like this might be introducing a whole slew of other issues: https://jpetazzo.github.io/2015/09/03/do-not-use-docker-in-docker-for-ci/ i imagine our limited access to the host machine in a "serverless" build service like Cloudbuild would also add some complexity |
Ah interesting. I guessed they might be using docker-in-docker like in the kubekins image that we use for our regular tests in Prow. In case we're not sure, maybe we should ask some test-infra folks how it works. But in any case I think even if we fix it in this image / environment. We should implement something for the older Docker version anyway. |
not getting much traction in the #sig-testing channel, will bump the thread in a few days if no one follows up. have also opened a ticket with GCP support to see if they can shed some light on the expected behavior here |
Sounds good. I think in any case we'll need to implement a workaround for other environments. |
just bumped the message in #sig-testing, though my information from the GCP support team seems to confirm we are using their Docker server to run the build:
and they did acknowledge the discrepancy in the docs (docs say that CloudBuild runs |
We just found out today that the images we publish post-submit on the main branch have the test extension binary in the CAPD image. It's unclear at this point if other providers and architectures are affected.
Our theory is that this was introduced when adding parallelism to our couldbuild job (i.e. running
make release-staging
with-j 8 -O
) (xref: #9536).We are trying to fix this by dropping the cd's in our make targets: #9744.
But it's unclear at this point if this will fix the issue. Christian provided a snippet to verify if all our images contain the right binaries: https://kubernetes.slack.com/archives/C8TSNPY4T/p1700585328755669?thread_ts=1700574862.617579&cid=C8TSNPY4T
We should do at least the following:
The text was updated successfully, but these errors were encountered: