-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Container image builds now randomly fail when triggered through CloudFormation (due to race condition in Java SDK?) #4
Comments
I have uncovered further evidence corroborating that this bug is originating from a race condition of the type described in the aforementioned Go SDK issue. Namely, if one replaces the component in the above CloudFormation/CDK scripts with the following: schemaVersion: 1.0
phases:
- name: build
steps:
- name: bash-script
action: ExecuteBash
inputs:
commands:
- range=($(seq 0 1 100))
- for i in ${range[@]}; do aws sts get-caller-identity &> /dev/null; done then the failure rate hits 100%. The fact that this component is just making 100 SDK calls clearly suggests to me that it is indeed a randomly occurring bug (arising from a race condition) which is leading to this unfortunate behavior. |
Hi @OrEisenberg, Both error log you posted
or
Are from the TOE binary. That's something I can confirm. This makes feel this may still have something to do with the Go SDK V2 issue - https://githubmemory.com/repo/aws/aws-sdk-go-v2/issues/1253 |
@ytssun I do agree that these errors are being raised from the TOE binary. Is TOE written in Go? Whether this is an issue on the TOE end or the CloudFormation end is not entirely clear to me -- all I know is that triggering container builds from the command line works, but that triggering the build of those exact same images from CloudFormation fails nondeterministically with these errors. This suggests to me that there are one of two possible things going on:
If I knew Java I might be able to dig into the CloudFormation side of things and help determine which of these two situations is actually occurring, but unfortunately I don't. Obviously, with TOE not being open source, that's also a dead end for me. As before, please do let me know if there's any way I can help resolve this -- my team and I would really love to be able to build container images again! |
@OrEisenberg TOE is written in Go, and I can confirm the logs are coming from TOE. We have internally requesting Go SDK team to help with the issue, I will keep you updated with their response/solution and unblock you ASAP. One question I do have is, how long have you been using Image Builder to build container images and when did you start experiencing such issue? |
@ytsssun That's fantastic news! I appreciate your responsiveness. We've been using ImageBuilder + CloudFormation to build container images for 4 or 5 months now and started experiencing this problem around 2 weeks ago, give or take a few days. |
@OrEisenberg The timeline makes sense to me -- we recently migrated TOE to Go SDK V2 (roughly 2-3 weeks ago), and your experience matches that timeline as well. One update: I indeed received response from the Go SDK Team. They acknowledged the issue and I think you have correctly located it:
I will let you know about the ETA once I get that from them, we will need to decide whether we need to implement the workaround ourselves or wait for their updates -- whichever way rolls out the fix faster would be the choice for us. Thanks again for pointing out the issue. |
@ytsssun Terrific. Thanks again so much for getting on this so quickly! |
Hi @OrEisenberg wanted to follow up and see if you are seeing this issue being resolved since the fix from Go SDK is rolled to all region now. |
@ytsssun My unit test for building container images is now passing! Thanks so much again for all your help on this! |
TL;DR: In the last week or so, I have begun encountering a non-deterministic bug when trying to trigger container image builds using CloudFormation. As a result, it seems that it has sadly become impossible to reliably build container images using CloudFormation's EC2 ImageBuilder functionality! I have come to this conclusion by carrying out the below described procedure.
You can find below two CloudFormation scripts which were generated by CDK. I apologize for their lengths, but I am far from fluent in CloudFormation (you may also find beneath them the two python CDK scripts which were respectively used to generate them). CloudFormation script number one builds a container recipe and an infrastructure configuration (along with all the resources upon which these are dependent including an image component which does nothing, a VPC, a subnet, an ECR repository, an IAM role, an instance profile, and an S3 bucket for logging). If, after this stack successfully launches, one manually triggers any number of container image builds from the AWS CLI using the command
one finds that they always build successfully.
If, however, one then upload the second CloudFormation script as a change set to the original Stack, the resulting difference simply changes some CDK metadata and launches 10 identical copies of the image with the same container recipe and infrastructure configuration as those just provisioned from the command line.
The catch is that among those 10 image builds just triggered by this change set, some number will fail (usually between two and four in my experience). Checking the logs which are generated in the S3 logging bucket built in the stack yields one of the following two related error messages for those failed builds:
or
Searching for the former error message on the web, one finds this relevant thread from an issue in the Go SDK. My suspicion is that this codebase is relying on a Java SDK suffering from a similar race condition as the one described in this thread, but I don't know Java and therefore can neither confirm nor deny this suspicion.
I hope that this has been a sufficiently detailed accounting of this bug as for you to be able to reproduce it. If, however, there's any more information which I can provide you which might help you reproduce or investigate this bug, please do not hesitate to ask.
CloudFormation Script 1:
CloudFormation Script 2:
CDK Script 1:
CDK Script 2:
The text was updated successfully, but these errors were encountered: