-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unpack TF plugins in a more atomic way. #33479
base: main
Are you sure you want to change the base?
Conversation
4c546ba
to
6f7e411
Compare
The original implementation of plugin cache handling unpacks plugin archives in a way that can result in race conditions when multiple terraform processes are accessing the same plugin cache. Since the archives are being decompressed in chunks and output files are written directly to the cache, we observed following manifestations of the issue: - `text file busy` errors if other terraform processes are already running the plugin and another process tries to replace it. - various plugin checksum errors triggered likely by simultaneous checksumming and rewriting the file. This PR changes the zip archives with plugins are handled: 1. Instead of writing directly to the target directory, `installFromLocalArchive` is now writing to a temporary staging directory prefixed with`.temp` that resides in the plugin cache dir to ensure this is on the same filesystem. 2. After unpacking, the directory structure of the staging directory is replicated in the `targetDir`. The directories are created as needed and any files are moved using `os.Rename`. After this, the staging directory is removed. 3. Since the temporary staging directories can be picked up by `SearchLocalDirectory` and make it fail in the situation when they're removed during directory traversal, the function has been modified to skip any entry that starts with dot. Signed-off-by: Milan Plzik <[email protected]>
6f7e411
to
dabf85a
Compare
Thanks for this submission, I will raise it in triage. |
Oh please this would make me so happy, thanks |
Thanks for the reminder, and thanks @mplzik again for the submission. The review from triage is that this change would require more due diligence. One issue with this implementation is that it relies on We also noticed that the documentation should call out more clearly that the plugin cache is not concurrency-safe. I'll leave this PR open for discussion for the moment, but will likely close it after a time as the feeling was it would be better to redesign a concurrency-safe plugin cache, which would require more discussion and prioritization. Thanks again for this submission! |
@crw I definitely agree that this is not solution to all the problems -- hence my comment Let me also do a bit of an argument in favor of merging this PR from the risk assessment side. The golang documentation is not going too far on describing
I see two main differences:
|
Hi. Just some 0.02€ from userland. We (and many others using terraform directly or via wrappers - in my case terragrunt) are having a terrible and very painfull time due to the concurrency issues, stopping us from moving ahead, and causing uncertainty in processes which should be quite trivial. |
I'm failing to understand the rationale behind blocking a small, simple change that would fix the problem for the vast majority of users in favor of holding out for an unknown, unplanned bulletproof solution that (likely) no one is working on. I'm finding it difficult to believe this could have been fixed for months already and yet it just sits here getting stale. You can always revert the PR when you implement the really badass perfect plugin cache that you're now committed to implementing. And if you're not committed to it, why block this? Can we at least get a timeline of when you expect to release the real fix? |
First, apologies to @mplzik for the late response on this, the team appreciates your thoughtful response on this PR. The team is still concerned about edge cases with this solution. Speaking based on my own observations of how the maintainer team considers PRs, I think this is a difficult section of code to change from outside due to plugin cache management being in the critical path for every user. @brandon-fryslie, to answer your question, the rationale is that the primary path to run Terraform is in sequence, not in parallel. The plugin cache is not concurrency safe. You are correct to assume no one is working on a more robust solution, as making the plugin cache safe for concurrent execution of Terraform is not currently a priority for the team. I'll re-raise this PR with our product manager to see if concurrency-safety for the plugin cache is something we can prioritize. Given that Terraform, as a whole, is not meant to be run in parallel, this likely has considerations beyond the plugin cache. However, it is always possible we will need to prioritize this as we add new features going forward. Thanks for your feedback and continued interest in this feature. |
Just to chime in - running Terraform in parallel (for different projects that use the same modules) is a huge part of our CI/CD workflow. |
return authResult, fmt.Errorf("failed to create new directory: %w", err) | ||
} | ||
|
||
stagingDir, err := os.MkdirTemp(path.Dir(targetDir), ".temp") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Presumably we could get rid of the need to refactor the SearchLocalDirectory
function if we wrote the files into a directory created by os.TempDir()
instead of inside our cache? Any reason not to do that?
The original implementation of plugin cache handling unpacks plugin
archives in a way that can result in race conditions when multiple
terraform processes are accessing the same plugin cache. Since the
archives are being decompressed in chunks and output files are written
directly to the cache, we observed following manifestations of the issue:
text file busy
errors if other terraform processes are alreadyrunning the plugin and another process tries to replace it.
and rewriting the file.
This PR changes the zip archives with plugins are handled:
installFromLocalArchive
is now writing to a temporary stagingdirectory prefixed with
.temp
that resides in the plugin cache dirto ensure this is on the same filesystem.
replicated in the
targetDir
. The directories are created as neededand any files are moved using
os.Rename
. After this, the stagingdirectory is removed.
SearchLocalDirectory
and make it fail in the situation when they'reremoved during directory traversal, the function has been modified to
skip any entry that starts with dot.
Signed-off-by: Milan Plzik [email protected]
Fixes #31964
Target Release
1.5.x
Draft CHANGELOG entry
BUG FIXES