Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unpack TF plugins in a more atomic way. #33479

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

mplzik
Copy link

@mplzik mplzik commented Jul 6, 2023

The original implementation of plugin cache handling unpacks plugin
archives in a way that can result in race conditions when multiple
terraform processes are accessing the same plugin cache. Since the
archives are being decompressed in chunks and output files are written
directly to the cache, we observed following manifestations of the issue:

  • text file busy errors if other terraform processes are already
    running the plugin and another process tries to replace it.
  • various plugin checksum errors triggered likely by simultaneous checksumming
    and rewriting the file.

This PR changes the zip archives with plugins are handled:

  1. Instead of writing directly to the target directory,
    installFromLocalArchive is now writing to a temporary staging
    directory prefixed with.temp that resides in the plugin cache dir
    to ensure this is on the same filesystem.
  2. After unpacking, the directory structure of the staging directory is
    replicated in the targetDir. The directories are created as needed
    and any files are moved using os.Rename. After this, the staging
    directory is removed.
  3. Since the temporary staging directories can be picked up by
    SearchLocalDirectory and make it fail in the situation when they're
    removed during directory traversal, the function has been modified to
    skip any entry that starts with dot.

Signed-off-by: Milan Plzik [email protected]

Fixes #31964

Target Release

1.5.x

Draft CHANGELOG entry

BUG FIXES

  • Concurrent write access to plugin cache directory has been fixed.

@hashicorp-cla
Copy link

hashicorp-cla commented Jul 6, 2023

CLA assistant check
All committers have signed the CLA.

@mplzik mplzik force-pushed the mplzik/tf-more-atomic-plugin-cache branch from 4c546ba to 6f7e411 Compare July 6, 2023 10:41
@mplzik mplzik marked this pull request as draft July 6, 2023 12:09
The original implementation of plugin cache handling unpacks plugin
archives in a way that can result in race conditions when multiple
terraform processes are accessing the same plugin cache. Since the
archives are being decompressed in chunks and output files are written
directly to the cache, we observed following manifestations of the issue:

- `text file busy` errors if other terraform processes are already
  running the plugin and another process tries to replace it.
- various plugin checksum errors triggered likely by simultaneous checksumming
  and rewriting the file.

This PR changes the zip archives with plugins are handled:

1. Instead of writing directly to the target directory,
   `installFromLocalArchive` is now writing to a temporary staging
   directory prefixed with`.temp` that resides in the plugin cache dir
   to ensure this is on the same filesystem.
2. After unpacking, the directory structure of the staging directory is
   replicated in the `targetDir`. The directories are created as needed
   and any files are moved using `os.Rename`. After this, the staging
   directory is removed.
3. Since the temporary staging directories can be picked up by
   `SearchLocalDirectory` and make it fail in the situation when they're
   removed during directory traversal, the function has been modified to
   skip any entry that starts with dot.

Signed-off-by: Milan Plzik <[email protected]>
@mplzik mplzik force-pushed the mplzik/tf-more-atomic-plugin-cache branch from 6f7e411 to dabf85a Compare July 6, 2023 13:25
@mplzik mplzik marked this pull request as ready for review July 6, 2023 14:25
@crw crw added the enhancement label Jul 6, 2023
@crw
Copy link
Collaborator

crw commented Jul 6, 2023

Thanks for this submission, I will raise it in triage.

@brandon-fryslie
Copy link

Oh please this would make me so happy, thanks

@crw
Copy link
Collaborator

crw commented Jul 31, 2023

Thanks for the reminder, and thanks @mplzik again for the submission.

The review from triage is that this change would require more due diligence. One issue with this implementation is that it relies on os.rename being atomic, which it is not. This may make the concurrency issue worse on different platforms.

We also noticed that the documentation should call out more clearly that the plugin cache is not concurrency-safe.

I'll leave this PR open for discussion for the moment, but will likely close it after a time as the feeling was it would be better to redesign a concurrency-safe plugin cache, which would require more discussion and prioritization.

Thanks again for this submission!

@crw crw added the waiting-response An issue/pull request is waiting for a response from the community label Jul 31, 2023
@mplzik
Copy link
Author

mplzik commented Aug 4, 2023

@crw I definitely agree that this is not solution to all the problems -- hence my comment // On supported platforms, this should perform atomic replacement of the file., with supported meaning platforms where this is supported for os.Rename. Speaking of this, would hiding this change behind a (default off) flag would make this PR more acceptable? I understand that implementing a really atomic cache would be preferred, but likely also entails much bigger engineering time investment (and it hasn't been implemented for a longer time, suggesting it might not be trivial).

Let me also do a bit of an argument in favor of merging this PR from the risk assessment side.

The golang documentation is not going too far on describing os.Rename non-atomicity, but https://stackoverflow.com/questions/30385225/is-there-an-os-independent-way-to-atomically-overwrite-a-file goes a bit farther in describing the platform-specific behavior. Specifically (assuming files on the same volume):

I see two main differences:

  • The renaming approach uses more disk space temporarily, so some real-world setups that are on almost full disks might be tipped to exceeding available disk space. This might, however, happen also with increasing size of the providers.
  • In the windows case, the file might be deleted (and thus stop existing) for a short period of time if atomic rename is not supported. I believe this will trigger mostly the same issues that are manifesting now (although I can't provide more data to prove that -- our infrastructure is not using any Windows machines to run Terraform).

@joaocc
Copy link
Contributor

joaocc commented Aug 24, 2023

Hi. Just some 0.02€ from userland. We (and many others using terraform directly or via wrappers - in my case terragrunt) are having a terrible and very painfull time due to the concurrency issues, stopping us from moving ahead, and causing uncertainty in processes which should be quite trivial.
I would tend to agree with @mplzik in that, even if this is not 100% bulletprof, the fact that it is supposed to work much better on linux and (from what I understood) on Windows, means that a huge number of users should have their experience improved, and get back to using terraform in a productive and streamlined manner.
If in the future there is an "even better way" (TM) to improve the cache in more complex scenarios (like network file systems), then it would be also very welcomed, of course. But in the meantime, the benefits would already be rolled out.
Thanks

@crw crw removed the waiting-response An issue/pull request is waiting for a response from the community label Aug 24, 2023
@brandon-fryslie
Copy link

I'm failing to understand the rationale behind blocking a small, simple change that would fix the problem for the vast majority of users in favor of holding out for an unknown, unplanned bulletproof solution that (likely) no one is working on. I'm finding it difficult to believe this could have been fixed for months already and yet it just sits here getting stale.

You can always revert the PR when you implement the really badass perfect plugin cache that you're now committed to implementing. And if you're not committed to it, why block this? Can we at least get a timeline of when you expect to release the real fix?

@crw
Copy link
Collaborator

crw commented Dec 12, 2023

First, apologies to @mplzik for the late response on this, the team appreciates your thoughtful response on this PR. The team is still concerned about edge cases with this solution. Speaking based on my own observations of how the maintainer team considers PRs, I think this is a difficult section of code to change from outside due to plugin cache management being in the critical path for every user.

@brandon-fryslie, to answer your question, the rationale is that the primary path to run Terraform is in sequence, not in parallel. The plugin cache is not concurrency safe. You are correct to assume no one is working on a more robust solution, as making the plugin cache safe for concurrent execution of Terraform is not currently a priority for the team. I'll re-raise this PR with our product manager to see if concurrency-safety for the plugin cache is something we can prioritize. Given that Terraform, as a whole, is not meant to be run in parallel, this likely has considerations beyond the plugin cache. However, it is always possible we will need to prioritize this as we add new features going forward.

Thanks for your feedback and continued interest in this feature.

@expnch
Copy link

expnch commented Dec 12, 2023

the primary path to run Terraform is in sequence, not in parallel.

Just to chime in - running Terraform in parallel (for different projects that use the same modules) is a huge part of our CI/CD workflow.

return authResult, fmt.Errorf("failed to create new directory: %w", err)
}

stagingDir, err := os.MkdirTemp(path.Dir(targetDir), ".temp")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably we could get rid of the need to refactor the SearchLocalDirectory function if we wrote the files into a directory created by os.TempDir() instead of inside our cache? Any reason not to do that?

@crw crw added the waiting-response An issue/pull request is waiting for a response from the community label Jul 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement waiting-response An issue/pull request is waiting for a response from the community
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow multiple Terraform instances to write to plugin_cache_dir concurrently
8 participants