Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition between post-workflow hooks and automerge pull cleanup #3336

Open
ilyaluk opened this issue Apr 18, 2023 · 4 comments
Open

Race condition between post-workflow hooks and automerge pull cleanup #3336

ilyaluk opened this issue Apr 18, 2023 · 4 comments
Labels
bug Something isn't working Stale

Comments

@ilyaluk
Copy link
Contributor

ilyaluk commented Apr 18, 2023

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request. Searching for pre-existing feature requests helps us consolidate datapoints for identical requirements into a single place, thank you!
  • Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request.
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment.

Overview of the Issue

Sometimes stack locks are left intact after atlantis automerging PR.

Reproduction Steps

The conditions for reproducing this race are following:

  • Enable automerge (GitHub provider, but that's probably not that relevant)
  • Enable some post-workflow hooks

The issue can't be reproduced reliably, but basically you need to create a PR with any number of affected stacks, and trigger apply all. Then rarely following happens:

  • Thread 1:
    • Apply command calls autoMerger.automerge (calling GH API, which triggers merge webhook), returns to RunCommentCommand
    • RunCommentCommand calls RunPostHooks
    • RunPostHooks WorkingDir.Clone, which calls some git commands which try to fetch remote and create some files in .git directory
  • Thread 2:
    • On receiving GH webhook event ClosedPullEvent we call PullCleaner.CleanUpPull
    • CleanUpPull calls WorkingDir.Delete, which calls os.RemoveAll(w.repoPullDir(r, p))

As os.RemoveAll traverses over workspace and removes files in second thread, git creates some in first thread, which yields following error:
error during cleanup of pull data%!(EXTRA *errors.withStack=cleaning workspace: unlinkat /home/atlantis/data/repos/.../123/default/.git/logs/refs/remotes: directory not empty)

As CleanUpPull fails after merge, we are left with merged PR and locks still intact, that's not great as we need to remove them manually.

Logs

Logs
automerging pull request
PUT /repos/.../pulls/123/merge
post-hooks configured, running...
got workspace lock
clone directory \"/home/atlantis/data/repos/.../123/default\" already exists, checking if it's at the right commit
repo is at correct commit \"dead...beef\" so will not re-clone
Initiating cleanup of pull data.
getting remote update failed: Fetching origin\nFrom https://github.com/...\n   abcabc..defdef  master     -> origin/master\nFetching head\nfatal: 'head' does not appear to be a git repository\nfatal: Could not read from remote repository.\n\nPlease make sure you have the correct access rights\nand the repository exists.\nerror: could not fetch head\n
error during cleanup of pull data%!(EXTRA *errors.withStack=cleaning workspace: unlinkat /home/atlantis/data/repos/.../123/default/.git/logs/refs/remotes: directory not empty)
error handling gh post code: 403 err: cleaning workspace: unlinkat /home/atlantis/data/repos/.../123/default/.git/logs/refs/remotes: directory not empty
error: exit status 1: running \"/usr/local/bin/some-post-workflow-hook\" in \"/home/atlantis/data/repos/.../123/default\": \nSome error about missing atlantis.yaml after workspace delete\n
Error running post-workflow hooks exit status 1: running \"/usr/local/bin/some-post-workflow-hook\" in \"/home/atlantis/data/repos/.../123/default\": \nSome error about missing atlantis.yaml after workspace delete\n.

Environment details

  • Atlantis version: v0.23.3 (commit: 3214b8c) (build date: 2023-03-20T23:29:09.859Z)
  • Deployment method: docker on ec2
  • If not running the latest Atlantis version have you tried to reproduce this issue on the latest version: No, but I checked commits for last two releases and there is nothing that might fix this behaviour.

Flags and configs are not provided, as we have quite complex config and simplifying it for this issue would take too long. I think there is enough information in the description to identify the issue. If not, I'll create a small configuration for reproduction.

Additional Context

I think that some of the possible solutions are:

  • Don't run post-workflow hooks if PR was automerged in command execution
  • Merge only after running post-workflow hooks. That might be tricky, as automerge is part of apply command, and post-workflow hooks are running in any case.
  • Somehow take lock on workspace between commands, i.e. take lock while apply command is running, and wait on that lock in thread which processes merge events, this would eliminate the race, as cleanup will be run after apply command and post-hooks are finished

The first one looks easiest, I'll be happy to send a PR if this is fine behaviour.

Relevant issues:

@ilyaluk ilyaluk added the bug Something isn't working label Apr 18, 2023
@jamengual
Copy link
Contributor

Don't run post-workflow hooks if PR was automerged in command execution I believe that should be fine but it should still work for people that do not use automerge.

@nitrocode @GenPage , thoughts?

@ilyaluk
Copy link
Contributor Author

ilyaluk commented May 8, 2023

Hey @jamengual @GenPage, we are still experiencing the issue quite often in our repo (with dozens of PRs per day this happens at least weekly).

Would disabling post-workflow hooks after automerge be acceptable solution while there is no accepted ADR for locks? I could send a PR this week.

@jamengual
Copy link
Contributor

@nitrocode

@nitrocode
Copy link
Member

The post-workflow would still need to run in some cases. I think it would be better to clean up the working directory after the post-workflow-hook completes its run, no?

@dosubot dosubot bot added the Stale label Oct 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Stale
Projects
None yet
Development

No branches or pull requests

3 participants