-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs(adr): Project Locks 0002 #3345
base: main
Are you sure you want to change the base?
Conversation
c82657e
to
5d5c7c6
Compare
2084532
to
e681c32
Compare
### Problem | ||
|
||
There is a long-standing regression introduced by a PR to allow parallel plans to happen for projects within the same repository that also belongs in the same workspace. The error prompting users when attempting to plan is: | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After commit 5288389 this is hard to trigger. In practice it requires a TryLock to be called while TryLockPull is held, and that only happens for a reasonably short time in buildAllProjectCommandsByPlan.
TryLockPull can be killed and the caller rewritten to actually lock the affected directories, then this error will not be possible any longer.
|
||
There has also been a history of issues with colliding locks that disrupts parallel commands running. In this case, locking happens to avoid collison on file operations. We should be locking during Git operations but might not need to during workspace/plan file creation. | ||
|
||
*TODO:* dig into locking more thoroughly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The WorkingDirLocker locks would be a lot more end user friendly if they were blocking locks, instead of "try or instantly fail" like they are now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the blocking locks. Block initially and then timeout after a certain limit.
a) Error when attempting fetch + pull to update existing local git repo | ||
b) File operation on a path that doesn't exist (data loss/accidental deletes) | ||
|
||
In all other situations, we should be utilizing Git for its delta compressions and performing less intensive fetch + pull operations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are using the merge checkout strategy, The time between the initial checkout, and the actual plan execution that will run successfully to completion without hitting any lock problems can be hours/days/weeks, depending on what other PRs are doing.
So, before the steps, you will at least have to verify (again!) that you are still up to date before proceeding. You will have to merge (again) if upsteam has been modified and you are using the merge strategy.
|
||
### Locking | ||
|
||
There has also been a history of issues with colliding locks that disrupts parallel commands running. In this case, locking happens to avoid collison on file operations. We should be locking during Git operations but might not need to during workspace/plan file creation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lock for git operations should cover the "lifetime" of the git operation and the tests performed to see if they are necessary:
- lock
- check for conditions ( "does directory exist", "are we are the right commit", "are we in sync", ..)
- fix condition (make dir, check out, fetch, checkout, merge, ... )
- unlock
6. PlanCommandRunner cleans up previous plans and Project locks | ||
7. PlanCommandRunner passes ProjectCommandRunner and a list of projects to ProjectCommandPoolExecutor which executes `ProjectCommandRunner.doPlan` | ||
8. ProjectCommandRunner.doPlan | ||
1. acquires Project lock - `p.Locker.TryLock(ctx.Log, ctx.Pull, ctx.User, ctx.Workspace, models.NewProject(ctx.Pull.BaseRepo.FullName, ctx.RepoRelDir), ctx.RepoLocking)` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moving this project lock up into projectCmdBuilder before step 5.3 (workingdirlocker locks) would mean we could give instant feedback about the lock problem instead of possibly hours later when we get around to the plan step that fails to get the lock, and we could drop the clone here in doPlan step 3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really want to lock on the whole PR so early? There's on going work in #3879 to an the ability to move it even deeper to the apply step instead of at the first plan.
There has also been a history of issues with colliding locks that disrupts parallel commands running. In this case, locking happens to avoid collison on file operations. We should be locking during Git operations but might not need to during workspace/plan file creation. | ||
|
||
*TODO:* dig into locking more thoroughly. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest a locking scheme something like this:
-
An internal in-memory blocking lock for file operations that cannot safely run in parallel. Lock key should be the directory affected, the output file generated, or the TF_DATA_DIR.
- grab lock
- test conditions (directory exists, files are modified, commit is right, synced with upstream, ...)
- remedy conditions (make dir, regenerate files, clone/fetch/merge/checkout/... )
- unlock
-
An external/shared between PRs/atlantis instances lock (the current project lock)
To be safe, the lock lifetime should be :- grab project lock
- test conditions (up to date with HEAD/upstream .. ?)
- update if necessary
- plan
- apply
- merge PR to upstream
- unlock project lock
Can I suggest separate locks for |
@brandon-fryslie This is something that someone has proposed in #3879, which should address your concerns |
034aa8b
to
34da5a3
Compare
Hi @GenPage, I was working on #2921 before but life got in the way. Good news is that I have some bandwidth to start working on this again, I can't make any promises besides I have some free time and I'm willing to give this a shot again. That would also help me remember how the code works and maybe I can help with the discussion. Should I start working on that? |
So hey... I've started looking at the code and remembered why this was so hard... I've tested a few different approaches but I feel like to fix the locking issue a lot work is going to be required. I took a step back and tried to see things from the user perspective and I've noticed that the suffering comes not from the locking issue, but from the interruption that happens when a command tries to acquire a lock that is being used already. With that in mind I've created a draft PR here trying to mitigate this by adding a retry logic when we try to acquire a lock (it's what users have to do manually anyway). Doing this can buy us some time to rethink this without users suffering while waiting for us. |
You have the green light, @Fabianoshz. Thanks for taking the time to look at this. |
Welcome back @Fabianoshz, yes its a very complex issue hence the ADR to try and understand it and break it down into multiple pieces that we can step through. Having a timeout on the working dir lock is an excellent first step, as for other ideas I've been a bit removed from the code as of late and differ to others judgment. |
what
This is the first Architecture Decision Record for Atlantis to address the problem we have with locking. This address a key function of the system and has a broad impact on how Atlantis operates. Discussion for the ADR can take place in the PR for the community and will not be merged until accepted.
why
There have been many bug reports, issues, and pull requests over multiple release cycles trying to address this problem. The ADR format provides the perfect way to get the community on the same page.
references