New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

docs(adr): Project Locks 0002 #3345

Open

GenPage wants to merge 3 commits into main from adr/project-locks

Member

GenPage commented Apr 21, 2023 •

edited by nitrocode

Loading

what

This is the first Architecture Decision Record for Atlantis to address the problem we have with locking. This address a key function of the system and has a broad impact on how Atlantis operates. Discussion for the ADR can take place in the PR for the community and will not be merged until accepted.

why

There have been many bug reports, issues, and pull requests over multiple release cycles trying to address this problem. The ADR format provides the perfect way to get the community on the same page.

references

GenPage requested a review from a team as a code owner

April 21, 2023 19:03

GenPage force-pushed the adr/project-locks branch 2 times, most recently from c82657e to 5d5c7c6 Compare

April 21, 2023 19:14

nitrocode reviewed

View reviewed changes

docs/adr/001-project-locks.md Outdated Show resolved Hide resolved

nitrocode changed the title ~~docs(adr): 001 - Project Locks~~ docs(adr): adrs 0001 and Project Locks 0002

nitrocode reviewed

View reviewed changes

docs/adr/0002-project-locks.md Outdated Show resolved Hide resolved

nitrocode reviewed

View reviewed changes

docs/adr/0002-project-locks.md Show resolved Hide resolved

nitrocode reviewed

View reviewed changes

docs/adr/0002-project-locks.md Outdated Show resolved Hide resolved

nitrocode reviewed

View reviewed changes

docs/adr/0002-project-locks.md Outdated Show resolved Hide resolved

GenPage mentioned this pull request

docs(adr): setup ADR process 0001 #3405

Merged

GenPage changed the title ~~docs(adr): adrs 0001 and Project Locks 0002~~ docs(adr): Project Locks 0002

GenPage added docs adr labels

GenPage mentioned this pull request

Workspace (default) lock creation for no apparent reason #2200

Open

GenPage force-pushed the adr/project-locks branch from 2084532 to e681c32 Compare

May 16, 2023 20:31

github-actions bot removed the docs label

GenPage mentioned this pull request

fix: Discarding a Plan Causes the Whole Working Directory to be Deleted #3553

Merged

jamengual mentioned this pull request

feat: add project to hooks #2882

Closed

GenPage mentioned this pull request

fix: ensure cloning workingdir before calling plan api #3584

Open

This was referenced Aug 14, 2023

Atlantis 0.17.2 not always creating 'default' working dir #1714

Open

fix: parallel plan and apply also in a single workspace #3670

Open

GenPage mentioned this pull request

/api/plan does not create workspace if it does not exist #2949

Open

finnag reviewed

View reviewed changes

docs/adr/0002-project-locks.md

+              ### Problem
+              There is a long-standing regression introduced by a PR to allow parallel plans to happen for projects within the same repository that also belongs in the same workspace. The error prompting users when attempting to plan is:
+              ```

Contributor

finnag Aug 21, 2023 •

edited

Loading

After commit 5288389 this is hard to trigger. In practice it requires a TryLock to be called while TryLockPull is held, and that only happens for a reasonably short time in buildAllProjectCommandsByPlan.

TryLockPull can be killed and the caller rewritten to actually lock the affected directories, then this error will not be possible any longer.

finnag reviewed

View reviewed changes

docs/adr/0002-project-locks.md


		There has also been a history of issues with colliding locks that disrupts parallel commands running. In this case, locking happens to avoid collison on file operations. We should be locking during Git operations but might not need to during workspace/plan file creation.

		TODO: dig into locking more thoroughly.

Contributor

finnag Aug 21, 2023

The WorkingDirLocker locks would be a lot more end user friendly if they were blocking locks, instead of "try or instantly fail" like they are now.

Member Author

GenPage Dec 13, 2023

I agree with the blocking locks. Block initially and then timeout after a certain limit.

finnag reviewed

View reviewed changes

docs/adr/0002-project-locks.md

+                		a) Error when attempting fetch + pull to update existing local git repo
+                  		b) File operation on a path that doesn't exist (data loss/accidental deletes)
+              In all other situations, we should be utilizing Git for its delta compressions and performing less intensive fetch + pull operations.

Contributor

finnag Aug 21, 2023 •

edited

Loading

If you are using the merge checkout strategy, The time between the initial checkout, and the actual plan execution that will run successfully to completion without hitting any lock problems can be hours/days/weeks, depending on what other PRs are doing.

So, before the steps, you will at least have to verify (again!) that you are still up to date before proceeding. You will have to merge (again) if upsteam has been modified and you are using the merge strategy.

finnag reviewed

View reviewed changes

docs/adr/0002-project-locks.md


		### Locking

		There has also been a history of issues with colliding locks that disrupts parallel commands running. In this case, locking happens to avoid collison on file operations. We should be locking during Git operations but might not need to during workspace/plan file creation.

Contributor

finnag Aug 21, 2023

The lock for git operations should cover the "lifetime" of the git operation and the tests performed to see if they are necessary:

lock
check for conditions ( "does directory exist", "are we are the right commit", "are we in sync", ..)
fix condition (make dir, check out, fetch, checkout, merge, ... )
unlock

finnag reviewed

View reviewed changes

docs/adr/0002-project-locks.md

+. PlanCommandRunner cleans up previous plans and Project locks
+. PlanCommandRunner passes ProjectCommandRunner and a list of projects to ProjectCommandPoolExecutor which executes  `ProjectCommandRunner.doPlan`
+. ProjectCommandRunner.doPlan
+. acquires Project lock - `p.Locker.TryLock(ctx.Log, ctx.Pull, ctx.User, ctx.Workspace, models.NewProject(ctx.Pull.BaseRepo.FullName, ctx.RepoRelDir), ctx.RepoLocking)`

Contributor

finnag Aug 21, 2023

Moving this project lock up into projectCmdBuilder before step 5.3 (workingdirlocker locks) would mean we could give instant feedback about the lock problem instead of possibly hours later when we get around to the plan step that fails to get the lock, and we could drop the clone here in doPlan step 3.

Member Author

GenPage Dec 13, 2023

Do we really want to lock on the whole PR so early? There's on going work in #3879 to an the ability to move it even deeper to the apply step instead of at the first plan.

finnag reviewed

View reviewed changes

docs/adr/0002-project-locks.md

		There has also been a history of issues with colliding locks that disrupts parallel commands running. In this case, locking happens to avoid collison on file operations. We should be locking during Git operations but might not need to during workspace/plan file creation.

		TODO: dig into locking more thoroughly.

Contributor

finnag Aug 21, 2023

I suggest a locking scheme something like this:

An internal in-memory blocking lock for file operations that cannot safely run in parallel. Lock key should be the directory affected, the output file generated, or the TF_DATA_DIR.
- grab lock
- test conditions (directory exists, files are modified, commit is right, synced with upstream, ...)
- remedy conditions (make dir, regenerate files, clone/fetch/merge/checkout/... )
- unlock
An external/shared between PRs/atlantis instances lock (the current project lock)
To be safe, the lock lifetime should be :
- grab project lock
- test conditions (up to date with HEAD/upstream .. ?)
- update if necessary
- plan
- apply
- merge PR to upstream
- unlock project lock

brandon-fryslie commented Sep 9, 2023

Can I suggest separate locks for plan vs apply? We've disabled Atlantis locking entirely because the current locking system in Atlantis is not scalable for repos with many modules and many concurrent users.

jamengual mentioned this pull request

Run plan/apply for multiple projects in parallel #260

Open

GenPage mentioned this pull request

feat: plan queue functionality #2782

Open

19 tasks

Member Author

GenPage commented Dec 13, 2023

Can I suggest separate locks for plan vs apply? We've disabled Atlantis locking entirely because the current locking system in Atlantis is not scalable for repos with many modules and many concurrent users.

@brandon-fryslie This is something that someone has proposed in #3879, which should address your concerns

GenPage added 2 commits

December 13, 2023 09:40


          docs(adr): create 0002-project-locks.md

ef57d48


          Update 0002-project-locks.md

34da5a3

GenPage force-pushed the adr/project-locks branch from 034aa8b to 34da5a3 Compare

December 13, 2023 14:40

Contributor

Fabianoshz commented Oct 9, 2024

Hi @GenPage, I was working on #2921 before but life got in the way. Good news is that I have some bandwidth to start working on this again, I can't make any promises besides I have some free time and I'm willing to give this a shot again. That would also help me remember how the code works and maybe I can help with the discussion.

Should I start working on that?


          Merge branch 'main' into adr/project-locks

6fafcc7

jamengual requested a review from a team as a code owner

October 10, 2024 02:42

jamengual requested review from nitrocode and X-Guardian and removed request for a team

October 10, 2024 02:42

Fabianoshz mentioned this pull request

fix: Add retry on locks #4997

Open

Contributor

Fabianoshz commented Oct 11, 2024

So hey... I've started looking at the code and remembered why this was so hard... I've tested a few different approaches but I feel like to fix the locking issue a lot work is going to be required.

I took a step back and tried to see things from the user perspective and I've noticed that the suffering comes not from the locking issue, but from the interruption that happens when a command tries to acquire a lock that is being used already.

With that in mind I've created a draft PR here trying to mitigate this by adding a retry logic when we try to acquire a lock (it's what users have to do manually anyway). Doing this can buy us some time to rethink this without users suffering while waiting for us.

Contributor

jamengual commented Oct 11, 2024

You have the green light, @Fabianoshz. Thanks for taking the time to look at this.
@lukemassa is another maintainer to whom you can bounce ideas.
I can be a guinea pig for testing if you need me to, and we can definitely ask the community to try it.

Member Author

GenPage commented Oct 21, 2024

So hey... I've started looking at the code and remembered why this was so hard... I've tested a few different approaches but I feel like to fix the locking issue a lot work is going to be required.

I took a step back and tried to see things from the user perspective and I've noticed that the suffering comes not from the locking issue, but from the interruption that happens when a command tries to acquire a lock that is being used already.

With that in mind I've created a draft PR here trying to mitigate this by adding a retry logic when we try to acquire a lock (it's what users have to do manually anyway). Doing this can buy us some time to rethink this without users suffering while waiting for us.

Welcome back @Fabianoshz, yes its a very complex issue hence the ADR to try and understand it and break it down into multiple pieces that we can step through.

Having a timeout on the working dir lock is an excellent first step, as for other ideas I've been a bit removed from the code as of late and differ to others judgment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

jamengual jamengual left review comments

chenrui333 chenrui333 left review comments

finnag finnag left review comments

nitrocode Awaiting requested review from nitrocode nitrocode is a code owner automatically assigned from runatlantis/core-contributors

X-Guardian Awaiting requested review from X-Guardian X-Guardian is a code owner automatically assigned from runatlantis/core-contributors

At least 1 approving review is required to merge this pull request.

Labels

adr