Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs(adr): Project Locks 0002 #3345

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

docs(adr): Project Locks 0002 #3345

wants to merge 3 commits into from

Conversation

GenPage
Copy link
Member

@GenPage GenPage commented Apr 21, 2023

what

This is the first Architecture Decision Record for Atlantis to address the problem we have with locking. This address a key function of the system and has a broad impact on how Atlantis operates. Discussion for the ADR can take place in the PR for the community and will not be merged until accepted.

why

There have been many bug reports, issues, and pull requests over multiple release cycles trying to address this problem. The ADR format provides the perfect way to get the community on the same page.

references

@GenPage GenPage requested a review from a team as a code owner April 21, 2023 19:03
@GenPage GenPage force-pushed the adr/project-locks branch 2 times, most recently from c82657e to 5d5c7c6 Compare April 21, 2023 19:14
@nitrocode nitrocode changed the title docs(adr): 001 - Project Locks docs(adr): adrs 0001 and Project Locks 0002 May 10, 2023
### Problem

There is a long-standing regression introduced by a PR to allow parallel plans to happen for projects within the same repository that also belongs in the same workspace. The error prompting users when attempting to plan is:
```
Copy link
Contributor

@finnag finnag Aug 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After commit 5288389 this is hard to trigger. In practice it requires a TryLock to be called while TryLockPull is held, and that only happens for a reasonably short time in buildAllProjectCommandsByPlan.

TryLockPull can be killed and the caller rewritten to actually lock the affected directories, then this error will not be possible any longer.


There has also been a history of issues with colliding locks that disrupts parallel commands running. In this case, locking happens to avoid collison on file operations. We should be locking during Git operations but might not need to during workspace/plan file creation.

*TODO:* dig into locking more thoroughly.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The WorkingDirLocker locks would be a lot more end user friendly if they were blocking locks, instead of "try or instantly fail" like they are now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the blocking locks. Block initially and then timeout after a certain limit.

a) Error when attempting fetch + pull to update existing local git repo
b) File operation on a path that doesn't exist (data loss/accidental deletes)

In all other situations, we should be utilizing Git for its delta compressions and performing less intensive fetch + pull operations.
Copy link
Contributor

@finnag finnag Aug 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are using the merge checkout strategy, The time between the initial checkout, and the actual plan execution that will run successfully to completion without hitting any lock problems can be hours/days/weeks, depending on what other PRs are doing.

So, before the steps, you will at least have to verify (again!) that you are still up to date before proceeding. You will have to merge (again) if upsteam has been modified and you are using the merge strategy.


### Locking

There has also been a history of issues with colliding locks that disrupts parallel commands running. In this case, locking happens to avoid collison on file operations. We should be locking during Git operations but might not need to during workspace/plan file creation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lock for git operations should cover the "lifetime" of the git operation and the tests performed to see if they are necessary:

  • lock
  • check for conditions ( "does directory exist", "are we are the right commit", "are we in sync", ..)
  • fix condition (make dir, check out, fetch, checkout, merge, ... )
  • unlock

6. PlanCommandRunner cleans up previous plans and Project locks
7. PlanCommandRunner passes ProjectCommandRunner and a list of projects to ProjectCommandPoolExecutor which executes `ProjectCommandRunner.doPlan`
8. ProjectCommandRunner.doPlan
1. acquires Project lock - `p.Locker.TryLock(ctx.Log, ctx.Pull, ctx.User, ctx.Workspace, models.NewProject(ctx.Pull.BaseRepo.FullName, ctx.RepoRelDir), ctx.RepoLocking)`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving this project lock up into projectCmdBuilder before step 5.3 (workingdirlocker locks) would mean we could give instant feedback about the lock problem instead of possibly hours later when we get around to the plan step that fails to get the lock, and we could drop the clone here in doPlan step 3.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really want to lock on the whole PR so early? There's on going work in #3879 to an the ability to move it even deeper to the apply step instead of at the first plan.

There has also been a history of issues with colliding locks that disrupts parallel commands running. In this case, locking happens to avoid collison on file operations. We should be locking during Git operations but might not need to during workspace/plan file creation.

*TODO:* dig into locking more thoroughly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest a locking scheme something like this:

  • An internal in-memory blocking lock for file operations that cannot safely run in parallel. Lock key should be the directory affected, the output file generated, or the TF_DATA_DIR.

    • grab lock
    • test conditions (directory exists, files are modified, commit is right, synced with upstream, ...)
    • remedy conditions (make dir, regenerate files, clone/fetch/merge/checkout/... )
    • unlock
  • An external/shared between PRs/atlantis instances lock (the current project lock)
    To be safe, the lock lifetime should be :

    • grab project lock
    • test conditions (up to date with HEAD/upstream .. ?)
    • update if necessary
    • plan
    • apply
    • merge PR to upstream
    • unlock project lock

@brandon-fryslie
Copy link

Can I suggest separate locks for plan vs apply? We've disabled Atlantis locking entirely because the current locking system in Atlantis is not scalable for repos with many modules and many concurrent users.

@GenPage GenPage mentioned this pull request Dec 11, 2023
19 tasks
@GenPage
Copy link
Member Author

GenPage commented Dec 13, 2023

Can I suggest separate locks for plan vs apply? We've disabled Atlantis locking entirely because the current locking system in Atlantis is not scalable for repos with many modules and many concurrent users.

@brandon-fryslie This is something that someone has proposed in #3879, which should address your concerns

@Fabianoshz
Copy link
Contributor

Hi @GenPage, I was working on #2921 before but life got in the way. Good news is that I have some bandwidth to start working on this again, I can't make any promises besides I have some free time and I'm willing to give this a shot again. That would also help me remember how the code works and maybe I can help with the discussion.

Should I start working on that?

@jamengual jamengual requested a review from a team as a code owner October 10, 2024 02:42
@jamengual jamengual requested review from nitrocode and X-Guardian and removed request for a team October 10, 2024 02:42
@Fabianoshz
Copy link
Contributor

So hey... I've started looking at the code and remembered why this was so hard... I've tested a few different approaches but I feel like to fix the locking issue a lot work is going to be required.

I took a step back and tried to see things from the user perspective and I've noticed that the suffering comes not from the locking issue, but from the interruption that happens when a command tries to acquire a lock that is being used already.

With that in mind I've created a draft PR here trying to mitigate this by adding a retry logic when we try to acquire a lock (it's what users have to do manually anyway). Doing this can buy us some time to rethink this without users suffering while waiting for us.

@jamengual
Copy link
Contributor

You have the green light, @Fabianoshz. Thanks for taking the time to look at this.
@lukemassa is another maintainer to whom you can bounce ideas.
I can be a guinea pig for testing if you need me to, and we can definitely ask the community to try it.

@GenPage
Copy link
Member Author

GenPage commented Oct 21, 2024

So hey... I've started looking at the code and remembered why this was so hard... I've tested a few different approaches but I feel like to fix the locking issue a lot work is going to be required.

I took a step back and tried to see things from the user perspective and I've noticed that the suffering comes not from the locking issue, but from the interruption that happens when a command tries to acquire a lock that is being used already.

With that in mind I've created a draft PR here trying to mitigate this by adding a retry logic when we try to acquire a lock (it's what users have to do manually anyway). Doing this can buy us some time to rethink this without users suffering while waiting for us.

Welcome back @Fabianoshz, yes its a very complex issue hence the ADR to try and understand it and break it down into multiple pieces that we can step through.

Having a timeout on the working dir lock is an excellent first step, as for other ideas I've been a bit removed from the code as of late and differ to others judgment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
adr Architecture Decision Record
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants