-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid race conditions when installing multiple dependencies from the same git repository. #9658
base: main
Are you sure you want to change the base?
Avoid race conditions when installing multiple dependencies from the same git repository. #9658
Conversation
Multiple installs from the same git repository causes arace condition when the repository is cloned. This commit only allows one operation per repository to be executed by the parallel workers. Any extra operation will be performed serially. Since the git repository is cached after the first operation, this is blazingly fast. Resolves: python-poetry#6958
Thank you so much |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Just a minor nitpick.
I thought a bit about how to write a unit test but even with mocking I could not come up with a good test. Maybe, we can just add kind of a smoke test like test_executor_should_append_subdirectory_for_git
but with two packages from the same git repo and check that both are still installed. At least, the order should be fix now, since they are installed serially.
A workaround for now, that works for 1.8.3
|
src/poetry/installation/executor.py
Outdated
if operation.package.source_type == "git" and not operation.skipped: | ||
# If this is the first operation cloning this repository, | ||
# allow it to be executed in parallel | ||
if operation.package.source_url not in git_repositories_to_clone: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this work on forked vs non-forked?
https://github.com/repo1/code
https://github.com/repo2/code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, it does not! The race condition stems from having multiple installs that try to clone to the same local directory. The name for this directory is created here.
In your example, both repositories would get the same target
: <source_root>/code
.
But as you point out, I would not catch this in my checks for identical source_urls, since the forks have different urls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see two ways this can be resolved:
- In the
Git.clone
method, making sure that a larger portion of the source url is used to craft the target directory. - In the executor logic, ensuring that packages that will have the same
target
directory will be considered parallel unsafe.
Does anyone have an opinion on what route to take?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or 3. fix the race condition ?
From the little I understand, it appears that the root cause is concurrent execution of poetry.vcs.git.Git.clone()
in the same physical directory. Neither the git
executable, nor dulwich support concurrent execution in the same directory. Does this sound right?
What if we prevent concurrent clone execution in the same directory? We could use threading.Lock
.
In vcs/git/backend.py
this would look like:
from collections import defaultdict
from threading import Lock
_clone_lock: dict[Path, Lock] = defaultdict(Lock)
@classmethod
def clone(
cls,
url: str,
name: str | None = None,
branch: str | None = None,
tag: str | None = None,
revision: str | None = None,
source_root: Path | None = None,
clean: bool = False,
) -> Repo:
source_root = source_root or cls.get_default_source_root()
source_root.mkdir(parents=True, exist_ok=True)
name = name or cls.get_name_from_source_url(url=url)
target = source_root / name
refspec = GitRefSpec(branch=branch, revision=revision, tag=tag)
# Prohibit concurrent clones in the same target Path
with cls._clone_lock[target]:
# rest of clone follows, only the indentation changes
This will deadlock if clone()
calls back into clone()
using the same directory. And although recursive calls do happen via submodules, it doesn't look like they use the same directory.
I tried this out on the sample repo in #6958 and it seems to fix the issue, but I'm not familiar enough with the poetry code to know if this would be a viable approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Fixing the problem at it's root would of course be great.
An unfortunate functional drawback would be that when multiple installs from the same repository are executed in parallel, they would all execute past the cache lookup, and then proceed to wait right before the clone.
I have not dug deep into this, but it seems like this would cause the repository to be downloaded multiple times instead of just once. With the solution in this PR, it's only downloaded once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. My suggestion in #9658 (comment) could potentially work, but would cause a performance hit for sure.
So how about this: git operations with the same url name (from Git.get_name_from_source_url()
) might conflict, because they're cloned in <source_root>/src/<url name>
. So we serialize all git operations which share an url name.
For example, on top of poetry main:
diff --git a/src/poetry/installation/executor.py b/src/poetry/installation/executor.py
index 196cdd5a..ff439d50 100644
--- a/src/poetry/installation/executor.py
+++ b/src/poetry/installation/executor.py
@@ -32,6 +32,7 @@ from poetry.utils.helpers import pluralize
from poetry.utils.helpers import remove_directory
from poetry.utils.isolated_build import IsolatedBuildError
from poetry.utils.isolated_build import IsolatedBuildInstallError
+from poetry.vcs.git import Git
if TYPE_CHECKING:
@@ -144,6 +145,7 @@ class Executor:
for _, group in groups:
tasks = []
serial_operations = []
+ git_operations = []
for operation in group:
if self._shutdown:
break
@@ -162,8 +164,32 @@ class Executor:
serial_operations.append(operation)
continue
+ # Defer git operations, they might need to be serialized
+ if operation.package.source_type == "git" and not operation.skipped:
+ git_operations.append(operation)
+ continue
+
tasks.append(self._executor.submit(self._execute_operation, operation))
+ def url_name_from_operation(operation: Operation) -> str | None:
+ if url := operation.package.repository_url:
+ return Git.get_name_from_source_url(url)
+ return None
+
+ # git operations on the same url name end up being cloned in the same directory,
+ # so these must execute serially or they'll interfere with each other.
+ for _, git_ops_group in itertools.groupby(
+ git_operations,
+ key=url_name_from_operation,
+ ):
+ ops = list(git_ops_group)
+
+ def _serialize(ops: list[Operation] = ops) -> None:
+ for op in ops:
+ self._execute_operation(op)
+
+ tasks.append(self._executor.submit(_serialize))
+
try:
wait(tasks)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm leaning towards ensuring that forked repositories (actually any repository that happen to have the same name) are not cloned into the same target directory. I'd have to examine if there are any implicit dependencies on the name of the target folder though.
If that route seems acceptable, I would also consider implementing it in a separate PR, since it could be viewed as a separate fix.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A separate PR sounds right to me.
I am wondering what the use case behind this issue is? If you have two forks with the same name the two packages that are built from the forks are likely to have the same name, right? If you have two packages with the same name from different repos you will get a conflict because you can only install one package with a specific name into an environment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem occurs when the repositories have the same name, not the package. So you would encounter this problem with dependencies like this:
pkg_1 = {git = "[email protected]:foo/my_packages.git", subdirectory = "pkg_1"}
pkg_2 = {git = "[email protected]:bar/my_packages.git", subdirectory = "pkg_2"}
It doesn't matter if one repository is a fork of the other or not, having the same name should be enough to cause the conflict.
I'm not sure how common this case is though, so I would be happy to defer it to a separate fix.
Operations for a single repository are grouped and executed serially by a single worker.
@gustavgransbo @TheSven73 Just to be sure before I take another detailed look and potentially merge it: If we outsource #9658 (comment) into a separate issue/PR, this PR is ready from your end, isn't it? |
Yes, from my perspective it's ready to be merged! |
self._executor.submit( | ||
functools.partial( | ||
_serialize, | ||
repository_serial_operations=repository_git_operations, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is functools.partial
an unnecessary complication here? Isn't this code just equivalent to
tasks.append(self._executor.submit(_serialize, repository_git_operations))
?
Multiple installs from the same git repository causes a race condition when the repository is cloned.
This change only allows one operation per repository to be executed by the parallel workers.
Any extra operation on the same git repository will be performed serially.
Since the git repository is cached after the first operation, this is fast.
Building on the setup of @JonathanRayner, I was able to consistently reproduce the error in this repository.
Thank you @JonathanRayner for the hard work!
Pull Request Check List
Resolves: #6958
The race condition occurs inside poetry.vcs.git.Git.clone, this method is mocked in all tests: conftest.py.
This makes the actual race condition hard to test.
Also, since the race condition required quite a bit of external setup, I think it could be challenging to add a good test for this.
This change should not require any new documentation.