Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Library leaves dangling processes after use #1333

Open
zhiltsov-max opened this issue Sep 2, 2021 · 2 comments
Open

Library leaves dangling processes after use #1333

zhiltsov-max opened this issue Sep 2, 2021 · 2 comments

Comments

@zhiltsov-max
Copy link

zhiltsov-max commented Sep 2, 2021

Hi, I've faced a problem on Windows with file/directory removal with shutil.rmtree(). The problem, by itself, is not new and already has known solutions, including git.util.rmtree(). However, just using this function was not enough in my case and I started to dig deeper - there was an error about an open file handle. I discovered there were extra child git processes hanging in the process tree, but, unexpectedly, it was happening after all the library objects were removed:
image

Then I tried to catch and trace Popen calls from the library and found out this function:

GitPython/git/cmd.py

Lines 1190 to 1199 in 5b3669e

def get_object_header(self, ref: str) -> Tuple[str, str, int]:
""" Use this method to quickly examine the type and size of the object behind
the given ref.
:note: The method will only suffer from the costs of command invocation
once and reuses the command in subsequent calls.
:return: (hexsha, type_string, size_as_int)"""
cmd = self._get_persistent_cmd("cat_file_header", "cat_file", batch_check=True)
return self.__get_object_header(cmd, ref)

From comments and names, it looks like such persistent behavior is intended. There is also the AutoInterrupt class here:

GitPython/git/cmd.py

Lines 367 to 373 in 5b3669e

class AutoInterrupt(object):
"""Kill/Interrupt the stored process instance once this instance goes out of scope. It is
used to prevent processes piling up in case iterators stop reading.
Besides all attributes are wired through to the contained process object.
The wait method was overridden to perform automatic status code checking
and possibly raise."""

returned from:

return self.AutoInterrupt(proc, command)

The class comment indicates that the process should be killed when the object goes out of scope. To me, it looks like an attempt to imitate the RAII C++ idiom. Unfortunately, in Python it does not work this way because of garbage collector and the scoping rules. That is, an object that has no references can be removed in any time after it lost its last reference. The "pythonic" way to deal with this behavior is using a context manager (which this class is not).

My suggestion is to collect these processes and manage them on a library/repo level, or wrap this in a context manager (which is better than just calling wait() manually in case of exceptions). BTW, probably, the process created is not reused after the first call.

import git
r = git.Repo.init('test_repo') 
r.index.commit('aa')   
del r

Upd: I see there is an undocumented repo.close(), which can help, but can have undesired side-effects because of a forced gc call.

@Byron
Copy link
Member

Byron commented Sep 5, 2021

Thanks for all the research done to investigate, here are my thoughts.

The class comment indicates that the process should be killed when the object goes out of scope. To me, it looks like an attempt to imitate the RAII C++ idiom. Unfortunately, in Python it does not work this way because of garbage collector and the scoping rules

Indeed, this comes from a time long gone where python would use ref counting and the deterministic destruction of objects that comes with it. Since the adoption of GC this behaviour is no more and GitPython is somewhat broken due to it. On unix system it's not noticeable, on Windows however, it is. Generally, GitPython in long-running processes should be avoided, and if it really has to run as part of the long-running processes these should probably be well isolated.

My suggestion is to collect these processes and manage them on a library/repo level, or wrap this in a context manager (which is better than just calling wait() manually in case of exceptions).

I welcome any fix and would hope it can be done in a non-breaking, additive fashion. For example, those who write new code would use the context managers and get perfect resource cleanup, but existing code may keep working as well with the current behaviour.

BTW, probably, the process created is not reused after the first call.

If that were the case it would probably show in all kinds of places. But assuming it's true, it should be possible to remove the persistent process (or offer a flag not to use it) and solve this issue that way. This means avoiding such a persistent process could be another avenue for a fix.

Upd: I see there is an undocumented repo.close(), which can help, but can have undesired side-effects because of a forced gc call.

If better control over resources is a requirement, GitPython might not be adequate for the job, and if I had to use python and wanted to use git, I'd probably settle for libgit2 and its python bindings, while separating everything using it out into a separate process which can be shut down after use to free all system resources (I wouldn't trust libgit2 with that either).

@Jorricks
Copy link

AFAIK, I tested out repo.close() and this also did not close the two zombie processes for me on Mac.
When I ran ps -ef, I still get

1001 33792 24077   0 10:57AM ttys002    0:03.99 python3 use_of_gitpython.py
1001 33852 33792   0 10:57AM ttys002    0:00.01 git cat-file --batch
1001 34024 33792   0 10:57AM ttys002    0:00.00 git cat-file --batch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

3 participants