Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

issue with diff method #501

Open
valeriocos opened this issue Aug 21, 2016 · 4 comments
Open

issue with diff method #501

valeriocos opened this issue Aug 21, 2016 · 4 comments

Comments

@valeriocos
Copy link

valeriocos commented Aug 21, 2016

I'm using gitpython to collect diff information between a commit and his parent.
Generally, the following code works fine when the number of diffs to retrieve is small:
diffs = c.parents[0].diff(c, create_patch=True)
Conversely, when the number of diffs is huge (https://git.eclipse.org/c/papyrus/org.eclipse.papyrus.git/commit/?id=f5f817279baa2008450aa32b18e576c2fcda02bb), that code is not able to produce an output after 24h (at least).
Is there another way I could use to retrieve the diff information between two commits?

Below you can find the code to replicate this behaviour:

from git import *

REPO_PATH = ""C:/Users/.../org.eclipse.papyrus"" (you can clone it from here: https://git.eclipse.org/c/papyrus/org.eclipse.papyrus.git/)

BRANCH = "2.0.0"

def main():
    repo = Repo(REPO_PATH, odbt=GitCmdObjectDB)
    reference = [r for r in repo.references if r.name == BRANCH][0]
    for c in repo.iter_commits(rev=reference):
        if c.hexsha == 'f5f817279baa2008450aa32b18e576c2fcda02bb':
            diffs = c.parents[0].diff(c, create_patch=True)
            print str(len(diffs))
            break

if __name__ == "__main__":
    main()
@Byron
Copy link
Member

Byron commented Aug 21, 2016

Unfortunately, I cannot reproduce the issue despite of the fabulous reproduction script. This is what I did:

  • git clone http://git.eclipse.org/gitroot/papyrus/org.eclipse.papyrus.git
  • time python reproduce.py

The latter produced this output:

➜  GitPython git:(master) ✗ time python reproduce.py
7241
python reproduce.py  4.97s user 0.66s system 99% cpu 5.670 total

It appears there is something else going on. Maybe you are not using the latest version ? Maybe it's something related to windows particularly. In any case, we will have to dig deeper to find a solution for this one.

The actual script I ended up using is behind the fold.

from git import *

REPO_PATH = "./org.eclipse.papyrus"

BRANCH = "2.0.0"

def main():
    repo = Repo(REPO_PATH, odbt=GitCmdObjectDB)
    reference = [r for r in repo.references if r.name == BRANCH][0]
    for c in repo.iter_commits(rev=reference):
        if c.hexsha == 'f5f817279baa2008450aa32b18e576c2fcda02bb':
            diffs = c.parents[0].diff(c, create_patch=True)
            print str(len(diffs))
            break

if __name__ == "__main__":
    main()

For completeness, here is the memory usage when trying to show the diff in the WEB-GUI - it took a long time to load as well.
screen shot 2016-08-21 at 20 27 57

@valeriocos
Copy link
Author

valeriocos commented Aug 22, 2016

I've updated gitpython to the last version (2.0.8), however the problem is still there. As you said, it may depend on Windows-related stuff.
I found a workaround that seems to work fine, below the code.

from git import *

REPO_PATH = "./org.eclipse.papyrus"

BRANCH = "2.0.0"

def main():
    diffs = []
    repo = Repo(REPO_PATH, odbt=GitCmdObjectDB)
    reference = [r for r in repo.references if r.name == BRANCH][0]
    for c in repo.iter_commits(rev=reference):
        if c.hexsha == 'f5f817279baa2008450aa32b18e576c2fcda02bb':
            files = repo.git.execute(["git", "diff", "--name-only", c.parents[0].hexsha, c.hexsha]).split('\n')
            for f in files:
                diff = c.parents[0].diff(c, paths=f, create_patch=True)
                diffs = diffs + diff

if __name__ == "__main__":
    main()

@Byron
Copy link
Member

Byron commented Aug 23, 2016

Thanks for the feedback, and for posting the workaround !
Given that the project is not tested on Windows anymore, and is supporting Windows only on a 'best-effort' basis, I believe there is nothing that can be done here to fix this particular case.
Thus I am closing this issue. If you disagree or would like to contribute some sort of fix, please let me know in the comments.

@ankostis
Copy link
Contributor

I can definitely reproduce this. git.diff code has been retrofitted on #519 to use threads when reading stream, but STILL I've seen a case where it blocked with particularly big streams.
Maybe using additionally queues might solve the problem for good. See http://eyalarubas.com/python-subproc-nonblock.html and http://stackoverflow.com/a/4896288/548792

@ankostis ankostis reopened this Oct 11, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

3 participants