Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dvc: optimize Git.is_tracked() #3053

Merged
merged 1 commit into from
Jan 10, 2020
Merged

dvc: optimize Git.is_tracked() #3053

merged 1 commit into from
Jan 10, 2020

Conversation

Suor
Copy link
Contributor

@Suor Suor commented Jan 4, 2020

Closes #3000.

Comment on lines +232 to +233
# There are 4 stages, see BaseIndexEntry.stage
return any((path, i) in self.repo.index.entries for i in (0, 1, 2, 3))
Copy link
Contributor

@efiop efiop Jan 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be a new index version released at some point and this might break in a non-obvious way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you you have any benchmark-ish results on how much problems this is causing us right now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are not index versions. From BaseIndexEntry.stage docstring:

Stage of the entry, either:

    * 0 = default stage
    * 1 = stage before a merge or common ancestor entry in case of a 3 way merge
    * 2 = stage of entries from the 'left' side of the merge
    * 3 = stage of entries from the right side of the merge

:note: For more information, see http://www.kernel.org/pub/software/scm/git/docs/git-read-tree.html

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Suor How much real-world problems is it causing us? Does it justify this hack at all?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tbh, glimpsing at it I don't quite understand what is going on.

Copy link
Contributor Author

@Suor Suor Jan 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Benched it - doesn't help much, the reason is .index is not cached it's reread on each use.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Listing 10k files in 10k git repo took 6:19 with old code and 5:16 with new one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Suor When do we do that? When we have 10K stage outputs? We don't list files inside dirs, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We list when we do dvc add -r dir.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Suor But we don't recomend doing that on thousands of files, and consider that a misuse. Hence why I'm questioning if we even need this change.

@Suor
Copy link
Contributor Author

Suor commented Jan 6, 2020

Benchmarked old and new code to walk 10k files in 10k repo, with and without index entries caching:

setting seconds
old code 379
new code 316
old code + caching 24
new code + caching 19

@Suor
Copy link
Contributor Author

Suor commented Jan 6, 2020

For 100k files in repo:

100k over 100k seconds
old code + caching 620
new code + caching 24

All these tests are about running _find_all_targets(dvc, 'dir', True).

@efiop
Copy link
Contributor

efiop commented Jan 6, 2020

@Suor what is index entries caching? Don't see it mentioned/explained anywhere around here.

@Suor
Copy link
Contributor Author

Suor commented Jan 7, 2020

This is how git.Repo.index property is implemented:

    @property
    def index(self):
        """:return: IndexFile representing this repository's index.
        :note: This property can be expensive, as the returned ``IndexFile`` will be
         reinitialized. It's recommended to re-use the object."""
        return IndexFile(self)

Then IndexFile lazily loads entries and stores them in the corresponding attribute, which doesn't matter since IndexFile is recreated on each access. This looks like a bug in gitpython.

For benching I cached it on our Git class:

    @cached_property
    def index(self):
        return self.repo.index

and then used self.index.entries instead of self.repo.index.entries in Git.is_tracked() - these are lines, which say '... + caching'.

@Suor
Copy link
Contributor Author

Suor commented Jan 7, 2020

Summary:

  • for 10k file repos it's ok to only fix index caching (probably need to fix in gitpython)
  • for bigger ones we will need both fixes: the one implemented here and index caching.

@efiop efiop merged commit 90152ea into iterative:master Jan 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

perf: _find_all_targets() has quadratic complexity because of Git.is_tracked()
3 participants