Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

external_repo doesn't respect parent cache and progress bar #3611

Closed
efiop opened this issue Apr 8, 2020 · 7 comments · Fixed by #3841
Closed

external_repo doesn't respect parent cache and progress bar #3611

efiop opened this issue Apr 8, 2020 · 7 comments · Fixed by #3841
Assignees
Labels
bug Did we break something? p0-critical Critical issue. Needs to be fixed ASAP.

Comments

@efiop
Copy link
Contributor

efiop commented Apr 8, 2020

Somewhere during the external_repo refactor we've lost the logic that was using parent's cache dir to store cache. Currently, we are using tempdir (yes, in /tmp) to pull data too, which is really bad, as it might not have enough space available. Plus, we don't respect link type, trying to always copy stuff, which is also really bad. And on top of that we've lost our progress bar for when we are checking out the data after downloading during import/get.

@triage-new-issues triage-new-issues bot added the triage Needs to be triaged label Apr 8, 2020
@efiop efiop added bug Did we break something? p0-critical Critical issue. Needs to be fixed ASAP. labels Apr 8, 2020
@triage-new-issues triage-new-issues bot removed the triage Needs to be triaged label Apr 8, 2020
@efiop efiop self-assigned this Apr 8, 2020
@shcheklein
Copy link
Member

lack of tests cc @Suor

@efiop
Copy link
Contributor Author

efiop commented Apr 8, 2020

I guess it is lack of tests before the refactoring as well. Then refactor broke it without us noticing. Will add some.

@efiop
Copy link
Contributor Author

efiop commented Apr 8, 2020

Ok, update: we do respect link type by chance, because of self.commit() in stage.py after we download it. Still the fact that we download to /tmp, then copy! it to local workspace and only then we move it to cache and create a link is really bad.

@Suor
Copy link
Contributor

Suor commented Apr 9, 2020

This was intentional. This allows sharing external repo caches outside of repo, which might not even exist, like in dvcx.

@Suor
Copy link
Contributor

Suor commented Apr 9, 2020

Also, I see polluting current repo cache is bad too.

@Suor
Copy link
Contributor

Suor commented Apr 9, 2020

The thing is external repos don't have parent, they are completely independent.

@Suor
Copy link
Contributor

Suor commented Apr 9, 2020

Looked at how import is implemented, looks like it downloads directly to targets repo cache, via the hack:

if hasattr(repo, "cache"):
    repo.cache.local.cache_dir = self.repo.cache.local.cache_dir

Then Output.checkout() is called with path set to tmp path, then thing is moved from tmp path to needed location, tmp is removed. The tmp path/move part was introduced for partial get/imports and is not in /tmp, but next to dest. Copying in ExternalRepo.pull_to() is only used for git files not cached ones.

Since we call .checkout() on out from external repo it will probably not use cache type of the parent repo. So the issue appeared when we moved from copying dest out to using its path_info directly. I.e. moved from:

out = self.fetch()
to.info = copy.copy(out.info)
to.checkout()

to

out = self.fetch()
out.path_info = to.path_info
out.checkout()

This was part of making ExternalRepo.

So I suggest:

  • changing DependencyRepo.download() hack to smth like: repo.cache.local = self.repo.cache.local, to make it exatly the same, i.e. use cache type and everything
  • changing DependencyRepo.download() to accept PathInfo instead of to output, this is confusing and wrong
  • look at how we do partial pull/checkout and use it directly in ExternalRepo._pull_cached() instead of making tmp copy

We will still use caches in /tmp on every external repo use but imports though. Not sure if this is the issue or anything should be done with it.

On progress bar, not sure what you mean. Does cache have some associated pbar with it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Did we break something? p0-critical Critical issue. Needs to be fixed ASAP.
Projects
None yet
4 participants