Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Caching for local packages #261

Open
tzakharko opened this issue Dec 1, 2021 · 13 comments
Open

Caching for local packages #261

tzakharko opened this issue Dec 1, 2021 · 13 comments
Labels
feature a feature request or enhancement

Comments

@tzakharko
Copy link

I noticed local packages are currently not cached. If this feature is considered to be of interest, I can try to implement caching. From a cursory glance I suppose this involves manipulation of satisfy_remote_local() and/or installedok_remote_local()? Any pointers on where to start?

@gaborcsardi
Copy link
Member

The main issue for local packages is that it is harder to decide if the cache is valid or not.

@tzakharko
Copy link
Author

What about something like this: one silently builds the local package, takes a checksum and checks it agains the cache. If building the package has failed, we throw an error.

Or, a pragmatic approach: simply checksum the contents of the local directory. This might result in many false positives, but it's probably good enough for the use case.

@gaborcsardi
Copy link
Member

What about something like this: one silently builds the local package, takes a checksum and checks it agains the cache. If building the package has failed, we throw an error.

If you already build the (binary?) package, then you might as well install it, that is fast.

Or, a pragmatic approach: simply checksum the contents of the local directory. This might result in many false positives, but it's probably good enough for the use case.

Yeah, something like that would work. What is a false positive here? A cache miss that should not happen? Or a cache hit that should not happen?

@gaborcsardi
Copy link
Member

gaborcsardi commented Dec 1, 2021

The tricky stuff is that if you build-time depend on some package, and that has changed, that should ideally trigger a re-install. Similarly, if the system dependencies have changed, that should also trigger a re-install. It does not seem possibly to solve this 100%, and it hard to give a solution that won't break some people's workflow.

@tzakharko
Copy link
Author

The tricky stuff is that if you build-time depend on some package, and that has changed, that should ideally trigger a re-install. Similarly, if the system dependencies have changed, that should also trigger a re-install. It does not seem possibly to solve this 100%, and it hard to give a solution that won't break some people's workflow.

Yeah , I see what you mean. I think the problem is that there are really two use cases — local package installation for development purposes, where the current behavior of always reinstalling makes sense (probably how this is used in most cases), and just installing a package that has for some reason not been published yet.

My use case that motivated this inquiry is that I have this very big data export/validation/transformation pipeline. To prevent its dependencies from interfering with the usual R system it installs packages in a private library — there is is basically a pkg:pkg_install(deps) story going on every pipeline invocation on which works great. But the internal packages (which are provided as source trees) get reinstalled every single time, which is not a big deal honestly, but it does bug me a little bit :)

@gaborcsardi
Copy link
Member

In theory we could also try to use the mtime of the files/directories, but that is unreliable on some systems, so better to stay away from it, I think.

I think hashing the files in the local dir could work reasonably well. There is already a ?nocache option to opt out of the cache, so we would observe that and skip the hashing + caching altogether. We could also have a ?cache option to force the cache for a local package, maybe that's useful, but I am not convinced.

@tzakharko
Copy link
Author

Relying on extra option would introduce additional complexity and arguably make the user experience worse, so not really a fan. Maybe one can have two different types of local packages — local that always builds and another one that is cached, but I think this quickly becomes a bit ridiculous. I would be happy with simply supporting git remotes (#14), as it will solve all my woes and is probably a more useful approach anyway.

@tzakharko
Copy link
Author

@gaborcsardi If you don't have any other comments I would close it in favor of #14

@gaborcsardi
Copy link
Member

gaborcsardi commented Dec 1, 2021

Oh, I do want to have caching support for local packages, so let's keep this open. :) I think more people would use this than git remotes, actually. But understand if you want to focus on #14 instead of this.

@gaborcsardi gaborcsardi reopened this Dec 1, 2021
@gaborcsardi
Copy link
Member

Btw. this is a good summary of the issues with mtimes: https://apenwarr.ca/log/20181113

@tzakharko
Copy link
Author

I think the ultimate question is what caching should accomplish — avoiding doing work (e.g. building the package) or avoiding installing the package if it has not changed? If it is the first, then yes, it is going to be more tricky (I don't know if pkgdepends already looks at build time dependencies for example or whether that is something that should be added). If you want to be paranoid bout these things, rebuilding the package (but staying silent unless it actually changed) would work, but it will take extra time and work.

My approach would be to checksum the directory contents (I wouldn't bother with mtimes TBH, too much hassle), that's performant enough on modern machines and with usual package sizes. One would need to take care of build time dependencies though. As to system configuration changes... for that one might need a "force rebuild" option of some sort.

But understand if you want to focus on #14 instead of this.

Well, both would solve my current "problem" and I think that both should be done ultimately, so I am happy to contribute. Caching local files is probably a simpler thing (all the tricky corner cases aside)

@gaborcsardi
Copy link
Member

Avoiding the install is less important, I think. Installing the binary package is fast, basically an unzip/untargz.

I don't want to be paranoid, hence my opinion re checksum. I would not even worry about build time dependencies yet. The existing ?nocache or ?reinstall options already force a rebuild + reinstall.

FWIW, a quick experiment on the BH package that is relatively big and has many files.

❯ system.time(d <- dir(recursive=TRUE))
   user  system elapsed
  0.059   0.039   0.097

❯ system.time(mt <- file.mtime(d))
   user  system elapsed
  0.003   0.017   0.020

❯ system.time(fs <- file.size(d))
   user  system elapsed
  0.003   0.017   0.020

❯ system.time(md5 <- tools::md5sum(d))
   user  system elapsed
  0.254   0.124   0.378

❯ length(d)
[1] 11691

❯ system("du -hs .")
144M	.

This is a pretty fast machine, so anywhere else is probably slower, especially on Windows. But nevertheless this is encouraging.

@tzakharko
Copy link
Author

Looks great! So even with paranoid checksumming we can expect the check to perform in well under a second for average machine and package size.

I will probably have some time next week to look into implementing this. Is there some docs on what satisfy() and installedok() are supposed to do?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests

2 participants