Caching for local packages #261

tzakharko · 2021-12-01T09:34:51Z

I noticed local packages are currently not cached. If this feature is considered to be of interest, I can try to implement caching. From a cursory glance I suppose this involves manipulation of satisfy_remote_local() and/or installedok_remote_local()? Any pointers on where to start?

The text was updated successfully, but these errors were encountered:

gaborcsardi · 2021-12-01T09:59:52Z

The main issue for local packages is that it is harder to decide if the cache is valid or not.

tzakharko · 2021-12-01T10:12:45Z

What about something like this: one silently builds the local package, takes a checksum and checks it agains the cache. If building the package has failed, we throw an error.

Or, a pragmatic approach: simply checksum the contents of the local directory. This might result in many false positives, but it's probably good enough for the use case.

gaborcsardi · 2021-12-01T10:15:18Z

What about something like this: one silently builds the local package, takes a checksum and checks it agains the cache. If building the package has failed, we throw an error.

If you already build the (binary?) package, then you might as well install it, that is fast.

Or, a pragmatic approach: simply checksum the contents of the local directory. This might result in many false positives, but it's probably good enough for the use case.

Yeah, something like that would work. What is a false positive here? A cache miss that should not happen? Or a cache hit that should not happen?

gaborcsardi · 2021-12-01T10:17:24Z

The tricky stuff is that if you build-time depend on some package, and that has changed, that should ideally trigger a re-install. Similarly, if the system dependencies have changed, that should also trigger a re-install. It does not seem possibly to solve this 100%, and it hard to give a solution that won't break some people's workflow.

tzakharko · 2021-12-01T10:27:59Z

The tricky stuff is that if you build-time depend on some package, and that has changed, that should ideally trigger a re-install. Similarly, if the system dependencies have changed, that should also trigger a re-install. It does not seem possibly to solve this 100%, and it hard to give a solution that won't break some people's workflow.

Yeah , I see what you mean. I think the problem is that there are really two use cases — local package installation for development purposes, where the current behavior of always reinstalling makes sense (probably how this is used in most cases), and just installing a package that has for some reason not been published yet.

My use case that motivated this inquiry is that I have this very big data export/validation/transformation pipeline. To prevent its dependencies from interfering with the usual R system it installs packages in a private library — there is is basically a pkg:pkg_install(deps) story going on every pipeline invocation on which works great. But the internal packages (which are provided as source trees) get reinstalled every single time, which is not a big deal honestly, but it does bug me a little bit :)

gaborcsardi · 2021-12-01T10:39:49Z

In theory we could also try to use the mtime of the files/directories, but that is unreliable on some systems, so better to stay away from it, I think.

I think hashing the files in the local dir could work reasonably well. There is already a ?nocache option to opt out of the cache, so we would observe that and skip the hashing + caching altogether. We could also have a ?cache option to force the cache for a local package, maybe that's useful, but I am not convinced.

tzakharko · 2021-12-01T10:45:33Z

Relying on extra option would introduce additional complexity and arguably make the user experience worse, so not really a fan. Maybe one can have two different types of local packages — local that always builds and another one that is cached, but I think this quickly becomes a bit ridiculous. I would be happy with simply supporting git remotes (#14), as it will solve all my woes and is probably a more useful approach anyway.

tzakharko · 2021-12-01T10:45:54Z

@gaborcsardi If you don't have any other comments I would close it in favor of #14

gaborcsardi · 2021-12-01T10:47:24Z

Oh, I do want to have caching support for local packages, so let's keep this open. :) I think more people would use this than git remotes, actually. But understand if you want to focus on #14 instead of this.

gaborcsardi · 2021-12-01T10:50:39Z

Btw. this is a good summary of the issues with mtimes: https://apenwarr.ca/log/20181113

tzakharko · 2021-12-01T10:57:04Z

I think the ultimate question is what caching should accomplish — avoiding doing work (e.g. building the package) or avoiding installing the package if it has not changed? If it is the first, then yes, it is going to be more tricky (I don't know if pkgdepends already looks at build time dependencies for example or whether that is something that should be added). If you want to be paranoid bout these things, rebuilding the package (but staying silent unless it actually changed) would work, but it will take extra time and work.

My approach would be to checksum the directory contents (I wouldn't bother with mtimes TBH, too much hassle), that's performant enough on modern machines and with usual package sizes. One would need to take care of build time dependencies though. As to system configuration changes... for that one might need a "force rebuild" option of some sort.

But understand if you want to focus on #14 instead of this.

Well, both would solve my current "problem" and I think that both should be done ultimately, so I am happy to contribute. Caching local files is probably a simpler thing (all the tricky corner cases aside)

gaborcsardi · 2021-12-01T11:21:58Z

Avoiding the install is less important, I think. Installing the binary package is fast, basically an unzip/untargz.

I don't want to be paranoid, hence my opinion re checksum. I would not even worry about build time dependencies yet. The existing ?nocache or ?reinstall options already force a rebuild + reinstall.

FWIW, a quick experiment on the BH package that is relatively big and has many files.

❯ system.time(d <- dir(recursive=TRUE))
   user  system elapsed
  0.059   0.039   0.097

❯ system.time(mt <- file.mtime(d))
   user  system elapsed
  0.003   0.017   0.020

❯ system.time(fs <- file.size(d))
   user  system elapsed
  0.003   0.017   0.020

❯ system.time(md5 <- tools::md5sum(d))
   user  system elapsed
  0.254   0.124   0.378

❯ length(d)
[1] 11691

❯ system("du -hs .")
144M	.

This is a pretty fast machine, so anywhere else is probably slower, especially on Windows. But nevertheless this is encouraging.

tzakharko · 2021-12-01T12:45:11Z

Looks great! So even with paranoid checksumming we can expect the check to perform in well under a second for average machine and package size.

I will probably have some time next week to look into implementing this. Is there some docs on what satisfy() and installedok() are supposed to do?

tzakharko closed this as completed Dec 1, 2021

gaborcsardi reopened this Dec 1, 2021

gaborcsardi added the feature a feature request or enhancement label Dec 13, 2021

tzakharko mentioned this issue Dec 17, 2021

WIP: Proof of concept implementation for local package caching #268

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Caching for local packages #261

Caching for local packages #261

tzakharko commented Dec 1, 2021

gaborcsardi commented Dec 1, 2021

tzakharko commented Dec 1, 2021

gaborcsardi commented Dec 1, 2021

gaborcsardi commented Dec 1, 2021 •

edited

Loading

tzakharko commented Dec 1, 2021

gaborcsardi commented Dec 1, 2021

tzakharko commented Dec 1, 2021

tzakharko commented Dec 1, 2021

gaborcsardi commented Dec 1, 2021 •

edited

Loading

gaborcsardi commented Dec 1, 2021

tzakharko commented Dec 1, 2021

gaborcsardi commented Dec 1, 2021

tzakharko commented Dec 1, 2021

Caching for local packages #261

Caching for local packages #261

Comments

tzakharko commented Dec 1, 2021

gaborcsardi commented Dec 1, 2021

tzakharko commented Dec 1, 2021

gaborcsardi commented Dec 1, 2021

gaborcsardi commented Dec 1, 2021 • edited Loading

tzakharko commented Dec 1, 2021

gaborcsardi commented Dec 1, 2021

tzakharko commented Dec 1, 2021

tzakharko commented Dec 1, 2021

gaborcsardi commented Dec 1, 2021 • edited Loading

gaborcsardi commented Dec 1, 2021

tzakharko commented Dec 1, 2021

gaborcsardi commented Dec 1, 2021

tzakharko commented Dec 1, 2021

gaborcsardi commented Dec 1, 2021 •

edited

Loading

gaborcsardi commented Dec 1, 2021 •

edited

Loading