-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Caching for local packages #261
Comments
The main issue for local packages is that it is harder to decide if the cache is valid or not. |
What about something like this: one silently builds the local package, takes a checksum and checks it agains the cache. If building the package has failed, we throw an error. Or, a pragmatic approach: simply checksum the contents of the local directory. This might result in many false positives, but it's probably good enough for the use case. |
If you already build the (binary?) package, then you might as well install it, that is fast.
Yeah, something like that would work. What is a false positive here? A cache miss that should not happen? Or a cache hit that should not happen? |
The tricky stuff is that if you build-time depend on some package, and that has changed, that should ideally trigger a re-install. Similarly, if the system dependencies have changed, that should also trigger a re-install. It does not seem possibly to solve this 100%, and it hard to give a solution that won't break some people's workflow. |
Yeah , I see what you mean. I think the problem is that there are really two use cases — local package installation for development purposes, where the current behavior of always reinstalling makes sense (probably how this is used in most cases), and just installing a package that has for some reason not been published yet. My use case that motivated this inquiry is that I have this very big data export/validation/transformation pipeline. To prevent its dependencies from interfering with the usual R system it installs packages in a private library — there is is basically a |
In theory we could also try to use the mtime of the files/directories, but that is unreliable on some systems, so better to stay away from it, I think. I think hashing the files in the local dir could work reasonably well. There is already a |
Relying on extra option would introduce additional complexity and arguably make the user experience worse, so not really a fan. Maybe one can have two different types of local packages — |
@gaborcsardi If you don't have any other comments I would close it in favor of #14 |
Oh, I do want to have caching support for local packages, so let's keep this open. :) I think more people would use this than git remotes, actually. But understand if you want to focus on #14 instead of this. |
Btw. this is a good summary of the issues with mtimes: https://apenwarr.ca/log/20181113 |
I think the ultimate question is what caching should accomplish — avoiding doing work (e.g. building the package) or avoiding installing the package if it has not changed? If it is the first, then yes, it is going to be more tricky (I don't know if My approach would be to checksum the directory contents (I wouldn't bother with mtimes TBH, too much hassle), that's performant enough on modern machines and with usual package sizes. One would need to take care of build time dependencies though. As to system configuration changes... for that one might need a "force rebuild" option of some sort.
Well, both would solve my current "problem" and I think that both should be done ultimately, so I am happy to contribute. Caching local files is probably a simpler thing (all the tricky corner cases aside) |
Avoiding the install is less important, I think. Installing the binary package is fast, basically an unzip/untargz. I don't want to be paranoid, hence my opinion re checksum. I would not even worry about build time dependencies yet. The existing FWIW, a quick experiment on the BH package that is relatively big and has many files. ❯ system.time(d <- dir(recursive=TRUE))
user system elapsed
0.059 0.039 0.097
❯ system.time(mt <- file.mtime(d))
user system elapsed
0.003 0.017 0.020
❯ system.time(fs <- file.size(d))
user system elapsed
0.003 0.017 0.020
❯ system.time(md5 <- tools::md5sum(d))
user system elapsed
0.254 0.124 0.378
❯ length(d)
[1] 11691
❯ system("du -hs .")
144M . This is a pretty fast machine, so anywhere else is probably slower, especially on Windows. But nevertheless this is encouraging. |
Looks great! So even with paranoid checksumming we can expect the check to perform in well under a second for average machine and package size. I will probably have some time next week to look into implementing this. Is there some docs on what |
I noticed local packages are currently not cached. If this feature is considered to be of interest, I can try to implement caching. From a cursory glance I suppose this involves manipulation of
satisfy_remote_local()
and/orinstalledok_remote_local()
? Any pointers on where to start?The text was updated successfully, but these errors were encountered: