Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build artifacts caching support for non-relocatable binaries #480

Open
yunxing opened this issue Sep 29, 2016 · 16 comments
Open

Build artifacts caching support for non-relocatable binaries #480

yunxing opened this issue Sep 29, 2016 · 16 comments

Comments

@yunxing
Copy link
Contributor

yunxing commented Sep 29, 2016

The current caching mechanism assumes all binaries are relocatable. However that's not always the case. As an example, the ocaml compiler itself has hardcoded paths inside the binary.

The current problem today is that if we build ocaml in project A and use the cache in building project B, artifacts in project B will be pointing to project A, where it was originally build. This means if we remove projectA, B will stop working.

One possible solution to this problem is to always run build in a configurable directory, instead of node_modules in the current project. After build we can then copy the build artifacts back to the destination (either project A or B). This way, the non-relocatable artifacts will only depend on a directory where users are aware of.

To repro:

mkdir A; cd A
yarn add @opam-alpha/ocaml
mkdir B; cd B
yarn add @opam-alpha/ocaml # This should be built instantly from cache
rm -rf ../A
node_modules/ocaml/bin/ocaml # ocaml in project B won't work anymore

@bestander @jordwalke @dxu

@yunxing
Copy link
Contributor Author

yunxing commented Sep 29, 2016

This is not blocking us, but we have to apply a workaround which we always remove the cache before installation.

@dxu
Copy link
Contributor

dxu commented Sep 29, 2016

Looking over package install scripts, it looks like it's copying the build artifacts to the cache post install. One way to address might be to do the build in the cache (as the configurable directory you're referring to).

Related, and this might have a really obvious solution, but why do package managers like npm/bower/yarn all install local copies of all of the packages as opposed to having a central store of packages installed, and then symlinking them, similar to how npm link works with local packages? This would solve this issue, and maintain a single copy of (a version of) a dependency on the system.

@yunxing
Copy link
Contributor Author

yunxing commented Sep 30, 2016

@dxu Building this in cache should work.

I think the reason that npm or other package mangers don't do this is yarn tries to solve both of:

  1. compile things
  2. local sandbox

npm doesn't bother the cache artifacts because there are seldom packages that have install fields. (npm doesn't even have parallel installation support).

@dxu
Copy link
Contributor

dxu commented Sep 30, 2016

Yeah, I can understand npm not doing this - there also isn't any notion of a central place to store all packages apart from global packages or an offline cache, I guess I was wondering if you guys have thought about doing it/what are disadvantages to doing so.

My confusion is mainly wondering why its important to have both the cache in yarn along with the local node_modules copy (esp. if you start building pkg's in the cache), as opposed to just keeping everything in ~/.yarn and just symlinking the node modules folder to the dependency fetched/built in ~/.yarn, which is basically how npm link works for another local package. As a dev, I generally don't really care/think about the files in the node_modules folder. I don't know if this would break in cases where the developer is extremely concerned with a strict sandbox, but I feel like that doesn't need to be the default. You'd be saving time (symlink vs copying files in the install process) and space (single source of truth for packages). Would love to be enlightened on this!

@dxu
Copy link
Contributor

dxu commented Sep 30, 2016

On second thought, you'd still run into problems with this if you build in the cache and then copy over, if you clear the cache (or any configurable directory).

You'll still need to build it per project, or the user would have to know to reinstall to rebuild it. Not sure how you can address this without forcing the build each time, and not sure how to set up something to address that on a case by case basis.

Although, a single source of truth would help in this case as well.

@yunxing
Copy link
Contributor Author

yunxing commented Sep 30, 2016

@dxu Good points. I'm convinced that maybe symlinking is a better idea here.

As for your concern about clearing the cache -- yes, we will still have the same problem. But if we are explicit about the cache location configuration (ie, telling the user that this is the source of truth that you shouldn't remove), then users will understand the behavior.

@jordwalke
Copy link

I wish the binaries were relocatable, but sometimes binaries can be large and copying can actually take a while too (12 seconds would be common), so symlinking seems like it would work pretty well aside from the cache clearing issue. It might be a worthwhile tradeoff though. Especially if kpm commands in your project (that has node_modules with dead symlinks) will message nicely about the fact that symlinks now point to cleared caches and what you can do about it. Still, I could see both modes being useful (symlink vs. copying) and maybe it can be a configurable option (maybe each package can have a "relocatable": true/false field, and kpm install can accept a flag --copyRelocatableCachedArtifacts defaulted to true. So the logic would be shouldCopyInsteadOfSymlink = packageIsRelocatable && copyRelocatableCachedArtifacts===true

@dxu
Copy link
Contributor

dxu commented Sep 30, 2016

Just to clarify, doesn't the cache clearing issues only apply for the copy case? As you mentioned, using symlinks actually allows for easy checking of cache misses (the path no longer exists).

As I see it, with the current setup where node_modules contains copies of the cache:

  1. You can build the items in cache, and then copy it over to the node_modules directory.
  2. You can build the items in cache, and then create a symlink for the packages that need it.

Problems with these two:

  1. @yunxing, this is also in response to your point - given that it's a cache, I don't think it's uncommon to want to clear the cache and think nothing of doing so. If you ever clear the cache/have a cache miss, ocaml (e.g) will fail with a yarn-unrelated error. At this point, it might not be obvious to the user that they have to re-run yarn install to solve the issue. I don't think you can tell users its a cache, and then tell them that you shouldn't ever remove it because it's also a source of truth for only certain packages.
  2. Correct me if I'm wrong here, but if we're trying to increase compatibility, we wouldn't really be able to expect all packages to include the new configurable options, right? Any old or npm specific packages with post install scripts that rely on or create artifacts that rely on hardcoded paths would still fail. I remember running into similar same issues with this package a year or two back (I think). I feel like the only way symlinks could work across the board and reap the benefits is if everything were symlinked, so that you wouldn't have to treat packages any differently.

@jordwalke
Copy link

Just to clarify, doesn't the cache clearing issues only apply for the copy case?

No, there's an issue with symlinks too (though I think it's worth it). You have a symlink to the cache, then you clear the cache, any project that has a symlink to the cache in its node_modules will start failing. That isn't the behavior you'd expect from a "cache", though I think it's still useful and I'd opt into this behavior for faster installs.

@jordwalke
Copy link

And perhaps we should see if the newest versions of the ocaml compiler are relocatable, which would make the copy option viable/preferred. However, aren't there some other popular programs that are also not relocatable?

@dxu
Copy link
Contributor

dxu commented Oct 1, 2016

Oh, I see what you meant, and yeah you're right. I was differentiating the symlink and copy case in my head, because after clearing the cache, in the copy case, my understanding is that you'd get an ocaml-package-specific file error due to the relocated/removed build artifacts, whereas with the symlink, you just get the generic "command ocaml not found" (since the link points to nothing), which should be more digestible and understandable for users since it'd be the same error to when they didn't run npm install in the past.

I can't confirm what the current error with ocaml says - I can't seem to reproduce this on my laptop right now, yarn seems to be rebuilding each time i yarn add for me now.

@Daniel15
Copy link
Member

Daniel15 commented Oct 2, 2016

No, there's an issue with symlinks too (though I think it's worth it). You have a symlink to the cache, then you clear the cache, any project that has a symlink to the cache in its node_modules will start failing.

Make it a hardlink then, not a symlink. Then all the links would point to the same inode, and clearing the cache will not break it (the files won't actually be deleted until all links are deleted).

(as a bonus, you can create hardlinks and junctions without admin permissions on Windows, whereas symlink require admin permissions)

@dxu
Copy link
Contributor

dxu commented Oct 5, 2016

An alternative if links (hard/soft) aren't possible (see #499) is fetching to cache directly prior to running postinstall scripts, and always running postinstall scripts after copying from cache -> node_modules, regardless of cache hits in future projects

However, I'm not able to reproduce the original problem the above example @yunxing, it seems to be rebuilding each time? Are you able to reproduce

@yunxing
Copy link
Contributor Author

yunxing commented Oct 5, 2016

@dxu After some testing. I think the problem is not the binaries being non-relocatable. The problem is some installations are not idempotent (caching the installed scripts then reinstalling essentially means install the same thing twice).

@sammdec
Copy link

sammdec commented Feb 23, 2017

Is there any update on this? I am getting the same issue with sharp lovell/sharp#722

@jordwalke
Copy link

jordwalke commented Feb 25, 2017

We have a potential solution for binary relocation from cache. We've tried it with esy and we can bring the solution to Yarn, perhaps.

Binary relocation doesn't work in all cases, and there are some limitations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants