-
-
Notifications
You must be signed in to change notification settings - Fork 158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC 0084] Input-aware fetchers #84
Conversation
I like this A Lot but would not favor it being dependent upon Nix version.
…On Wed, Dec 30, 2020, at 12:08 PM, Ryan Mulligan wrote:
***@***.**** approved this pull request.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub <#84 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAASXLHO2555DFXPVB5FILTSXNNBVANCNFSM4VOP3N7Q>.
|
Isn't it simpler to essentialy (finally!) revert I think in practice basename of the URL is almost always good enough. And also readable unlike the proposal. And can be slightly tuned to be less dependent on the fetching method. Of course, when something better is needed, explicit name evaluated to include the version also works… |
Great idea, here is my 2 cents:
|
@7c6f434c Yeah that would be another way. But as you mentioned, it only works when the basename of the URL is the part that changes, which doesn't have to be the case. This means that in some cases, people will still have to change the hash to something invalid to force a fetch. The goal of this RFC is that people can expect to never have to do that. A minor problem is also that the basename of the URL can include query parameters, which have to be escaped or removed properly (see NixOS/nixpkgs#107515). Not impossible, but another non-trivial case. By using a fixed-size hash we can be sure that no matter how many inputs can influence the result, any change of any of them leads to a rebuild. |
[alternatives]: #alternatives | ||
|
||
- The trivial alternative of keeping the status quo. This leads to the common beginner problem described in the motivation section. | ||
- Adding a special `outputHashInputs` attribute to the Nix `derivation` primitive, which can be set to any attributes. These attributes then influence Nix's store path hash directly, without the need for using the derivation name as a hash influencer. This could be a much nicer solution, but is a much more indepth change to how Nix works. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would like to highlight this alternative I haven't considered a whole lot. Implementing this is a bit harder, but doing that would allow us to have the best of both worlds: Arbitrary inputs that cause a rebuild, but no increase of store path sizes. (<- @holymonson)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this could be a very good alternative. This would also create a more obvious “disable if nix is to old” situation for nixpkgs.
@7c6f434c Yeah that would be another way. But as you mentioned, it only works when the basename of the URL is the part that changes, which doesn't have to be the case. This means that in some cases, people will still have to change the hash to something invalid to force a fetch. The goal of this RFC is that people can expect to never have to do that.
In some cases people just need to specify name because upstream uses URL schemes too strange to be tolerable. In return we actually get readability.
By using a fixed-size hash we can be sure that no matter how many inputs can influence the result, any change of any of them leads to a rebuild.
Frankly, I think it is a rare chance to make the «source» thing look readable in comparison. It's so much of just-to-have-something I am not sure just adding «name» attribute as things get changed is not a better idea.
By the way, I think in some cases even the name doesn't save you from tarballs.nixos.org, does it? (If the answer is to change all the hashes, then you can alter the source by including a file with metadata instead of altering Nix)
|
| --- | --- | | ||
| `pkgs.fetchurl`, `<nix/fetchurl.nix>`, `builtins.fetchurl`, `nix-prefetch-url` | "fetchurl-" + first URL | | ||
| `pkgs.fetchzip`, `<nix/fetchurl.nix>` (with `unpack = true`), `builtins.fetchTarball`, `nix-prefetch-url --unpack` | "fetchurl-unpack-" + URL | | ||
| `pkgs.fetchgit`, `builtins.fetchGit` and co. | "fetchgit-" + url + "-" + rev | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pkgs.fetchgit
would also need to have deepClone
, leaveDotGit
and fetchSubmodules
included, wouldn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah, though we'll have to make sure that it harmonizes with builtins.fetchGit
which doesn't support deepClone
and leaveDotGit
. So it would probably have to be something like
"fetchgit-${optionalString deepClone "deepclone-"}${optionalString leaveDotGit "leavedotgit-"}${optionalString fetchSubmodules "fetchsubmodules-"}..."
Another option, though it requires mass changes in nixpkgs and doesn't address all cases, would be mandating a name for fetchurl, and introducing the convention of setting it to |
``` | ||
[nix-shell:~]$ echo -n 'example string' | openssl dgst -sha256 -binary | openssl base64 -A | cut -b1-42 | tr +/ -_ | ||
rt-5KzBTohoRT08wGgKjxq1d_1BNEk3CzuYRdiPuxw | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For most sources, the pname
and version
attributes are probably enough to invalidate the inputs. One of the benefit is that those could be included in the name, making the store path more readable.
Another benefit of using pname
and version
is that it allow to invert the control. Historically the pname
is first set on the package derivation, and then inherited in the source. In cases where we have a package generator function, like with most languages, the pname could be set on the source, and then inherited by the generator function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this rfc is mainly for minority, such as git tarball with revision
or patch with patchutils arguments, with no pname
nor version
. As for normal sources, new pname
and version
will generate a new name
, which could already trigger refetch now. (Store::makeFixedOutputPath
takes name
as paramater and generate a different derivation path.)
We may want something other than hash
or name
could affect fixedOutputPath.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@holymonson if I'm interpreting zimbatm's comment right, he's suggesting that name/pname/version be moved into the source rather than the derivation that builds the source.
| `pkgs.fetchurl`, `<nix/fetchurl.nix>`, `builtins.fetchurl`, `nix-prefetch-url` | "fetchurl-" + first URL | | ||
| `pkgs.fetchzip`, `<nix/fetchurl.nix>` (with `unpack = true`), `builtins.fetchTarball`, `nix-prefetch-url --unpack` | "fetchurl-unpack-" + URL | | ||
| `pkgs.fetchgit`, `builtins.fetchGit` and co. | "fetchgit-" + url + "-" + rev | | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth talking about fetchers that download a bunch of things, like the Rust cargoSha256 and Go vendorSha256? They are a bit different in nature than the above-mentioned fetchers.
One trick we did with Cargo is to add the Cargo.lock in the derivation itself, and then compare it in the main derivation to make sure they are the same. This was done as a workaround for the issue mentioned in this RFC. So if a good solution is found, we could remove that hack.
Not in favor of this. It goes the wrong way, namely away from content-addressability. (In CA Nix, ideally we wouldn't have a Also, including a hash of the URL in the |
Not in favor of this. It goes the wrong way, namely *away* from content-addressability. (In CA Nix, ideally we wouldn't have a `name` attribute...)
The latest designs for CA Nix I have seen have two layers; intensional derivations and extensional (content-addressed) output paths with a mapping from derivations to output paths. Nothing forces us to make mistakes like make derivation files impossible to instantiate in storage, or drop names from derivation files / derivations. The extensional side will indeed lack names.
In CA terms, this proposal would mean that a change in pname/version means the derivation-output mapping is not automatically inherited. The CA output will of course eventually be the same. I agree that the maximalist (full-URL / fetcher type) approach is counterproductive, but your arguments are not as strong about a more balanced version (to which this proposal should indeed be changed).
It duplicates source tarballs that could otherwise be deduplicated. This hurts long-term reproducibility from source, since it creates the possibility that a source tarball exists in the binary cache, but not under the right name.
Maybe availability of a listing (or something in the style of tarballs.nixos.org but redirecting to cache and more up to date?) would be a better balance?
Also, including a hash of the URL in the `name` negates the point of fixed-output derivations, which is to ensure that you can update a URL without causing a global rebuild.
Agreed, mirror list should be updateable without rebuilds. As it it risks (depending on implementation details) causing a full rebuild on mirror://gnu/ mirror list update or something.
|
I feel like the problem is that there are situations which you want nix to be input aware and other situation when you specifically don't want it. I. e. when working in your local nixpkgs checkout you definitely want to rebuild fixed-output derivations with changed inputs, but when rebuilding your system or building packages on hydra you don't want it at all. I'm not sure whether there is a proper solution which preserves both use cases though. |
I find the reason (and the problem it tries to solve) for this RFC to be spot on. A lot of times I have seen newcomers and even more experiences Nix users to spend hours debugging because they forgot to update the sha as well as the url. I understand why this is done the way it is and it even becomes more obvious with CA, but I wonder what would a solution to this UX problem look like. If there is even a solution that would make everybody happy or to think about the trade-off we want to make. I don't really have a proper solution, but I only want to voice the opinion that this is an common problem which most likely every newcomer will experienced. |
I've been bitten by this too, but changing the output path for fetchers seems like an odd workaround that only works for some instances of this problem. The first time I encountered a changed URL with the same expected SHA not being re-fetched, I too was confused at first, but as soon as I figured out what was going on it seemed obvious that it was not a bug in Nix but in my understanding. Of course if we already have a file with the same SHA, there's no reason to re-download it for a normal build. Normally, we do want to use things that are already in the nix store, but there are times when we don't. So, the more general problem is that we sometimes do want to force something to be re-built. Some examples where we don't want to use the existing output that's already in the store:
|
Actually that's no problem, because the mirrors for Also, with such a feature we could easily implement a fetchurl {
url = "https://example.com/original-source";
alternateUrls = [ "https://foo.com/other-source" ];
} And we'd only let We do need to keep in mind though: Changing the URL and expecting the hash to not change is a very rare use case. I don't think we should let such a proposal be rejected because of this. My guesstimate is that in at least 99.9999% of cases, changing the URL is also expected to change the hash. By making this common case have a much better user experience with such a proposal, we can save a lot of time and frustration, especially for beginners. In my opinion the benefits of this vastly overshadow the disadvantages (which can be worked around, see above). |
> Agreed, mirror list should be updateable without rebuilds. As it it risks (depending on implementation details) causing a full rebuild on mirror://gnu/ mirror list update or something.
Actually that's no problem, because the mirrors for `mirror://` urls won't be hashed. It's only the `mirror://` url itself that is hashed.
That's somewhat better, sorry for misunderstanding.
Also, with such a feature we could easily implement a `fetchurl` like
```nix
fetchurl {
url = "https://example.com/original-source";
alternateUrls = [ "https://foo.com/other-source" ];
}
```
And we'd only let `url` influence the hash, but not `alternateUrls`. In the implementation, curl would try the `url` first, but fall back to `alternateUrls` if that failed. While this isn't perfect, it's an easy workaround for preventing global rebuilds if necessary.
That looks messier than plain `urls` list…
|
|
||
| Fetchers | Input string | | ||
| --- | --- | | ||
| `pkgs.fetchurl`, `<nix/fetchurl.nix>`, `builtins.fetchurl`, `nix-prefetch-url` | "fetchurl-" + first URL | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@7c6f434c Oh and actually I even specified this here already that only the first URL of multiple should be used in the hash, so yeah, no need for additionalUrls
, just the existing urls
can do that already.
Agree with @edolstra this is tempting and yes I've been bitten too, but this is absolutely the wrong direction. We must find a better solution. As they say in Haskell land "avoid success-at-all-costs" (added the dashes to disambiguate, don't need the joke here). In NixOS/nix#3635 I make it so we can just use the git hash, and don't need a nar hash. Doesn't that solve the problem nicely without these downsides? |
@Ericson2314 That's awesome, but it only works for content-addressed data, so not with most |
@infinisil it works fine for the There are number of reasons post-tarball and post-compression hashing is bad, too. See https://www.tweag.io/blog/2020-06-18-software-heritage/ for example. Keep that case worse is not ideal, but has a silver lining of pushing people in a good direction. |
Another idea is that if the narinfo for the build we found has a |
Recently on IRC I discussed Nix hashing with @adisbladis, where we concluded that if we want package managers to interact nicely with Nix, we need a standard directory hashing scheme standard. Researching that I stumbled upon dirhash, which is exactly that. While writing this RFC I didn't make the connection, but now I can't deny it. Notably this dirhash standard allows you to customize the hash function: It lets you configure which files should be ignored, how to treat recursive symlinks, and more. It also points out that you can extend the standard with more such properties. In our case, we could see the URL and any other relevant inputs as exactly that. If Nix supported such a hashing standard, we would e.g. specify output hashes like {
outputHash = {
mode = "dirhash";
version = "0.1.0";
algorithm = "sha256";
hash = "...";
protocol.nixInputs.url = url;
# ...
};
} Now when Nix wants to check whether a certain hash exists already, it can:
The store path should notably only be derived from the I can't see anything obviously wrong with this and this doesn't have any of the previously-discussed downsides. This would also pave the way for better package manager integration with a directory hashing standard. Ping @regnat as well |
@infinisil I would consider the rest of the properties a separate sort of derivation, in a way. Instead of saying exactly "how to build" something, they are making a claim to what the data represents. From a substitutee's standpoint, this is no worse because we are choosing to trusting the substitutor anyways, so whether or not the claim is "how" or "why", it doesn't super matter. I think the concept of "alt derivations" is good in general. Another option is "merkle inclusion proof"s. Since the world at large prefers tags and commit hashes to tree hashes, it's nice to not specify the tree hash (which may not be as well known) directly. But if we specifying something like "give me the tree from this commit" or even "give me the tree from this ancestor of this commit", rather than download all the history, we can just download a "spine" of nodes and siblings which proves that the tree hash at the bottom is included in the original hash. This trustlessly lets us use fewer hashes, which is very nice. |
How would this affect someone using a GitHub copy on read mirror via Artifactory. At my company we need to be able to guarantee working even if GitHub or something similar goes down, or potentially if the sources get renamed or removed (e.g. I ran into this when using
What would happen in this case? Would all the sources be downloaded again, would it re-trigger builds? |
@terlar Yes! This is a great demo of why content addressing is good, and making source authoritative when there is also an authoritative content address bad. The RFC as written would get stuck, but @infinisil's last comment does look more promising. (Relatedly, I think it might be good to allow not only a single fixed-output derivation derivation, but a set of them providing the same path to be jointly depended upon, to basically make mirroring/fallbacks a first-class concept in Nix at the store level.) |
I guess for But yeah I think I'll either change this RFC to the proposal in #84 (comment) or create a new one for that. I think that's a much better approach than the one described here. |
/nix/store/l98gjfznp8lpxi0hvj4i0rw34xnnqma8-source | ||
``` | ||
|
||
This can't be right! Of course, people with Nix knowledge will tell you that you need to change the hash to something invalid first in order for it to attempt a refetch. This is because Nix already has a path with that exact output `sha256` hash, so it just reuses that. Indeed, changing the output hash to something invalid works: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course, people with Nix knowledge will tell you that you need to change the hash to something invalid first in order for it to attempt a refetch.
I'm sorry to hear that this is encouraged as a good practice. Shouldn't people be told instead to use prefetchers whenever updating a source, rather than wastefully downloading the source twice just to get a hash mismatch the first time?
(I concede that I do the invalid-hash procedure myself for Rust packages, for which using a prefetcher would have been beyond my understanding of the Rust packaging functions, but I always use prefetchers for URL sources and Git sources.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It won't download it twice, since it stores the result under the store path for the calculated hash the first time
We have it, namely the NAR hash. |
Another (future) possible solution to the issue is to treat fixed-output derivations as a special-case of content-addressed derivations (as defined in #66) where we assert the expected output hash. The key difference with the way we handle them right now is that a change in the derivation would cause it to be rebuilt (like we do for input-addressed derivations), but we would stop the rebuild right after that (or error out if the new build doesn't have the expected hash). So we'd have slightly more rebuilds than currently (because we would rebuild fo derivations when their definition changes), but not too much (because we still wouldn't rebuild whatever depends on it) ; and that would solve the problem at hand as we would be able to notice every hash mismatch |
Closing this as I don't think the proposed solution here is a good idea. If I have enough time I'll soon write a new RFC with the idea from #84 (comment) (and actually, it doesn't have to involve dirhash at all) |
This pull request has been mentioned on NixOS Discourse. There might be relevant details there: https://discourse.nixos.org/t/tweag-nix-dev-update-6/11195/1 |
@infinisil I just had a thought that might just re-use this approach without (I think) its drawbacks: What if we just add an |
Right now, all I do when I update a fixed output is to remove the sha1. It will then download it and tell me the new hash. |
@Ekleog the problem with that approach is that if the original URL goes offline it will be impossible to provide an alternate URL of the same data. I guess we could add an optional |
No, it is not. I want to turn off fixed output derivations on a developer machine to know when download links break or I forgot to change a hash. |
… that's one thing I really do not want happenning on any machine I use. Maybe what is needed is an overlay that hashes all the parameters to fetchers and appends hash to name, then copies it into a normal fixed-output derivation with the compatible hash, and uses the tests-required-but-do-not-affect-hash by @Profpatsch to make sure the non-fixed version works before you are allowed to use the properly fixed version? |
@kevincox I'm not sure my suggestion was clear? What I'm suggesting is:
I don't think this would be particularly confusing, as Does this idea make sense? |
That makes sense. I think the requirement to set the old URL is a bit weird but avoiding accidentally building on a old hash seems worth it 👍 |
Change the name of the main fixed-output derivations to contain a hash of all inputs that could affect the output hash, such that if any of them change, a previously-fetched path with the same hash won't be reused. This resolves the long-standing confusion about Nix using an old version of a source even though the url it should be fetched from was updated.
Rendered