Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC 0062] Content-addressed paths #62

Merged
merged 34 commits into from
Jan 12, 2022
Merged
Changes from 13 commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
6d001f3
CAP RFC: First draft
thufschmitt Sep 19, 2019
435fc42
typo
thufschmitt Dec 11, 2019
7b26144
Apply @grahamc's suggestions
regnat Dec 11, 2019
81099b2
nix code -> Nix expression
thufschmitt Dec 11, 2019
4277386
Break-up the big introduction paragraph
thufschmitt Dec 11, 2019
7af7d2c
Rename to match the PR number
thufschmitt Dec 12, 2019
5fec861
Rename the drv attribute to __contentAddressed
thufschmitt Dec 12, 2019
9edc11f
Mention the GC issue
thufschmitt Jan 8, 2020
5717351
Remove the ambiguity on what an `output` is
thufschmitt Jan 8, 2020
1a844cc
Replace aliases paths by a pathOf mapping
thufschmitt Jan 15, 2020
26ae77e
Move the example after the design description
thufschmitt Jan 15, 2020
bbdca7e
Rephrase the design
thufschmitt Jan 15, 2020
63f3eca
Add shepherd team
thufschmitt Jan 16, 2020
a6d2f38
Rewrite the RFC to account for the RFC meeting comments
thufschmitt Feb 17, 2020
140e093
Add a section about leaking output paths
thufschmitt Feb 17, 2020
288dcb4
Merge remote-tracking branch 'upstream/master' into cas-rfc
Ericson2314 Mar 14, 2020
60e7da3
Merge pull request #5 from Ericson2314/cas-rfc-new-template
regnat Mar 18, 2020
1115a0d
Refine the design summary
thufschmitt Mar 18, 2020
13938de
Rename dependency-addressed into input-addressed
thufschmitt Mar 18, 2020
3a25f7f
minor fixup after comments
thufschmitt Mar 25, 2020
3a18867
Apply suggestions from code review
regnat Jun 19, 2020
fa16e86
Update rfcs/0062-content-addressed-paths.md
Mic92 Oct 22, 2020
94b65bd
Update the terminology to match the in the implementation
thufschmitt Apr 14, 2021
7ed4481
Reword the detailed design presentation
thufschmitt Apr 14, 2021
fb4c61d
Quote some strings in the yaml frontmatter
thufschmitt Apr 14, 2021
841fe3f
Add a design paragraph about the remote caching
thufschmitt Apr 14, 2021
27bd048
Lift the determinism requirement
thufschmitt Apr 14, 2021
1e8fab7
Typo
edolstra May 31, 2021
9772625
Apply suggestions from code review
edolstra May 31, 2021
02ae2b5
Rewrite the RFC
thufschmitt Jun 2, 2021
2d74fed
Make the python samples a bit more pythonic
regnat Jun 2, 2021
168a149
Explicit that unresolved dependencies are eval-time
thufschmitt Jun 2, 2021
427abed
Prettify
thufschmitt Jun 2, 2021
f275669
Make the end-goal an experiment
regnat Dec 10, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
305 changes: 305 additions & 0 deletions rfcs/0062-content-addressed-paths.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,305 @@
---
feature: Simple content-adressed store paths
start-date: 2019-08-14
author: Théophane Hufschmitt
co-authors: (find a buddy later to help our with the RFC)
shepherd-team: @layus, @edolstra and @Ericson2314
shepherd-leader: (name to be appointed by RFC steering committee)
Mic92 marked this conversation as resolved.
Show resolved Hide resolved
related-issues: (will contain links to implementation PRs)
---

# Summary

[summary]: #summary

Add some basic but simple support for content-adressed store paths to Nix.
edolstra marked this conversation as resolved.
Show resolved Hide resolved

We plan here to give the possibility to mark certain store paths as
content-adressed (ca), while keeping the other dependency-adressed as they are
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small nit: hard to read the text because ca is too close to ca. (abbreviation of circa). I would uppercase CA.

now (modulo some mandatory drv rewriting before the build, see below)
edolstra marked this conversation as resolved.
Show resolved Hide resolved

By making this opt-in, we can impose arbitrary limitations to the paths that
are allowed to be ca to avoid some tricky issues that can arise with
content-adressability.

In particular, we restrict ourselves to paths that are:

- without any non-textual self-reference (_i.e_ a self-reference hidden inside a zip file)
thufschmitt marked this conversation as resolved.
Show resolved Hide resolved
- known to be deterministic (for caching reasons, see [caching]).

That way we don't have to worry about the fact that hash-rewriting is only an
approximation nor by the semantics of the distribution of non-deterministic
thufschmitt marked this conversation as resolved.
Show resolved Hide resolved
paths.

We also leave the option to lift these restrictions later.

This RFC already has a (somewhat working) POC at
<https://github.com/NixOS/nix/pull/3262>.

# Motivation

[motivation]: #motivation

Having a content-adressed store with Nix (aka the "Intensional store") is a
long-time dream of the community − a design for that was already taking a whole
chapter in [Eelco's PHD thesis][nixphd].

This was never done because it represents a quite big change in Nix's model,
thufschmitt marked this conversation as resolved.
Show resolved Hide resolved
with some non-totally-solved implications (regarding the trust model in
thufschmitt marked this conversation as resolved.
Show resolved Hide resolved
particular).
Even without going all the way down to a fully intensional model, we can
make specific paths content-adressed, which can give some important benefits of
the intensional store at a much lower price. In particular, setting some
critical derivations as content-adressed can lead to some substancial build
thufschmitt marked this conversation as resolved.
Show resolved Hide resolved
cutoffs.

# Detailed design

[design]: #detailed-design

In all that follows, we pretend that each derivation has only one output.
This doesn't change the reasoning but makes things easier to state.

The gist of the design is that:

- Some derivations can be marked as content-adressed (ca), in which case their
output will be moved to a path `ca` determined only by its content after the
build
- We introduce the notion of a `resolved derivation` which is a derivation that
doesn't refer to any other derivation but only to concrete store paths.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless we commit to breaking self cycles in the first version, we cannot actually do this because there is no hash of normal derivations with cyclic data.

Derivations that don't transitively depend on any ca derivation are “equivalent” to their associated resolved derivation in that they refer to the same inputs and have the same output hash.

This quote in particular is in conflict with the above.

I think you wish to say that a resolved derivation depends on no content-addressed or fixed output derivations, but only concrete store paths and normal derivations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What kind of derivation are we talking about here, derivation as in the builtin in the expression language or .drv files?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, what’s a concrete store path?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What kind of derivation are we talking about here, derivation as in the builtin in the expression language or .drv files?

Neither? but closer to the files. The concept (node in a merkle dag with certain data) exists ontologically prior to being a primop in some language or having some serialization format.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, what’s a concrete store path?

Good question. A content address store path, as exists today (fixed output, add-to-store, etc).

To prevent ambiguities, we might speak of a `symbolic derivation` to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
To prevent ambiguities, we might speak of a `symbolic derivation` to
To prevent ambiguities, we will speak of a `symbolic derivation` to

designate a derivation that's not necessarily resolved.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
designate a derivation that's not necessarily resolved.
designate a derivation that might or might not be resolved.

This looks like it would be the set of all derivations? Is there a LEM-gotcha here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is indeed the set of all derivations. But we don't use it anywhere so I'll drop it

We also define a `resolving` function that given a symbolic derivation
returns a new resolved derivation with the same semantics.
- When asked to build a derivation, Nix will first resolve it, build the
resolved derivation and link back the symbolic one to the out path of the
resolved one.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
resolved derivation and link back the symbolic one to the out path of the
resolved one.
resolved derivation and create a link from the path of the symbolic derivation, pointing to the path of the
resolved derivation

Or is “out path” intentional here, that is are we talking about the output out specifically?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nah, out isn't intentional, thanks


## Nix-build process

### Output mappings

A major consequence of allowing content-addressed derivations is that the
actual output path of a derivation might not match its output hash anymore.

To express this, we introduce a new mapping `pathOf` that associates the hash
of every live derivation to its store path.
By extension, we also define `pathOf(drv) = pathOf(hash(drv))`

### Building a ca derivation

ca derivations are derivations with the `__contentAddressed` argument set to
`true`.

The process for building a content-adressed derivation is the following:

- We build it like a normal derivation (see below) to get an output path `$out`.
- We compute a cryptographic hash `$chash` of `$out`[^modulo-hashing]
- We move `$out` to `/nix/store/$chash-$name`
- We create a mapping from `$dhash` (the hash computed at eval-time) to
`/nix/store/$chash-$name`

[^modulo-hashing]:

We can possibly normalize all the self-references before
computing the hash and rewrite them when moving the path to handle paths with
self-references, but this isn't strictly required for a first iteration
Copy link
Member

@Ericson2314 Ericson2314 Jan 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would very much like to not do this the first iteration, for the record

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nix already supports this, behind a flag.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does, but I'd still like to be a bit cautious. When the mounted store path is the content addressed thing, we open the door to all sorts of interesting ways to distribute and deploy nix stores. By adding this post-processing step we complicate those at the very least.


### Building a normal derivation

#### Resolved derivations

We define a `resolved derivation` as a derivation that has no reference to any
other derivation (but can refere to store paths).

For a derivation `drv` whose input derivations have all been realised, we define
its `associated resolved derivation` of `drv` (`resolved(drv)`) as
`drv` in which we replace every input derivation `inDrv` of `drv` by
`pathOf(inDrv)` (and update the output hash accordingly).

`resolved` is (intentionally) not injective: If `drv` and `drv'` only differ because one depends on `dep` and the other on `dep'`, but `dep` and `dep'` are content-addressed and have the same output hash, then `resolved(drv)` and `resolved(drv')` will be equal.

Derivations that don't transitively depend on any ca derivation are “equivalent” to their associated resolved derivation in that they refer to the same inputs and have the same output hash.

#### Build process

When asked to build a derivation `drv`, we instead:

1. Try to substitute and build `resolved(drv)`. Possibly this is a no-op because it may be that `resolved(drv)` has already been built.
2. Add a new mapping `pathOf(hash(drv)) = out(resolved(drv))`

## Example

In this example, we have the following Nix expression:

```nix
rec {
contentAddressed = mkDerivation {
name = "contentAddressed";
__contentAddressed = true;
… # Some extra arguments
};
dependent = mkDerivation {
name = "dependent";
buildInputs = [ contentAddressed ];
… # Some extra arguments
};
transitivelyDependent = mkDerivation {
name = "transitivelyDependent";
buildInputs = [ dependent ];
… # Some extra arguments
};
}
```

Suppose that we want to build `transitivelyDependent`.
What will happen is the following

- We instantiate the Nix expression, this gives us three drv files:
`contentAddressed.drv`, `dependent.drv` and `transitivelyDependent.drv`
- We build `contentAddressed.drv`.
- We first compute `resolved(contentAddressed.drv)` to replace its
inputs by their real output path. Since there is none, we
have here `resolved(contentAddressed.drv) == contentAddressed.drv`
- We realise `resolved(contentAddressed.drv)`. This gives us an output path
`out(resolved(contentAddressed.drv))`
- We move `out(resolved(contentAddressed.drv))` to its content-adressed path
`ca(contentAddressed.drv)` which derives from
`sha256(out(resolved(contentAddressed.drv)))`
- We build `dependent.drv`
- We first compute `resolved(dependent.drv)` to replace its
inputs by their real output path.
In that case, we replace `contentAddressed.drv!out` by
`ca(contentAddressed.drv)`
- We realise `resolved(dependent.drv)`. This gives us an output path
`out(resolved(dependent.drv))`
- We build `transitivelyDependent.drv`
- We first compute `resolved(transitivelyDependent.drv)` to replace its
inputs by their real output path.
In that case, that means replacing `dependent.drv!out` by
`out(resolved(dependent.drv))`
- We realise `resolved(transitivelyDependent.drv)`. This gives us an output path
`out(resolved(transitivelyDependent.drv))`

Now suppose that we slightly change the definition of `contentAddressed` in such
a way that `contentAddressed.drv` will be modified, but its output will be the
same. We try to rebuild the new `transitivelyDependent`. What happens is the
following:

- We instantiate the Nix expression, this gives us three new drv files:
`contentAddressed.drv`, `dependent.drv` and `transitivelyDependent.drv`
- We build `contentAddressed.drv`.
- We first compute `resolved(contentAddressed.drv)` to replace its
inputs by their real output path. Since there is none, we
have here `resolved(contentAddressed.drv) == contentAddressed.drv`
- We realise `resolved(contentAddressed.drv)`. This gives us an output path
`out(resolved(contentAddressed.drv))`
- We compute `ca(contentAddressed.drv)` and notice that the
path already exists (since it's the same as the one we built previously),
so we discard the result.
- We build `dependent.drv`
- We first compute `resolved(dependent.drv)` to replace its
inputs by their real output path.
In that case, we replace `contentAddressed.drv!out` by
`ca(contentAddressed.drv)`
- We notice that `resolved(dependent.drv)` is the same as before (since
`ca(contentAddressed.drv)` is the same as before), so we
just return the already existing path
- We build `transitivelyDependent.drv`
- We first compute `resolved(transitivelyDependent.drv)` to replace its
inputs by their real output path.
In that case, that means replacing `dependent.drv!out` by
`out(resolved(dependent.drv))`
- Here again, we notice that `resolved(transitivelyDependent.drv)` is the same as before,
so we don't build anything

## Wrapping it up

# Drawbacks

[drawbacks]: #drawbacks

- Obviously, this makes the Nix model more complicated than what it is now. In
thufschmitt marked this conversation as resolved.
Show resolved Hide resolved
particular, the caching model needs some modifications (see [caching]);

- We specify that only a sub-category of derivations can safely be marked as
`contentAddressed`, but there's no way to enforce these restricitions;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bigger problem than it might look like, as it means that trivial updates can break the CA marking for reasons not worth mentioning in the upstream changelog.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely :)

Maybe that could be clearly stated, but the original scope of this work was to be able to mark very specific derivations that were clearly guaranteed to be deterministic, in which case the problem was less important

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the question «why not just propagate CA» shows that writing more is a good idea.

I do think that stressing the limitations in a few key places is also a nice thing to do (people should be able to apply RFC as passed, not what was intended and not what was discussed, after all… we should not treat ourselves worse than we treat computers!)


- This will probably be a breaking-change for some tooling since the output path
that's stored in the `.drv` files doesn't correspond to the actual on-disk
path the output will be stored in (because it might just be an alias for the
other path)

# Alternatives

[alternatives]: #alternatives

[RFC 0017][] is another proposal with the
same end-goal. The big difference between these two is in the scope they cover:
RFC 0017 is about fundamentally changing the base model of Nix, while this
proposal suggests to make only the minimal amount of changes to the current
model to allow the content-adressed model to live in parallel (which would open
the way to a fully content-adressed store as RFC0017, but in a much more
incremental way).

Eventually this RFC should be subsumed by RFC0017.

# Unresolved questions

[unresolved]: #unresolved-questions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have functionality that allows to build a CA package twice with different apparent output paths, and optionally with different parallelism settings? The build of the package obviously fails if the CA unification doesn't lead to the same result.

Should we mandate that Hydra uses this functionality? Should it be on by default?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per https://github.com/NixOS/rfcs/pull/62/files#r357243841 I think we can deal with non-deterministic derivation just fine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, for binary cache transparency it is much better if you can build something locally, then regain connectivity and fetch stuff from a cache, then fetch stuff from a different cache, then build some more locally, etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in my mind that's equally risky with and without content addressable derivations. The only difference is one lets you know if something goes wrong, and one doesn't.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. Build nondeterminism doesn't introduce significant behaviour changes, so as long as the expectations are not broken (yeah, we install you into this output path and your dependencies into those paths, and that is not going to change), it will be mostly usable. There are a few CPU-dependent optimisations from time to time, they are annoying.

With CA things are actually moved around, so even though everything would still work when assembled together, the assembling part will be failing. It is Nix, not the code that is built by Nix, that would fail to do things because of nondeterminism.

Copy link
Member

@Ericson2314 Ericson2314 Dec 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think trying to keep going despite non-determinism incoherence is a misfeature. You can always evict your own CA mappings (can keep the builds themselves for easy "rollback") and align with cache.reflex-frp.org and keep going.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if things are marked CA, of course it is a good idea to catch failures. But what you propose will not catch much, because a typical derivation is only built once (ever) by Hydra, later Hydra will use the binary cache. Also my proposal includes feeding different «apparent» output paths to the same build with the same dependencies, which has a better chance of discovering compressed self-references.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am OK with running --check and trying different termporary output paths. Catching non-determinism I don't think is important, because it's really clashes that we care about. However, catching self-references is important as we have to be able to move the thing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, varying the output paths is something that doesn't follow from anything Nix does, so it has to be spelled explicitly.


## Caching

[caching]: #caching

The big unresolved question is about the caching of content-adressed paths.
As [Eelco's phd thesis][nixphd] states it, caching ca paths raises a number of
thufschmitt marked this conversation as resolved.
Show resolved Hide resolved
questions when building that path is non-deterministic (because two different
stores can have two different outputs for the same path, which might lead to
some dependencies being duplicated in the closure of a dependency).
There exist some solutions to this problem (including one presented in Eelco's
thesis), but for the sake of simplicity, this RFC simply forbids to mark a
derivation as ca if its build is not deterministic (although there's no real
thufschmitt marked this conversation as resolved.
Show resolved Hide resolved
way to check that so it's up to the author of the derivation to ensure that it
is the case).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can skip this. If we track all the evaluation steps, we have all the information to ensure a binary cache isn't given anything that clashes with ourself. Maybe the first prototype will discover these errrors lazily, but it should discover them


## Client support

The bulk of the job here is done by the Nix daemon.

Depending on the details of the current Nix implementation, there might or
might not be a need for the client to also support it (which would require the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
might not be a need for the client to also support it (which would require the
be a need for the client to also support it (which would require the

daemon and the client to be updated in synchronously)

## Old Nix versions and caching

What happens (and should happen) if a Nix not supporting the cas model queries
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cas hasn't been defined yet.

a cache with cas paths in it is not clear yet.

## Garbage collection

Another major open issue is garbage collection of the aliases table. It's not
clear when entries should be deleted. The paths in the domain are "fake" so we
can't use them for expiration. The paths in the codomain could be used (i.e. if
a path is GC'ed, we delete the alias entries that map to it) but it's not clear
whether that's desirable since you may want to bring back the path via
substitution in the future.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend we might just store the "generating" mappings from almost-resolved ca input paths (all deps resolved) to output paths, as this will require far less space. OTOH it makes garbage collection tricker as now all mappings in the build closure are needed to recover a maximum-unresolved input path to map.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OTOH we can just do that with tracing GC. We just read the table backwards, saying each derivation in the codomain references everything in the domain that maps to it, and then look those up in turn.


# Future work

[future]: #future-work

This RFC tries as much as possible to provide a solid foundation for building
ca paths with Nix, leaving as much room as possible for future extensions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ca paths with Nix, leaving as much room as possible for future extensions.
CA paths with Nix, leaving as much room as possible for future extensions.

In particular:

- Add some path-rewriting to allow derivations with self-references to be built
as ca
thufschmitt marked this conversation as resolved.
Show resolved Hide resolved
- Consolidate the caching model to allow non-deterministic derivations to be
built as ca
thufschmitt marked this conversation as resolved.
Show resolved Hide resolved
- (hopefully, one day) make the CA model the default one in Nix
- Investigate the consequences in term of privileges requirements
- Build a trust model on top of the content-adressed model to share store paths
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reference the reserved truster field from here


[rfc 0017]: https://github.com/NixOS/rfcs/pull/17
[nixphd]: https://nixos.org/~eelco/pubs/phd-thesis.pdf