Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC 0017] Intensional Store #17

Draft
wants to merge 11 commits into
base: master
Choose a base branch
from
110 changes: 110 additions & 0 deletions rfcs/0017-intensional-store.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
---
feature: intensional_store
start-date: 2017-08-11
author: [email protected]
co-authors: (find a buddy later to help our with the RFC)
wmertens marked this conversation as resolved.
Show resolved Hide resolved
related-issues: (will contain links to implementation PRs)
wmertens marked this conversation as resolved.
Show resolved Hide resolved
---

# Summary
[summary]: #summary

One paragraph explanation of the feature.
wmertens marked this conversation as resolved.
Show resolved Hide resolved

# Motivation
[motivation]: #motivation
wmertens marked this conversation as resolved.
Show resolved Hide resolved

* Better re-use of inputs between compiles
* Faster updates via nix-channel
* Less compiling
* More benefit from reproducible compiles, so more reason to work on that

# Detailed design
[design]: #detailed-design

Terms used:
* derivation: a `nix-build` output product depending on some inputs and resulting in a file or directory under `/nix/store`
* dependent derivation: a derivation built using the currently considered derivation
* `$out`: name of the location where a derivation is installed first, e.g., `zyb6qaasr5yhh2r4484x01cy87xzddn7-unit-script-1.12`
* calculated based on the hashes of all the inputs, including build tools
* `$cas`: output hash, the total hash of all the files under $out, with the derivation name appended, e.g., `qqzyb6bsr5yhh2r5624x01cy87xzn7aa-unit-script-1.12`

## Concept

The basic concept is aliasing equivalent input derivations in such a way that dependent derivations won't need to change if only `$out` changes but not the input derivation contents.

After building a derivation, `$cas` is calculated, and `$out` is renamed to `$cas`. Then, if another build requires the input `$out`, it gets `$cas` instead, and all references to that build input will be `$cas` instead of `$out`. That dependent derivation will also have its input hash calculated with the `$cas` instead of the `$out`.

This means that if 2 different derivations of the same input have a different `$out` but the same `$cas`, any dependent builds will not need to rebuild due to the inputs being different. For example, the 12MB input `poppler-data` is often the same across multiple different input derivations, so many `$out`s for `poppler-data` all result in the same `$cas`. Similarly, a compiler flag change might leave most derivations unchanged.

In order to know which `$out`s refer to a particular `$cas`, symlinks can be used (`$out` pointing to `$cas`), or that data can be stored in the store database. The database can help with doing reverse lookups from `$cas` to all the `$out`s. Using symlinks will have a benefit of handling the case when self-references that are not discoverable via grep (e.g. filtered by ```xxd(1)```).
wmertens marked this conversation as resolved.
Show resolved Hide resolved

## Calculating `$cas`

There is one important corner case that needs special handling: if a derivation refers to itself, it will be referring to `$out`, because `$cas` is not known at the time of the build. This means that each `$out` of a furthermore equivalent build would have a different hash, due to the different `$out`s.
wmertens marked this conversation as resolved.
Show resolved Hide resolved

To fix this, the `$cas` calculation has to replace all occurrences of `$out` with an equal-length string of (for example) NULL bytes. After that, `$out` is renamed to `$cas` and all occurrences of `$out` are replaced with `$cas`.
wmertens marked this conversation as resolved.
Show resolved Hide resolved
wmertens marked this conversation as resolved.
Show resolved Hide resolved

This also means that `$out` and `$cas` should have the same length. The easiest way to achieve that is to use the same hash function for the output hash as used for the input hashes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not work if the references are are hidden for example an executable compressed with PEX or a java jar file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For those edge cases symlinks would be nice.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I'd prefer that things fail so we can find the hidden reference and find a permanent fix for that particular problem.

As for IPFS: ah yes, looks like there would have to be a second translation from $cas to $ipfs, but that would be trustless because you can check the downloaded $cas.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So basically I just assume that for any build, we'll find a way to make it build in a different location than the final install location.


To calculate `$cas` we need to include all the data that uniquely defines a derivation: the file contents, case-sensitive names, and the permission bits, traversed in a fixed order, no matter what the filesystem or platform. Not to be included are the owning `uid` of the store and timestamps.

## Distributing derivations

Since `$cas` is only known when `$out` is built, binary caches would need to retain that information. When you look up `$out` to see if it was built already, the response should be _"Yes, this is available as `$cas`"_.
wmertens marked this conversation as resolved.
Show resolved Hide resolved

## Maintaining the Nix store

When garbage collecting, the Nix store should also remove `$out` references (be they symlinks or db entries) when removing a `$cas`.

## Micro-optimizations not worth considering
wmertens marked this conversation as resolved.
Show resolved Hide resolved

* By stripping the version from `$cas`, it could be the same for multiple versions of the same derivation.
* However, increased version numbers mean the derivation actually changed, so there is no point in doing that.
* By stripping the name and the version from `$cas`, it could be the same for multiple different derivations.
* However, this makes it hard to find out what derivation a certain `$cas` is
wmertens marked this conversation as resolved.
Show resolved Hide resolved
* Furthermore, different inputs with the same contents are very unlikely, and there is no reduction in builds that need to be done.
wmertens marked this conversation as resolved.
Show resolved Hide resolved

Finally, `nix-store` supports hardlinking duplicate files, so the above optimizations are useless.
wmertens marked this conversation as resolved.
Show resolved Hide resolved

# Drawbacks
[drawbacks]: #drawbacks
wmertens marked this conversation as resolved.
Show resolved Hide resolved

* Extra code to maintain
* Slightly more processing after a build
wmertens marked this conversation as resolved.
Show resolved Hide resolved

# Alternatives
[alternatives]: #alternatives

* No change: This is only an optimization, it won't change the fundamental working of Nix in any way
wmertens marked this conversation as resolved.
Show resolved Hide resolved

# Unresolved questions
[unresolved]: #unresolved-questions

* Whether to store mappings as symlinks or db entries
* Exactly how the Hydra protocol needs to be changed

wmertens marked this conversation as resolved.
Show resolved Hide resolved
# Future work
[future]: #future-work

## Input-agnostic derivations

If a derivation with a new input is the same except that it has a changed reference to that input (e.g., a script referring to its interpreter, or a binary using a new library version), we call this an input-agnostic derivation for those two input versions (old and new input).

* To detect this, calculate the hash over the derivation, replacing *all* input references with NULL bytes. If that resulting hash is the same as a previous derivation, it is input-agnostic for those versions.
* This means that instead of downloading for installing it, it could be patched together from the previous version, by patching the old input `$cas`s with the new `$cas`s.
* This could keep storage and network traffic for Hydra down, by storing the previous `$cas` and the strings that need to be patched.

### …and beyond:

Knowing this also could enable a building shortcut: If a dependent derivation needs rebuilding, and a previous version is available depending on an input-agnostic derivation, it could be generated by patching in the new `$cas`.

This will not always work, i.e., when the input-agnostic derivation is used to copy data from the input it is agnostic over, it results in a change besides the input reference.

Therefore, this optimization should be optional, defaulting to off.

## Reproducible builds

If two derivations are the same except for some irrelevant build-environment changes, they won't get the same `$cas`. Since this impacts rebuilds, there is more incentive to have fully reproducible builds.

Hopefully this means we'll have it at some point, so we can crowd-source `$out` to `$cas` mappings by trusting many systems that get the same result.
wmertens marked this conversation as resolved.
Show resolved Hide resolved