diff --git a/docs/adr/0010-repository-library-design.md b/docs/adr/0010-repository-library-design.md new file mode 100644 index 0000000000..0673063e89 --- /dev/null +++ b/docs/adr/0010-repository-library-design.md @@ -0,0 +1,136 @@ +# Repository library design built on top of Metadata API + + +## Context and Problem Statement + +The Metadata API provides a modern Python API for accessing individual pieces +of metadata. It does not provide any wider context help to someone looking to +implement a TUF repository. + +The legacy python-tuf implementation offers tools for this but suffers from +some issues (as do many other implementations): +* There is a _very_ large amount of code to maintain: repo.py, + repository_tool.py and repository_lib.py alone are almost 7000 lines of code. +* The "library like" parts of the implementation do not form a good coherent + API: methods routinely have a large number of arguments, code still depends + on globals in a major way and application (repo.py) still implements a lot of + "repository code" itself +* The "library like" parts of the implementation make decisions that look like + application decisions. As an example, repository_tool loads _every_ metadata + file in the repository: this is fine for CLI that operates on a small + repository but is unlikely to be a good choice for a large scale server. + + +## Decision Drivers + +* There is a consensus on removing the legacy code from python-tuf due to + maintainability issues +* Metadata API makes modifying metadata far easier than legacy code base: this + makes significantly different designs possible +* Not providing a "repository library" (and leaving implementers on their own) + may be a short term solution because of the previous point, but to make + adoption easier and to help adopters create safe implementations the project + would benefit from some shared repository code and a shared repository design +* Maintainability of new library code must be a top concern +* Allowing a wide range of repository implementations (from CLI tools to + minimal in-memory implementations to large scale application servers) + would be good: unfortunately these can have wildly differing requirements + + +## Considered Options + +1. No repository packages +2. repository_tool -like API +3. Minimal repository abstraction + + +## Decision Outcome + +Option 3: Minimal repository abstraction + +While option 1 might be used temporarily, the goal should be to implement a +minimal repository abstraction as soon as possible: this should give the +project a path forward where the maintenance burden is reasonable and results +should be usable very soon. The python-tuf repository functionality can be +later extended as ideas are experimented with in upstream projects and in +python-tuf example code. + +The concept is still unproven but validating the design should be straight +forward: decision could be re-evaluated in a few months if not in weeks. + + +## Pros and Cons of the Options + +### No repository packages + +Metadata API makes editing the repository content vastly simpler. There are +already repository implementations built with it[^1] so clearly a repository +library is not an absolute requirement. + +Not providing repository packages in python-tuf does mean that external +projects could experiment and create implementations without adding to the +maintenance burden of python-tuf. This would be the easiest way to iterate many +different designs and hopefully find good ones in the end. + +That said, there are some tricky parts of repository maintenance (e.g. +initialization, snapshot update, hashed bin management) that would benefit from +having a canonical implementation, both for easier adoption of python-tuf and +as a reference for other implementations. Likewise, a well designed library +could make some repeated actions (e.g. version bumps, expiry updates, signing) +much easier to manage. + +### repository_tool -like API + +It won't be possible to support the repository_tool API as it is but a similar +one would certainly be an option. + +This would likely be the easiest upgrade path for any repository_tool users out +there. The implementation would not be a huge amount of work as Metadata API +makes many things easier. + +However, repository_tool (and parts of repo.py) are not a great API. It is +likely that a similar API suffers from some of the same issues: it might end up +being a substantial amount of code that is only a good fit for one application. + +### Minimal repository abstraction + +python-tuf could define a tiny repository API that +* provides carefully selected core functionality (like core snapshot update) +* does not implement all repository actions itself, instead it makes it easy + for the application code to do them +* leaves application details to specific implementations (examples of decisions + a library should not always decide: "are targets stored with the repo?", + "which versions of metadata are stored?", "when to load metadata?", "when to + unload metadata?", "when to bump metadata version?", "what is the new expiry + date?", "which targets versions should be part of new snapshot?") + +python-tuf could also provide one or more implementations of this abstraction +as examples -- this could include a _repo.py_- or _repository_tool_-like +implementation. + +This could be a compromise that allows: +* low maintenance burden on python-tuf: initial library could be tiny +* sharing the important, canonical parts of a TUF repository implementation +* ergonomic repository modification, meaning most actions do not have to be in + the core code +* very different repository implementations using the same core code and the + same abstract API + +The approach does have some downsides: +* it's not a drop in replacement for repository_tool or repo.py +* A prototype has been implemented (see Links below) but the concept is still + unproven + +More details in [Design document](../repository-library-design.md). + +## Links +* [Design document for minimal repository abstraction](../repository-library-design.md) +* [Prototype implementation of minimal repository abstraction](https://github.com/vmware-labs/repository-editor-for-tuf/) + + +[^1]: + [RepositorySimulator](https://github.com/theupdateframework/python-tuf/blob/develop/tests/repository_simulator.py) + in python-tuf tests is an in-memory implementation, while + [repository-editor-for-tuf](https://github.com/vmware-labs/repository-editor-for-tuf) + is an external Command line repository maintenance tool. + diff --git a/docs/adr/index.md b/docs/adr/index.md index 54a9be0861..46d9d84b5d 100644 --- a/docs/adr/index.md +++ b/docs/adr/index.md @@ -14,6 +14,7 @@ This log lists the architectural decisions for tuf. - [ADR-0008](0008-accept-unrecognised-fields.md) - Accept metadata that includes unrecognized fields - [ADR-0009](0009-what-is-a-reference-implementation.md) - Primary purpose of the reference implementation +- [ADR-0010](0010-repository-library-design.md) - Repository library design built on top of Metadata API diff --git a/docs/repository-library-design-ownership.jpg b/docs/repository-library-design-ownership.jpg new file mode 100644 index 0000000000..68eaafc8e4 Binary files /dev/null and b/docs/repository-library-design-ownership.jpg differ diff --git a/docs/repository-library-design-usage.jpg b/docs/repository-library-design-usage.jpg new file mode 100644 index 0000000000..9eb7ca711b Binary files /dev/null and b/docs/repository-library-design-usage.jpg differ diff --git a/docs/repository-library-design.md b/docs/repository-library-design.md new file mode 100644 index 0000000000..5a9b0fde48 --- /dev/null +++ b/docs/repository-library-design.md @@ -0,0 +1,226 @@ +# Python-tuf repository API proposal: _minimal repository abstraction_ + +This is an attachment to ADR 10: _Repository library design built on top of +Metadata API_, and documents the design proposal in Dec 2021. + +## Design principles + +Primary goals of this repository library design are +1. Support full range of repository implementations: from command line + “repository editing” tools to production repositories like PyPI +2. Provide canonical solutions for the difficult repository problems but avoid + making implementation decisions +3. Keep python-tuf maintenance burden in mind: less is more + +Why does this design look so different from both legacy python-tuf code and +other implementations? +* Most existing implementations are focused on a specific use case (typically a + command line application): this is a valid design choice but severely limits + goal #1 +* The problem space contains many application decisions. Many implementations + solve this by creating functions with 15 arguments: this design tries to find + another way (#2) +* The Metadata API makes modifying individual pieces of metadata simpler. This, + combined with good repository API design, should enable more variance in + where things are implemented: The repository library does not have to + implement every little detail as we can safely let specific implementations + handle things, see goal #3 +* This variance means we can start by implementing a minimal design: as + experience from implementations is collected, we can then move implementation + details into the library (goals #2, #3) + +## Design + +### Application and library components + +![Design: Application and library components](repository-library-design-ownership.jpg) + +The design expects a fully functional repository application to contain code at +three levels: +* Repository library (abstract classes that are part of python-tuf) + * The Repository abstract class provides an ergonomic abstract metadata + editing API for all code levels to use. It also provides implementations + for some core edit actions like _snapshot update_. + * A small amount of related functionality is also provided (private key + management API, maybe repository validation). + * is a very small library: possibly a few hundred lines of code. +* Concrete Repository implementation (typically part of application code, + implements interfaces provided by the repository API in python-tuf) + * Contains the “application level” decisions that the Repository abstraction + requires to operate: examples of application decisions include + * _When should “targets” metadata next expire when it is edited?_ + * _What is the current “targets” metadata version? Where do we load it + from?_ + * _Where to store current “targets” after editing? Should the previous + version be deleted from storage?_ +* Actual application + * Uses the Repository API to do the repository actions it needs to do + +For context here’s a trivial example showing what “ergonomic editing” means -- +this key-adding code could be in the application (or later, if common patterns +are found, in the python-tuf library): + +```python +with repository.edit(“targets”) as targets: + # adds a key for role1 (as an example, arbitrary edits are allowed) + targets.add_key(“role1”, key) +``` + +This code loads current targets metadata for editing, adds the key to a role, +and handles version and expiry bumps before persisting the new targets version. +The reason for the context manager style is that it manages two things +simultaneously: +* Hides the complexity of loading and persisting metadata, and updating expiry + and versions from the editing code (by putting it in the repository + implementation that is defined in python-tuf but implemented by the + application) +* Still allows completely arbitrary edits on the metadata in question: now the + library does not need to anticipate what application wants to do and on the + other hand library can still provide e.g. snapshot functionality without + knowing about the application decisions mentioned in previous point. + +Other designs do not seem to manage both of these. + +### How the components are used + +![Design: How components are used](repository-library-design-usage.jpg) + +The core idea here is that because editing is ergonomic enough, when new +functionality (like “developer uploads new targets”) is added, _it can be added +at any level_: the application might add a `handle_new_target_files()` method +that adds a bunch of targets into the metadata, but one of the previous layers +could offer that as a helper function as well: code in both cases would look +similar as it would use the common editing interface. + +The proposed design is purposefully spartan in that the library provides +very few high-level actions (the prototype only provided _sign_ and +_snapshot_): everything else is left to implementer at this point. As we gain +experience of common usage patterns we can start providing other features as +well. + +There are a few additional items worth mentioning: +* Private key management: the Repository API should come with a “keyring + abstraction” -- a way for the application to provide roles’ private keys for + the Repository to use. Some implementations could be provided as well. +* Validating repository state: the design is very much focused on enabling + efficient editing of individual metadata. Implementations are also likely to + be interested in validating (after some edits) that the repository is correct + according to client workflow and that it contains the expected changes. The + Repository API should provide some validation, but we should recognise that + validation may be implementation specific. +* Improved metadata editing: There are a small number of improvements that + could be made to metadata editing. These do not necessarily need to be part + of the repository API: they could be part of Metadata API as well + +It would make sense for python-tuf to ship with at least one concrete +Repository implementation: possibly a repo.py look alike. This implementation +should not be part of the library but an example. + +## Details + +This section includes links to a Proof of Concept implementation in +[repository-editor-for-tuf](https://github.com/vmware-labs/repository-editor-for-tuf/): +it should not be seen as the exact proposed API but a prototype of the ideas. + +The ideas in this document map to POC components like this: + +| Concept | repository-editor-for-tuf implementation | +|-|-| +| Repository API | [librepo/repo.py](https://github.com/vmware-labs/repository-editor-for-tuf/blob/main/tufrepo/librepo/repo.py), [librepo/keys.py](https://github.com/vmware-labs/repository-editor-for-tuf/blob/main/tufrepo/librepo/repo.py) | +| Example of repository implementation | [git_repo.py](https://github.com/vmware-labs/repository-editor-for-tuf/blob/main/tufrepo/git_repo.py) | +|Application code | [cli.py (command line app)](https://github.com/vmware-labs/repository-editor-for-tuf/blob/main/tufrepo/cli.py), [keys_impl.py (keyring implementation)](https://github.com/vmware-labs/repository-editor-for-tuf/blob/main/tufrepo/keys_impl.py) | +| Repository validation | [verifier.py (very rough, not intended for python-tuf)](https://github.com/vmware-labs/repository-editor-for-tuf/blob/main/tufrepo/verifier.py) +| Improved Metadata editing | [helpers.py](https://github.com/vmware-labs/repository-editor-for-tuf/blob/main/tufrepo/helpers.py) + + +### Repository API + +Repository itself is a minimal abstract class: The value of this class is in +defining the abstract method signatures (most importantly `_load`, `_save()`, +`edit()`) that enable ergonomic metadata editing. The Repository class in this +proposal includes concrete implementations only for the following: +* `sign()` -- signing without editing metadata payload +* `snapshot()` -- updates snapshot and timestamp metadata based on given input. + Note that a concrete Repository implementation could provide an easier to use + snapshot that does not require input (see example in git_repo.py) + +More concrete method implementations (see cli.py for examples) could be added +to Repository itself but none seem essential at this point. + +The current prototype API defines five abstract methods that take care of +access to metadata storage, expiry updates, version updates and signing. These +must be implemented in the concrete implementation: + +* **keyring()**: A property that returns the private key mapping that should be + used for signing. + +* **_load()**: Loads metadata from storage or cache. Is used by edit() and + sign(). + +* **_save()**: Signs and persists metadata in cache/storage. Is used by edit() + and sign(). + +* **edit()**: The ContextManager that enables ergonomic metadata + editing by handling expiry and version number management. + +* **init_role()**: initializes new metadata handling expiry and version number. + (_init_role is in a way a special case of edit and should potentially be + integrated there_). + +The API requires a “Keyring” abstraction that the repository code can use to +lookup a set of signers for a specific role. Specific implementations of +Keyring could include a file-based keyring for testing, env-var keyring for CI +use, etc. Some implementations should be provided in the python-tuf code base +and more could be implemented in applications. + +_Prototype status: Prototype Repository and Keyring abstractions exist in +librepo/repo.py._ + +### Example concrete Repository implementation + +The design decisions that the included example `GitRepository` makes are not +important but provide an example of what is possible: +* Metadata versions are stored in files in git, with filenames that allow + serving the metadata directory as is over HTTP +* Version bumps are made based on git status (so edits in staging area only + bump version once) +* “Current version” when loading metadata is decided based on filenames on disk +* Files are removed once they are no longer part of the snapshot (to keep + directory uncluttered) +* Expiry times are decided based on an application specific metadata field +* Private keys can be stored in a file or in environment variables (for CI use) + +Note that GitRepository implementation is significantly larger than the +Repository interface -- but all of the complexity in GitRepository is really +related to the design decisions made there. + +_Prototype status: The GitRepository example exists in git_repo.py._ + +### Validating repository state + +This is mostly undesigned but something built on top of TrustedMetadataSet +(currently ngclient component) might work as a way to easily check specific +aspects like: +* Is top-level metadata valid according to client workflow +* Is a role included in the snapshot and the delegation tree + +It’s likely that different implementations will have different needs though: a +command line app for small repos might want to validate loading all metadata +into memory, but a server application hosting tens of thousands of pieces of +metadata is unlikely to do so. + +_Prototype status: A very rough implementation exists in verifier.py : this is +unlikely to be very useful_ + +### Improved metadata editing + +Currently the identified improvement areas are: +* Metadata initialization: this could potentially be improved by adding + default argument values to Metadata API constructors +* Modifying and looking up data about roles in delegating metadata + (root/targets): they do similar things but root and targets do not have + identical API. This may be a very specific use case and not interesting + for some applications + +_Prototype status: Some potential improvements have been collected in +helpers.py_