Merge pull request #1693 from jku/add-repo-lib-design-adr

ADR: Add New repository library design
theupdateframework · Dec 8, 2021 · b2d8572 · b2d8572
2 parents 942e6d2 + f6ede42
commit b2d8572
Show file tree

Hide file tree

Showing 5 changed files with 363 additions and 0 deletions.
diff --git a/docs/adr/0010-repository-library-design.md b/docs/adr/0010-repository-library-design.md
@@ -0,0 +1,136 @@
+# Repository library design built on top of Metadata API
+
+
+## Context and Problem Statement
+
+The Metadata API provides a modern Python API for accessing individual pieces
+of metadata. It does not provide any wider context help to someone looking to
+implement a TUF repository.
+
+The legacy python-tuf implementation offers tools for this but suffers from
+some issues (as do many other implementations):
+* There is a _very_ large amount of code to maintain: repo.py,
+  repository_tool.py and repository_lib.py alone are almost 7000 lines of code.
+* The "library like" parts of the implementation do not form a good coherent
+  API: methods routinely have a large number of arguments, code still depends
+  on globals in a major way and application (repo.py) still implements a lot of
+  "repository code" itself
+* The "library like" parts of the implementation make decisions that look like
+  application decisions. As an example, repository_tool loads _every_ metadata
+  file in the repository: this is fine for CLI that operates on a small
+  repository but is unlikely to be a good choice for a large scale server.
+
+
+## Decision Drivers
+
+* There is a consensus on removing the legacy code from python-tuf due to
+  maintainability issues
+* Metadata API makes modifying metadata far easier than legacy code base: this
+  makes significantly different designs possible
+* Not providing a "repository library" (and leaving implementers on their own)
+  may be a short term solution because of the previous point, but to make
+  adoption easier and to help adopters create safe implementations the project
+  would benefit from some shared repository code and a shared repository design 
+* Maintainability of new library code must be a top concern
+* Allowing a wide range of repository implementations (from CLI tools to
+  minimal in-memory implementations to large scale application servers)
+  would be good: unfortunately these can have wildly differing requirements
+
+
+## Considered Options
+
+1. No repository packages
+2. repository_tool -like API
+3. Minimal repository abstraction
+
+
+## Decision Outcome
+
+Option 3: Minimal repository abstraction
+
+While option 1 might be used temporarily, the goal should be to implement a
+minimal repository abstraction as soon as possible: this should give the
+project a path forward where the maintenance burden is reasonable and results
+should be usable very soon. The python-tuf repository functionality can be
+later extended as ideas are experimented with in upstream projects and in
+python-tuf example code.
+
+The concept is still unproven but validating the design should be straight
+forward: decision could be re-evaluated in a few months if not in weeks.
+
+
+## Pros and Cons of the Options
+
+### No repository packages
+
+Metadata API makes editing the repository content vastly simpler. There are
+already repository implementations built with it[^1] so clearly a repository
+library is not an absolute requirement.
+
+Not providing repository packages in python-tuf does mean that external
+projects could experiment and create implementations without adding to the
+maintenance burden of python-tuf. This would be the easiest way to iterate many
+different designs and hopefully find good ones in the end.
+
+That said, there are some tricky parts of repository maintenance (e.g.
+initialization, snapshot update, hashed bin management) that would benefit from
+having a canonical implementation, both for easier adoption of python-tuf and
+as a reference for other implementations. Likewise, a well designed library
+could make some repeated actions (e.g. version bumps, expiry updates, signing)
+much easier to manage.
+
+### repository_tool -like API
+
+It won't be possible to support the repository_tool API as it is but a similar
+one would certainly be an option.
+
+This would likely be the easiest upgrade path for any repository_tool users out
+there. The implementation would not be a huge amount of work as Metadata API
+makes many things easier.
+
+However, repository_tool (and parts of repo.py) are not a great API. It is
+likely that a similar API suffers from some of the same issues: it might end up
+being a substantial amount of code that is only a good fit for one application.
+
+### Minimal repository abstraction
+
+python-tuf could define a tiny repository API that
+* provides carefully selected core functionality (like core snapshot update)
+* does not implement all repository actions itself, instead it makes it easy
+  for the application code to do them
+* leaves application details to specific implementations (examples of decisions
+  a library should not always decide: "are targets stored with the repo?",
+  "which versions of metadata are stored?", "when to load metadata?", "when to
+  unload metadata?", "when to bump metadata version?", "what is the new expiry
+  date?", "which targets versions should be part of new snapshot?")
+
+python-tuf could also provide one or more implementations of this abstraction
+as examples -- this could include a _repo.py_- or _repository_tool_-like
+implementation.
+
+This could be a compromise that allows:
+* low maintenance burden on python-tuf: initial library could be tiny
+* sharing the important, canonical parts of a TUF repository implementation
+* ergonomic repository modification, meaning most actions do not have to be in
+  the core code
+* very different repository implementations using the same core code and the
+  same abstract API
+
+The approach does have some downsides:
+* it's not a drop in replacement for repository_tool or repo.py
+* A prototype has been implemented (see Links below) but the concept is still
+  unproven
+
+More details in [Design document](../repository-library-design.md).
+
+## Links
+* [Design document for minimal repository abstraction](../repository-library-design.md)
+* [Prototype implementation of minimal repository abstraction](https://github.com/vmware-labs/repository-editor-for-tuf/)
+
+
+[^1]:
+    [RepositorySimulator](https://github.com/theupdateframework/python-tuf/blob/develop/tests/repository_simulator.py)
+    in python-tuf tests is an in-memory implementation, while
+    [repository-editor-for-tuf](https://github.com/vmware-labs/repository-editor-for-tuf)
+    is an external Command line repository maintenance tool.
+
diff --git a/docs/adr/index.md b/docs/adr/index.md
@@ -14,6 +14,7 @@ This log lists the architectural decisions for tuf.
 
 - [ADR-0008](0008-accept-unrecognised-fields.md) - Accept metadata that includes unrecognized fields
 - [ADR-0009](0009-what-is-a-reference-implementation.md) - Primary purpose of the reference implementation
+- [ADR-0010](0010-repository-library-design.md) - Repository library design built on top of Metadata API
 
 <!-- adrlogstop -->
 

diff --git a/docs/repository-library-design-ownership.jpg b/docs/repository-library-design-ownership.jpg
diff --git a/docs/repository-library-design-usage.jpg b/docs/repository-library-design-usage.jpg
diff --git a/docs/repository-library-design.md b/docs/repository-library-design.md
@@ -0,0 +1,226 @@
+# Python-tuf repository API proposal: _minimal repository abstraction_
+
+This is an attachment to ADR 10: _Repository library design built on top of
+Metadata API_, and documents the design proposal in Dec 2021.
+
+## Design principles
+
+Primary goals of this repository library design are
+1. Support full range of repository implementations: from command line
+   “repository editing” tools to production repositories like PyPI
+2. Provide canonical solutions for the difficult repository problems but avoid
+   making implementation decisions
+3. Keep python-tuf maintenance burden in mind: less is more
+
+Why does this design look so different from both legacy python-tuf code and
+other implementations?
+* Most existing implementations are focused on a specific use case (typically a
+  command line application): this is a valid design choice but severely limits
+  goal #1
+* The problem space contains many application decisions. Many implementations
+  solve this by creating functions with 15 arguments: this design tries to find
+  another way (#2)
+* The Metadata API makes modifying individual pieces of metadata simpler. This,
+  combined with good repository API design, should enable more variance in
+  where things are implemented: The repository library does not have to
+  implement every little detail as we can safely let specific implementations
+  handle things, see goal #3
+* This variance means we can start by implementing a minimal design: as
+  experience from implementations is collected, we can then move implementation
+  details into the library (goals #2, #3)
+
+## Design
+
+### Application and library components
+
+![Design: Application and library components](repository-library-design-ownership.jpg)
+
+The design expects a fully functional repository application to contain code at
+three levels:
+* Repository library (abstract classes that are part of python-tuf)
+  * The Repository abstract class provides an ergonomic abstract metadata
+    editing API for all code levels to use. It also provides implementations
+    for some core edit actions like _snapshot update_.
+  * A small amount of related functionality is also provided (private key
+    management API, maybe repository validation).
+  * is a very small library: possibly a few hundred lines of code.
+* Concrete Repository implementation (typically part of application code,
+  implements interfaces provided by the repository API in python-tuf)
+  * Contains the “application level” decisions that the Repository abstraction
+    requires to operate: examples of application decisions include
+    * _When should “targets” metadata next expire when it is edited?_
+    * _What is the current “targets” metadata version? Where do we load it 
+      from?_
+    * _Where to store current “targets” after editing? Should the previous
+      version be deleted from storage?_
+* Actual application
+  * Uses the Repository API to do the repository actions it needs to do
+
+For context here’s a trivial example showing what “ergonomic editing” means --
+this key-adding code could be in the application (or later, if common patterns
+are found, in the python-tuf library):
+
+```python
+with repository.edit(“targets”) as targets:
+    # adds a key for role1 (as an example, arbitrary edits are allowed)
+    targets.add_key(“role1”, key)
+```
+
+This code loads current targets metadata for editing, adds the key to a role,
+and handles version and expiry bumps before persisting the new targets version.
+The reason for the context manager style is that it manages two things
+simultaneously:
+* Hides the complexity of loading and persisting metadata, and updating expiry
+  and versions from the editing code (by putting it in the repository
+  implementation that is defined in python-tuf but implemented by the
+  application)
+* Still allows completely arbitrary edits on the metadata in question: now the
+  library does not need to anticipate what application wants to do and on the
+  other hand library can still provide e.g. snapshot functionality without
+  knowing about the application decisions mentioned in previous point.
+
+Other designs do not seem to manage both of these.
+
+### How the components are used
+
+![Design: How components are used](repository-library-design-usage.jpg)
+
+The core idea here is that because editing is ergonomic enough, when new
+functionality (like “developer uploads new targets”) is added, _it can be added
+at any level_: the application might add a `handle_new_target_files()` method
+that adds a bunch of targets into the metadata, but one of the previous layers
+could offer that as a helper function as well: code in both cases would look
+similar as it would use the common editing interface.
+
+The proposed design is purposefully spartan in that the library provides
+very few high-level actions (the prototype only provided _sign_ and
+_snapshot_): everything else is left to implementer at this point. As we gain
+experience of common usage patterns we can start providing other features as
+well.
+
+There are a few additional items worth mentioning:
+* Private key management: the Repository API should come with a “keyring
+  abstraction” -- a way for the application to provide roles’ private keys for
+  the Repository to use. Some implementations could be provided as well.
+* Validating repository state: the design is very much focused on enabling
+  efficient editing of individual metadata. Implementations are also likely to
+  be interested in validating (after some edits) that the repository is correct
+  according to client workflow and that it contains the expected changes. The
+  Repository API should provide some validation, but we should recognise that
+  validation may be implementation specific.
+* Improved metadata editing: There are a small number of improvements that
+  could be made to metadata editing. These do not necessarily need to be part
+  of the repository API: they could be part of Metadata API as well
+
+It would make sense for python-tuf to ship with at least one concrete
+Repository implementation: possibly a repo.py look alike. This implementation
+should not be part of the library but an example.
+
+## Details
+
+This section includes links to a Proof of Concept implementation in
+[repository-editor-for-tuf](https://github.com/vmware-labs/repository-editor-for-tuf/):
+it should not be seen as the exact proposed API but a prototype of the ideas.
+
+The ideas in this document map to POC components like this:
+
+| Concept | repository-editor-for-tuf implementation |
+|-|-|
+| Repository API | [librepo/repo.py](https://github.com/vmware-labs/repository-editor-for-tuf/blob/main/tufrepo/librepo/repo.py), [librepo/keys.py](https://github.com/vmware-labs/repository-editor-for-tuf/blob/main/tufrepo/librepo/repo.py)  |
+| Example of repository implementation | [git_repo.py](https://github.com/vmware-labs/repository-editor-for-tuf/blob/main/tufrepo/git_repo.py) |
+|Application code | [cli.py (command line app)](https://github.com/vmware-labs/repository-editor-for-tuf/blob/main/tufrepo/cli.py), [keys_impl.py (keyring implementation)](https://github.com/vmware-labs/repository-editor-for-tuf/blob/main/tufrepo/keys_impl.py) |
+| Repository validation | [verifier.py (very rough, not intended for python-tuf)](https://github.com/vmware-labs/repository-editor-for-tuf/blob/main/tufrepo/verifier.py)
+| Improved Metadata editing | [helpers.py](https://github.com/vmware-labs/repository-editor-for-tuf/blob/main/tufrepo/helpers.py) 
+
+
+### Repository API
+
+Repository itself is a minimal abstract class: The value of this class is in
+defining the abstract method signatures (most importantly `_load`, `_save()`,
+`edit()`) that enable ergonomic metadata editing. The Repository class in this
+proposal includes concrete implementations only for the following:
+* `sign()` -- signing without editing metadata payload
+* `snapshot()` -- updates snapshot and timestamp metadata based on given input.
+  Note that a concrete Repository implementation could provide an easier to use
+  snapshot that does not require input (see example in git_repo.py)
+
+More concrete method implementations (see cli.py for examples) could be added
+to Repository itself but none seem essential at this point.
+
+The current prototype API defines five abstract methods that take care of
+access to metadata storage, expiry updates, version updates and signing. These
+must be implemented in the concrete implementation:
+
+* **keyring()**: A property that returns the private key mapping that should be
+  used for signing.
+
+* **_load()**: Loads metadata from storage or cache. Is used by edit() and
+  sign().
+
+* **_save()**: Signs and persists metadata in cache/storage. Is used by edit()
+  and sign().
+
+* **edit()**: The ContextManager that enables ergonomic metadata
+  editing by handling expiry and version number management.
+
+* **init_role()**: initializes new metadata handling expiry and version number.
+  (_init_role is in a way a special case of edit and should potentially be
+  integrated there_).
+
+The API requires a “Keyring” abstraction that the repository code can use to
+lookup a set of signers for a specific role. Specific implementations of
+Keyring could include a file-based keyring for testing, env-var keyring for CI
+use, etc. Some implementations should be provided in the python-tuf code base
+and more could be implemented in applications.
+
+_Prototype status: Prototype Repository and Keyring abstractions exist in
+librepo/repo.py._
+
+### Example concrete Repository implementation
+
+The design decisions that the included example `GitRepository` makes are not
+important but provide an example of what is possible:
+* Metadata versions are stored in files in git, with filenames that allow
+  serving the metadata directory as is over HTTP
+* Version bumps are made based on git status (so edits in staging area only
+  bump version once)
+* “Current version” when loading metadata is decided based on filenames on disk
+* Files are removed once they are no longer part of the snapshot (to keep
+  directory uncluttered)
+* Expiry times are decided based on an application specific metadata field
+* Private keys can be stored in a file or in environment variables (for CI use)
+
+Note that GitRepository implementation is significantly larger than the
+Repository interface -- but all of the complexity in GitRepository is really
+related to the design decisions made there.
+
+_Prototype status: The GitRepository example exists in git_repo.py._
+
+### Validating repository state
+
+This is mostly undesigned but something built on top of TrustedMetadataSet
+(currently ngclient component) might work as a way to easily check specific
+aspects like:
+* Is top-level metadata valid according to client workflow
+* Is a role included in the snapshot and the delegation tree
+
+It’s likely that different implementations will have different needs though: a
+command line app for small repos might want to validate loading all metadata
+into memory, but a server application hosting tens of thousands of pieces of
+metadata is unlikely to do so.
+
+_Prototype status: A very rough implementation exists in verifier.py : this is
+unlikely to be very useful_
+
+### Improved metadata editing
+
+Currently the identified improvement areas are:
+* Metadata initialization: this could potentially be improved by adding
+  default argument values to Metadata API constructors 
+* Modifying and looking up data about roles in delegating metadata
+  (root/targets): they do similar things but root and targets do not have
+  identical API. This may be a very specific use case and not interesting
+  for some applications
+
+_Prototype status: Some potential improvements have been collected in
+helpers.py_