Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Record (TOC digest → DiffID) mapping in BlobInfoCache #2321

Merged
merged 11 commits into from
Jul 30, 2024

Conversation

mtrmac
Copy link
Collaborator

@mtrmac mtrmac commented Feb 29, 2024

A single DiffID may map to multiple TOC digest values. Record that in BlobInfoCache, and use it for layer reuse.

Also prefer reusing even TOC-matched layers by DiffID, when available.

@giuseppe I’d appreciate a preliminary review of the new logic; see individual commits.

Draft: The BlobInfoCache implementations don’t actually store/record any data yet — so this is obviously completely untested.

// UncompressedDigest returns an uncompressed digest corresponding to anyDigest.
// Returns "" if the uncompressed digest is unknown.
// FIXME: Does this need to record TOC/compression type?
UncompressedDigestForTOC(tocDigest digest.Digest) digest.Digest
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TOC digest is the checksum of the uncompressed JSON document, so I think the compression should not matter in this case

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we probably don’t need that right now (with GetTOCDigest refusing to work on manifests which contain multiple TOC digest annotations, and presumably with the zstd / estargz code being unable to decompress the other one).

This comment is a looking a bit more into the future, for lookups in the other direction, where we will want to look up (UncompressedDigest → (compressed digest, TOC digest, algorithm)) and match that against “the user wants the destination to contain zstd:chunked” (i.e. reject estargz matches).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for lookups in the other direction,

That will be done in a separate data structure (an extension of RecordDigestCompressorName: We need the full set of annotations for reuse of a TOC-compressed blob, so this simple mapping is not sufficient anyway. And the other structure does record the algorithm.

Copy link
Collaborator Author

@mtrmac mtrmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: This is code-complete but I want to test it in practice.

storage/storage_dest.go Outdated Show resolved Hide resolved
@mtrmac mtrmac force-pushed the chunked-bic branch 2 times, most recently from 2a542f7 to 9e3cace Compare April 24, 2024 18:26
@mtrmac
Copy link
Collaborator Author

mtrmac commented Apr 25, 2024

To test:

Before:

# podman rmi alpine level1 level9
# rm -f /var/lib/containers/cache/blob-info-cache-v1.sqlite 
# podman pull quay.io/libpod/alpine
# podman --log-level=debug push --compression-format zstd:chunked --compression-level 1 --force-compression quay.io/libpod/alpine localhost:50000/level1
## Even better would be to use two different destination registries, to be 100% certain the blobs are not reused
## (right now they are not reused, but we’ll fix that):
# podman --log-level=debug push --compression-format zstd:chunked --compression-level 9 --force-compression quay.io/libpod/alpine localhost:50000/level9
## Note the compressed digest, and TOC digest, values:
# skopeo inspect --raw docker://localhost:50000/level1 | jq .
# skopeo inspect --raw docker://localhost:50000/level9 | jq .
## No DigestTOCUncompressedPairs entries:
# sqlite3 /var/lib/containers/cache/blob-info-cache-v1.sqlite .dump 
# podman rmi alpine level1 level9
## Triggers a partial pull: "Applying differ in …":
# podman --log-level=debug pull localhost:50000/level1
## Triggers a partial pull: "Applying differ in …"
# podman --log-level=debug pull localhost:50000/level9 
## level1 and level9 have different image IDs:
# podman images 
## Contains two copies of the layer, with the same expected-layer-diffid
# jq . < /var/lib/containers/storage/overlay-layers/layers.json

After:

  • DigestTOCUncompressedPairs contains 2 records
  • Pull of level1 triggers a partial pull (creating a layer with known TOC digest and uncompressed digest)
  • Pull of level9 reuses the layer (by BIC compressed -> uncompressed mapping)
  • Both level1 and level9 images use the same image ID (which matches the image ID used by the original non-chunked alpine image)

@mtrmac mtrmac marked this pull request as ready for review July 18, 2024 21:38
@mtrmac mtrmac changed the title WIP: Record (TOC digest → DiffID) mapping in BlobInfoCache Record (TOC digest → DiffID) mapping in BlobInfoCache Jul 18, 2024
@mtrmac mtrmac marked this pull request as draft July 18, 2024 23:24
@mtrmac mtrmac changed the title Record (TOC digest → DiffID) mapping in BlobInfoCache WIP: Record (TOC digest → DiffID) mapping in BlobInfoCache Jul 18, 2024
@mtrmac mtrmac force-pushed the chunked-bic branch 2 times, most recently from 72db3f5 to 714108a Compare July 19, 2024 20:57
@mtrmac
Copy link
Collaborator Author

mtrmac commented Jul 23, 2024

Podman tests are now passing in containers/podman#23348 (with containers/podman#23379 for Podman-side updates).

@mtrmac mtrmac marked this pull request as ready for review July 23, 2024 20:34
@mtrmac mtrmac changed the title WIP: Record (TOC digest → DiffID) mapping in BlobInfoCache Record (TOC digest → DiffID) mapping in BlobInfoCache Jul 23, 2024
@mtrmac mtrmac force-pushed the chunked-bic branch 2 times, most recently from 2dcf236 to 1a14d38 Compare July 24, 2024 20:31
@giuseppe
Copy link
Member

@mtrmac is it ready to review?

@mtrmac
Copy link
Collaborator Author

mtrmac commented Jul 29, 2024

@giuseppe yes, please review. containers/podman#23348 shows Podman tests passing.

This should generally make the layer reuse on pulls comparable with non-chunked. See also the specific test case above.

mtrmac added a commit to mtrmac/libpod that referenced this pull request Jul 29, 2024
Copy link
Member

@giuseppe giuseppe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a nit (probably not even worth a repush), otherwise LGTM

great work!

// Externally, a layer is identified either by (compressed) digest, or by TOC digest
// (and we assume the TOC digest also uniquely identifies the contents, i.e. there aren’t two
// different formats/ways to parse a single TOC); internally, we use uncompressed digest (“DiffID”) or a TOC digest.
// We may or may not know the relantionships between these three values.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo in "relantionships"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Fixed (and rebased).

mtrmac added 11 commits July 30, 2024 18:53
We are already calling m.LayerInfos() anyway, so there is ~no
extra cost. And using LayerInfos means we don't need to worry
about reversing the order of layers, and we will have access
to the layer index, allowing us to acccess the indexTo* fields
in the future.

Should not change behavior.

Signed-off-by: Miloslav Trmač <[email protected]>
- Don't claim that we only use compressed digests.
- Explicitly document that we assume TOC digests to be unambiguous
- Actually define the term "DiffID".
- Be more precise in computeID about the criteria being layer identity,
  not where we pull the layer from.

Should not change behavior.

Signed-off-by: Miloslav Trmač <[email protected]>
Some errors are severe enough that just logging and continuing is
not really worthwhile.

Signed-off-by: Miloslav Trmač <[email protected]>
…tyDataLocked

Currrently we "only" have indexToTOCDigest and blobDiffIDs, but we will make this
more complex.

Centralizing the consumption of these fields into trustedLayerIdentityDataLocked
ensure that all consumers interpret the data exactly consistently (and it also
allows us to use a single "trusted" variable instead of 2/3 individual ones).

Should not change behavior.

Signed-off-by: Miloslav Trmač <[email protected]>
The new code is not called, so it should not change behavior
(apart from extending the BoltDB/SQLite schema).

Signed-off-by: Miloslav Trmač <[email protected]>
…storage by DiffID

If we can, prefer identifying layers by DiffID, because multiple TOCs can map to the
same DiffID; and because it maximizes reuse with non-TOC layers.

For now, the new situation is unreachable.

Signed-off-by: Miloslav Trmač <[email protected]>
We will add one more instance of this, so share the code.

Should not change behavior (it does remove one unreachable code path).

Signed-off-by: Miloslav Trmač <[email protected]>
… is known

- Multiple TOC values might correspond to a single DiffID (e.g. if different
  compression levels are used); try to share them all, identified by DiffID
  (so that we also reuse with non-TOC pulls).
  - LayersByTOCDigest only uses a single TOC digest per layer; BlobInfoCache
    allows multiple matches, matches layers which have been since deleted,
    and potentially matches TOC digests which we have created by pushing
    but haven't pulled yet.
- On reuse, we can now use DiffID-based layer identities even if the reuse
  was TOC~driven.

Signed-off-by: Miloslav Trmač <[email protected]>
…ayers

- Rely on it instead of triggering the "untrusted DiffID" logic
- Also propagate it to storage

Signed-off-by: Miloslav Trmač <[email protected]>
@rhatdan
Copy link
Member

rhatdan commented Jul 30, 2024

LGTM

@rhatdan rhatdan merged commit 0b130b8 into containers:main Jul 30, 2024
10 checks passed
@mtrmac mtrmac deleted the chunked-bic branch July 30, 2024 18:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature A request for, or a PR adding, new functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

zstd:chunked and layer encryption don’t make sense together
3 participants