Record (TOC digest → DiffID) mapping in BlobInfoCache #2321

mtrmac · 2024-02-29T15:01:04Z

A single DiffID may map to multiple TOC digest values. Record that in BlobInfoCache, and use it for layer reuse.

Also prefer reusing even TOC-matched layers by DiffID, when available.

@giuseppe I’d appreciate a preliminary review of the new logic; see individual commits.

~~Draft: The BlobInfoCache implementations don’t actually store/record any data yet — so this is obviously completely untested.~~

giuseppe · 2024-02-29T15:30:20Z

internal/blobinfocache/types.go

+	// UncompressedDigest returns an uncompressed digest corresponding to anyDigest.
+	// Returns "" if the uncompressed digest is unknown.
+	// FIXME: Does this need to record TOC/compression type?
+	UncompressedDigestForTOC(tocDigest digest.Digest) digest.Digest


The TOC digest is the checksum of the uncompressed JSON document, so I think the compression should not matter in this case

I agree we probably don’t need that right now (with GetTOCDigest refusing to work on manifests which contain multiple TOC digest annotations, and presumably with the zstd / estargz code being unable to decompress the other one).

This comment is a looking a bit more into the future, for lookups in the other direction, where we will want to look up (UncompressedDigest → (compressed digest, TOC digest, algorithm)) and match that against “the user wants the destination to contain zstd:chunked” (i.e. reject estargz matches).

for lookups in the other direction,

That will be done in a separate data structure (an extension of RecordDigestCompressorName: We need the full set of annotations for reuse of a TOC-compressed blob, so this simple mapping is not sufficient anyway. And the other structure does record the algorithm.

mtrmac

Note to self: This is code-complete but I want to test it in practice.

storage/storage_dest.go

mtrmac · 2024-04-25T21:30:00Z

To test:

Before:

# podman rmi alpine level1 level9
# rm -f /var/lib/containers/cache/blob-info-cache-v1.sqlite 
# podman pull quay.io/libpod/alpine
# podman --log-level=debug push --compression-format zstd:chunked --compression-level 1 --force-compression quay.io/libpod/alpine localhost:50000/level1
## Even better would be to use two different destination registries, to be 100% certain the blobs are not reused
## (right now they are not reused, but we’ll fix that):
# podman --log-level=debug push --compression-format zstd:chunked --compression-level 9 --force-compression quay.io/libpod/alpine localhost:50000/level9
## Note the compressed digest, and TOC digest, values:
# skopeo inspect --raw docker://localhost:50000/level1 | jq .
# skopeo inspect --raw docker://localhost:50000/level9 | jq .
## No DigestTOCUncompressedPairs entries:
# sqlite3 /var/lib/containers/cache/blob-info-cache-v1.sqlite .dump 
# podman rmi alpine level1 level9
## Triggers a partial pull: "Applying differ in …":
# podman --log-level=debug pull localhost:50000/level1
## Triggers a partial pull: "Applying differ in …"
# podman --log-level=debug pull localhost:50000/level9 
## level1 and level9 have different image IDs:
# podman images 
## Contains two copies of the layer, with the same expected-layer-diffid
# jq . < /var/lib/containers/storage/overlay-layers/layers.json

After:

DigestTOCUncompressedPairs contains 2 records
Pull of level1 triggers a partial pull (creating a layer with known TOC digest and uncompressed digest)
Pull of level9 reuses the layer (by BIC compressed -> uncompressed mapping)
Both level1 and level9 images use the same image ID (which matches the image ID used by the original non-chunked alpine image)

mtrmac · 2024-07-23T20:34:33Z

Podman tests are now passing in containers/podman#23348 (with containers/podman#23379 for Podman-side updates).

giuseppe · 2024-07-29T08:00:04Z

@mtrmac is it ready to review?

mtrmac · 2024-07-29T14:44:52Z

@giuseppe yes, please review. containers/podman#23348 shows Podman tests passing.

This should generally make the layer reuse on pulls comparable with non-chunked. See also the specific test case above.

Signed-off-by: Miloslav Trmač <[email protected]>

giuseppe

just a nit (probably not even worth a repush), otherwise LGTM

great work!

giuseppe · 2024-07-30T08:15:00Z

storage/storage_dest.go

+	// Externally, a layer is identified either by (compressed) digest, or by TOC digest
+	// (and we assume the TOC digest also uniquely identifies the contents, i.e. there aren’t two
+	// different formats/ways to parse a single TOC); internally, we use uncompressed digest (“DiffID”) or a TOC digest.
+	// We may or may not know the relantionships between these three values.


typo in "relantionships"

Thanks! Fixed (and rebased).

We are already calling m.LayerInfos() anyway, so there is ~no extra cost. And using LayerInfos means we don't need to worry about reversing the order of layers, and we will have access to the layer index, allowing us to acccess the indexTo* fields in the future. Should not change behavior. Signed-off-by: Miloslav Trmač <[email protected]>

- Don't claim that we only use compressed digests. - Explicitly document that we assume TOC digests to be unambiguous - Actually define the term "DiffID". - Be more precise in computeID about the criteria being layer identity, not where we pull the layer from. Should not change behavior. Signed-off-by: Miloslav Trmač <[email protected]>

Some errors are severe enough that just logging and continuing is not really worthwhile. Signed-off-by: Miloslav Trmač <[email protected]>

…tyDataLocked Currrently we "only" have indexToTOCDigest and blobDiffIDs, but we will make this more complex. Centralizing the consumption of these fields into trustedLayerIdentityDataLocked ensure that all consumers interpret the data exactly consistently (and it also allows us to use a single "trusted" variable instead of 2/3 individual ones). Should not change behavior. Signed-off-by: Miloslav Trmač <[email protected]>

The new code is not called, so it should not change behavior (apart from extending the BoltDB/SQLite schema). Signed-off-by: Miloslav Trmač <[email protected]>

…storage by DiffID If we can, prefer identifying layers by DiffID, because multiple TOCs can map to the same DiffID; and because it maximizes reuse with non-TOC layers. For now, the new situation is unreachable. Signed-off-by: Miloslav Trmač <[email protected]>

We will add one more instance of this, so share the code. Should not change behavior (it does remove one unreachable code path). Signed-off-by: Miloslav Trmač <[email protected]>

… is known - Multiple TOC values might correspond to a single DiffID (e.g. if different compression levels are used); try to share them all, identified by DiffID (so that we also reuse with non-TOC pulls). - LayersByTOCDigest only uses a single TOC digest per layer; BlobInfoCache allows multiple matches, matches layers which have been since deleted, and potentially matches TOC digests which we have created by pushing but haven't pulled yet. - On reuse, we can now use DiffID-based layer identities even if the reuse was TOC~driven. Signed-off-by: Miloslav Trmač <[email protected]>

…yers Signed-off-by: Miloslav Trmač <[email protected]>

…ayers - Rely on it instead of triggering the "untrusted DiffID" logic - Also propagate it to storage Signed-off-by: Miloslav Trmač <[email protected]>

Signed-off-by: Miloslav Trmač <[email protected]>

rhatdan · 2024-07-30T18:25:48Z

LGTM

giuseppe reviewed Feb 29, 2024

View reviewed changes

mtrmac force-pushed the chunked-bic branch from 3560394 to 5f98f2b Compare March 4, 2024 14:29

mtrmac force-pushed the chunked-bic branch 3 times, most recently from be098a2 to b14f00b Compare March 14, 2024 22:14

mtrmac force-pushed the chunked-bic branch from b14f00b to 506bacc Compare March 25, 2024 17:31

mtrmac added the kind/feature A request for, or a PR adding, new functionality label Apr 5, 2024

mtrmac force-pushed the chunked-bic branch 4 times, most recently from d238714 to 6dae67d Compare April 11, 2024 22:35

mtrmac mentioned this pull request Apr 13, 2024

zstd:chunked blocker: TarSplitChecksumKey not used in a layer ID containers/storage#1888

Closed

mtrmac force-pushed the chunked-bic branch from 6dae67d to e0e53b6 Compare April 13, 2024 15:51

mtrmac commented Apr 13, 2024

View reviewed changes

storage/storage_dest.go Outdated Show resolved Hide resolved

mtrmac force-pushed the chunked-bic branch 2 times, most recently from 2a542f7 to 9e3cace Compare April 24, 2024 18:26

mtrmac force-pushed the chunked-bic branch from 9e3cace to a9266ff Compare April 30, 2024 20:30

mtrmac force-pushed the chunked-bic branch from a9266ff to 800614d Compare May 30, 2024 19:51

mtrmac mentioned this pull request Jul 8, 2024

chunked: store compressed digest if validated containers/storage#2001

Merged

mtrmac force-pushed the chunked-bic branch from 800614d to 2d45cdc Compare July 9, 2024 18:36

mtrmac mentioned this pull request Jul 10, 2024

Allow matching of compressed blobs converted on the fly to zstd:chunked #2478

Merged

mtrmac force-pushed the chunked-bic branch from 2d45cdc to bac8947 Compare July 11, 2024 18:15

mtrmac marked this pull request as ready for review July 18, 2024 21:38

mtrmac changed the title ~~WIP: Record (TOC digest → DiffID) mapping in BlobInfoCache~~ Record (TOC digest → DiffID) mapping in BlobInfoCache Jul 18, 2024

mtrmac marked this pull request as draft July 18, 2024 23:24

mtrmac changed the title ~~Record (TOC digest → DiffID) mapping in BlobInfoCache~~ WIP: Record (TOC digest → DiffID) mapping in BlobInfoCache Jul 18, 2024

mtrmac force-pushed the chunked-bic branch 2 times, most recently from 72db3f5 to 714108a Compare July 19, 2024 20:57

mtrmac marked this pull request as ready for review July 23, 2024 20:34

mtrmac changed the title ~~WIP: Record (TOC digest → DiffID) mapping in BlobInfoCache~~ Record (TOC digest → DiffID) mapping in BlobInfoCache Jul 23, 2024

mtrmac force-pushed the chunked-bic branch 2 times, most recently from 2dcf236 to 1a14d38 Compare July 24, 2024 20:31

mtrmac force-pushed the chunked-bic branch from 1a14d38 to 2b3d34d Compare July 27, 2024 15:25

mtrmac added a commit to mtrmac/libpod that referenced this pull request Jul 29, 2024

DO NOT MERGE: Vendor UNMERGED containers/image#2321

855cc8f

Signed-off-by: Miloslav Trmač <[email protected]>

giuseppe approved these changes Jul 30, 2024

View reviewed changes

mtrmac added 11 commits July 30, 2024 18:53

Allow returning (and reporting) unexpected errors from computeID

e7a01b8

Some errors are severe enough that just logging and continuing is not really worthwhile. Signed-off-by: Miloslav Trmač <[email protected]>

Add TOC digest <-> uncompressed digest mapping to BIC

757d726

The new code is not called, so it should not change behavior (apart from extending the BoltDB/SQLite schema). Signed-off-by: Miloslav Trmač <[email protected]>

Split reusedBlobFromLayerLookup from tryReusingBlobAsPending

924d853

We will add one more instance of this, so share the code. Should not change behavior (it does remove one unreachable code path). Signed-off-by: Miloslav Trmač <[email protected]>

Record the (TOC digest, uncompressed digest) data when we compress la…

acdd064

…yers Signed-off-by: Miloslav Trmač <[email protected]>

Use the uncompressed digest we got from a BlobInfoCache for chunked l…

403d0a2

…ayers - Rely on it instead of triggering the "untrusted DiffID" logic - Also propagate it to storage Signed-off-by: Miloslav Trmač <[email protected]>

HACK: Don't compress with zstd:chunked when encrypting

f49cb62

Signed-off-by: Miloslav Trmač <[email protected]>

mtrmac force-pushed the chunked-bic branch from 2b3d34d to f49cb62 Compare July 30, 2024 16:59

rhatdan merged commit 0b130b8 into containers:main Jul 30, 2024
10 checks passed

mtrmac deleted the chunked-bic branch July 30, 2024 18:29

mtrmac mentioned this pull request Jul 30, 2024

DO NOT MERGE: Testing https://github.com/containers/image/pull/2321 containers/podman#23348

Closed

This was referenced Aug 7, 2024

release-5.32 Zstd backports #2507

Closed

[release-5.32] Zstd backports #2508

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record (TOC digest → DiffID) mapping in BlobInfoCache #2321

Record (TOC digest → DiffID) mapping in BlobInfoCache #2321

mtrmac commented Feb 29, 2024 •

edited

Loading

giuseppe Feb 29, 2024

mtrmac Feb 29, 2024

mtrmac Apr 13, 2024

mtrmac left a comment

mtrmac commented Apr 25, 2024 •

edited

Loading

mtrmac commented Jul 23, 2024

giuseppe commented Jul 29, 2024

mtrmac commented Jul 29, 2024

giuseppe left a comment

giuseppe Jul 30, 2024

mtrmac Jul 30, 2024

rhatdan commented Jul 30, 2024

Record (TOC digest → DiffID) mapping in BlobInfoCache #2321

Record (TOC digest → DiffID) mapping in BlobInfoCache #2321

Conversation

mtrmac commented Feb 29, 2024 • edited Loading

giuseppe Feb 29, 2024

Choose a reason for hiding this comment

mtrmac Feb 29, 2024

Choose a reason for hiding this comment

mtrmac Apr 13, 2024

Choose a reason for hiding this comment

mtrmac left a comment

Choose a reason for hiding this comment

mtrmac commented Apr 25, 2024 • edited Loading

mtrmac commented Jul 23, 2024

giuseppe commented Jul 29, 2024

mtrmac commented Jul 29, 2024

giuseppe left a comment

Choose a reason for hiding this comment

giuseppe Jul 30, 2024

Choose a reason for hiding this comment

mtrmac Jul 30, 2024

Choose a reason for hiding this comment

rhatdan commented Jul 30, 2024

mtrmac commented Feb 29, 2024 •

edited

Loading

mtrmac commented Apr 25, 2024 •

edited

Loading