This doc describes the extension to gzip layers of container images (application/vnd.oci.image.layer.v1.tar+gzip
of OCI Image Specification and application/vnd.docker.image.rootfs.diff.tar.gzip
of Docker Image Specification) for lazy pulling.
The extension is called eStargz.
eStargz is a backward-compatible extension which means that images can be pushed to the extension-agnostic registry and can run on extension-agnostic runtimes.
This extension is based on stargz (stands for seekable tar.gz) proposed by Google CRFS project (initially discussed in Go community). eStargz extends stargz for chunk-level verification and runtime performance optimization.
Notational convention follows OCI Image Specification.
Lazy pulling is a technique of pulling container images aiming at the faster cold start. This allows a container to startup without waiting for the entire image layer contents to be locally available. Instead, necessary files (or chunks for large files) in the layer are fetched on-demand during running the container.
For achieving this, runtimes need to fetch and extract each file in a layer independently. However, layer without eStargz extension doesn't allow this because of the following reasons,
- The entire layer blob needs to be extracted even for getting a single file entry.
- Digests aren't provided for each file so it cannot be verified independently.
eStargz solves these issues and enables lazy pulling.
Additionally, it supports prefetching of files. This can be used to mitigate runtime performance drawbacks caused by the on-demand fetching of each file.
This extension is a backward-compatible so the eStargz-formatted image can be pushed to the registry and can run even on eStargz-agnostic runtimes.
eStargz is a gzip-compressed tar archive of files and a metadata component called TOC (described in the later section). In an eStargz-formatted blob, each non-empty regular file and each metadata component MUST be separately compressed as gzip. This structure is inherited from stargz.
Therefore, the gzip headers MUST locate at the following locations.
- The top of the blob
- The top of the payload of each non-empty regular file tar entry except TOC
- The top of TOC tar header
- The top of footer (described in the later section)
Large regular files in an eStargz blob MAY be chunked into several smaller gzip members. Each chunked member is called chunk in this doc.
Therefore, gzip headers MAY locate at the following locations.
- Arbitrary location within the payload of non-empty regular file entry
An eStargz-formatted blob is the concatenation of these gzip members, which is a still valid gzip blob.
eStargz contains a regular file called TOC which records metadata (e.g. name, file type, owners, offset etc) of all file entries in eStargz, except TOC itself. Container runtimes MAY use TOC to mount the container's filesystem without downloading the entire layer contents.
TOC MUST be a JSON file contained as the last tar entry and MUST be named stargz.index.json
.
The following fields contain the primary properties that constitute a TOC.
-
version
intThis REQUIRED property contains the version of the TOC. This value MUST be
1
. -
entries
array of objectsThis property MUST contain an array of TOCEntry of all tar entries and chunks in the blob, except
stargz.index.json
.
TOCEntry consists of metadata of a file or chunk in eStargz. If metadata in a TOCEntry of a file differs from the corresponding tar entry, TOCEntry SHOULD be respected.
The following fields contain the primary properties that constitute a TOCEntry.
Properties other than chunkDigest
are inherited from stargz.
-
name
stringThis REQUIRED property contains the name of the tar entry. This MUST be the complete path stored in the tar file.
-
type
stringThis REQUIRED property contains the type of tar entry. This MUST be either of the following.
dir
: directoryreg
: regular filesymlink
: symbolic linkhardlink
: hard linkchar
: character deviceblock
: block devicefifo
: fifochunk
: a chunk of regular file data As described in the above section, a regular file can be divided into several chunks. TOCEntry MUST be created for each chunk. TOCEntry of the first chunk of that file MUST be typed asreg
. TOCEntry of each chunk after 2nd MUST be typed aschunk
.chunk
TOCEntry MUST set offset, chunkOffset and chunkSize properties.
-
size
uint64This OPTIONAL property contains the uncompressed size of the regular file. Non-empty
reg
file MUST set this property. -
modtime
stringThis OPTIONAL property contains the modification time of the tar entry. Empty means zero or unknown. Otherwise, the value is in UTC RFC3339 format.
-
linkName
stringThis OPTIONAL property contains the link target.
symlink
andhardlink
MUST set this property. -
mode
int64This REQUIRED property contains the permission and mode bits.
-
uid
uintThis REQUIRED property contains the user ID of the owner of this file.
-
gid
uintThis REQUIRED property contains the group ID of the owner of this file.
-
userName
stringThis OPTIONAL property contains the username of the owner.
-
groupName
stringThis OPTIONAL property contains the groupname of the owner.
-
devMajor
intThis OPTIONAL property contains the major device number of device files.
char
andblock
files MUST set this property. -
devMinor
intThis OPTIONAL property contains the minor device number of device files.
char
andblock
files MUST set this property. -
xattrs
string-bytes mapThis OPTIONAL property contains the extended attribute for the tar entry.
-
digest
stringThis OPTIONAL property contains the digest of the regular file contents.
-
offset
int64This OPTIONAL property contains the offset of the gzip header of the regular file or chunk in the blob. TOCEntries of non-empty
reg
andchunk
MUST set this property. -
chunkOffset
int64This OPTIONAL property contains the offset of this chunk in the decompressed regular file payload. TOCEntries of
chunk
type MUST set this property. -
chunkSize
int64This OPTIONAL property contains the decompressed size of this chunk. The last
chunk
in areg
file orreg
file that isn't chunked MUST set this property to zero. Otherreg
andchunk
MUST set this property. -
chunkDigest
stringThis OPTIONAL property contains a digest of this chunk. TOCEntries of non-empty
reg
andchunk
MUST set this property. This MAY be used for verifying the data of the chunk. -
innerOffset
int64This OPTIONAL property indicates the uncompressed offset of the "reg" or "chunk" entry payload in a stream starts from
offset
field.
innerOffset
enables to put multiple "reg" or "chunk" payloads in one gzip stream starts from offset
.
This field allows the following structure.
Use case of this field is --estargz-min-chunk-size
flag of ctr-remote
.
The value of this flag is the minimal number of bytes of data must be written in one gzip stream.
If it's > 0, multiple files and chunks can be written into one gzip stream.
Smaller number of gzip header and smaller size of the result blob can be expected.
At the end of the blob, a footer MUST be appended. This MUST be an empty gzip member whose Extra field contains the offset of TOC in the blob. The footer MUST be the following 51 bytes (1 byte = 8 bits in gzip).
- 10 bytes gzip header
- 2 bytes XLEN (length of Extra field) = 26 (4 bytes header + 16 hex digits + len("STARGZ"))
- 2 bytes Extra: SI1 = 'S', SI2 = 'G'
- 2 bytes Extra: LEN = 22 (16 hex digits + len("STARGZ"))
- 22 bytes Extra: subfield = fmt.Sprintf("%016xSTARGZ", offsetOfTOC)
- 5 bytes flate header: BFINAL = 1(last block), BTYPE = 0(non-compressed block), LEN = 0
- 8 bytes gzip footer
(End of eStargz)
Runtimes MAY first read and parse the footer to get the offset of TOC.
Each file's metadata is recorded in the TOC so runtimes don't need to extract other parts of the archive as long as it only uses file metadata. If runtime needs to get a regular file's content, it can get the size and offset of that content from the TOC and extract that range without scanning the entire blob. By combining this with HTTP Range Request supported by OCI Distribution Spec, runtimes can selectively download file entries from the registry.
eStargz is designed aiming to compatibility with gzip layers. For achieving this, eStargz's footer structure is incompatible with stargz's one. eStargz adds SI1, SI2 and LEN fields to the footer to make it compliant to Extra field definition in RFC1952. TOC, TOCEntry and the position of gzip headers are still compatible with stargz.
Lazy pulling can cause runtime performance overhead by on-demand fetching of each file. eStargz mitigates this by supporting prefetching of important files called prioritized files.
eStargz encodes the information about prioritized files to the order of file entries with some landmark file entries.
File entries in eStargz are grouped into the following groups,
- A. prioritized files
- B. non prioritized files
If no files are belonging to A, a landmark file no-prefetch landmark MUST be contained in the archive.
If one or more files are belonging to A, eStargz MUST consist of two separated areas corresponding to these groups and a landmark file prefetch landmark MUST be contained at the boundary between these two areas.
The Landmark file MUST be a regular file entry with 4 bits contents 0xf in eStargz.
It MUST be recorded to TOC as a TOCEntry. Prefetch landmark MUST be named .prefetch.landmark
. No-prefetch landmark MUST be named .no.prefetch.landmark
.
Stargz Snapshotter makes use of eStargz's prioritized files for workload-based optimization to mitigate the overhead of reading files. The workload of the image is the runtime configuration defined in the Dockerfile, including entrypoint command, environment variables and user.
Stargz snapshotter provides an image converter command ctr-remote images optimize
to create optimized eStargz images.
When converting the image, this command runs the specified workload in a sandboxed environment and profiles all file accesses.
This command treats all accessed files as prioritized files.
Then it constructs eStargz by
- putting prioritized files from the top of the archive, sorting them by the accessed order,
- putting prefetch landmark file entry at the end of this range, and
- putting all other files (non-prioritized files) after the prefetch landmark.
Before running the container, stargz snapshotter prefetches and pre-caches the range where prioritized files are contained, by a single HTTP Range Request supported by the registry. This can increase the cache hit rate for the specified workload and can mitigate runtime overheads.
The goal of the content verification in eStargz is to ensure the downloaded metadata and contents of all files are the expected ones, based on the calculated digests. The verification of other components in the image including image manifests is out-of-scope of eStargz. On the verification step of an eStargz layer, we assume that the manifest that references this eStargz layer is already verified (using digest tag, etc).
A non-eStargz layer can be verified by recalculating the digest and comparing it with the one written in the layer descriptor referencing that layer in the verified manifest. However, an eStargz layer is lazily pulled from the registry in file (or chunk if that file is large) granularity so each one needs to be independently verified every time fetched.
The following describes how the verification of eStargz is done using the verified manifest.
eStargz consists of the following components to be verified:
- TOC (a set of metadata of all files contained in the layer)
- chunks of contents of each regular file
TOC contains metadata (name, type, mode, etc.) of all files and chunks in the blob.
On mounting eStargz, filesystem fetches the TOC from the registry.
For making the TOC verifiable using the verified manifest, we define an annotation containerd.io/snapshot/stargz/toc.digest
.
The value of this annotation is the digest of the TOC and this MUST be contained in the descriptor that references this eStargz layer.
Using this annotation, filesystem can verify the TOC by recalculating the digest and comparing it to the annotation value.
Each file's metadata is encoded to a TOCEntry in the TOC.
TOCEntry is created also for each chunk of regular files.
For making the contents of each file and chunk verifiable using the verified manifest, TOCEntry has a property chunkDigest.
chunkDigest contains the digest of the content of the reg
or chunk
entry.
As mentioned above, the TOC is verifiable using the special annotation.
Using chunkDigest fields written in the verified TOC, each file and chunk can be independently verified by recalculating the digest and comparing it to the property.
As the conclusion, eStargz MUST contain the following metadata:
containerd.io/snapshot/stargz/toc.digest
annotation in the descriptor that references eStargz layer: The value is the digest of the TOC.- chunkDigest properties of non-empty
reg
orchunk
TOCEntry: The value is the digest of the contents of the file or chunk.
Stargz Snapshotter verifies eStargz layers leveraging the above metadata. As mentioned above, the verification of other image components including the manifests is out-of-scope of the snapshotter. When this snapshotter mounts an eStargz layer, the manifest that references this layer must be verified in advance and the TOC digest annotation written in the verified manifest must be passed down to this snapshotter.
On mounting a layer, stargz snapshotter fetches the TOC from the registry. Then it verifies the TOC by recalculating the digest and comparing it with the one written in the manifest. After the TOC is verified, the snapshotter mounts this layer using the metadata recorded in the TOC.
During runtime of the container, this snapshotter fetches chunks of regular file contents lazily. Before providing a chunk to the filesystem user, snapshotter recalculates the digest and checks it matches the one recorded in the corresponding TOCEntry.
This OPTIONAL feature allows separating TOC into another image called TOC image.
This type of eStargz is the same as the normal eStargz but doesn't contain TOC JSON file (stargz.index.json
) in the layer blob and has a special footer.
This feature enables creating a smaller eStargz blob by avoiding including TOC JSON file in that blob.
Footer has the following structure:
// The footer is an empty gzip stream with no compression and an Extra header.
//
// 46 comes from:
//
// 10 bytes gzip header
// 2 bytes XLEN (length of Extra field) = 21 (4 bytes header + len("STARGZEXTERNALTOC"))
// 2 bytes Extra: SI1 = 'S', SI2 = 'G'
// 2 bytes Extra: LEN = 17 (len("STARGZEXTERNALTOC"))
// 17 bytes Extra: subfield = "STARGZEXTERNALTOC"
// 5 bytes flate header
// 8 bytes gzip footer
// (End of the eStargz blob)
TOC image is an OCI image containing TOC.
Each layer contains a TOC JSON file (stargz.index.json
) in the root directory.
Layer descriptors in the manifest must contain an annotation containerd.io/snapshot/stargz/layer.digest
.
The value of this annotation is the digest of the eStargz layer blob corresponding to that TOC.
The following is an example layer descriptor in the TOC image.
This layer (sha256:64dedefd539280a5578c8b94bae6f7b4ebdbd12cb7a7df0770c4887a53d9af70
) contains the TOC JSON file (stargz.index.json
) in the root directory and can be used for eStargz layer blob that has the digest sha256:5da5601c1f2024c07f580c11b2eccf490cd499473883a113c376d64b9b10558f
.
{
"mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
"digest": "sha256:64dedefd539280a5578c8b94bae6f7b4ebdbd12cb7a7df0770c4887a53d9af70",
"size": 154425,
"annotations": {
"containerd.io/snapshot/stargz/layer.digest": "sha256:5da5601c1f2024c07f580c11b2eccf490cd499473883a113c376d64b9b10558f"
}
}
Stargz Snapshotter supports eStargz with external TOC. If an eStargz blob's footer indicates that it requires the TOC image, stargz snapshotter also pulls it from the registry.
Stargz snapshotter assumes the TOC image has the reference name same as the eStargz with -esgztoc
suffix.
For example, if an eStargz image is named ghcr.io/stargz-containers/ubuntu:22.04-esgz
, stargz snapshotter acquires the TOC image from ghcr.io/stargz-containers/ubuntu:22.04-esgz-esgztoc
.
Note that future versions of stargz snapshotter will support more ways to search the TOC image (e.g. allowing custom suffix, using OCI Reference Type, etc.)
Once stargz snapshotter acquires TOC image, it tries to find the TOC corresponding to the mounting eStargz blob, by looking containerd.io/snapshot/stargz/layer.digest
annotations.
As describe in the above, the acquired TOC JSON is validated using containerd.io/snapshot/stargz/toc.digest
annotation.
Here is an example TOC JSON:
{
"version": 1,
"entries": [
{
"name": "bin/",
"type": "dir",
"modtime": "2019-08-20T10:30:43Z",
"mode": 16877,
"NumLink": 0
},
{
"name": "bin/busybox",
"type": "reg",
"size": 833104,
"modtime": "2019-06-12T17:52:45Z",
"mode": 33261,
"offset": 126,
"NumLink": 0,
"digest": "sha256:8b7c559b8cccca0d30d01bc4b5dc944766208a53d18a03aa8afe97252207521f",
"chunkDigest": "sha256:8b7c559b8cccca0d30d01bc4b5dc944766208a53d18a03aa8afe97252207521f"
},
{
"name": "lib/",
"type": "dir",
"modtime": "2019-08-20T10:30:43Z",
"mode": 16877,
"NumLink": 0
},
{
"name": "lib/ld-musl-x86_64.so.1",
"type": "reg",
"size": 580144,
"modtime": "2019-08-07T07:15:30Z",
"mode": 33261,
"offset": 512427,
"NumLink": 0,
"digest": "sha256:45c6ee3bd1862697eab8058ec0e462f5a760927331c709d7d233da8ffee40e9e",
"chunkDigest": "sha256:45c6ee3bd1862697eab8058ec0e462f5a760927331c709d7d233da8ffee40e9e"
},
{
"name": ".prefetch.landmark",
"type": "reg",
"size": 1,
"offset": 886633,
"NumLink": 0,
"digest": "sha256:dc0e9c3658a1a3ed1ec94274d8b19925c93e1abb7ddba294923ad9bde30f8cb8",
"chunkDigest": "sha256:dc0e9c3658a1a3ed1ec94274d8b19925c93e1abb7ddba294923ad9bde30f8cb8"
},
... (omit) ...