Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zstdchunked error deleting temp layers #3623

Open
GrigoryEvko opened this issue Oct 31, 2024 · 12 comments · May be fixed by containerd/stargz-snapshotter#1847
Open

zstdchunked error deleting temp layers #3623

GrigoryEvko opened this issue Oct 31, 2024 · 12 comments · May be fixed by containerd/stargz-snapshotter#1847
Labels
bug Something isn't working

Comments

@GrigoryEvko
Copy link

GrigoryEvko commented Oct 31, 2024

Description

The issues seems to be related to the code imported from stargz-snapshotter, because it persists across both nerdctl and stargz-snapshotter packages. (reported here containerd/stargz-snapshotter#1842)

Using both packages (nerdctl with nerdctl image convert --zstdchunked --oci src target) results in similar errors:

WARN[0287]failedtoremovetmpuncompressedlayererror="contextcanceled"uncompressedDesc="&{application/vnd.oci.image.layer.v1.tarsha256:2e69fe729d5788239a3713310c27ed5af34147e2b4a1df6f25ddb9dd440ba66a11264[]map[][]<nil>}" WARN[0287]failedtoremovetmpuncompressedlayererror="contextcanceled"uncompressedDesc="&{application/vnd.oci.image.layer.v1.tarsha256:9073450b516f979f3ae63598ee8e12fd8ca6460e02fda53da70b2871641b2b4d42496[]map[][]<nil>}" WARN[0287]failedtoremovetmpuncompressedlayererror="contextcanceled"uncompressedDesc="&{application/vnd.oci.image.layer.v1.tarsha256:69002f0165290f934255c49e3b5d58a26088445340f2843ee1959723b6ea6fea340992[]map[][]<nil>}" WARN[0287]failedtoremovetmpuncompressedlayererror="contextcanceled"uncompressedDesc="&{application/vnd.oci.image.layer.v1.tarsha256:78d295a0c8e004e84b10c4dfafe83385e57de34965443211c2ad0faabf216c4799840[]map[][]<nil>}" WARN[0287]failedtoremovetmpuncompressedlayererror="contextcanceled"uncompressedDesc="&{application/vnd.oci.image.layer.v1.tarsha256:43318f0006477ee6a02be460ab98c75e45bc23e2643b9f877b041b88a7cea17526112[]map[][]<nil>}" WARN[0287]failedtoremovetmpuncompressedlayererror="contextcanceled"uncompressedDesc="&{application/vnd.oci.image.layer.v1.tarsha256:a62c8c75f8c8b0841061d8b0d3589f77756cef54f4bb6e529a83cc1b0412ee2517408[]map[][]<nil>}" WARN[0287]failedtoremovetmpuncompressedlayererror="contextcanceled"uncompressedDesc="&{application/vnd.oci.image.layer.v1.tarsha256:14efc3c96ab2c5c0920c20a484d9755ea28e0b230310e7f64b7c7a0d31d998423072[]map[][]<nil>}" WARN[0287]failedtoremovetmpuncompressedlayererror="contextcanceled"uncompressedDesc="&{application/vnd.oci.image.layer.v1.tarsha256:1656497b35dd71fe1bd45aa1cdf38a6e49d035ab8efdf30b2124563a2f3faec33072[]map[][]<nil>}" WARN[0287]failedtoremovetmpuncompressedlayererror="contextcanceled"uncompressedDesc="&{application/vnd.oci.image.layer.v1.tarsha256:5115d7b09e2af6f16a520081cdbb07a01970bed93b71992d0da5018b53555ca58704[]map[][]<nil>}" WARN[0287]failedtoremovetmpuncompressedlayererror="contextcanceled"uncompressedDesc="&{application/vnd.oci.image.layer.v1.tarsha256:51c96fce3f2585d9ff3bab85bc3d8186b27610cd9807b4f490ebc4184aeb805530208[]map[][]<nil>}" WARN[0287]failedtoremovetmpuncompressedlayererror="contextcanceled"uncompressedDesc="&{application/vnd.oci.image.layer.v1.tarsha256:0f5d1fc9b9ac6c0541d19729fd70c74bf96fb0e21cb1b512179627ca43ecc6603584[]map[][]<nil>}" WARN[0287]failedtoremovetmpuncompressedlayererror="contextcanceled"uncompressedDesc="&{application/vnd.oci.image.layer.v1.tarsha256:f97588faa6c468724eda134f66eff4dca51a5f55bf5a4606fba8193ecdd474ca3584[]map[][]<nil>}" WARN[0287]failedtoremovetmpuncompressedlayererror="contextcanceled"uncompressedDesc="&{application/vnd.oci.image.layer.v1.tarsha256:0e256f201c5067cc84d79e2f9cc03d30937728fe791d43ecfa831cc00c7b7fb42048[]map[][]<nil>}" WARN[0287]failedtoremovetmpuncompressedlayererror="contextcanceled"uncompressedDesc="&{application/vnd.oci.image.layer.v1.tarsha256:edeb577b0f8827769d6a5a69adae1a235354afff23e4948ce0c6c2df2fa04743151040[]map[][]<nil>}" WARN[0287]failedtoremovetmpuncompressedlayererror="contextcanceled"uncompressedDesc="&{application/vnd.oci.image.layer.v1.tarsha256:5f70bf18a086007016e948b04aed3b82103a36bea41755b6cddfaf10ace3c6ef1024[]map[][]<nil>}"

Also on some steps I ended up with containerd bug present in latest 1.7.22 version, but not in 2.0 rc it seems, it was reported multiple times recently.
Is it just me? I'm using Ubuntu 22.04 with containerd installed from apt repo (but tried several tarball releases, nerdctl from latest release, stargz-snapshotter both from release and built from latest git repo).

Also it seems related to the recent fix in nerdctl #3079 , because zstdchunked converter looks almost identical to pre pull request nerdctl code.

Thanks in advance! Maybe I need to use dockerized environment for conversion? Maybe it's my giant (30GB) image? I use heavy ML docker with many python packages in layers and some models in layers as well, because it's the image for kubernetes autoscaling deployment with registry storage on s3.

Steps to reproduce the issue

No response

Describe the results you received and expected

Zstdchunked converter working as expected

What version of nerdctl are you using?

v1.7.7

Are you using a variant of nerdctl? (e.g., Rancher Desktop)

None

Host information

No response

@GrigoryEvko GrigoryEvko added the kind/unconfirmed-bug-claim Unconfirmed bug claim label Oct 31, 2024
@apostasie
Copy link
Contributor

Also on some steps I ended up with containerd bug present in latest 1.7.22 version, but not in 2.0 rc it seems, it was reported multiple times recently.

Do you have a link for that one?

@apostasie
Copy link
Contributor

Also it seems related to the recent fix in nerdctl #3079 , because zstdchunked converter looks almost identical to pre pull request nerdctl code.

Interesting.

You are using nerdctl v1.7 (which does not have the fix mentioned).
Can you try with nerdctl v2.rc and see if the problem is still there?

@GrigoryEvko
Copy link
Author

Also on some steps I ended up with containerd bug present in latest 1.7.22 version, but not in 2.0 rc it seems, it was reported multiple times recently.

Do you have a link for that one?

It's one of your issues actually (#3509 and a few similar ones).
Unfortunately I don't have logs accessible at the moment, I need to reproduce it all, but during nerdctl image convert at some point I started receiving msg="content digest sha256:..... not found on top of the error messages, and then the context canceled errors. estargz converter and ctr-remote optimize work perfectly fine with gzip compression.

Interesting.

You are using nerdctl v1.7 (which does not have the fix mentioned). Can you try with nerdctl v2.rc and see if the problem is still there?

Oh I thought it's already in the release. But looking at the code, I think it wouldn't help because zstd:chunked converter is using the same problematic module from snapshotter repo

zstdchunkedconvert "github.com/containerd/stargz-snapshotter/nativeconverter/zstdchunked"

I need zstd:chunked lazy loading specifically, because with gzip, decompressing my 30 GB of layers utilizes 100% of CPU and barely hit the disk speed limit for a very prolonged time, so with the regular zstd image container startup time is 8-10 minutes approximately, depends on s3 throughput, and with estargz it's barely better, 5 minutes total for decompressing and disc writing. And considering that I need most of my 50 GB uncompressed image for ML inferencing anyway on the machine, it would still be disk bounded in any case, but I'd like to optimize it as much as possible :/
Thanks!

@apostasie
Copy link
Contributor

@GrigoryEvko

I see.

So:

This is definitely a stargz issue (since it happens with ctr as well)

Maybe we can fix it here (just rewrite the func in nerdctl and bypass stargz converter), so, a few comments:

  • it seems from the logs like the context is getting cancelled - so, maybe it is just a time out (with 30G+ layers, yeah, I can see this timeouting)
  • avoiding the use of a temp file like I did in the other patch would certainly improve performance a bit, but probably would just move the goal post

I need tests for this - at least a reproducer.

Is there a chance you could share one of these images? Or provide a way for me to build one, that would be close enough to what you have to trigger the issue?

@AkihiroSuda
Copy link
Member

Cc @ktock for stargz

@AkihiroSuda AkihiroSuda added bug Something isn't working and removed kind/unconfirmed-bug-claim Unconfirmed bug claim labels Oct 31, 2024
@GrigoryEvko
Copy link
Author

GrigoryEvko commented Oct 31, 2024

I need tests for this - at least a reproducer.

Absolutely, you can use nvidia triton inference server image, it reproduces the error. I kinda realized why it was not reported previously - it works as intended on smaller images. Also I've got a bug in nerdctl v2.0.0-rc3 image convert not being able to find images pulled from dockerhub (even with a URL), only from other registries. Here's my log:

(base) ubuntu@ip-10-0-17-13:~$ sudo nerdctl --version
nerdctl version 2.0.0-rc.3
(base) ubuntu@ip-10-0-17-13:~$ sudo nerdctl image ls -a
REPOSITORY                               TAG                      IMAGE ID        CREATED           PLATFORM       SIZE       BLOB SIZE
ghcr.io/containerd/stargz-snapshotter    0.15.1-kind-zstd         af433c8521cb    51 seconds ago    linux/amd64    0B         487.9MB
ghcr.io/containerd/stargz-snapshotter    0.15.1-kind              77742284151e    3 minutes ago     linux/amd64    1.121GB    494.9MB
vitess/lite                              latest                   e3a3ff311b0a    7 minutes ago     linux/amd64    2.352GB    734.1MB
ubuntu                                   latest                   99c35190e22d    10 minutes ago    linux/amd64    87.56MB    29.75MB
ubuntu                                   jammy                    0e5e4a57c249    12 minutes ago    linux/amd64    87.51MB    29.54MB
nvcr.io/nvidia/tritonserver              24.10-vllm-python-py3    6c9dcf2dbe0d    21 minutes ago    linux/amd64    21.51GB    13.43GB
(base) ubuntu@ip-10-0-17-13:~$ sudo nerdctl image convert --oci --zstdchunked vitess/lite:latest vitess/lite:zstdchunked
FATA[0000] image "vitess/lite:latest": not found        
(base) ubuntu@ip-10-0-17-13:~$ sudo nerdctl image convert --oci --zstdchunked ghcr.io/containerd/stargz-snapshotter:0.15.1-kind ghcr.io/containerd/stargz-snapshotter:0.15.1-kind-zstd
sha256:af433c8521cbb3a38a2b7ccd2b11fc09b1451bbe28fb3bbfb1a3217cc42950df
(base) ubuntu@ip-10-0-17-13:~$ sudo nerdctl image ls -a
REPOSITORY                               TAG                      IMAGE ID        CREATED           PLATFORM       SIZE       BLOB SIZE
ghcr.io/containerd/stargz-snapshotter    0.15.1-kind-zstd         af433c8521cb    3 seconds ago     linux/amd64    0B         487.9MB
ghcr.io/containerd/stargz-snapshotter    0.15.1-kind              77742284151e    5 minutes ago     linux/amd64    1.121GB    494.9MB
vitess/lite                              latest                   e3a3ff311b0a    9 minutes ago     linux/amd64    2.352GB    734.1MB
ubuntu                                   latest                   99c35190e22d    12 minutes ago    linux/amd64    87.56MB    29.75MB
ubuntu                                   jammy                    0e5e4a57c249    14 minutes ago    linux/amd64    87.51MB    29.54MB
nvcr.io/nvidia/tritonserver              24.10-vllm-python-py3    6c9dcf2dbe0d    23 minutes ago    linux/amd64    21.51GB    13.43GB
(base) ubuntu@ip-10-0-17-13:~$ sudo nerdctl image convert --oci --zstdchunked nvcr.io/nvidia/tritonserver:24.10-vllm-python-py3 nvcr.io/nvidia/tritonserver:24.10-vllm-python-py3-zstdchunked
WARN[0000] failed to remove tmp uncompressed layer       error="context canceled" uncompressedDesc="&{application/vnd.docker.image.rootfs.diff.tar sha256:f5f79ac10bb874bdbe60f05aefdf89d24c8f07b24910dbd787b9ee4cfd390565 17408 [] map[] [] <nil> }"
WARN[0000] failed to remove tmp uncompressed layer       error="context canceled" uncompressedDesc="&{application/vnd.docker.image.rootfs.diff.tar sha256:176c746bdb5ad24a387e0d855c44bd57391d7c33a2bad8e19d4aced54bea5a00 71680 [] map[] [] <nil> }"
WARN[0000] failed to remove tmp uncompressed layer       error="context canceled" uncompressedDesc="&{application/vnd.docker.image.rootfs.diff.tar sha256:66450b4ef0ee9891dd2b44a9c947bee0db15b50863cc69a80a98a8c74ba7abf8 8704 [] map[] [] <nil> }"
WARN[0000] failed to remove tmp uncompressed layer       error="context canceled" uncompressedDesc="&{application/vnd.docker.image.rootfs.diff.tar sha256:87e1348a15f93372a287356a2c98836d061f33b6bd6d768ef42360e6b5f62630 341504 [] map[] [] <nil> }"
WARN[0000] failed to remove tmp uncompressed layer       error="context canceled" uncompressedDesc="&{application/vnd.docker.image.rootfs.diff.tar sha256:5f70bf18a086007016e948b04aed3b82103a36bea41755b6cddfaf10ace3c6ef 1024 [] map[] [] <nil> }"
WARN[0000] failed to remove tmp uncompressed layer       error="context canceled" uncompressedDesc="&{application/vnd.docker.image.rootfs.diff.tar sha256:0b6a520db613be9ef2d808547aefba361788a92f82ccaa532fa3b2895f94debc 151040 [] map[] [] <nil> }"
WARN[0000] failed to remove tmp uncompressed layer       error="context canceled" uncompressedDesc="&{application/vnd.docker.image.rootfs.diff.tar sha256:98734bf94d2bbd3b2d3b3032ab41bea0ee1ad76db24ca128ed1d866f3df6ff8b 2560 [] map[] [] <nil> }"
WARN[0000] failed to remove tmp uncompressed layer       error="context canceled" uncompressedDesc="&{application/vnd.docker.image.rootfs.diff.tar sha256:6c75d6484379aa51f50d3e6a3c1f0b7acc2364aed0b9fe643224ce3134c970f3 26112 [] map[] [] <nil> }"
WARN[0000] failed to remove tmp uncompressed layer       error="context canceled" uncompressedDesc="&{application/vnd.docker.image.rootfs.diff.tar sha256:94236b11b2863870accf74d3d01d44e6095f7550e18451e75a1e6ca26642355a 62976 [] map[] [] <nil> }"
WARN[0000] failed to remove tmp uncompressed layer       error="context canceled" uncompressedDesc="&{application/vnd.docker.image.rootfs.diff.tar sha256:b8ff3f71e1363ccb2bf7e69e1fcafc48ea021e939037d2ef6782c0729e114fd1 11264 [] map[] [] <nil> }"
WARN[0000] failed to remove tmp uncompressed layer       error="context canceled" uncompressedDesc="&{application/vnd.docker.image.rootfs.diff.tar sha256:b05947e518f595236486c35a836848c01ccd7c06539a481d2084667e8288ddf4 3584 [] map[] [] <nil> }"
WARN[0000] failed to remove tmp uncompressed layer       error="context canceled" uncompressedDesc="&{application/vnd.docker.image.rootfs.diff.tar sha256:5f70bf18a086007016e948b04aed3b82103a36bea41755b6cddfaf10ace3c6ef 1024 [] map[] [] <nil> }"
WARN[0000] failed to remove tmp uncompressed layer       error="context canceled" uncompressedDesc="&{application/vnd.docker.image.rootfs.diff.tar sha256:caf07e7743c0eb80a8a7ac78b631cd93b73f96e2d1a1dabe4d9ae7a9b922d24b 3072 [] map[] [] <nil> }"
FATA[0000] ref default/1/convert-zstdchunked-from-sha256:4f4fb700ef54461cfa02571ae0db9a0dc1e0cdb5577484a6d75e68dc38e8acc1 locked for 103.633334ms (since 2024-10-31 18:47:12.269280827 +0000 UTC m=+486.719742327): unavailable 

To reproduce, pull sudo nerdctl pull nvcr.io/nvidia/tritonserver:24.10-vllm-python-py3 and try to convert it.
So it looks like the issue is just some kind of size limit, because it gets canceled instantly in 100 ms after launching, not breaking or anything.

I hope it helps!

@apostasie
Copy link
Contributor

apostasie commented Oct 31, 2024

@GrigoryEvko on my side, I did a quick read of the source: looks like Build does require a SectionReader - https://github.com/containerd/stargz-snapshotter/blob/a6b9bdb5a9e113277fa213e002e65bf1a761509c/estargz/build.go#L153 - although it is not clear why that is useful, compared to just a Reader - especially if the content is not compressed - furthermore, there is a fair amount of filesystem back and forth - https://github.com/containerd/stargz-snapshotter/blob/a6b9bdb5a9e113277fa213e002e65bf1a761509c/estargz/build.go#L652 - and finally it seems like after compression, we do decompress the result (same here, not sure why - it appears we want the size of the uncompressed content - https://github.com/containerd/stargz-snapshotter/blob/a6b9bdb5a9e113277fa213e002e65bf1a761509c/nativeconverter/zstdchunked/zstdchunked.go#L166)

Anyhow, there is possibly room for improvement here.

Addendum: estargz.Build does need a ReaderAt for good reasons, although it is not clear how the current implementation could scale to that size (30G+).

@apostasie
Copy link
Contributor

Also I've got a bug in nerdctl v2.0.0-rc3 image convert not being able to find images pulled from dockerhub (even with a URL), only from other registries.

I will look into this one.

@apostasie
Copy link
Contributor

Also I've got a bug in nerdctl v2.0.0-rc3 image convert not being able to find images pulled from dockerhub (even with a URL), only from other registries.

I will look into this one.

PR incoming for this specifically: #3626

apostasie added a commit to apostasie/nerdctl that referenced this issue Oct 31, 2024
apostasie added a commit to apostasie/nerdctl that referenced this issue Oct 31, 2024
@apostasie
Copy link
Contributor

So:

go build -o /tmp/nerdctl_s ./cmd/nerdctl/ && /tmp/nerdctl_s --debug-full image convert --oci --zstdchunked nvcr.io/nvidia/tritonserver:24.10-vllm-python-py3 nvcr.io/nvidia/tritonserver:24.10-vllm-python-py3-zstdchunked
DEBU[0000] stateDir: /run/user/501/containerd-rootless
DEBU[0000] RootlessKit detach-netns mode: true
DEBU[0000] rootless parent main: executing "/usr/bin/nsenter" with [-r/ -w/Users/dmp/Projects/go/nerd/nerdctl --preserve-credentials -m -U -t 1665 -F /tmp/nerdctl_s --debug-full image convert --oci --zstdchunked nvcr.io/nvidia/tritonserver:24.10-vllm-python-py3 nvcr.io/nvidia/tritonserver:24.10-vllm-python-py3-zstdchunked]
DEBU[0000] using igzip for decompression
DEBU[0000] zstdchunked: uncompressed sha256:4f4fb700ef54461cfa02571ae0db9a0dc1e0cdb5577484a6d75e68dc38e8acc1 into sha256:5f70bf18a086007016e948b04aed3b82103a36bea41755b6cddfaf10ace3c6ef
DEBU[0000] zstdchunked: uncompressed sha256:6da051397311aff5f9d1d8b5c75afa073318d65f9251b7665e909a1066c809f9 into sha256:caf07e7743c0eb80a8a7ac78b631cd93b73f96e2d1a1dabe4d9ae7a9b922d24b
DEBU[0000] zstdchunked: uncompressed sha256:e6f5e18001c21008ddf1f80699663abbce296ae8311b9bd76e39f63ad746ec43 into sha256:f5f79ac10bb874bdbe60f05aefdf89d24c8f07b24910dbd787b9ee4cfd390565
DEBU[0000] zstdchunked: uncompressed sha256:5738d44ce3f25fe9275b54fd5e24d0b26d6b404ffa57de69735d51778224afe8 into sha256:0b6a520db613be9ef2d808547aefba361788a92f82ccaa532fa3b2895f94debc
DEBU[0000] zstdchunked: uncompressed sha256:86abba0172c5ca2b6660f29e0a9e9602bfe45b42ec16e9e2d29d516f6ab20373 into sha256:176c746bdb5ad24a387e0d855c44bd57391d7c33a2bad8e19d4aced54bea5a00
DEBU[0000] zstdchunked: uncompressed sha256:c2ad6da399bae2b3351c82d04f0d0ef4139a834390e551e516cb1fba74f97df2 into sha256:6c75d6484379aa51f50d3e6a3c1f0b7acc2364aed0b9fe643224ce3134c970f3
DEBU[0000] zstdchunked: uncompressed sha256:9ded5c3415695be1951b6f058e20d3363003b53a2658340c5f15d81856ad0e98 into sha256:b05947e518f595236486c35a836848c01ccd7c06539a481d2084667e8288ddf4
DEBU[0000] zstdchunked: uncompressed sha256:b9b0caed1c8c12f12dfad67a5ea1d7432c4271672cb1de4270c4867347a937f8 into sha256:94236b11b2863870accf74d3d01d44e6095f7550e18451e75a1e6ca26642355a
DEBU[0000] zstdchunked: uncompressed sha256:c9cc852679cb7fe38fcf5665287edaa9d99f14b1e1cec65f67cdefd53ff2f9e0 into sha256:87e1348a15f93372a287356a2c98836d061f33b6bd6d768ef42360e6b5f62630
DEBU[0000] zstdchunked: uncompressed sha256:4790d1bdaaa8b59802f26526c2adac543f9fa7bf765f340e7e981e0a4f845d54 into sha256:98734bf94d2bbd3b2d3b3032ab41bea0ee1ad76db24ca128ed1d866f3df6ff8b
DEBU[0000] zstdchunked: uncompressed sha256:4f4fb700ef54461cfa02571ae0db9a0dc1e0cdb5577484a6d75e68dc38e8acc1 into sha256:5f70bf18a086007016e948b04aed3b82103a36bea41755b6cddfaf10ace3c6ef
DEBU[0000] zstdchunked: uncompressed sha256:8c975606e87f18f1f237f9bfe68b58786e20d518a493d96780e73a4dcc408a21 into sha256:66450b4ef0ee9891dd2b44a9c947bee0db15b50863cc69a80a98a8c74ba7abf8
DEBU[0000] zstdchunked: uncompressed sha256:34c2dbdcbc81ceac35887d59172aa654fd08bae28a3c41ad238152170c73ae91 into sha256:b8ff3f71e1363ccb2bf7e69e1fcafc48ea021e939037d2ef6782c0729e114fd1
DEBU[0000] zstdchunked: uncompressed sha256:4f4fb700ef54461cfa02571ae0db9a0dc1e0cdb5577484a6d75e68dc38e8acc1 into sha256:5f70bf18a086007016e948b04aed3b82103a36bea41755b6cddfaf10ace3c6ef
WARN[0000] failed to remove tmp uncompressed layer       error="context canceled" uncompressedDesc="&{application/vnd.docker.image.rootfs.diff.tar sha256:b8ff3f71e1363ccb2bf7e69e1fcafc48ea021e939037d2ef6782c0729e114fd1 11264 [] map[] [] <nil> }"
WARN[0000] failed to remove tmp uncompressed layer       error="context canceled" uncompressedDesc="&{application/vnd.docker.image.rootfs.diff.tar sha256:66450b4ef0ee9891dd2b44a9c947bee0db15b50863cc69a80a98a8c74ba7abf8 8704 [] map[] [] <nil> }"
WARN[0000] failed to remove tmp uncompressed layer       error="context canceled" uncompressedDesc="&{application/vnd.docker.image.rootfs.diff.tar sha256:caf07e7743c0eb80a8a7ac78b631cd93b73f96e2d1a1dabe4d9ae7a9b922d24b 3072 [] map[] [] <nil> }"
WARN[0000] failed to remove tmp uncompressed layer       error="context canceled" uncompressedDesc="&{application/vnd.docker.image.rootfs.diff.tar sha256:176c746bdb5ad24a387e0d855c44bd57391d7c33a2bad8e19d4aced54bea5a00 71680 [] map[] [] <nil> }"
WARN[0000] failed to remove tmp uncompressed layer       error="context canceled" uncompressedDesc="&{application/vnd.docker.image.rootfs.diff.tar sha256:5f70bf18a086007016e948b04aed3b82103a36bea41755b6cddfaf10ace3c6ef 1024 [] map[] [] <nil> }"
WARN[0000] failed to remove tmp uncompressed layer       error="context canceled" uncompressedDesc="&{application/vnd.docker.image.rootfs.diff.tar sha256:98734bf94d2bbd3b2d3b3032ab41bea0ee1ad76db24ca128ed1d866f3df6ff8b 2560 [] map[] [] <nil> }"
WARN[0000] failed to remove tmp uncompressed layer       error="context canceled" uncompressedDesc="&{application/vnd.docker.image.rootfs.diff.tar sha256:6c75d6484379aa51f50d3e6a3c1f0b7acc2364aed0b9fe643224ce3134c970f3 26112 [] map[] [] <nil> }"
WARN[0000] failed to remove tmp uncompressed layer       error="context canceled" uncompressedDesc="&{application/vnd.docker.image.rootfs.diff.tar sha256:87e1348a15f93372a287356a2c98836d061f33b6bd6d768ef42360e6b5f62630 341504 [] map[] [] <nil> }"
WARN[0000] failed to remove tmp uncompressed layer       error="context canceled" uncompressedDesc="&{application/vnd.docker.image.rootfs.diff.tar sha256:5f70bf18a086007016e948b04aed3b82103a36bea41755b6cddfaf10ace3c6ef 1024 [] map[] [] <nil> }"
WARN[0000] failed to remove tmp uncompressed layer       error="context canceled" uncompressedDesc="&{application/vnd.docker.image.rootfs.diff.tar sha256:b05947e518f595236486c35a836848c01ccd7c06539a481d2084667e8288ddf4 3584 [] map[] [] <nil> }"
WARN[0000] failed to remove tmp uncompressed layer       error="context canceled" uncompressedDesc="&{application/vnd.docker.image.rootfs.diff.tar sha256:f5f79ac10bb874bdbe60f05aefdf89d24c8f07b24910dbd787b9ee4cfd390565 17408 [] map[] [] <nil> }"
WARN[0000] failed to remove tmp uncompressed layer       error="context canceled" uncompressedDesc="&{application/vnd.docker.image.rootfs.diff.tar sha256:94236b11b2863870accf74d3d01d44e6095f7550e18451e75a1e6ca26642355a 62976 [] map[] [] <nil> }"
WARN[0000] failed to remove tmp uncompressed layer       error="context canceled" uncompressedDesc="&{application/vnd.docker.image.rootfs.diff.tar sha256:0b6a520db613be9ef2d808547aefba361788a92f82ccaa532fa3b2895f94debc 151040 [] map[] [] <nil> }"
FATA[0000] ref default/1/convert-zstdchunked-from-sha256:4f4fb700ef54461cfa02571ae0db9a0dc1e0cdb5577484a6d75e68dc38e8acc1 locked for 23.340521ms (since 2024-10-31 22:29:51.884491233 +0100 CET m=+28191.745890060): unavailable

4f4fb700ef54461cfa02571ae0db9a0dc1e0cdb5577484a6d75e68dc38e8acc1 is being uncompressed multiple times, concurrently:

https://github.com/containerd/stargz-snapshotter/blob/a6b9bdb5a9e113277fa213e002e65bf1a761509c/nativeconverter/zstdchunked/zstdchunked.go#L103-L118

I will assume this is safe to do (albeit it seems wasteful), but more importantly, we defer deletion of the uncompressed layer inside the converter function, which likely may happen before other calls for the same desc are done.

This looks like the culprit ^.

Using mutexes to prevent the same desc to be processed in parallel does the trick - although this is ugly.

nerdctl images
REPOSITORY                     TAG                                  IMAGE ID        CREATED           PLATFORM       SIZE       BLOB SIZE
nvcr.io/nvidia/tritonserver    24.10-vllm-python-py3-zstdchunked    ba4109a4c485    22 seconds ago    linux/amd64    0B         12.26GB
nvcr.io/nvidia/tritonserver    24.10-vllm-python-py3                6c9dcf2dbe0d    2 hours ago       linux/amd64    21.51GB    13.43GB

I sent a PR here on nerdctl, but I am not convinced this is enough and there may be more issues at play here (containerd gc-ing layers?).
Furthermore, it should probably go to stargz instead.

@ktock feel free to carry the PR over or use the info here to write a better patch on stargz.

@GrigoryEvko
Copy link
Author

Ohh, great!! Thank you so much!

Can we instead decompress all layers only once? I think it's how it was supposed to be implemented and some concurrency leaked into it.

Anyway really impressed by the quick fix! Thanks again😄

@apostasie
Copy link
Contributor

Ohh, great!! Thank you so much!

Can we instead decompress all layers only once? I think it's how it was supposed to be implemented and some concurrency leaked into it.

Anyway really impressed by the quick fix! Thanks again😄

I tried a few approaches - notably, with a ref counter and storing the uncompressed desc in the map.
The extra complexity is not worth it IMHO.
Furthermore, we are constrained by containerd methods design.

Anyhow, peeps at stargz will probably have better ideas than me on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants