-
Notifications
You must be signed in to change notification settings - Fork 379
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFE: Retry individual layer copies #1145
Comments
After debugging a bit more, I see the same errors with docker too. But in contrast to podman, docker retries failed downloads so this error was hidden. |
Podman is supposed to retry 3 times on network failures now. @QiWang19 Correct? |
Suspicion: Docker retries individual failed layers, we only retry the
entire pull, which continues to consistently fail (which is itself
interesting).
…On Sat, Nov 28, 2020, 09:07 Daniel J Walsh ***@***.***> wrote:
Podman is supposed to retry 3 times on network failures now. @QiWang19
<https://github.com/QiWang19> Correct?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<https://github.com/containers/podman/issues/8503#issuecomment-735235215>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB3AOCD6CCKFTYKGKFSNQBDSSD7YXANCNFSM4UFUKWZA>
.
|
The root cause is somewhere between Vodafone DE and dockerhub. Nevertheless, podman could manage theses issue better than it currently does. |
@mtrmac @vrothberg Could we do better? |
Auto-retries seem like a reoccurring issue. Maybe that is something c/image could handle. Many reported that issue XYZ does not occur on Docker but on Podman, so it seems that Docker may have found a sweet spot of retries making it slightly more robust to transient network failures. @mtrmac what do you think? |
I guess? It’s useful, probably possible, OTOH not quite trivial:
|
Doing retry logic is pretty easy to do in a safe way: Just retry if any data has been received. |
That fails if we run out of disk space while storing the pulled data (but aborting causes the temporary copy to be deleted), a case we have already encountered. |
I'm not into podman's source. I'll look into it soon. |
Another way to say this, non-targeted heuristics like this are exactly the kind of rule I think should be avoided; target only “specific clearly identified errors”. |
A friendly reminder that this issue had no activity for 30 days. |
Moved the issue over to containers/image where the work had to be done. |
I have exactly the same issue: From the same host having docker and podman available and using the same proxy settings (company proxy), pulling for example the image mariadb:10.6.5 is OK using docker and failing using podman: Docker:
Podman:
I noticed that when I launch the podman pull several times consecutively, it is all the time the copy of the specific layer 7b1a6ab2e44d whish is interrupted and never ends ... !? |
I've tried to pull the image using skopeo specifying a format (OCI or v2s2) ... same issue ... so this is not linked to this:
When pulling the image using docker in debug mode, we can see that docker is resuming (so even better than just retrying) the copy of the layer 7b1a6ab2e44d :
So there is certainly an issue at proxy level (cache ?) but resuming the download of the impacted layer is allowing the image pull completion. |
( |
Never seen this issue on Fedora, but I'm seeing it a lot in WSL 2. My podman info: podman info --debug: Update: This does happen on Fedora too. Very irritating. Goes up to almost fully downloading several images and then starts from scratch from zero percent. |
I just wanted to note that an RFE bugzilla has been created for this issue too. https://bugzilla.redhat.com/show_bug.cgi?id=2009877 |
I'm seeing a lot of Docker is more robust when pulling the same images. It completes without errors, resuming layers (as mentioned above by @grunlab) when it encounters network issues. Not being able to reliably pull large images is a serious deal-breaker and may affect the choice between Podman and Docker. |
I recently encountered the same problem from a private registry with podman I tryed also downloading the layer with
If I retry, the connection doesn't close at the same byte.
|
We are using Sonatype Nexus as a private container registry. Docker is smart enough to resume the download, so after a retry or two it completes the download without the user noticing any problem. The workaround for us was to increase Nexus timeout by setting Hopefully this could be useful for someone in the same situation. |
Thank you Amir for the insight.I always got this problem when I am running an active VPN connection on the host. No matter it's Linux, Mac or Windows. It's always like this.
On Mon, Nov 28, 2022 at 11:15, Amir ***@***.***> wrote:
We are using Sonatype Nexus as a private container registry.
By default, Nexus times out connections after 30 seconds.
When pulling containers with large layers this timeout sometimes expires and the connection is closed.
Docker is smart enough to resume the download, so after a retry or two it completes the download without the user noticing any problem.
But Podman restarts the download instead of resuming, so naturally Podman hits the same timeout on each retry and eventually fails with error happened during read: unexpected EOF or similar error.
The workaround for us was to increase Nexus timeout by setting nexus.httpclient.connectionpool.idleTime=600s in /opt/sonatype-work/nexus3/etc/nexus.properties and restarting Nexus
Hopefully this could be useful for someone in the same situation.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you commented.Message ID: ***@***.***>
|
With #1816 and #1847, c/image will resume layer pulls after “unexpected EOF” and “connection reset by peer” errors, limited to cases where there has been enough progress, or enough time, since previous attempt. This should hopefully alleviate the most frequent cases, without interfering too much with any existing higher-level retry logic. |
Hopefully the resume on EOFs gives us enough resiliency as mentioned in containers/image#1145 (comment). Signed-off-by: Alex Kalenyuk <[email protected]>
…Fs (#2874) Hopefully the resume on EOFs gives us enough resiliency as mentioned in containers/image#1145 (comment). Signed-off-by: Alex Kalenyuk <[email protected]>
/kind bug
I'm unable to pull certain images from dockerhub
Steps to reproduce the issue:
podman pull octoprint/octoprint
wait
get the following error:
Describe the results you received:
After a while I get a TCP connection reset error. This is the complete output of the run:
Describe the results you expected:
The image should be pulled successfully
Additional information you deem important (e.g. issue happens only occasionally):
Smaller images may take an unusual amount of time, but they're pulled successfully:
Output of
podman version
:Output of
podman info --debug
:Package info (e.g. output of
rpm -q podman
orapt list podman
):Have you tested with the latest version of Podman and have you checked the Podman Troubleshooting Guide?
Yes
Additional environment details (AWS, VirtualBox, physical, etc.):
Local box with voidlinux x86_64 glibc
The text was updated successfully, but these errors were encountered: