-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Large files get truncated on dvc push to HTTP remote #8100
Comments
We have been getting reports that the timeout on sock_read was raising timeout error even for chunked uploads, and sometimes even uploading zero-byte files. See: https://github.com/iterative/dvc/issues/8065 and iterative/dvc#8100. These kinds of logics don't belong here, and should be upstreamed (eg: RetryClient/ClientTimeout, etc). We added timeout in iterative/dvc#7460 because of freezes in iterative/dvc#7414. I think we can rollback this for now given that there are lots of report of failures/issues with this line, and if we get any new reports of hangs, we'll investigate it separately.
We have been getting reports that the timeout on sock_read was raising timeout error even for chunked uploads, and sometimes even uploading zero-byte files. See: https://github.com/iterative/dvc/issues/8065 and iterative/dvc#8100. These kinds of logics don't belong here, and should be upstreamed (eg: RetryClient/ClientTimeout, etc). We added timeout in iterative/dvc#7460 because of freezes in iterative/dvc#7414. I think we can rollback this for now given that there are lots of report of failures/issues with this line, and if we get any new reports of hangs, we'll investigate it separately.
Seems related to iterative/dvc-http#27, although for me with GitLab's generic packages repository as the storage backend the upload fails after ~5 minutes without truncation. It would be interesting to understand why the behavior differs for Artifactory and GitLab. |
Hey, I created a consistent reproduction of this issue in this repo: The root cause is that the HTTP retries don't send from the start of the pushed file - they push from the middle of the file after the first try. To be clear, it means that any users with HTTP remotes might have a large amount of corrupted files stored in their HTTP remotes, without knowing anything is wrong. It doesn't even require sketchy network connections to reproduce reliably, any failed push request will cause corrupted data if a retry succeeds. You can contact me on Discord if you want help faster: Tolstoyevsky#7927 |
I'm working on debugging the client in DVC to find the root cause and fix it ASAP but no luck so far |
I narrowed it down further: The problem is the aiohttp_retry RetryClient, which gets passed a data chunk generator in DVC's implementation of HTTPFileSystem. The generator doesn't seek to the start of the file on request retry, and retries should be totally unsupported when the data isn't seekable. I think we have to seek in the file on retries using TraceConfig: https://docs.aiohttp.org/en/stable/tracing_reference.html But it still feels dangerous since this doesn't seem like something that was well thought out in the retry client, just a workaround. |
For reference, @guysmoilov the dagshub repo in your example doesn't work for me, when I clone it there is no default remote configured so you can't
But I am able to reproduce the issue using my own test DVC repo and the POC go server. |
So the issue is that RetryClient defaults to retrying on all 5xx server errors. @skshetry has identified that this can be disabled by setting However, what we probably want to do is just disable retries for write operations (PUT and POST). |
This is fixed in the latest dvc-data/dvc-objects releases and will be available in the next DVC release. In the meantime, you can manually install the latest dvc-data to get the fix.
|
I think you mean to say:
? |
@guysmoilov you are correct, it should be |
Bug Report
Description
When using an HTTP or HTTPS remote (e.g. Artifactory), and
dvc push
-ing a large file (which takes more than 1 minute to upload).Reproduce
I believe this affects all HTTP remotes, but I experienced this using an Artifactory server, so I'll describe that.
Create a dvc-tracked repository, with the following
.dvc/config
:Create a large file
large_file
(large enough that it will take more than 1 minute to upload to the remote). Then doWhat happens:
dvc push
does not report an error, but the uploaded version of the file is truncated. When somebody else tries to download it withdvc pull
, they get a truncated version of the file.Expected
The expected behavior is either the file gets uploaded correctly, or at the very least that
dvc
reports an error when pushing.Environment information
Output of
dvc doctor
:More details:
After seeing this error, I modified my dvc install a little to use
aiohttp
's tracing functionality, to trace the calls toaiohttp
. I then logged everything that happened when I did thedvc push large_file.dvc
command, on a 2.2 GBlarge_file
. Here is the log (slightly anonymized):aiohttp_log_anonymized.txt
What happens is:
Transfer-Encoding: chunked
.ServerTimeoutError('Timeout on reading data from socket')
I believe the timeout behavior comes from this line in dvc_objects. If I change that to
sock_read=None
, then theServerTimeoutError
doesn't happen, and everything works.That line was changed to the current behavior in this pull request.
I think the problem is that aiohttp's
sock_read
timeout timer starts ticking at the beginning of the request, which means that it will trigger when you try to upload large files.The text was updated successfully, but these errors were encountered: