Large files get truncated on dvc push to HTTP remote #8100

zhansliu · 2022-08-05T15:36:20Z

Bug Report

Description

When using an HTTP or HTTPS remote (e.g. Artifactory), and dvc push-ing a large file (which takes more than 1 minute to upload).

Reproduce

I believe this affects all HTTP remotes, but I experienced this using an Artifactory server, so I'll describe that.

Create a dvc-tracked repository, with the following .dvc/config:

[core]
    remote = artifactory
    analytics = false
    check_update = false
['remote "artifactory"']
    url = https://<artifactory host>/artifactory/datasets/
    method = PUT
    ask_password = true
    auth = basic

Create a large file large_file (large enough that it will take more than 1 minute to upload to the remote). Then do

dvc add large_file
dvc push large_file.dvc

What happens: dvc push does not report an error, but the uploaded version of the file is truncated. When somebody else tries to download it with dvc pull, they get a truncated version of the file.

Expected

The expected behavior is either the file gets uploaded correctly, or at the very least that dvc reports an error when pushing.

Environment information

Output of dvc doctor:

Supports:
        webhdfs (fsspec = 2022.5.0),
        http (aiohttp = 3.8.1, aiohttp-retry = 2.4.8),
        https (aiohttp = 3.8.1, aiohttp-retry = 2.4.8)
Cache types: hardlink
Cache directory: NTFS on D:\
Caches: local
Remotes: https
Workspace directory: NTFS on D:\
Repo: dvc, git

More details:

After seeing this error, I modified my dvc install a little to use aiohttp's tracing functionality, to trace the calls to aiohttp. I then logged everything that happened when I did the dvc push large_file.dvc command, on a 2.2 GB large_file. Here is the log (slightly anonymized):

aiohttp_log_anonymized.txt

What happens is:

The file is broken down into chunks of 5 MiB, and uploaded using Transfer-Encoding: chunked.
This all goes well, until one minute in, when you get a ServerTimeoutError('Timeout on reading data from socket')
After this happens, a retry is triggered, and the transfer is restarted.
BUT: the already-transferred chunks are not included in the retry -- instead it just continues with the chunks that it hasn't transferred yet.
The result is that the first batch of chunks are lost, and the uploaded file is truncated.

I believe the timeout behavior comes from this line in dvc_objects. If I change that to sock_read=None, then the ServerTimeoutError doesn't happen, and everything works.

That line was changed to the current behavior in this pull request.

I think the problem is that aiohttp's sock_read timeout timer starts ticking at the beginning of the request, which means that it will trigger when you try to upload large files.

The text was updated successfully, but these errors were encountered:

We have been getting reports that the timeout on sock_read was raising timeout error even for chunked uploads, and sometimes even uploading zero-byte files. See: https://github.com/iterative/dvc/issues/8065 and iterative/dvc#8100. These kinds of logics don't belong here, and should be upstreamed (eg: RetryClient/ClientTimeout, etc). We added timeout in iterative/dvc#7460 because of freezes in iterative/dvc#7414. I think we can rollback this for now given that there are lots of report of failures/issues with this line, and if we get any new reports of hangs, we'll investigate it separately.

sisp · 2022-08-06T11:23:07Z

Seems related to iterative/dvc-http#27, although for me with GitLab's generic packages repository as the storage backend the upload fails after ~5 minutes without truncation.

It would be interesting to understand why the behavior differs for Artifactory and GitLab.

guysmoilov · 2022-08-11T17:21:46Z

Hey, I created a consistent reproduction of this issue in this repo:
https://github.com/guysmoilov/golang-throttled-fileserver

The root cause is that the HTTP retries don't send from the start of the pushed file - they push from the middle of the file after the first try.

To be clear, it means that any users with HTTP remotes might have a large amount of corrupted files stored in their HTTP remotes, without knowing anything is wrong.

It doesn't even require sketchy network connections to reproduce reliably, any failed push request will cause corrupted data if a retry succeeds.

You can contact me on Discord if you want help faster: Tolstoyevsky#7927

guysmoilov · 2022-08-11T17:23:58Z

I'm working on debugging the client in DVC to find the root cause and fix it ASAP but no luck so far

guysmoilov · 2022-08-11T20:49:33Z

I narrowed it down further: The problem is the aiohttp_retry RetryClient, which gets passed a data chunk generator in DVC's implementation of HTTPFileSystem. The generator doesn't seek to the start of the file on request retry, and retries should be totally unsupported when the data isn't seekable.

I think we have to seek in the file on retries using TraceConfig: https://docs.aiohttp.org/en/stable/tracing_reference.html

But it still feels dangerous since this doesn't seem like something that was well thought out in the retry client, just a workaround.

pmrowla · 2022-08-12T04:23:00Z

For reference, @guysmoilov the dagshub repo in your example doesn't work for me, when I clone it there is no default remote configured so you can't dvc pull any data

Clone the following repo and dvc pull its data: https://dagshub.com/nirbarazida/hello-world

But I am able to reproduce the issue using my own test DVC repo and the POC go server.

pmrowla · 2022-08-12T04:46:21Z

So the issue is that RetryClient defaults to retrying on all 5xx server errors. @skshetry has identified that this can be disabled by setting retry_all_server_errors=False.

However, what we probably want to do is just disable retries for write operations (PUT and POST).

pmrowla · 2022-08-12T09:11:33Z

This is fixed in the latest dvc-data/dvc-objects releases and will be available in the next DVC release. In the meantime, you can manually install the latest dvc-data to get the fix.

pip install dvc-data==0.1.15

guysmoilov · 2022-08-14T08:45:38Z

This is fixed in the latest dvc-data/dvc-objects releases and will be available in the next DVC release. In the meantime, you can manually install the latest dvc-data to get the fix.
pip install dvc-data==0.1.14

I think you mean to say:

pip install dvc-data==0.1.15

?
From looking at the changelogs

pmrowla · 2022-08-15T03:14:33Z

@guysmoilov you are correct, it should be 0.1.15

skshetry mentioned this issue Aug 5, 2022

fs: http: do not timeout on sock_read iterative/dvc-objects#117

Merged

pared added the bug label Aug 9, 2022

guysmoilov mentioned this issue Aug 11, 2022

push: Unnotified error when pushing data into HTTP remote #7564

Closed

pmrowla self-assigned this Aug 12, 2022

pmrowla added the fs: http Related to the HTTP filesystem label Aug 12, 2022

pmrowla added this to DVC Aug 12, 2022

pmrowla moved this to Backlog in DVC Aug 12, 2022

pmrowla moved this from Backlog to In Progress in DVC Aug 12, 2022

This was referenced Aug 12, 2022

fs: http: disable PUT/POST retries iterative/dvc-objects#118

Merged

deps: bump dvc-data #8128

Merged

efiop closed this as completed in #8128 Aug 12, 2022

Repository owner moved this from In Progress to Done in DVC Aug 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large files get truncated on dvc push to HTTP remote #8100

Large files get truncated on dvc push to HTTP remote #8100

zhansliu commented Aug 5, 2022

sisp commented Aug 6, 2022

guysmoilov commented Aug 11, 2022 •

edited

Loading

guysmoilov commented Aug 11, 2022

guysmoilov commented Aug 11, 2022 •

edited

Loading

pmrowla commented Aug 12, 2022 •

edited

Loading

pmrowla commented Aug 12, 2022

pmrowla commented Aug 12, 2022 •

edited

Loading

guysmoilov commented Aug 14, 2022

pmrowla commented Aug 15, 2022

Large files get truncated on dvc push to HTTP remote #8100

Large files get truncated on dvc push to HTTP remote #8100

Comments

zhansliu commented Aug 5, 2022

Bug Report

Description

Reproduce

Expected

Environment information

More details:

sisp commented Aug 6, 2022

guysmoilov commented Aug 11, 2022 • edited Loading

guysmoilov commented Aug 11, 2022

guysmoilov commented Aug 11, 2022 • edited Loading

pmrowla commented Aug 12, 2022 • edited Loading

pmrowla commented Aug 12, 2022

pmrowla commented Aug 12, 2022 • edited Loading

guysmoilov commented Aug 14, 2022

pmrowla commented Aug 15, 2022

guysmoilov commented Aug 11, 2022 •

edited

Loading

guysmoilov commented Aug 11, 2022 •

edited

Loading

pmrowla commented Aug 12, 2022 •

edited

Loading

pmrowla commented Aug 12, 2022 •

edited

Loading