Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pull: freezes on large datasets #7414

Closed
dtrifiro opened this issue Feb 22, 2022 · 3 comments · Fixed by #7460
Closed

pull: freezes on large datasets #7414

dtrifiro opened this issue Feb 22, 2022 · 3 comments · Fixed by #7460
Assignees
Labels
A: data-sync Related to dvc get/fetch/import/pull/push bug Did we break something? fs: http Related to the HTTP filesystem p0-critical Critical issue. Needs to be fixed ASAP.

Comments

@dtrifiro
Copy link
Contributor

dtrifiro commented Feb 22, 2022

When running dvc pull on large datasets, dvc sometimes hangs and never returns, here's an example of traceback (after interrupting twice with KeyboardInterrupt)

2022-02-22 19:56:00,887 ERROR: interrupted by the user
------------------------------------------------------------
Traceback (most recent call last):
  File "dvc/dvc/data/transfer.py", line 176, in transfer
    _do_transfer(
  File "dvc/dvc/data/transfer.py", line 112, in _do_transfer
    total_fails += sum(processor(all_file_ids))
  File "dvc/dvc/utils/threadpool.py", line 29, in imap_unordered
    done, tasks = futures.wait(
  File "versions/3.9.10/lib/python3.9/concurrent/futures/_base.py", line 307, in wait
    waiter.event.wait(timeout)
  File "versions/3.9.10/lib/python3.9/threading.py", line 574, in wait
    signaled = self._cond.wait(timeout)
  File "versions/3.9.10/lib/python3.9/threading.py", line 312, in wait
    waiter.acquire()
KeyboardInterrupt

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "dvc/dvc/cli/__init__.py", line 78, in main
    ret = cmd.do_run()
  File "dvc/dvc/cli/command.py", line 22, in do_run
    return self.run()
  File "dvc/dvc/commands/data_sync.py", line 31, in run
    stats = self.repo.pull(
  File "dvc/dvc/repo/__init__.py", line 48, in wrapper
    return f(repo, *args, **kwargs)
  File "dvc/dvc/repo/pull.py", line 34, in pull
    processed_files_count = self.fetch(
  File "dvc/dvc/repo/__init__.py", line 48, in wrapper
    return f(repo, *args, **kwargs)
  File "dvc/dvc/repo/fetch.py", line 86, in fetch
    d, f = _fetch(
  File "dvc/dvc/repo/fetch.py", line 106, in _fetch
    downloaded += repo.cloud.pull(obj_ids, **kwargs)
  File "dvc/dvc/data_cloud.py", line 114, in pull
    return transfer(
  File "dvc/dvc/data/transfer.py", line 176, in transfer
    _do_transfer(
  File "versions/3.9.10/lib/python3.9/concurrent/futures/_base.py", line 637, in __exit__
    self.shutdown(wait=True)
  File "versions/3.9.10/lib/python3.9/concurrent/futures/thread.py", line 235, in shutdown
    t.join()
  File "versions/3.9.10/lib/python3.9/threading.py", line 1053, in join
    self._wait_for_tstate_lock()
  File "versions/3.9.10/lib/python3.9/threading.py", line 1073, in _wait_for_tstate_lock
    if lock.acquire(block, timeout):
KeyboardInterrupt

$ dvc doctor

DVC version: 2.9.5 (pip)

Platform: Python 3.9.10 on macOS-12.1-arm64-arm-64bit
Supports:
azure (adlfs = 2022.2.0, knack = 0.9.0, azure-identity = 1.7.1),
gdrive (pydrive2 = 1.10.0),
gs (gcsfs = 2022.1.0),
hdfs (fsspec = 2022.1.0, pyarrow = 7.0.0),
webhdfs (fsspec = 2022.1.0),
http (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
https (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
s3 (s3fs = 2022.1.0, boto3 = 1.20.24),
ssh (sshfs = 2021.11.2),
oss (ossfs = 2021.8.0),
webdav (webdav4 = 0.9.4),
webdavs (webdav4 = 0.9.4)
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk3s1s1
Caches: local
Remotes: s3, https
Workspace directory: apfs on /dev/disk3s1s1
Repo: dvc, git

To reproduce:

mkdir pull_freeze
pushd pull_freeze

git clone https://github.com/iterative/dvc-bench
cd dvc-bench
python -m venv .venv
./venv/bin/pip install -r requirements.txt dvc
dvc pull # hangs forever

See also iterative/dvc-bench#319

@dtrifiro
Copy link
Contributor Author

dtrifiro commented Feb 22, 2022

EDIT: also getting hangs with no reported errors

By looking at the logs, when this happens one of the threads is erroring out:

ERROR: failed to transfer 'md5: a581ff211ae0f9e33b5df67a805f1dbc' - 520, message='', url=URL('https://remote.dvc.org/dvc-bench/a5/81ff211ae0f9e33b5df67a805f1dbc')

A full traceback:
Traceback (most recent call last):
  File ".venv/lib/python3.9/site-packages/dvc/objects/transfer.py", line 25, in wrapper
    func(fs_path, *args, **kwargs)
  File ".venv/lib/python3.9/site-packages/dvc/objects/transfer.py", line 159, in func
    return dest.add(
  File ".venv/lib/python3.9/site-packages/dvc/objects/db/base.py", line 115, in add
    self._add_file(
  File ".venv/lib/python3.9/site-packages/dvc/objects/db/base.py", line 87, in _add_file
    return fs.utils.transfer(
  File ".venv/lib/python3.9/site-packages/dvc/fs/utils.py", line 96, in transfer
    _try_links(links, from_fs, from_path, to_fs, to_path)
  File ".venv/lib/python3.9/site-packages/dvc/fs/utils.py", line 66, in _try_links
    return _copy(from_fs, from_path, to_fs, to_path)
  File ".venv/lib/python3.9/site-packages/dvc/fs/utils.py", line 47, in _copy
    return from_fs.download_file(from_path, to_path)
  File ".venv/lib/python3.9/site-packages/dvc/fs/base.py", line 306, in download
    return self._download_file(from_info, to_info, callback=callback)
  File ".venv/lib/python3.9/site-packages/dvc/fs/base.py", line 368, in _download_file
    self.get_file(from_info, tmp_file, callback=callback)
  File ".venv/lib/python3.9/site-packages/dvc/fs/fsspec_wrapper.py", line 138, in get_file
    self.fs.get_file(from_info, to_info, callback=callback, **kwargs)
  File ".venv/lib/python3.9/site-packages/fsspec/asyn.py", line 91, in wrapper
    return sync(self.loop, func, *args, **kwargs)
  File ".venv/lib/python3.9/site-packages/fsspec/asyn.py", line 71, in sync
    raise return_result
  File ".venv/lib/python3.9/site-packages/fsspec/asyn.py", line 25, in _runner
    result[0] = await coro
  File ".venv/lib/python3.9/site-packages/fsspec/implementations/http.py", line 243, in _get_file
    self._raise_not_found_for_status(r, rpath)
  File ".venv/lib/python3.9/site-packages/fsspec/implementations/http.py", line 208, in _raise_not_found_for_status
    response.raise_for_status()
  File ".venv/lib/python3.9/site-packages/aiohttp/client_reqrep.py", line 1004, in raise_for_status
    raise ClientResponseError(
aiohttp.client_exceptions.ClientResponseError: 520, message='', url=URL('https://remote.dvc.org/dvc-bench/b9/0b4073ceb97afa27437e5c8ecb149f')

Apparently 520 is a "catch-all" cloudflare error (see this)

@dtrifiro dtrifiro self-assigned this Feb 22, 2022
@dtrifiro dtrifiro added bug Did we break something? fs: http Related to the HTTP filesystem p0-critical Critical issue. Needs to be fixed ASAP. labels Feb 22, 2022
@daavoo daavoo added the A: data-sync Related to dvc get/fetch/import/pull/push label Feb 22, 2022
@dtrifiro dtrifiro moved this from Todo to In Progress in DVC Feb 22, 2022
dtrifiro added a commit to iterative/dvc-bench that referenced this issue Mar 2, 2022
Reducing the dataset size should make it possible to run the benchmarks again while we wait for iterative/dvc#7414 to fix #319
@dtrifiro
Copy link
Contributor Author

dtrifiro commented Mar 4, 2022

This also happens when calling dvc pull -j 1, this is the stack for all threads while hung:

(gdb) info threads
  Id   Target Id                                Frame
* 1    Thread 0x7f0ba8829740 (LWP 952005) "dvc" 0x00007f0ba88b815a in __futex_abstimed_wait_common () from /usr/lib/libc.so.6
  2    Thread 0x7f0ba5dff640 (LWP 952012) "dvc" 0x00007f0ba88b815a in __futex_abstimed_wait_common () from /usr/lib/libc.so.6
  3    Thread 0x7f0b9e2eb640 (LWP 952131) "dvc" 0x00007f0ba89409ae in epoll_wait () from /usr/lib/libc.so.6
  4    Thread 0x7f0b9d9ea640 (LWP 952132) "dvc" 0x00007f0ba88b815a in __futex_abstimed_wait_common () from /usr/lib/libc.so.6
  5    Thread 0x7f0b9ebec640 (LWP 952240) "dvc" 0x00007f0ba88b815a in __futex_abstimed_wait_common () from /usr/lib/libc.so.6
(gdb) thread apply all py-bt

Thread 5 (Thread 0x7f0b9ebec640 (LWP 952240) "dvc"):
Traceback (most recent call first):
  File "/home/brethil/.pyenv/versions/3.10.2-debug/lib/python3.10/threading.py", line 324, in wait
    gotit = waiter.acquire(True, timeout)
  File "/home/brethil/.pyenv/versions/3.10.2-debug/lib/python3.10/threading.py", line 600, in wait
    signaled = self._cond.wait(timeout)
  File "/home/brethil/projects/work/filesystem_spec/fsspec/asyn.py", line 53, in sync
    if event.wait(1):
  File "/home/brethil/projects/work/filesystem_spec/fsspec/asyn.py", line 85, in wrapper
    return sync(self.loop, func, *args, **kwargs)
  File "/home/brethil/projects/work/dvc/dvc/fs/fsspec_wrapper.py", line 128, in get_file
    self.fs.get_file(from_info, to_info, callback=callback, **kwargs)
  File "/home/brethil/projects/work/dvc/dvc/fs/base.py", line 354, in _download_file
    self.get_file(from_info, tmp_file, callback=callback)
  File "/home/brethil/projects/work/dvc/dvc/fs/base.py", line 292, in download
    return self._download_file(from_info, to_info, callback=callback)
  File "/home/brethil/projects/work/dvc/dvc/fs/utils.py", line 47, in _copy
    return from_fs.download_file(from_path, to_path)
  File "/home/brethil/projects/work/dvc/dvc/fs/utils.py", line 66, in _try_links
    return _copy(from_fs, from_path, to_fs, to_path)
  File "/home/brethil/projects/work/dvc/dvc/fs/utils.py", line 96, in transfer
    _try_links(links, from_fs, from_path, to_fs, to_path)
  File "/home/brethil/projects/work/dvc/dvc/objects/db.py", line 87, in _add_file
    return fs.utils.transfer(
  File "/home/brethil/projects/work/dvc/dvc/objects/db.py", line 115, in add
    self._add_file(
  File "/home/brethil/projects/work/dvc/dvc/data/transfer.py", line 162, in func
    return dest.add(
  File "/home/brethil/projects/work/dvc/dvc/data/transfer.py", line 25, in wrapper
    func(fs_path, *args, **kwargs)
  File "/home/brethil/projects/work/dvc/dvc/progress.py", line 135, in wrapped
    res = fn(*args, **kwargs)
  File "/home/brethil/.pyenv/versions/3.10.2-debug/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/brethil/.pyenv/versions/3.10.2-debug/lib/python3.10/concurrent/futures/thread.py", line 83, in _worker
    work_item.run()
  File "/home/brethil/.pyenv/versions/3.10.2-debug/lib/python3.10/threading.py", line 946, in run
    self._target(*self._args, **self._kwargs)
  File "/home/brethil/.pyenv/versions/3.10.2-debug/lib/python3.10/threading.py", line 1009, in _bootstrap_inner
    self.run()
  File "/home/brethil/.pyenv/versions/3.10.2-debug/lib/python3.10/threading.py", line 966, in _bootstrap
    self._bootstrap_inner()

Thread 4 (Thread 0x7f0b9d9ea640 (LWP 952132) "dvc"):
Traceback (most recent call first):
  <builtin_method at remote 0x7f0b9cb832a0>
  File "/home/brethil/.pyenv/versions/3.10.2-debug/lib/python3.10/concurrent/futures/thread.py", line 81, in _worker
    work_item = work_queue.get(block=True)
  File "/home/brethil/.pyenv/versions/3.10.2-debug/lib/python3.10/threading.py", line 946, in run
    self._target(*self._args, **self._kwargs)
  File "/home/brethil/.pyenv/versions/3.10.2-debug/lib/python3.10/threading.py", line 1009, in _bootstrap_inner
    self.run()
  File "/home/brethil/.pyenv/versions/3.10.2-debug/lib/python3.10/threading.py", line 966, in _bootstrap
    self._bootstrap_inner()

Thread 3 (Thread 0x7f0b9e2eb640 (LWP 952131) "dvc"):
Traceback (most recent call first):
  File "/home/brethil/.pyenv/versions/3.10.2-debug/lib/python3.10/selectors.py", line 469, in select
    fd_event_list = self._selector.poll(timeout, max_ev)
  File "/home/brethil/.pyenv/versions/3.10.2-debug/lib/python3.10/asyncio/base_events.py", line 1845, in _run_once
    event_list = self._selector.select(timeout)
  File "/home/brethil/.pyenv/versions/3.10.2-debug/lib/python3.10/asyncio/base_events.py", line 595, in run_forever
    self._run_once()
  File "/home/brethil/.pyenv/versions/3.10.2-debug/lib/python3.10/threading.py", line 946, in run
    self._target(*self._args, **self._kwargs)
  File "/home/brethil/.pyenv/versions/3.10.2-debug/lib/python3.10/threading.py", line 1009, in _bootstrap_inner
    self.run()
  File "/home/brethil/.pyenv/versions/3.10.2-debug/lib/python3.10/threading.py", line 966, in _bootstrap
    self._bootstrap_inner()

Thread 2 (Thread 0x7f0ba5dff640 (LWP 952012) "dvc"):
Traceback (most recent call first):
--Type <RET> for more, q to quit, c to continue without paging--thread
  File "/home/brethil/.pyenv/versions/3.10.2-debug/lib/python3.10/threading.py", line 324, in wait
    gotit = waiter.acquire(True, timeout)
  File "/home/brethil/.pyenv/versions/3.10.2-debug/lib/python3.10/threading.py", line 600, in wait
    signaled = self._cond.wait(timeout)
  File "/home/brethil/projects/work/dvc-bench/.venv/lib/python3.10/site-packages/tqdm/_monitor.py", line 60, in run
    self.was_killed.wait(self.sleep_interval)
  File "/home/brethil/.pyenv/versions/3.10.2-debug/lib/python3.10/threading.py", line 1009, in _bootstrap_inner
    self.run()
  File "/home/brethil/.pyenv/versions/3.10.2-debug/lib/python3.10/threading.py", line 966, in _bootstrap
    self._bootstrap_inner()

Thread 1 (Thread 0x7f0ba8829740 (LWP 952005) "dvc"):
Traceback (most recent call first):
  File "/home/brethil/.pyenv/versions/3.10.2-debug/lib/python3.10/threading.py", line 324, in wait
    gotit = waiter.acquire(True, timeout)
  File "/home/brethil/.pyenv/versions/3.10.2-debug/lib/python3.10/threading.py", line 600, in wait
    signaled = self._cond.wait(timeout)
  File "/home/brethil/.pyenv/versions/3.10.2-debug/lib/python3.10/concurrent/futures/_base.py", line 307, in wait
    waiter.event.wait(timeout)
  File "/home/brethil/projects/work/dvc/dvc/utils/threadpool.py", line 30, in imap_unordered
    done, tasks = futures.wait(
  <built-in method sum of module object at remote 0x7f0ba84e3d70>
  File "/home/brethil/projects/work/dvc/dvc/data/transfer.py", line 112, in _do_transfer
    total_fails += sum(processor(all_file_ids))
  File "/home/brethil/projects/work/dvc/dvc/data/transfer.py", line 176, in transfer
    _do_transfer(
  File "/home/brethil/projects/work/dvc/dvc/data_cloud.py", line 114, in pull
    return transfer(
  File "/home/brethil/projects/work/dvc/dvc/repo/fetch.py", line 106, in _fetch
    downloaded += repo.cloud.pull(obj_ids, **kwargs)
  File "/home/brethil/projects/work/dvc/dvc/repo/fetch.py", line 86, in fetch
    d, f = _fetch(
  File "/home/brethil/projects/work/dvc/dvc/repo/__init__.py", line 48, in wrapper
    return f(repo, *args, **kwargs)
  File "/home/brethil/projects/work/dvc/dvc/repo/pull.py", line 34, in pull
    processed_files_count = self.fetch(
  File "/home/brethil/projects/work/dvc/dvc/repo/__init__.py", line 48, in wrapper
    return f(repo, *args, **kwargs)
  File "/home/brethil/projects/work/dvc/dvc/commands/data_sync.py", line 31, in run
    stats = self.repo.pull(
  File "/home/brethil/projects/work/dvc/dvc/cli/command.py", line 22, in do_run
    return self.run()
  File "/home/brethil/projects/work/dvc/dvc/cli/__init__.py", line 78, in main
    ret = cmd.do_run()
  File "/home/brethil/projects/work/dvc-bench/.venv/bin/dvc", line 33, in <module>
    sys.exit(load_entry_point('dvc', 'console_scripts', 'dvc')())

@dtrifiro
Copy link
Contributor Author

dtrifiro commented Mar 9, 2022

Small update: by forcing a timeout value (60) to fsspec's sync wrapper in asyn.py:

diff --git a/fsspec/asyn.py b/fsspec/asyn.py
index 18cbeb0..a013cf0 100644
--- a/fsspec/asyn.py
+++ b/fsspec/asyn.py
@@ -29,7 +29,7 @@ async def _runner(event, coro, result, timeout=None):
         event.set()
 
 
-def sync(loop, func, *args, timeout=None, **kwargs):
+def sync(loop, func, *args, timeout=60, **kwargs):
     """
     Make loop run coroutine until it returns. Runs in other thread
     """

one gets the following traceback when the hang happens:

2022-03-09 17:09:50,095 ERROR: failed to transfer 'md5: eff5931c3cc9ade6d326a6234781d4cb' - : :
------------------------------------------------------------
Traceback (most recent call last):
  File "/home/dtrifiro/projects/work/filesystem_spec/fsspec/implementations/http.py", line 261, in _get_file
    async with session.get(rpath, **self.kwargs) as r:
  File "/home/dtrifiro/projects/work/aiohttp_retry/aiohttp_retry/client.py", line 99, in __aenter__
    return await self._do_request()
  File "/home/dtrifiro/projects/work/aiohttp_retry/aiohttp_retry/client.py", line 72, in _do_request
    response: ClientResponse = await self._request(
  File "/home/dtrifiro/projects/work/dvc-bench/.venv/lib/python3.10/site-packages/aiohttp/client.py", line 559, in _request
    await resp.start(conn)
  File "/home/dtrifiro/projects/work/dvc-bench/.venv/lib/python3.10/site-packages/aiohttp/client_reqrep.py", line 898, in start
    message, payload = await protocol.read()  # type: ignore[union-attr]
  File "/home/dtrifiro/projects/work/dvc-bench/.venv/lib/python3.10/site-packages/aiohttp/streams.py", line 616, in read
    await self._waiter
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:
[see attached file for full traceback]

full traceback

@dtrifiro dtrifiro added the upstream Issues which need to be resolved in an upstream dependency label Mar 9, 2022
@dtrifiro dtrifiro removed the upstream Issues which need to be resolved in an upstream dependency label Mar 14, 2022
@dtrifiro dtrifiro moved this from In Progress to Review In Progress in DVC Mar 15, 2022
Repository owner moved this from Review In Progress to Done in DVC Mar 28, 2022
skshetry added a commit to iterative/dvc-objects that referenced this issue Aug 5, 2022
We have been getting reports that the timeout on sock_read was raising
timeout error even for chunked uploads, and sometimes even uploading
zero-byte files.

See: https://github.com/iterative/dvc/issues/8065
and iterative/dvc#8100.

These kinds of logics don't belong here, and should be upstreamed (eg: RetryClient/ClientTimeout, etc).
We added timeout in iterative/dvc#7460 because of freezes in
iterative/dvc#7414.

I think we can rollback this for now given that there are lots of report of failures/issues with this line,
and if we get any new reports of hangs, we'll investigate it separately.
efiop pushed a commit to iterative/dvc-objects that referenced this issue Aug 5, 2022
We have been getting reports that the timeout on sock_read was raising
timeout error even for chunked uploads, and sometimes even uploading
zero-byte files.

See: https://github.com/iterative/dvc/issues/8065
and iterative/dvc#8100.

These kinds of logics don't belong here, and should be upstreamed (eg: RetryClient/ClientTimeout, etc).
We added timeout in iterative/dvc#7460 because of freezes in
iterative/dvc#7414.

I think we can rollback this for now given that there are lots of report of failures/issues with this line,
and if we get any new reports of hangs, we'll investigate it separately.
dtrifiro added a commit to iterative/dvc-http that referenced this issue Jan 4, 2023
simplify _prepare credentials, revert some of the
changes related to iterative/dvc#7414
dtrifiro added a commit to iterative/dvc-http that referenced this issue Jan 5, 2023
simplify _prepare credentials, revert some of the
changes related to iterative/dvc#7414
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: data-sync Related to dvc get/fetch/import/pull/push bug Did we break something? fs: http Related to the HTTP filesystem p0-critical Critical issue. Needs to be fixed ASAP.
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants