Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fsspec.exceptions.FSTimeoutError when downloading dataset #7164

Open
timonmerk opened this issue Sep 24, 2024 · 5 comments
Open

fsspec.exceptions.FSTimeoutError when downloading dataset #7164

timonmerk opened this issue Sep 24, 2024 · 5 comments

Comments

@timonmerk
Copy link

Describe the bug

I am trying to download the librispeech_asr clean dataset, which results in a FSTimeoutError exception after downloading around 61% of the data.

Steps to reproduce the bug

import datasets
datasets.load_dataset("librispeech_asr", "clean")

The output is as follows:

Downloading data: 61%|██████████████▋ | 3.92G/6.39G [05:00<03:06, 13.2MB/s]Traceback (most recent call last):
File "/Users/Timon/Documents/iEEG_deeplearning/wav2vec_pretrain/.venv/lib/python3.12/site-packages/fsspec/asyn.py", line 56, in _runner
result[0] = await coro
^^^^^^^^^^
File "/Users/Timon/Documents/iEEG_deeplearning/wav2vec_pretrain/.venv/lib/python3.12/site-packages/fsspec/implementations/http.py", line 262, in _get_file
chunk = await r.content.read(chunk_size)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/Timon/Documents/iEEG_deeplearning/wav2vec_pretrain/.venv/lib/python3.12/site-packages/aiohttp/streams.py", line 393, in read
await self._wait("read")
File "/Users/Timon/Documents/iEEG_deeplearning/wav2vec_pretrain/.venv/lib/python3.12/site-packages/aiohttp/streams.py", line 311, in _wait
with self._timer:
^^^^^^^^^^^
File "/Users/Timon/Documents/iEEG_deeplearning/wav2vec_pretrain/.venv/lib/python3.12/site-packages/aiohttp/helpers.py", line 713, in exit
raise asyncio.TimeoutError from None
TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/Users/Timon/Documents/iEEG_deeplearning/wav2vec_pretrain/load_dataset.py", line 3, in
datasets.load_dataset("librispeech_asr", "clean")
File "/Users/Timon/Documents/iEEG_deeplearning/wav2vec_pretrain/.venv/lib/python3.12/site-packages/datasets/load.py", line 2096, in load_dataset
builder_instance.download_and_prepare(
File "/Users/Timon/Documents/iEEG_deeplearning/wav2vec_pretrain/.venv/lib/python3.12/site-packages/datasets/builder.py", line 924, in download_and_prepare
self._download_and_prepare(
File "/Users/Timon/Documents/iEEG_deeplearning/wav2vec_pretrain/.venv/lib/python3.12/site-packages/datasets/builder.py", line 1647, in _download_and_prepare
super()._download_and_prepare(
File "/Users/Timon/Documents/iEEG_deeplearning/wav2vec_pretrain/.venv/lib/python3.12/site-packages/datasets/builder.py", line 977, in _download_and_prepare
split_generators = self._split_generators(dl_manager, **split_generators_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/Timon/.cache/huggingface/modules/datasets_modules/datasets/librispeech_asr/2712a8f82f0d20807a56faadcd08734f9bdd24c850bb118ba21ff33ebff0432f/librispeech_asr.py", line 115, in _split_generators
archive_path = dl_manager.download(_DL_URLS[self.config.name])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/Timon/Documents/iEEG_deeplearning/wav2vec_pretrain/.venv/lib/python3.12/site-packages/datasets/download/download_manager.py", line 159, in download
downloaded_path_or_paths = map_nested(
^^^^^^^^^^^
File "/Users/Timon/Documents/iEEG_deeplearning/wav2vec_pretrain/.venv/lib/python3.12/site-packages/datasets/utils/py_utils.py", line 512, in map_nested
_single_map_nested((function, obj, batched, batch_size, types, None, True, None))
File "/Users/Timon/Documents/iEEG_deeplearning/wav2vec_pretrain/.venv/lib/python3.12/site-packages/datasets/utils/py_utils.py", line 380, in _single_map_nested
return [mapped_item for batch in iter_batched(data_struct, batch_size) for mapped_item in function(batch)]
^^^^^^^^^^^^^^^
File "/Users/Timon/Documents/iEEG_deeplearning/wav2vec_pretrain/.venv/lib/python3.12/site-packages/datasets/download/download_manager.py", line 216, in _download_batched
self._download_single(url_or_filename, download_config=download_config)
File "/Users/Timon/Documents/iEEG_deeplearning/wav2vec_pretrain/.venv/lib/python3.12/site-packages/datasets/download/download_manager.py", line 225, in _download_single
out = cached_path(url_or_filename, download_config=download_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/Timon/Documents/iEEG_deeplearning/wav2vec_pretrain/.venv/lib/python3.12/site-packages/datasets/utils/file_utils.py", line 205, in cached_path
output_path = get_from_cache(
^^^^^^^^^^^^^^^
File "/Users/Timon/Documents/iEEG_deeplearning/wav2vec_pretrain/.venv/lib/python3.12/site-packages/datasets/utils/file_utils.py", line 415, in get_from_cache
fsspec_get(url, temp_file, storage_options=storage_options, desc=download_desc, disable_tqdm=disable_tqdm)
File "/Users/Timon/Documents/iEEG_deeplearning/wav2vec_pretrain/.venv/lib/python3.12/site-packages/datasets/utils/file_utils.py", line 334, in fsspec_get
fs.get_file(path, temp_file.name, callback=callback)
File "/Users/Timon/Documents/iEEG_deeplearning/wav2vec_pretrain/.venv/lib/python3.12/site-packages/fsspec/asyn.py", line 118, in wrapper
return sync(self.loop, func, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/Timon/Documents/iEEG_deeplearning/wav2vec_pretrain/.venv/lib/python3.12/site-packages/fsspec/asyn.py", line 101, in sync
raise FSTimeoutError from return_result
fsspec.exceptions.FSTimeoutError
Downloading data: 61%|██████████████▋ | 3.92G/6.39G [05:00<03:09, 13.0MB/s]

Expected behavior

Complete the download

Environment info

Python version 3.12.6

Dependencies:

dependencies = [
"accelerate>=0.34.2",
"datasets[audio]>=3.0.0",
"ipython>=8.18.1",
"librosa>=0.10.2.post1",
"torch>=2.4.1",
"torchaudio>=2.4.1",
"transformers>=4.44.2",
]

MacOS 14.6.1 (23G93)

@lhoestq
Copy link
Member

lhoestq commented Sep 24, 2024

Hi ! If you check the dataset loading script here you'll see that it downloads the data from OpenSLR, and apparently their storage has timeout issues. It would be great to ultimately host the dataset on Hugging Face instead.

In the meantime I can only recommend to try again later :/

@timonmerk
Copy link
Author

Ok, still many thanks!

@Epiphero
Copy link

I'm also getting this same error but for CSTR-Edinburgh/vctk, so I don't think it's the remote host that's timing out, since I also time out at exactly 5 minutes. It seems there is a universal fsspec timeout that's getting hit starting in v3.

@lhoestq
Copy link
Member

lhoestq commented Oct 24, 2024

in v3 we cleaned the download parts of the library to make it more robust for HF downloads and to simplify support of script-based datasets. As a side effect it's not the same code that is used for other hosts, maybe time out handling changed. Anyway it should be possible to tweak fsspec to use retries

For example using aiohttp_retry maybe (haven't tried) ?

import fsspec
from aiohttp_retry import RetryClient

fsspec.filesystem("http")._session = RetryClient()

related topic : #7175

@JonasLoos
Copy link
Contributor

JonasLoos commented Oct 26, 2024

Adding a timeout argument to the fs.get_file call in fsspec_get in datasets/utils/file_utils.py might fix this (source code):

fs.get_file(path, temp_file.name, callback=callback, timeout=3600)

Setting timeout=1 fails after about one second, so setting it to 3600 should give us 1h. Havn't really tested this though. I'm also not sure what implications this has and if it causes errors for other fs implementations/configurations.

This is using datasets==3.0.1 and Python 3.11.6.


Edit: This doesn't seem to change the timeout time, but add a second timeout counter (probably in fsspec/asyn.py/sync). So one can reduce the time for downloading like this, but not expand.


Edit 2: fs is of type fsspec.implementations.http.HTTPFileSystem which initializes a aiohttp.ClientSession using client_kwargs. We can pass these when calling load_dataset.

TLDR; This fixes it:

import datasets, aiohttp
dataset = datasets.load_dataset(
    dataset_name,
    storage_options={'client_kwargs': {'timeout': aiohttp.ClientTimeout(total=3600)}}
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants