RuntimeError: Can't start new thread #280

cgebbe · 2024-07-31T07:45:19Z

🐛 Bug

I got the following error after training for 2h or 11h:

Traceback (most recent call last): 
  File "/home/ubuntu/deeplearning/semantic-segmentation/downstream/scripts/lightning/main.py", line 55, in <module> 
    main() 
  File "/home/ubuntu/deeplearning/semantic-segmentation/downstream/scripts/lightning/main.py", line 35, in main 
    _cli = LightningCLI( 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/cli.py", line 394, in __init__ 
    self._run_subcommand(self.subcommand) 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/cli.py", line 701, in _run_subcommand 
    fn(**fn_kwargs) 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 543, in fit 
    call._call_and_handle_interrupt( 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 43, in _call_and_handle_interrupt 
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs) 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch 
    return function(*args, **kwargs) 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 579, in _fit_impl 
    self._run(model, ckpt_path=ckpt_path) 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 986, in _run 
    results = self._run_stage() 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 1030, in _run_stage 
    self.fit_loop.run() 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 205, in run 
    self.advance() 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 363, in advance 
    self.epoch_loop.run(self._data_fetcher) 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 140, in run 
    self.advance(data_fetcher) 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 212, in advance 
    batch, _, __ = next(data_fetcher) 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/loops/fetchers.py", line 133, in __next__ 
    batch = super().__next__() 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/loops/fetchers.py", line 60, in __next__ 
    batch = next(self.iterator) 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/utilities/combined_loader.py", line 341, in __next__ 
    out = next(self._iterator) 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/utilities/combined_loader.py", line 78, in __next__ 
    out[i] = next(self.iterators[i]) 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 633, in __next__ 
    data = self._next_data() 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1345, in _next_data 
    return self._process_data(data) 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1371, in _process_data 
    data.reraise() 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/torch/_utils.py", line 644, in reraise 
    raise exception 
RuntimeError: Caught RuntimeError in DataLoader worker process 15. 

Original Traceback (most recent call last): 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop 
    data = fetcher.fetch(index) 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 32, in fetch 
    data.append(next(self.dataset_iter)) 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/litdata/streaming/combined.py", line 231, in __next__ 
    return self._get_sample(dataset_index) 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/litdata/streaming/combined.py", line 255, in _get_sample 
    sample = next(self._dataset_iters[dataset_index]) 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/litdata/streaming/dataset.py", line 365, in __next__ 
    data = self.__getitem__( 
  File "/home/ubuntu/deeplearning/semantic-segmentation/downstream/datasets/litdata/lit_dataset.py", line 52, in __getitem__ 
    dct = super().__getitem__(index) 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/litdata/streaming/dataset.py", line 335, in __getitem__ 
    return self.cache[index] 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/litdata/streaming/cache.py", line 140, in __getitem__ 
    return self._reader.read(index) 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/litdata/streaming/reader.py", line 269, in read 
    self._prepare_thread.start() 
  File "/usr/lib/python3.10/threading.py", line 935, in start 
    _start_new_thread(self._bootstrap, ()) 
RuntimeError: can't start new thread

Code sample

Unfortunately I can't provide a minimal code sample, but the main points are:

each dataset item is a dictionary containing numpy arrays
we use CombinedStreamingDataset with around ~7000 small StreamingDataset. The reason is that we need to specify several subsets of the 7000 datasets and do it this way (happy to learn about alternatives). While I know this is not optimal, it seemed to work fine at first (and also maxxed out GPU utilization)
the dataset is wrapped in a simple torch.utils.data.DataLoader
the training loop is triggered using lightning.pytorch.cli.LightningCLI

Environment

litdata version: 0.2.18
PyTorch Version (e.g., 1.0): [2.3.1](torch: 2.4.0+cu121)
OS (e.g., Linux): ubuntu 22.04
How you installed PyTorch (conda, pip, source): uv pip
Build command you used (if compiling from source):
Python version: 3.10
CUDA/cuDNN version: 12.2
GPU models and configuration: A10G
Any other relevant information:

Additional Info

The following dask issue looks very similar: RuntimeError: can't start new thread dask/dask#1780
Is it possible that this issue can be circumvented by defining a threadpool with a max size?

The text was updated successfully, but these errors were encountered:

tchaton · 2024-07-31T09:11:55Z

Hey cgebbe,

Do you think you could provide an example with a dummy model & synthetic data ? The model doesn't have to train at all.

Also, you should try to use the StreamingDataLoader. It handles lot of things for correctness.

Best,
T.C

cgebbe · 2024-07-31T10:02:46Z

@tchaton : Thanks for the super quick reply and pointing me to the StreamingDataLoader. Somehow missed this - maybe this will already resole the issue.

If not, I will try to give a minimal reproducing example. The issue has a rather high priority on our side, so I might even try a fix to limit the number of threads later.

tchaton · 2024-07-31T13:30:29Z

Hey @cgebbe. Sounds great. Let me know. We can do a pair debugging session too if you are open to it if you can't create a reproducible script for me investigate on my own.

tchaton · 2024-07-31T19:51:24Z

Hey @cgebbe. Any updates ?

cgebbe · 2024-08-01T11:01:10Z

I started using the StreamingDataLoader and faced some small obstacles:

Inside the LightningDataModule, I moved the litdata-dataset creation from __init__ to e.g. train_dataloader based on Resolve same global rank in DDP with Lightning Trainer #250 . Otherwise, data items were used more than once during one epoch.
StreamingDataset uses drop_last=True by default in a distributed environment. While I see its purpose, with many small datasets of around 50 items, num_workers * num_gpus * batch_size < num_items_per_dataset and the dataloader gets 0 items AFAIK. So I set drop_last=False
Somehow my small realistic example now hangs using fast_dev_run=100 after the first validation item with multiple GPUs. It runs fine with fast_dev_run=2 or with only 1 GPU instead of 3. Can you recommend any env flags to increase verbosity?

Finally, there's a small example below which does NOT hang, but is otherwise pretty close to the original code. A pair debugging session would be amazing @tchaton , thanks a ton for the offer - just waiting for the okay from my supervisor.

P.s.: Also successfully ran the system_check.py from Lightning-AI/pytorch-lightning#19609

"""Script sketching out current training pipeline.

FILENAME=200_thread_problem.py
python $FILENAME fit --data=DataModule --model=TaskModule --trainer.fast_dev_run=100 --trainer.devices='[0,1,2]'
"""

import shutil
import litdata
import lightning as L
from lightning.pytorch.cli import LightningCLI
import torch
from torch.utils import data
import numpy as np
import PIL.Image
from pathlib import Path
import tqdm
from torch import nn
import logging


logger = logging.getLogger(__name__)

NUM_ITEMS_PER_DATASET = 49
NUM_DATASETS = 4
CACHE_DIR = Path("/scratch/dummy")
CACHE_DIR.mkdir(exist_ok=True, parents=True)


def _create_dataset(output_dir: Path):
    def _random_images(index):
        fake_image = torch.rand((3, 32, 32), dtype=torch.float32)
        dct = {"index": index, "image": fake_image}
        return dct

    litdata.optimize(
        fn=_random_images,
        inputs=list(range(NUM_ITEMS_PER_DATASET)),
        output_dir=str(output_dir),
        num_workers=4,
        chunk_bytes="64MB",
    )


def _get_dataset_dirpaths(prefix: str):
    for idx in tqdm.trange(NUM_DATASETS):
        yield CACHE_DIR / prefix / str(idx)


def create_datasets():
    for prefix in ["train", "val"]:
        for dirpath in _get_dataset_dirpaths(prefix):
            if dirpath.exists():
                shutil.rmtree(dirpath)
            _create_dataset(dirpath)


class CombinedDs(litdata.CombinedStreamingDataset):
    def __init__(self, prefix: str):
        lst = [
            litdata.StreamingDataset(input_dir=str(dirpath), drop_last=False)
            for dirpath in _get_dataset_dirpaths(prefix)
        ]
        length_per_dataset = [len(ds) for ds in lst]
        logger.info(f"{prefix=} has {length_per_dataset=}")
        super().__init__(lst)


class DataModule(L.LightningDataModule):
    def __init__(self):
        super().__init__()
        self.kwargs = dict(num_workers=20, pin_memory=True, batch_size=2)

    def train_dataloader(self):
        self.train_ds = CombinedDs("train")
        return self._get_dataloader(self.train_ds)

    def val_dataloader(self):
        self.val_ds = CombinedDs("val")
        return self._get_dataloader(self.val_ds)

    def _get_dataloader(self, ds):
        assert ds is not None
        dl = litdata.StreamingDataLoader(ds, **self.kwargs)
        # dl= data.DataLoader(ds, **self.kwargs)
        logger.info(f"{len(ds)=}, {len(dl)=}")
        return dl


class TaskModule(L.LightningModule):
    # from https://lightning.ai/docs/pytorch/stable/starter/introduction.html
    def __init__(self):
        super().__init__()
        self.model = nn.Conv2d(3, 3, kernel_size=3, padding=1)

    def training_step(self, batch, batch_idx):
        return self._calc_loss(batch)

    def validation_step(self, batch, batch_idx):
        return self._calc_loss(batch)

    def _calc_loss(self, batch):
        x = batch["image"]
        y = self.model(x)
        loss = torch.nn.functional.mse_loss(y, x)
        return loss

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=1e-3)


def main():
    logging.basicConfig(level=logging.INFO)
    if 0:
        create_datasets()
    else:
        LightningCLI()


main()

tchaton · 2024-08-01T13:00:04Z

Hey @cgebbe,

That's perfect
So I set drop_last=False. This would lead to the training hanging due to gradient synchornization. I don't recommend doing so
It might hang due to drop_last=False

tchaton · 2024-08-01T13:50:22Z

Hey @cgebbe. I made a new release. Could you try the latest release 0.2.20. Use pip install -U litdata to upgrade.

cgebbe · 2024-08-02T08:08:30Z

Thanks again for the rapid answer @tchaton.

You recommend setting drop_last=False. However, in that case the (combined) dataloader has zero length since num_gpus * batch_size * num_workers > each_dataset_size =49. Do I miss something obvious?

I tried litdata==0.2.20. With DataLoader it hangs after the 10th of 16 training steps and the loader has length 16,16,17 per GPU. With StreamingLoader, all 10 training steps pass and it hangs after the first validation step. The loader has length 10,10,29 ?! (dataset for each of the 3 GPUs is 4*16 or 4*17, batch_size=4). When I wait for 10+ minutes, I get the error below.

[rank1]:[E801 14:37:42.731552721kProcessGroupNCCL.cpp:607] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=284, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800006 milliseconds before timing out.
[rank1]:[E801 14:37:42.731834446 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 1] Exception (either an error or timeout) detected by watchdog at work: 284, last enqueued NCCL work: 284, last completed NCCL work: 283.

I'll try to reproduce the hanging with the minimal example.

side questions

You mentioned it hangs due to drop_last=False. Do you have an idea why the minimal example runs fine though with the same (unbalanced) dataset sizes?
Are there ways to increase verbosity to narrow down where exactly it hangs?
Can I also move dataset creation from e.g. train_dataloader to prepare or setup?

tchaton · 2024-08-08T06:41:18Z

Hey @cgebbe. That's exactly what I meant. It would hang if you set drop_last=False. Right now. the solution is for you to add more training samples in the dataset.

tchaton · 2024-08-08T06:41:55Z

We could also explore padding with duplicated data but this would increase litdata complexity quite a lot.

cgebbe added bug Something isn't working help wanted Extra attention is needed labels Jul 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: Can't start new thread #280

RuntimeError: Can't start new thread #280

cgebbe commented Jul 31, 2024 •

edited

Loading

tchaton commented Jul 31, 2024 •

edited

Loading

cgebbe commented Jul 31, 2024

tchaton commented Jul 31, 2024 •

edited

Loading

tchaton commented Jul 31, 2024

cgebbe commented Aug 1, 2024 •

edited

Loading

tchaton commented Aug 1, 2024

tchaton commented Aug 1, 2024

cgebbe commented Aug 2, 2024 •

edited

Loading

tchaton commented Aug 8, 2024

tchaton commented Aug 8, 2024

RuntimeError: Can't start new thread #280

RuntimeError: Can't start new thread #280

Comments

cgebbe commented Jul 31, 2024 • edited Loading

🐛 Bug

Code sample

Environment

Additional Info

tchaton commented Jul 31, 2024 • edited Loading

cgebbe commented Jul 31, 2024

tchaton commented Jul 31, 2024 • edited Loading

tchaton commented Jul 31, 2024

cgebbe commented Aug 1, 2024 • edited Loading

tchaton commented Aug 1, 2024

tchaton commented Aug 1, 2024

cgebbe commented Aug 2, 2024 • edited Loading

tchaton commented Aug 8, 2024

tchaton commented Aug 8, 2024

cgebbe commented Jul 31, 2024 •

edited

Loading

tchaton commented Jul 31, 2024 •

edited

Loading

tchaton commented Jul 31, 2024 •

edited

Loading

cgebbe commented Aug 1, 2024 •

edited

Loading

cgebbe commented Aug 2, 2024 •

edited

Loading