Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: Can't start new thread #280

Open
cgebbe opened this issue Jul 31, 2024 · 10 comments
Open

RuntimeError: Can't start new thread #280

cgebbe opened this issue Jul 31, 2024 · 10 comments
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@cgebbe
Copy link

cgebbe commented Jul 31, 2024

🐛 Bug

I got the following error after training for 2h or 11h:

Traceback (most recent call last): 
  File "/home/ubuntu/deeplearning/semantic-segmentation/downstream/scripts/lightning/main.py", line 55, in <module> 
    main() 
  File "/home/ubuntu/deeplearning/semantic-segmentation/downstream/scripts/lightning/main.py", line 35, in main 
    _cli = LightningCLI( 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/cli.py", line 394, in __init__ 
    self._run_subcommand(self.subcommand) 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/cli.py", line 701, in _run_subcommand 
    fn(**fn_kwargs) 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 543, in fit 
    call._call_and_handle_interrupt( 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 43, in _call_and_handle_interrupt 
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs) 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch 
    return function(*args, **kwargs) 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 579, in _fit_impl 
    self._run(model, ckpt_path=ckpt_path) 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 986, in _run 
    results = self._run_stage() 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 1030, in _run_stage 
    self.fit_loop.run() 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 205, in run 
    self.advance() 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 363, in advance 
    self.epoch_loop.run(self._data_fetcher) 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 140, in run 
    self.advance(data_fetcher) 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 212, in advance 
    batch, _, __ = next(data_fetcher) 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/loops/fetchers.py", line 133, in __next__ 
    batch = super().__next__() 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/loops/fetchers.py", line 60, in __next__ 
    batch = next(self.iterator) 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/utilities/combined_loader.py", line 341, in __next__ 
    out = next(self._iterator) 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/lightning/pytorch/utilities/combined_loader.py", line 78, in __next__ 
    out[i] = next(self.iterators[i]) 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 633, in __next__ 
    data = self._next_data() 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1345, in _next_data 
    return self._process_data(data) 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1371, in _process_data 
    data.reraise() 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/torch/_utils.py", line 644, in reraise 
    raise exception 
RuntimeError: Caught RuntimeError in DataLoader worker process 15. 

Original Traceback (most recent call last): 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop 
    data = fetcher.fetch(index) 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 32, in fetch 
    data.append(next(self.dataset_iter)) 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/litdata/streaming/combined.py", line 231, in __next__ 
    return self._get_sample(dataset_index) 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/litdata/streaming/combined.py", line 255, in _get_sample 
    sample = next(self._dataset_iters[dataset_index]) 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/litdata/streaming/dataset.py", line 365, in __next__ 
    data = self.__getitem__( 
  File "/home/ubuntu/deeplearning/semantic-segmentation/downstream/datasets/litdata/lit_dataset.py", line 52, in __getitem__ 
    dct = super().__getitem__(index) 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/litdata/streaming/dataset.py", line 335, in __getitem__ 
    return self.cache[index] 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/litdata/streaming/cache.py", line 140, in __getitem__ 
    return self._reader.read(index) 
  File "/home/ubuntu/envs/pytorch/lib/python3.10/site-packages/litdata/streaming/reader.py", line 269, in read 
    self._prepare_thread.start() 
  File "/usr/lib/python3.10/threading.py", line 935, in start 
    _start_new_thread(self._bootstrap, ()) 
RuntimeError: can't start new thread 

Code sample

Unfortunately I can't provide a minimal code sample, but the main points are:

  • each dataset item is a dictionary containing numpy arrays
  • we use CombinedStreamingDataset with around ~7000 small StreamingDataset. The reason is that we need to specify several subsets of the 7000 datasets and do it this way (happy to learn about alternatives). While I know this is not optimal, it seemed to work fine at first (and also maxxed out GPU utilization)
  • the dataset is wrapped in a simple torch.utils.data.DataLoader
  • the training loop is triggered using lightning.pytorch.cli.LightningCLI

Environment

  • litdata version: 0.2.18
  • PyTorch Version (e.g., 1.0): [2.3.1](torch: 2.4.0+cu121)
  • OS (e.g., Linux): ubuntu 22.04
  • How you installed PyTorch (conda, pip, source): uv pip
  • Build command you used (if compiling from source):
  • Python version: 3.10
  • CUDA/cuDNN version: 12.2
  • GPU models and configuration: A10G
  • Any other relevant information:

Additional Info

@cgebbe cgebbe added bug Something isn't working help wanted Extra attention is needed labels Jul 31, 2024
@tchaton
Copy link
Collaborator

tchaton commented Jul 31, 2024

Hey cgebbe,

Do you think you could provide an example with a dummy model & synthetic data ? The model doesn't have to train at all.

Also, you should try to use the StreamingDataLoader. It handles lot of things for correctness.

Best,
T.C

@cgebbe
Copy link
Author

cgebbe commented Jul 31, 2024

@tchaton : Thanks for the super quick reply and pointing me to the StreamingDataLoader. Somehow missed this - maybe this will already resole the issue.

If not, I will try to give a minimal reproducing example. The issue has a rather high priority on our side, so I might even try a fix to limit the number of threads later.

@tchaton
Copy link
Collaborator

tchaton commented Jul 31, 2024

Hey @cgebbe. Sounds great. Let me know. We can do a pair debugging session too if you are open to it if you can't create a reproducible script for me investigate on my own.

@tchaton
Copy link
Collaborator

tchaton commented Jul 31, 2024

Hey @cgebbe. Any updates ?

@cgebbe
Copy link
Author

cgebbe commented Aug 1, 2024

I started using the StreamingDataLoader and faced some small obstacles:

  1. Inside the LightningDataModule, I moved the litdata-dataset creation from __init__ to e.g. train_dataloader based on Resolve same global rank in DDP with Lightning Trainer #250 . Otherwise, data items were used more than once during one epoch.

  2. StreamingDataset uses drop_last=True by default in a distributed environment. While I see its purpose, with many small datasets of around 50 items, num_workers * num_gpus * batch_size < num_items_per_dataset and the dataloader gets 0 items AFAIK. So I set drop_last=False

  3. Somehow my small realistic example now hangs using fast_dev_run=100 after the first validation item with multiple GPUs. It runs fine with fast_dev_run=2 or with only 1 GPU instead of 3. Can you recommend any env flags to increase verbosity?

Finally, there's a small example below which does NOT hang, but is otherwise pretty close to the original code. A pair debugging session would be amazing @tchaton , thanks a ton for the offer - just waiting for the okay from my supervisor.

P.s.: Also successfully ran the system_check.py from Lightning-AI/pytorch-lightning#19609

"""Script sketching out current training pipeline.

FILENAME=200_thread_problem.py
python $FILENAME fit --data=DataModule --model=TaskModule --trainer.fast_dev_run=100 --trainer.devices='[0,1,2]'
"""

import shutil
import litdata
import lightning as L
from lightning.pytorch.cli import LightningCLI
import torch
from torch.utils import data
import numpy as np
import PIL.Image
from pathlib import Path
import tqdm
from torch import nn
import logging


logger = logging.getLogger(__name__)

NUM_ITEMS_PER_DATASET = 49
NUM_DATASETS = 4
CACHE_DIR = Path("/scratch/dummy")
CACHE_DIR.mkdir(exist_ok=True, parents=True)


def _create_dataset(output_dir: Path):
    def _random_images(index):
        fake_image = torch.rand((3, 32, 32), dtype=torch.float32)
        dct = {"index": index, "image": fake_image}
        return dct

    litdata.optimize(
        fn=_random_images,
        inputs=list(range(NUM_ITEMS_PER_DATASET)),
        output_dir=str(output_dir),
        num_workers=4,
        chunk_bytes="64MB",
    )


def _get_dataset_dirpaths(prefix: str):
    for idx in tqdm.trange(NUM_DATASETS):
        yield CACHE_DIR / prefix / str(idx)


def create_datasets():
    for prefix in ["train", "val"]:
        for dirpath in _get_dataset_dirpaths(prefix):
            if dirpath.exists():
                shutil.rmtree(dirpath)
            _create_dataset(dirpath)


class CombinedDs(litdata.CombinedStreamingDataset):
    def __init__(self, prefix: str):
        lst = [
            litdata.StreamingDataset(input_dir=str(dirpath), drop_last=False)
            for dirpath in _get_dataset_dirpaths(prefix)
        ]
        length_per_dataset = [len(ds) for ds in lst]
        logger.info(f"{prefix=} has {length_per_dataset=}")
        super().__init__(lst)


class DataModule(L.LightningDataModule):
    def __init__(self):
        super().__init__()
        self.kwargs = dict(num_workers=20, pin_memory=True, batch_size=2)

    def train_dataloader(self):
        self.train_ds = CombinedDs("train")
        return self._get_dataloader(self.train_ds)

    def val_dataloader(self):
        self.val_ds = CombinedDs("val")
        return self._get_dataloader(self.val_ds)

    def _get_dataloader(self, ds):
        assert ds is not None
        dl = litdata.StreamingDataLoader(ds, **self.kwargs)
        # dl= data.DataLoader(ds, **self.kwargs)
        logger.info(f"{len(ds)=}, {len(dl)=}")
        return dl


class TaskModule(L.LightningModule):
    # from https://lightning.ai/docs/pytorch/stable/starter/introduction.html
    def __init__(self):
        super().__init__()
        self.model = nn.Conv2d(3, 3, kernel_size=3, padding=1)

    def training_step(self, batch, batch_idx):
        return self._calc_loss(batch)

    def validation_step(self, batch, batch_idx):
        return self._calc_loss(batch)

    def _calc_loss(self, batch):
        x = batch["image"]
        y = self.model(x)
        loss = torch.nn.functional.mse_loss(y, x)
        return loss

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=1e-3)


def main():
    logging.basicConfig(level=logging.INFO)
    if 0:
        create_datasets()
    else:
        LightningCLI()


main()

@tchaton
Copy link
Collaborator

tchaton commented Aug 1, 2024

Hey @cgebbe,

  1. That's perfect

  2. So I set drop_last=False. This would lead to the training hanging due to gradient synchornization. I don't recommend doing so

  3. It might hang due to drop_last=False

@tchaton
Copy link
Collaborator

tchaton commented Aug 1, 2024

Hey @cgebbe. I made a new release. Could you try the latest release 0.2.20. Use pip install -U litdata to upgrade.

@cgebbe
Copy link
Author

cgebbe commented Aug 2, 2024

Thanks again for the rapid answer @tchaton.

You recommend setting drop_last=False. However, in that case the (combined) dataloader has zero length since num_gpus * batch_size * num_workers > each_dataset_size =49. Do I miss something obvious?

I tried litdata==0.2.20. With DataLoader it hangs after the 10th of 16 training steps and the loader has length 16,16,17 per GPU. With StreamingLoader, all 10 training steps pass and it hangs after the first validation step. The loader has length 10,10,29 ?! (dataset for each of the 3 GPUs is 4*16 or 4*17, batch_size=4). When I wait for 10+ minutes, I get the error below.

[rank1]:[E801 14:37:42.731552721kProcessGroupNCCL.cpp:607] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=284, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800006 milliseconds before timing out.
[rank1]:[E801 14:37:42.731834446 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 1] Exception (either an error or timeout) detected by watchdog at work: 284, last enqueued NCCL work: 284, last completed NCCL work: 283.

I'll try to reproduce the hanging with the minimal example.


side questions

  • You mentioned it hangs due to drop_last=False. Do you have an idea why the minimal example runs fine though with the same (unbalanced) dataset sizes?
  • Are there ways to increase verbosity to narrow down where exactly it hangs?
  • Can I also move dataset creation from e.g. train_dataloader to prepare or setup?

@tchaton
Copy link
Collaborator

tchaton commented Aug 8, 2024

Hey @cgebbe. That's exactly what I meant. It would hang if you set drop_last=False. Right now. the solution is for you to add more training samples in the dataset.

@tchaton
Copy link
Collaborator

tchaton commented Aug 8, 2024

We could also explore padding with duplicated data but this would increase litdata complexity quite a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants