lmdb stuck/hang when use multiprocessing process(pytorch dataloder). #350

Sander-houqi · 2023-11-06T04:42:24Z

Affected Operating Systems

Linux

Affected py-lmdb Version

lmdb=1.4.1

py-lmdb Installation Method

pip install lmdb

Using bundled or distribution-provided LMDB library?

Bundled

Distribution name and LMDB library version

(0, 9, 29)

Machine "free -m" output

$ free -m

              total        used        free      shared  buff/cache   available
Mem:         515461      181423        8654        3241      325382      329357
Swap:             0           0           0

Other important machine info

linux version:
Linux version 3.10.0-1127.el7.x86_64 ([email protected]) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-39) (GCC) ) #1 SMP Tue Mar 31 23:36:51 UTC 2020

os:
Ubuntu 18.04.6 LTS

python version; 3.10.11
pytorch version:1.13.0+cu116

Describe Your Problem

I train use pytorch dataloder, and load data from lmdb file, about 6kw imgs and captions ,about 2.5T, but when I train with num_worker > 0, like 4，it will stuck and stop the training step in start or Intermediate steps（eg;1500step）, my code reference https://github.com/Lyken17/Efficient-PyTorch, the whole code is complex, and I stable recurrence the problem use my simple code like this, stuck delay to 5000 step comprae with init one db in __init__ function .

ps:

_init_db load in __init__ function is same error.


import lmdb
import six,random
from PIL import Image
import time
import pyarrow as pa
import logging
import traceback
import torch
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
from torch.utils.data.dataloader import default_collate
from torchvision import transforms


def mycollate_fn(batch):
    batch = list(filter(lambda x : not isinstance(x,type(None)), batch))
    return default_collate(batch)

class Laion_load(Dataset):
    def __init__(self, ann_paths):
        
        self.env = None
        self.length = 62363814
        self.ann_paths = ann_paths
        
        self.totensor = transforms.Compose(
                [   
                    transforms.Resize((224, 224)),
                    transforms.ToTensor(),
                ]
            )
        
    
    def _init_db(self):
            
        self.db_path = self.ann_paths[0]
        st  = time.time()
        self.env = lmdb.open(self.db_path, subdir=False,
                            readonly=True, lock=False,
                            readahead=False, meminit=False, max_readers=128)
        with self.env.begin(write=False) as txn:
            self.length =pa.deserialize(txn.get(b'__len__'))
            # self.keys= pa.deserialize(txn.get(b'__keys__'))
            
        end = time.time()
        logging.info("load time: {} ".format(end-st))
            
    def __getitem__(self, index):
        
        try:
            
            if self.env is None:
                self._init_db()
            
            encode_index = str(index).encode('ascii')
            
            with self.env.begin(write=False) as txn:
                byteflow = txn.get(encode_index)
                # byteflow = txn.get(self.keys[index])
            
            imagebuf, org_cap, gen_cap  = pa.deserialize(byteflow)
            del byteflow
            
            buf = six.BytesIO()
            buf.write(imagebuf)
            buf.seek(0)
            img = Image.open(buf).convert('RGB')
            
            img = self.totensor(img)
            
            return dict(input=img,
                    org_cap = org_cap,
                    gen_cap = gen_cap)
            
        except Exception as e:
            logging.error('index:{} Exception {}'.format(index,e))
            logging.error("error detail: {}".format(traceback.format_exc()))
            return None

    def __len__(self):
        return self.length

    def __repr__(self):
        return self.__class__.__name__ + ' (' + self.db_path + ')'
 

     
if __name__ == "__main__":
    
    test_data = Laion_load(ann_paths=["./test.lmdb"])
    
    data_loader = DataLoader(test_data,batch_size=100, num_workers=8,shuffle=True)
    for index,item in enumerate(data_loader):
        print('aa:',index)
        
    print("done")

Use multiprocess use deepspeed or torchrun like:
deepspeed --num_gpus=8 --master_port 6666 build_lmdb_datasets.py --deepspeed ./config.json
or
torchrun --nnodes=1 --nproc_per_node=8 --master_port=25001 build_lmdb_datasets.py

Errors/exceptions Encountered

it stuck and no error with CTRL +C .

Describe What You Expected To Happen

I expected the transaction to commit successfully.

Describe What Happened Instead

The Python process hang, GPU utilize drop to 0, gpu memroy still hold, training stop.

The text was updated successfully, but these errors were encountered:

Sander-houqi · 2023-11-07T02:59:21Z

can anyone helps? @jnwatson @dw Thanks.

orena1 · 2023-12-11T22:46:56Z

maybe remove the try expect also do you see the load time.. print?

Sander-houqi mentioned this issue Nov 6, 2023

dataloader stuck use lmdb Lyken17/Efficient-PyTorch#32

Open

jnwatson mentioned this issue Jun 25, 2024

Fix transactions freed by deallocation in child process #363

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lmdb stuck/hang when use multiprocessing process(pytorch dataloder). #350

lmdb stuck/hang when use multiprocessing process(pytorch dataloder). #350

Sander-houqi commented Nov 6, 2023 •

edited

Loading

Sander-houqi commented Nov 7, 2023

orena1 commented Dec 11, 2023

lmdb stuck/hang when use multiprocessing process(pytorch dataloder). #350

lmdb stuck/hang when use multiprocessing process(pytorch dataloder). #350

Comments

Sander-houqi commented Nov 6, 2023 • edited Loading

Affected Operating Systems

Affected py-lmdb Version

py-lmdb Installation Method

Using bundled or distribution-provided LMDB library?

Distribution name and LMDB library version

Machine "free -m" output

Other important machine info

Describe Your Problem

Errors/exceptions Encountered

Describe What You Expected To Happen

Describe What Happened Instead

Sander-houqi commented Nov 7, 2023

orena1 commented Dec 11, 2023

Sander-houqi commented Nov 6, 2023 •

edited

Loading