Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lmdb stuck/hang when use multiprocessing process(pytorch dataloder). #350

Open
Sander-houqi opened this issue Nov 6, 2023 · 2 comments
Open

Comments

@Sander-houqi
Copy link

Sander-houqi commented Nov 6, 2023

Affected Operating Systems

  • Linux

Affected py-lmdb Version

lmdb=1.4.1

py-lmdb Installation Method

pip install lmdb

Using bundled or distribution-provided LMDB library?

Bundled

Distribution name and LMDB library version

(0, 9, 29)

Machine "free -m" output

$ free -m

              total        used        free      shared  buff/cache   available
Mem:         515461      181423        8654        3241      325382      329357
Swap:             0           0           0

Other important machine info

linux version:
Linux version 3.10.0-1127.el7.x86_64 ([email protected]) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-39) (GCC) ) #1 SMP Tue Mar 31 23:36:51 UTC 2020

os:
Ubuntu 18.04.6 LTS

python version; 3.10.11
pytorch version:1.13.0+cu116

Describe Your Problem

I train use pytorch dataloder, and load data from lmdb file, about 6kw imgs and captions ,about 2.5T, but when I train with num_worker > 0, like 4,it will stuck and stop the training step in start or Intermediate steps(eg;1500step), my code reference https://github.com/Lyken17/Efficient-PyTorch, the whole code is complex, and I stable recurrence the problem use my simple code like this, stuck delay to 5000 step comprae with init one db in __init__ function .

ps:

_init_db load in __init__ function is same error.


import lmdb
import six,random
from PIL import Image
import time
import pyarrow as pa
import logging
import traceback
import torch
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
from torch.utils.data.dataloader import default_collate
from torchvision import transforms


def mycollate_fn(batch):
    batch = list(filter(lambda x : not isinstance(x,type(None)), batch))
    return default_collate(batch)

class Laion_load(Dataset):
    def __init__(self, ann_paths):
        
        self.env = None
        self.length = 62363814
        self.ann_paths = ann_paths
        
        self.totensor = transforms.Compose(
                [   
                    transforms.Resize((224, 224)),
                    transforms.ToTensor(),
                ]
            )
        
    
    def _init_db(self):
            
        self.db_path = self.ann_paths[0]
        st  = time.time()
        self.env = lmdb.open(self.db_path, subdir=False,
                            readonly=True, lock=False,
                            readahead=False, meminit=False, max_readers=128)
        with self.env.begin(write=False) as txn:
            self.length =pa.deserialize(txn.get(b'__len__'))
            # self.keys= pa.deserialize(txn.get(b'__keys__'))
            
        end = time.time()
        logging.info("load time: {} ".format(end-st))
            
    def __getitem__(self, index):
        
        try:
            
            if self.env is None:
                self._init_db()
            
            encode_index = str(index).encode('ascii')
            
            with self.env.begin(write=False) as txn:
                byteflow = txn.get(encode_index)
                # byteflow = txn.get(self.keys[index])
            
            imagebuf, org_cap, gen_cap  = pa.deserialize(byteflow)
            del byteflow
            
            buf = six.BytesIO()
            buf.write(imagebuf)
            buf.seek(0)
            img = Image.open(buf).convert('RGB')
            
            img = self.totensor(img)
            
            return dict(input=img,
                    org_cap = org_cap,
                    gen_cap = gen_cap)
            
        except Exception as e:
            logging.error('index:{} Exception {}'.format(index,e))
            logging.error("error detail: {}".format(traceback.format_exc()))
            return None

    def __len__(self):
        return self.length

    def __repr__(self):
        return self.__class__.__name__ + ' (' + self.db_path + ')'
 

     
if __name__ == "__main__":
    
    test_data = Laion_load(ann_paths=["./test.lmdb"])
    
    data_loader = DataLoader(test_data,batch_size=100, num_workers=8,shuffle=True)
    for index,item in enumerate(data_loader):
        print('aa:',index)
        
    print("done")
    
    

Use multiprocess use deepspeed or torchrun like:
deepspeed --num_gpus=8 --master_port 6666 build_lmdb_datasets.py --deepspeed ./config.json
or
torchrun --nnodes=1 --nproc_per_node=8 --master_port=25001 build_lmdb_datasets.py

Errors/exceptions Encountered

it stuck and no error with CTRL +C .

Describe What You Expected To Happen

I expected the transaction to commit successfully.

Describe What Happened Instead

The Python process hang, GPU utilize drop to 0, gpu memroy still hold, training stop.

@Sander-houqi
Copy link
Author

can anyone helps? @jnwatson @dw Thanks.

@orena1
Copy link

orena1 commented Dec 11, 2023

maybe remove the try expect also do you see the load time.. print?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants