You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Using bundled or distribution-provided LMDB library?
Bundled
Distribution name and LMDB library version
(0, 9, 29)
Machine "free -m" output
$ free -m
total used free shared buff/cache available
Mem: 515461 181423 8654 3241 325382 329357
Swap: 0 0 0
Other important machine info
linux version:
Linux version 3.10.0-1127.el7.x86_64 ([email protected]) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-39) (GCC) ) #1 SMP Tue Mar 31 23:36:51 UTC 2020
I train use pytorch dataloder, and load data from lmdb file, about 6kw imgs and captions ,about 2.5T, but when I train with num_worker > 0, like 4,it will stuck and stop the training step in start or Intermediate steps(eg;1500step), my code reference https://github.com/Lyken17/Efficient-PyTorch, the whole code is complex, and I stable recurrence the problem use my simple code like this, stuck delay to 5000 step comprae with init one db in __init__ function .
ps:
_init_db load in __init__ function is same error.
import lmdb
import six,random
from PIL import Image
import time
import pyarrow as pa
import logging
import traceback
import torch
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
from torch.utils.data.dataloader import default_collate
from torchvision import transforms
def mycollate_fn(batch):
batch = list(filter(lambda x : not isinstance(x,type(None)), batch))
return default_collate(batch)
class Laion_load(Dataset):
def __init__(self, ann_paths):
self.env = None
self.length = 62363814
self.ann_paths = ann_paths
self.totensor = transforms.Compose(
[
transforms.Resize((224, 224)),
transforms.ToTensor(),
]
)
def _init_db(self):
self.db_path = self.ann_paths[0]
st = time.time()
self.env = lmdb.open(self.db_path, subdir=False,
readonly=True, lock=False,
readahead=False, meminit=False, max_readers=128)
with self.env.begin(write=False) as txn:
self.length =pa.deserialize(txn.get(b'__len__'))
# self.keys= pa.deserialize(txn.get(b'__keys__'))
end = time.time()
logging.info("load time: {} ".format(end-st))
def __getitem__(self, index):
try:
if self.env is None:
self._init_db()
encode_index = str(index).encode('ascii')
with self.env.begin(write=False) as txn:
byteflow = txn.get(encode_index)
# byteflow = txn.get(self.keys[index])
imagebuf, org_cap, gen_cap = pa.deserialize(byteflow)
del byteflow
buf = six.BytesIO()
buf.write(imagebuf)
buf.seek(0)
img = Image.open(buf).convert('RGB')
img = self.totensor(img)
return dict(input=img,
org_cap = org_cap,
gen_cap = gen_cap)
except Exception as e:
logging.error('index:{} Exception {}'.format(index,e))
logging.error("error detail: {}".format(traceback.format_exc()))
return None
def __len__(self):
return self.length
def __repr__(self):
return self.__class__.__name__ + ' (' + self.db_path + ')'
if __name__ == "__main__":
test_data = Laion_load(ann_paths=["./test.lmdb"])
data_loader = DataLoader(test_data,batch_size=100, num_workers=8,shuffle=True)
for index,item in enumerate(data_loader):
print('aa:',index)
print("done")
Use multiprocess use deepspeed or torchrun like:
deepspeed --num_gpus=8 --master_port 6666 build_lmdb_datasets.py --deepspeed ./config.json
or
torchrun --nnodes=1 --nproc_per_node=8 --master_port=25001 build_lmdb_datasets.py
Errors/exceptions Encountered
it stuck and no error with CTRL +C .
Describe What You Expected To Happen
I expected the transaction to commit successfully.
Describe What Happened Instead
The Python process hang, GPU utilize drop to 0, gpu memroy still hold, training stop.
The text was updated successfully, but these errors were encountered:
Affected Operating Systems
Affected py-lmdb Version
lmdb=1.4.1
py-lmdb Installation Method
pip install lmdb
Using bundled or distribution-provided LMDB library?
Bundled
Distribution name and LMDB library version
(0, 9, 29)
Machine "free -m" output
$ free -m
Other important machine info
linux version:
Linux version 3.10.0-1127.el7.x86_64 ([email protected]) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-39) (GCC) ) #1 SMP Tue Mar 31 23:36:51 UTC 2020
os:
Ubuntu 18.04.6 LTS
python version; 3.10.11
pytorch version:1.13.0+cu116
Describe Your Problem
I train use pytorch dataloder, and load data from lmdb file, about 6kw imgs and captions ,about 2.5T, but when I train with num_worker > 0, like 4,it will stuck and stop the training step in start or Intermediate steps(eg;1500step), my code reference https://github.com/Lyken17/Efficient-PyTorch, the whole code is complex, and I stable recurrence the problem use my simple code like this, stuck delay to 5000 step comprae with init one db in
__init__ function
.ps:
Use multiprocess use deepspeed or torchrun like:
deepspeed --num_gpus=8 --master_port 6666 build_lmdb_datasets.py --deepspeed ./config.json
or
torchrun --nnodes=1 --nproc_per_node=8 --master_port=25001 build_lmdb_datasets.py
Errors/exceptions Encountered
it stuck and no error with CTRL +C .
Describe What You Expected To Happen
I expected the transaction to commit successfully.
Describe What Happened Instead
The Python process hang, GPU utilize drop to 0, gpu memroy still hold, training stop.
The text was updated successfully, but these errors were encountered: