-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Batch size beginning to vary half way through epoch #179
Comments
Hi! thanks for your contribution!, great first issue! |
Hey @MarcoForte. Fascinating, I have never seen this ;) Can you share a reproducible script with fake data ? Does this issue still happen if you use a single StreamingDataset ? |
Cheers @tchaton, yeah it was a bit surprising 👀. Only noticed it since I was in torch.compile mode, and the recompilation was being triggered causing a big slowdown. Otherwise it is possible it could go unnoticed... If I find a moment I'll try for a reproducible script, thanks |
Thanks a lot @MarcoForte. Looking forward for the code to debug it |
Hey @MarcoForte Any chance to provide a reproducible script ? |
Hey @MarcoForte Unfortunately, I can't reproduce this issue on my end. import os
from lightning_cloud.utils.data_connection import add_s3_connection
from lightning.data import StreamingDataset, StreamingDataLoader
from lightning.data.streaming.serializers import JPEGSerializer
import torchvision.transforms.v2 as T
import open_clip
from tqdm import tqdm
# 1. Add the prepared dataset to your teamspace
add_s3_connection("laoin-400m")
# 2. Create the streaming dataset
class LAIONStreamingDataset(StreamingDataset):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.tokenizer = open_clip.get_tokenizer('ViT-B-32', context_length=512) # You can use any tokenizer
self.serializer = JPEGSerializer()
self.preprocess = T.Normalize(mean=(0.48145466, 0.4578275, 0.40821073), std=(0.26862954, 0.26130258, 0.27577711))
def __getitem__(self, index):
_, image, text, _, _, _ = super().__getitem__(index)
image = self.serializer.deserialize(image).float()
return self.preprocess(image)
dataset = LAIONStreamingDataset(input_dir="/teamspace/s3_connections/laoin-400m")
dataloader = StreamingDataLoader(dataset, batch_size=64, num_workers=os.cpu_count())
batch_size = 64
for batch in tqdm(dataloader):
assert batch.shape[0] == batch_size |
Hey @MarcoForte. Any updates ? |
🐛 Bug
Hello, I'm running into an issue where my batch size begins to vary half way through an epoch.
To Reproduce
I logged when it deviated from 64. It happens in all epochs, and when training single gpu also.
Code sample
Unfortunately I can't share the code, but I will share as much as I can, and I can run many experiments.
I'm launching the training with
torchrun --standalone --nnodes=1 --nproc-per-node=8 main.py
I use
sets = [StreamingDataset(a),StreamingDataset(b))]
andDataloader(CombinedStreamingDataset(datasets=sets))
I launch the training through
trainer.fit
.drop_last=True
Expected behavior
Fixed batch size throughout epoch.
Environment
Using the ngc 23.05
Ubuntu 22.04 including Python 3.10
NVIDIA CUDA 12.4.1
NVIDIA cuBLAS 12.4.5.8
NVIDIA cuDNN 9.1.0.70
NVIDIA NCCL 2.21.5
lightning==2.3.0
litdata==0.2.12
8 x H100
The text was updated successfully, but these errors were encountered: