Batch size beginning to vary half way through epoch #179

MarcoForte · 2024-06-21T15:42:13Z

🐛 Bug

Hello, I'm running into an issue where my batch size begins to vary half way through an epoch.

To Reproduce

I logged when it deviated from 64. It happens in all epochs, and when training single gpu also.

Code sample

Unfortunately I can't share the code, but I will share as much as I can, and I can run many experiments.
I'm launching the training with torchrun --standalone --nnodes=1 --nproc-per-node=8 main.py
I use sets = [StreamingDataset(a),StreamingDataset(b))] and Dataloader(CombinedStreamingDataset(datasets=sets))
I launch the training through trainer.fit. drop_last=True

Expected behavior

Fixed batch size throughout epoch.

Environment

Using the ngc 23.05

Ubuntu 22.04 including Python 3.10
NVIDIA CUDA 12.4.1
NVIDIA cuBLAS 12.4.5.8
NVIDIA cuDNN 9.1.0.70
NVIDIA NCCL 2.21.5
lightning==2.3.0
litdata==0.2.12
8 x H100

The text was updated successfully, but these errors were encountered:

github-actions · 2024-06-21T15:42:39Z

Hi! thanks for your contribution!, great first issue!

tchaton · 2024-06-21T18:02:35Z

Hey @MarcoForte. Fascinating, I have never seen this ;) Can you share a reproducible script with fake data ? Does this issue still happen if you use a single StreamingDataset ?

MarcoForte · 2024-06-21T19:03:33Z

Cheers @tchaton, yeah it was a bit surprising 👀. Only noticed it since I was in torch.compile mode, and the recompilation was being triggered causing a big slowdown. Otherwise it is possible it could go unnoticed...
It did happen with the single StreamingDataset also, bypassing the CombinedStreamingDataset.

If I find a moment I'll try for a reproducible script, thanks

tchaton · 2024-06-22T09:24:10Z

Thanks a lot @MarcoForte. Looking forward for the code to debug it

tchaton · 2024-06-25T17:06:36Z

Hey @MarcoForte Any chance to provide a reproducible script ?

tchaton · 2024-06-27T07:47:55Z

Hey @MarcoForte

Unfortunately, I can't reproduce this issue on my end.

import os
from lightning_cloud.utils.data_connection import add_s3_connection
from lightning.data import StreamingDataset, StreamingDataLoader
from lightning.data.streaming.serializers import JPEGSerializer
import torchvision.transforms.v2 as T
import open_clip
from tqdm import tqdm

# 1. Add the prepared dataset to your teamspace
add_s3_connection("laoin-400m")

# 2. Create the streaming dataset
class LAIONStreamingDataset(StreamingDataset):

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.tokenizer = open_clip.get_tokenizer('ViT-B-32', context_length=512) # You can use any tokenizer
        self.serializer = JPEGSerializer()
        self.preprocess = T.Normalize(mean=(0.48145466, 0.4578275, 0.40821073), std=(0.26862954, 0.26130258, 0.27577711))

    def __getitem__(self, index):
        _, image, text, _, _, _ = super().__getitem__(index)
        image = self.serializer.deserialize(image).float()
        return self.preprocess(image)

dataset = LAIONStreamingDataset(input_dir="/teamspace/s3_connections/laoin-400m")
dataloader = StreamingDataLoader(dataset, batch_size=64, num_workers=os.cpu_count())

batch_size = 64

for batch in tqdm(dataloader):
    assert batch.shape[0] == batch_size

tchaton · 2024-07-12T08:22:13Z

Hey @MarcoForte. Any updates ?

MarcoForte added bug Something isn't working help wanted Extra attention is needed labels Jun 21, 2024

esivonxay-cognitiv mentioned this issue Jun 29, 2024

Using a streaming dataloader with an unbalanced dataset yields unexpected batch sizes. #199

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch size beginning to vary half way through epoch #179

Batch size beginning to vary half way through epoch #179

MarcoForte commented Jun 21, 2024 •

edited

Loading

github-actions bot commented Jun 21, 2024

tchaton commented Jun 21, 2024 •

edited

Loading

MarcoForte commented Jun 21, 2024

tchaton commented Jun 22, 2024

tchaton commented Jun 25, 2024

tchaton commented Jun 27, 2024

tchaton commented Jul 12, 2024

Batch size beginning to vary half way through epoch #179

Batch size beginning to vary half way through epoch #179

Comments

MarcoForte commented Jun 21, 2024 • edited Loading

🐛 Bug

To Reproduce

Code sample

Expected behavior

Environment

github-actions bot commented Jun 21, 2024

tchaton commented Jun 21, 2024 • edited Loading

MarcoForte commented Jun 21, 2024

tchaton commented Jun 22, 2024

tchaton commented Jun 25, 2024

tchaton commented Jun 27, 2024

tchaton commented Jul 12, 2024

MarcoForte commented Jun 21, 2024 •

edited

Loading

tchaton commented Jun 21, 2024 •

edited

Loading