Abusurdly slow on iteration #5841

fecet · 2023-05-11T08:04:09Z

Describe the bug

I am attempting to iterate through an image dataset, but I am encountering a significant slowdown in the iteration speed. In order to investigate this issue, I conducted the following experiment:

a=torch.randn(100,224)
a=torch.stack([a] * 10000)
a.shape

# %%
ds=Dataset.from_dict({"tensor":a})
for i in tqdm(ds.with_format("numpy")):
    pass

for i in tqdm(ds.with_format("torch")):
    pass

I noticed that the dataset in numpy format performs significantly faster than the one in torch format. My hypothesis is that the dataset undergoes a transformation process of torch->python->numpy(torch) in the background, which might be causing the slowdown. Is there any way to expedite the process by bypassing such transformations?

Furthermore, if I increase the size of a to an image shape, like:

a=torch.randn(3,224,224)

the iteration speed becomes absurdly slow, around 100 iterations per second, whereas the speed with numpy format is approximately 250 iterations per second. This level of speed would be unacceptable for large image datasets, as it could take several hours just to iterate through a single epoch.

Steps to reproduce the bug

a=torch.randn(100,224)
a=torch.stack([a] * 10000)
a.shape

# %%
ds=Dataset.from_dict({"tensor":a})
for i in tqdm(ds.with_format("numpy")):
   pass

for i in tqdm(ds.with_format("torch")):
   pass

Expected behavior

iteration faster

Environment info

datasets version: 2.11.0
Platform: Linux-5.4.0-148-generic-x86_64-with-glibc2.10
Python version: 3.8.16
Huggingface_hub version: 0.13.4
PyArrow version: 11.0.0
Pandas version: 2.0.0

The text was updated successfully, but these errors were encountered:

lhoestq · 2023-05-15T10:59:11Z

Hi ! You can try to use the Image type which decodes images on-the-fly into pytorch tensors :)

ds = Dataset.from_dict({"tensor":a}).with_format("torch")
%time sum(1 for _ in ds)
# CPU times: user 5.04 s, sys: 96.5 ms, total: 5.14 s
# Wall time: 5.14 s
# 10000

features = Features({"tensor": Image()})
ds = Dataset.from_dict({"tensor":a}, features=features).with_format("torch")
%time sum(1 for _ in ds)
# CPU times: user 1.86 s, sys: 49 ms, total: 1.91 s
# Wall time: 1.9 s
# 10000

-> Speed x2.7

And if you want to keep using arrays of integers, consider using the Array2D or Array3D types which are even faster (since it doesn't decode images):

features = Features({"tensor": Array2D(shape=(100, 224), dtype="float32")})
ds = Dataset.from_dict({"tensor":a}, features=features).with_format("torch")
%time sum(1 for _ in ds)
# CPU times: user 828 ms, sys: 68.4 ms, total: 896 ms
# Wall time: 897 ms
# 10000

-> Speed x5.7

Batching also speeds up a lot

from torch.utils.data import DataLoader
dl = DataLoader(ds, batch_size=100)
%time sum(1 for _ in dl)
# CPU times: user 564 ms, sys: 83.5 ms, total: 648 ms
# Wall time: 579 ms
# 100

-> Speed x8.9

%time sum(1 for _ in ds.iter(batch_size=100))
# CPU times: user 119 ms, sys: 96.8 ms, total: 215 ms
# Wall time: 117 ms
# 100

-> Speed x46

lhoestq · 2023-05-15T11:02:17Z

Anyway, regarding the speed difference between numpy and pytorch, I think the issue is that we first convert numpy sub-arrays to pytorch and then consolidate into one tensor, while we should to the opposite. Indeed converting a numpy array to pytorch has a fix cost that seems to cause a slow down. The current pipeline is

arrow -> nested numpy arrays -> lists of torch tensors -> one torch tensor

and we should do

arrow -> nested numpy arrays -> one numpy array -> one torch tensor

crisostomi · 2023-05-15T15:34:59Z

I have a similar issue: iterating over a dataset takes 5s without applying any transform, but takes ~30s after applying a transform.
Here is the minimum code to reproduce the problem

import numpy as np
from datasets import Dataset, DatasetDict, load_dataset, Array3D, Image, Features
from torch.utils.data import DataLoader
from tqdm import tqdm
import torchvision 
from torchvision.transforms import ToTensor, Normalize


#################################
# Without transform
#################################
    
train_dataset = load_dataset(
    'cifar100',
    split='train',
    use_auth_token=True,
)

train_dataset.set_format(type="numpy", columns=["img", "fine_label"])

train_loader= DataLoader(
    train_dataset,
    batch_size=100,
    pin_memory=False,
    shuffle=True,
    num_workers=8,
)

for batch in tqdm(train_loader, desc="Loading data, no transform"):
    pass


#################################
# With transform
#################################

transform_func = torchvision.transforms.Compose([
    ToTensor(), 
    Normalize(mean=[0.485, 0.456, 0.406], std= [0.229, 0.224, 0.225]),]          
)
    
train_dataset = train_dataset.map(
    desc=f"Preprocessing samples",
    function=lambda x: {"img": transform_func(x["img"])},
)

train_dataset.set_format(type="numpy", columns=["img", "fine_label"])


train_loader= DataLoader(
    train_dataset,
    batch_size=100,
    pin_memory=False,
    shuffle=True,
    num_workers=8,
)


for batch in tqdm(train_loader, desc="Loading data after transform"):
    pass

I have also tried converting the Image column to an Array3D

img_shape = train_dataset[0]["img"].shape

features = train_dataset.features.copy()
features["x"] = Array3D(shape=img_shape, dtype="float32")

train_dataset = train_dataset.map(
    desc=f"Preprocessing samples",
    function=lambda x: {"x": np.array(x["img"], dtype=np.uint8)},
    features=features,
)
train_dataset.cast_column("x", Array3D(shape=img_shape, dtype="float32"))
train_dataset.set_format(type="numpy", columns=["x", "fine_label"])

but to no avail. Any clue?

fecet · 2023-05-15T15:38:12Z

Thanks! I convert my dataset feature to Array3D and this speed became awesome!

fecet changed the title ~~Abusurd slow on iteration~~ Abusurdly slow on iteration May 11, 2023

fecet closed this as completed May 15, 2023

crisostomi mentioned this issue May 15, 2023

Slow iteration over Torch tensors #5864

Open

xslittlegrass mentioned this issue Apr 23, 2024

Super slow iteration with trivial custom transform #6833

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Abusurdly slow on iteration #5841

Abusurdly slow on iteration #5841

fecet commented May 11, 2023

lhoestq commented May 15, 2023 •

edited

Loading

lhoestq commented May 15, 2023

crisostomi commented May 15, 2023 •

edited

Loading

fecet commented May 15, 2023

Abusurdly slow on iteration #5841

Abusurdly slow on iteration #5841

Comments

fecet commented May 11, 2023

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

lhoestq commented May 15, 2023 • edited Loading

lhoestq commented May 15, 2023

crisostomi commented May 15, 2023 • edited Loading

fecet commented May 15, 2023

lhoestq commented May 15, 2023 •

edited

Loading

crisostomi commented May 15, 2023 •

edited

Loading