Slow iteration over Torch tensors #5864

crisostomi · 2023-05-15T16:43:58Z

Describe the bug

I have a problem related to this issue: I get a way slower iteration when using a Torch dataloader if I use vanilla Numpy tensors or if I first apply a ToTensor transform to the input. In particular, it takes 5 seconds to iterate over the vanilla input and ~30s after the transformation.

Steps to reproduce the bug

Here is the minimum code to reproduce the problem

import numpy as np
from datasets import Dataset, DatasetDict, load_dataset, Array3D, Image, Features
from torch.utils.data import DataLoader
from tqdm import tqdm
import torchvision 
from torchvision.transforms import ToTensor, Normalize


#################################
# Without transform
#################################
    
train_dataset = load_dataset(
    'cifar100',
    split='train',
    use_auth_token=True,
)

train_dataset.set_format(type="numpy", columns=["img", "fine_label"])

train_loader= DataLoader(
    train_dataset,
    batch_size=100,
    pin_memory=False,
    shuffle=True,
    num_workers=8,
)

for batch in tqdm(train_loader, desc="Loading data, no transform"):
    pass


#################################
# With transform
#################################

transform_func = torchvision.transforms.Compose([
    ToTensor(), 
    Normalize(mean=[0.485, 0.456, 0.406], std= [0.229, 0.224, 0.225]),]          
)
    
train_dataset = train_dataset.map(
    desc=f"Preprocessing samples",
    function=lambda x: {"img": transform_func(x["img"])},
)

train_dataset.set_format(type="numpy", columns=["img", "fine_label"])


train_loader= DataLoader(
    train_dataset,
    batch_size=100,
    pin_memory=False,
    shuffle=True,
    num_workers=8,
)


for batch in tqdm(train_loader, desc="Loading data after transform"):
    pass

I have also tried converting the Image column to an Array3D

img_shape = train_dataset[0]["img"].shape

features = train_dataset.features.copy()
features["x"] = Array3D(shape=img_shape, dtype="float32")

train_dataset = train_dataset.map(
    desc=f"Preprocessing samples",
    function=lambda x: {"x": np.array(x["img"], dtype=np.uint8)},
    features=features,
)
train_dataset.cast_column("x", Array3D(shape=img_shape, dtype="float32"))
train_dataset.set_format(type="numpy", columns=["x", "fine_label"])

but to no avail. Any clue?

Expected behavior

The iteration should take approximately the same time with or without the transformation, as it doesn't change the shape of the input. What may be the issue here?

Environment info

- `datasets` version: 2.12.0
- Platform: Linux-5.4.0-137-generic-x86_64-with-glibc2.31
- Python version: 3.9.16
- Huggingface_hub version: 0.14.1
- PyArrow version: 12.0.0
- Pandas version: 2.0.1

The text was updated successfully, but these errors were encountered:

fecet · 2023-05-15T18:13:56Z

I am highly interested performance of dataset so I ran your example as a curious user.

train_dataset.cast_column("x", Array3D(shape=img_shape, dtype="float32"))

have return values and "x" is a new column, it shoulde be

ds=train_dataset.cast_column("img", Array3D(shape=(3,32,32), dtype="float32"))

I rewrite your example as

train_dataset = load_dataset(
    'cifar100',
    split='train',
    use_auth_token=True,
)
transform_func = torchvision.transforms.Compose([
    ToTensor(), 
    Normalize(mean=[0.485, 0.456, 0.406], std= [0.229, 0.224, 0.225]),]          
)
    
train_dataset = train_dataset.map(
    desc=f"Preprocessing samples",
    function=lambda x: {"img": transform_func(x["img"])},
)
ds=train_dataset.cast_column("img", Array3D(shape=(3,32,32), dtype="float32"))
for i in tqdm(ds):
    pass

that require ~11s in my environment. While

ds = load_dataset(
    'cifar100',
    split='train',
    use_auth_token=True,
)

for i in tqdm(ds):
    pass

only need ~6s. (So I guess it's still undesirable)

alex-hh · 2024-10-08T10:21:47Z

perhaps related to #6833

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow iteration over Torch tensors #5864

Slow iteration over Torch tensors #5864

crisostomi commented May 15, 2023

fecet commented May 15, 2023 •

edited

Loading

alex-hh commented Oct 8, 2024

Slow iteration over Torch tensors #5864

Slow iteration over Torch tensors #5864

Comments

crisostomi commented May 15, 2023

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

fecet commented May 15, 2023 • edited Loading

alex-hh commented Oct 8, 2024

fecet commented May 15, 2023 •

edited

Loading