Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow iteration over Torch tensors #5864

Open
crisostomi opened this issue May 15, 2023 · 2 comments
Open

Slow iteration over Torch tensors #5864

crisostomi opened this issue May 15, 2023 · 2 comments

Comments

@crisostomi
Copy link

Describe the bug

I have a problem related to this issue: I get a way slower iteration when using a Torch dataloader if I use vanilla Numpy tensors or if I first apply a ToTensor transform to the input. In particular, it takes 5 seconds to iterate over the vanilla input and ~30s after the transformation.

Steps to reproduce the bug

Here is the minimum code to reproduce the problem

import numpy as np
from datasets import Dataset, DatasetDict, load_dataset, Array3D, Image, Features
from torch.utils.data import DataLoader
from tqdm import tqdm
import torchvision 
from torchvision.transforms import ToTensor, Normalize


#################################
# Without transform
#################################
    
train_dataset = load_dataset(
    'cifar100',
    split='train',
    use_auth_token=True,
)

train_dataset.set_format(type="numpy", columns=["img", "fine_label"])

train_loader= DataLoader(
    train_dataset,
    batch_size=100,
    pin_memory=False,
    shuffle=True,
    num_workers=8,
)

for batch in tqdm(train_loader, desc="Loading data, no transform"):
    pass


#################################
# With transform
#################################

transform_func = torchvision.transforms.Compose([
    ToTensor(), 
    Normalize(mean=[0.485, 0.456, 0.406], std= [0.229, 0.224, 0.225]),]          
)
    
train_dataset = train_dataset.map(
    desc=f"Preprocessing samples",
    function=lambda x: {"img": transform_func(x["img"])},
)

train_dataset.set_format(type="numpy", columns=["img", "fine_label"])


train_loader= DataLoader(
    train_dataset,
    batch_size=100,
    pin_memory=False,
    shuffle=True,
    num_workers=8,
)


for batch in tqdm(train_loader, desc="Loading data after transform"):
    pass 

I have also tried converting the Image column to an Array3D

img_shape = train_dataset[0]["img"].shape

features = train_dataset.features.copy()
features["x"] = Array3D(shape=img_shape, dtype="float32")

train_dataset = train_dataset.map(
    desc=f"Preprocessing samples",
    function=lambda x: {"x": np.array(x["img"], dtype=np.uint8)},
    features=features,
)
train_dataset.cast_column("x", Array3D(shape=img_shape, dtype="float32"))
train_dataset.set_format(type="numpy", columns=["x", "fine_label"])

but to no avail. Any clue?

Expected behavior

The iteration should take approximately the same time with or without the transformation, as it doesn't change the shape of the input. What may be the issue here?

Environment info

- `datasets` version: 2.12.0
- Platform: Linux-5.4.0-137-generic-x86_64-with-glibc2.31
- Python version: 3.9.16
- Huggingface_hub version: 0.14.1
- PyArrow version: 12.0.0
- Pandas version: 2.0.1
@fecet
Copy link

fecet commented May 15, 2023

I am highly interested performance of dataset so I ran your example as a curious user.

train_dataset.cast_column("x", Array3D(shape=img_shape, dtype="float32"))

have return values and "x" is a new column, it shoulde be

ds=train_dataset.cast_column("img", Array3D(shape=(3,32,32), dtype="float32"))

I rewrite your example as

train_dataset = load_dataset(
    'cifar100',
    split='train',
    use_auth_token=True,
)
transform_func = torchvision.transforms.Compose([
    ToTensor(), 
    Normalize(mean=[0.485, 0.456, 0.406], std= [0.229, 0.224, 0.225]),]          
)
    
train_dataset = train_dataset.map(
    desc=f"Preprocessing samples",
    function=lambda x: {"img": transform_func(x["img"])},
)
ds=train_dataset.cast_column("img", Array3D(shape=(3,32,32), dtype="float32"))
for i in tqdm(ds):
    pass

that require ~11s in my environment. While

ds = load_dataset(
    'cifar100',
    split='train',
    use_auth_token=True,
)

for i in tqdm(ds):
    pass

only need ~6s. (So I guess it's still undesirable)

@alex-hh
Copy link
Contributor

alex-hh commented Oct 8, 2024

perhaps related to #6833

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants