Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Abusurdly slow on iteration #5841

Closed
fecet opened this issue May 11, 2023 · 4 comments
Closed

Abusurdly slow on iteration #5841

fecet opened this issue May 11, 2023 · 4 comments

Comments

@fecet
Copy link

fecet commented May 11, 2023

Describe the bug

I am attempting to iterate through an image dataset, but I am encountering a significant slowdown in the iteration speed. In order to investigate this issue, I conducted the following experiment:

a=torch.randn(100,224)
a=torch.stack([a] * 10000)
a.shape

# %%
ds=Dataset.from_dict({"tensor":a})
for i in tqdm(ds.with_format("numpy")):
    pass

for i in tqdm(ds.with_format("torch")):
    pass

I noticed that the dataset in numpy format performs significantly faster than the one in torch format. My hypothesis is that the dataset undergoes a transformation process of torch->python->numpy(torch) in the background, which might be causing the slowdown. Is there any way to expedite the process by bypassing such transformations?

Furthermore, if I increase the size of a to an image shape, like:

a=torch.randn(3,224,224)

the iteration speed becomes absurdly slow, around 100 iterations per second, whereas the speed with numpy format is approximately 250 iterations per second. This level of speed would be unacceptable for large image datasets, as it could take several hours just to iterate through a single epoch.

Steps to reproduce the bug

a=torch.randn(100,224)
a=torch.stack([a] * 10000)
a.shape

# %%
ds=Dataset.from_dict({"tensor":a})
for i in tqdm(ds.with_format("numpy")):
   pass

for i in tqdm(ds.with_format("torch")):
   pass

Expected behavior

iteration faster

Environment info

  • datasets version: 2.11.0
  • Platform: Linux-5.4.0-148-generic-x86_64-with-glibc2.10
  • Python version: 3.8.16
  • Huggingface_hub version: 0.13.4
  • PyArrow version: 11.0.0
  • Pandas version: 2.0.0
@fecet fecet changed the title Abusurd slow on iteration Abusurdly slow on iteration May 11, 2023
@lhoestq
Copy link
Member

lhoestq commented May 15, 2023

Hi ! You can try to use the Image type which decodes images on-the-fly into pytorch tensors :)

ds = Dataset.from_dict({"tensor":a}).with_format("torch")
%time sum(1 for _ in ds)
# CPU times: user 5.04 s, sys: 96.5 ms, total: 5.14 s
# Wall time: 5.14 s
# 10000
features = Features({"tensor": Image()})
ds = Dataset.from_dict({"tensor":a}, features=features).with_format("torch")
%time sum(1 for _ in ds)
# CPU times: user 1.86 s, sys: 49 ms, total: 1.91 s
# Wall time: 1.9 s
# 10000

-> Speed x2.7

And if you want to keep using arrays of integers, consider using the Array2D or Array3D types which are even faster (since it doesn't decode images):

features = Features({"tensor": Array2D(shape=(100, 224), dtype="float32")})
ds = Dataset.from_dict({"tensor":a}, features=features).with_format("torch")
%time sum(1 for _ in ds)
# CPU times: user 828 ms, sys: 68.4 ms, total: 896 ms
# Wall time: 897 ms
# 10000

-> Speed x5.7

Batching also speeds up a lot

from torch.utils.data import DataLoader
dl = DataLoader(ds, batch_size=100)
%time sum(1 for _ in dl)
# CPU times: user 564 ms, sys: 83.5 ms, total: 648 ms
# Wall time: 579 ms
# 100

-> Speed x8.9

%time sum(1 for _ in ds.iter(batch_size=100))
# CPU times: user 119 ms, sys: 96.8 ms, total: 215 ms
# Wall time: 117 ms
# 100

-> Speed x46

@lhoestq
Copy link
Member

lhoestq commented May 15, 2023

Anyway, regarding the speed difference between numpy and pytorch, I think the issue is that we first convert numpy sub-arrays to pytorch and then consolidate into one tensor, while we should to the opposite. Indeed converting a numpy array to pytorch has a fix cost that seems to cause a slow down. The current pipeline is

arrow -> nested numpy arrays -> lists of torch tensors -> one torch tensor

and we should do

arrow -> nested numpy arrays -> one numpy array -> one torch tensor

@crisostomi
Copy link

crisostomi commented May 15, 2023

I have a similar issue: iterating over a dataset takes 5s without applying any transform, but takes ~30s after applying a transform.
Here is the minimum code to reproduce the problem

import numpy as np
from datasets import Dataset, DatasetDict, load_dataset, Array3D, Image, Features
from torch.utils.data import DataLoader
from tqdm import tqdm
import torchvision 
from torchvision.transforms import ToTensor, Normalize


#################################
# Without transform
#################################
    
train_dataset = load_dataset(
    'cifar100',
    split='train',
    use_auth_token=True,
)

train_dataset.set_format(type="numpy", columns=["img", "fine_label"])

train_loader= DataLoader(
    train_dataset,
    batch_size=100,
    pin_memory=False,
    shuffle=True,
    num_workers=8,
)

for batch in tqdm(train_loader, desc="Loading data, no transform"):
    pass


#################################
# With transform
#################################

transform_func = torchvision.transforms.Compose([
    ToTensor(), 
    Normalize(mean=[0.485, 0.456, 0.406], std= [0.229, 0.224, 0.225]),]          
)
    
train_dataset = train_dataset.map(
    desc=f"Preprocessing samples",
    function=lambda x: {"img": transform_func(x["img"])},
)

train_dataset.set_format(type="numpy", columns=["img", "fine_label"])


train_loader= DataLoader(
    train_dataset,
    batch_size=100,
    pin_memory=False,
    shuffle=True,
    num_workers=8,
)


for batch in tqdm(train_loader, desc="Loading data after transform"):
    pass 

I have also tried converting the Image column to an Array3D

img_shape = train_dataset[0]["img"].shape

features = train_dataset.features.copy()
features["x"] = Array3D(shape=img_shape, dtype="float32")

train_dataset = train_dataset.map(
    desc=f"Preprocessing samples",
    function=lambda x: {"x": np.array(x["img"], dtype=np.uint8)},
    features=features,
)
train_dataset.cast_column("x", Array3D(shape=img_shape, dtype="float32"))
train_dataset.set_format(type="numpy", columns=["x", "fine_label"])

but to no avail. Any clue?

@fecet
Copy link
Author

fecet commented May 15, 2023

Thanks! I convert my dataset feature to Array3D and this speed became awesome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants