-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Abusurdly slow on iteration #5841
Comments
Hi ! You can try to use the Image type which decodes images on-the-fly into pytorch tensors :) ds = Dataset.from_dict({"tensor":a}).with_format("torch")
%time sum(1 for _ in ds)
# CPU times: user 5.04 s, sys: 96.5 ms, total: 5.14 s
# Wall time: 5.14 s
# 10000 features = Features({"tensor": Image()})
ds = Dataset.from_dict({"tensor":a}, features=features).with_format("torch")
%time sum(1 for _ in ds)
# CPU times: user 1.86 s, sys: 49 ms, total: 1.91 s
# Wall time: 1.9 s
# 10000 -> Speed x2.7 And if you want to keep using arrays of integers, consider using the Array2D or Array3D types which are even faster (since it doesn't decode images): features = Features({"tensor": Array2D(shape=(100, 224), dtype="float32")})
ds = Dataset.from_dict({"tensor":a}, features=features).with_format("torch")
%time sum(1 for _ in ds)
# CPU times: user 828 ms, sys: 68.4 ms, total: 896 ms
# Wall time: 897 ms
# 10000 -> Speed x5.7 Batching also speeds up a lot from torch.utils.data import DataLoader
dl = DataLoader(ds, batch_size=100)
%time sum(1 for _ in dl)
# CPU times: user 564 ms, sys: 83.5 ms, total: 648 ms
# Wall time: 579 ms
# 100 -> Speed x8.9 %time sum(1 for _ in ds.iter(batch_size=100))
# CPU times: user 119 ms, sys: 96.8 ms, total: 215 ms
# Wall time: 117 ms
# 100 -> Speed x46 |
Anyway, regarding the speed difference between numpy and pytorch, I think the issue is that we first convert numpy sub-arrays to pytorch and then consolidate into one tensor, while we should to the opposite. Indeed converting a numpy array to pytorch has a fix cost that seems to cause a slow down. The current pipeline is
and we should do
|
I have a similar issue: iterating over a dataset takes 5s without applying any transform, but takes ~30s after applying a transform. import numpy as np
from datasets import Dataset, DatasetDict, load_dataset, Array3D, Image, Features
from torch.utils.data import DataLoader
from tqdm import tqdm
import torchvision
from torchvision.transforms import ToTensor, Normalize
#################################
# Without transform
#################################
train_dataset = load_dataset(
'cifar100',
split='train',
use_auth_token=True,
)
train_dataset.set_format(type="numpy", columns=["img", "fine_label"])
train_loader= DataLoader(
train_dataset,
batch_size=100,
pin_memory=False,
shuffle=True,
num_workers=8,
)
for batch in tqdm(train_loader, desc="Loading data, no transform"):
pass
#################################
# With transform
#################################
transform_func = torchvision.transforms.Compose([
ToTensor(),
Normalize(mean=[0.485, 0.456, 0.406], std= [0.229, 0.224, 0.225]),]
)
train_dataset = train_dataset.map(
desc=f"Preprocessing samples",
function=lambda x: {"img": transform_func(x["img"])},
)
train_dataset.set_format(type="numpy", columns=["img", "fine_label"])
train_loader= DataLoader(
train_dataset,
batch_size=100,
pin_memory=False,
shuffle=True,
num_workers=8,
)
for batch in tqdm(train_loader, desc="Loading data after transform"):
pass I have also tried converting the Image column to an Array3D img_shape = train_dataset[0]["img"].shape
features = train_dataset.features.copy()
features["x"] = Array3D(shape=img_shape, dtype="float32")
train_dataset = train_dataset.map(
desc=f"Preprocessing samples",
function=lambda x: {"x": np.array(x["img"], dtype=np.uint8)},
features=features,
)
train_dataset.cast_column("x", Array3D(shape=img_shape, dtype="float32"))
train_dataset.set_format(type="numpy", columns=["x", "fine_label"]) but to no avail. Any clue? |
Thanks! I convert my dataset feature to Array3D and this speed became awesome! |
Describe the bug
I am attempting to iterate through an image dataset, but I am encountering a significant slowdown in the iteration speed. In order to investigate this issue, I conducted the following experiment:
I noticed that the dataset in numpy format performs significantly faster than the one in torch format. My hypothesis is that the dataset undergoes a transformation process of torch->python->numpy(torch) in the background, which might be causing the slowdown. Is there any way to expedite the process by bypassing such transformations?
Furthermore, if I increase the size of a to an image shape, like:
the iteration speed becomes absurdly slow, around 100 iterations per second, whereas the speed with numpy format is approximately 250 iterations per second. This level of speed would be unacceptable for large image datasets, as it could take several hours just to iterate through a single epoch.
Steps to reproduce the bug
Expected behavior
iteration faster
Environment info
datasets
version: 2.11.0The text was updated successfully, but these errors were encountered: