Super slow iteration with trivial custom transform #6833

xslittlegrass · 2024-04-23T20:40:59Z

Describe the bug

Dataset is 10X slower when applying trivial transforms:

import time
import numpy as np
from datasets import Dataset, Features, Array2D

a = np.zeros((800, 800))
a = np.stack([a] * 1000)
features = Features({"a": Array2D(shape=(800, 800), dtype="uint8")})

ds1 = Dataset.from_dict({"a": a}, features=features).with_format('numpy')

def transform(batch):
	return batch

ds2 = ds1.with_transform(transform)

%time sum(1 for _ in ds1)
%time sum(1 for _ in ds2)

CPU times: user 472 ms, sys: 319 ms, total: 791 ms
Wall time: 794 ms
CPU times: user 9.32 s, sys: 443 ms, total: 9.76 s
Wall time: 9.78 s

In my real code I'm using set_transform to apply some post-processing on-the-fly for the 2d array, but it significantly slows down the dataset even if the transform itself is trivial.

Related issue: #5841

Steps to reproduce the bug

Use code in the description to reproduce.

Expected behavior

Trivial custom transform in the example should not slowdown the dataset iteration.

Environment info

datasets version: 2.18.0
Platform: Linux-5.15.0-79-generic-x86_64-with-glibc2.35
Python version: 3.11.4
huggingface_hub version: 0.20.2
PyArrow version: 15.0.0
Pandas version: 1.5.3
fsspec version: 2023.12.2

The text was updated successfully, but these errors were encountered:

rangehow · 2024-04-27T14:49:28Z

Similar issue in text process

tokenizer=AutoTokenizer.from_pretrained(model_dir[args.model])
train_dataset=datasets.load_from_disk(dataset_dir[args.dataset],keep_in_memory=True)['train']
train_dataset=train_dataset.map(partial(dname2func[args.dataset],tokenizer=tokenizer),batched=True,num_proc =50,remove_columns=train_dataset.features.keys(),desc='tokenize',keep_in_memory=True)

After this train_dataset will be like

Dataset({
    features: ['input_ids', 'labels'],
    num_rows: 51760
})

In which input_ids and labels are both List[int]
However, per iter on dataset cost 7.412479639053345s ……？

for j in tqdm(range(len(train_dataset)),desc='first stage'):
    input_id,label=train_dataset['input_ids'][j],train_dataset['labels'][j]

lhoestq · 2024-05-04T11:24:35Z

The transform currently replaces the numpy formatting.

So you're back to copying data to long python lists which is super slow.

It would be cool for the transform to not remove the formatting in this case, but this requires a few changes in the lib

alex-hh · 2024-10-08T12:57:48Z

This also (somewhat surprisingly) affects iterable datasets, making map very challenging to use for data with large arrays, unless there is some workaround?

lhoestq · 2024-10-08T13:12:59Z

For iterable datasets you should be able to do this without slow downs

ds = ds.with_format("arrow").map(...)

I haven't tried with "numpy" though, maybe there is a step that does Arrow -> List -> NumPy instead of Arrow -> NumPy directly. If it's the case it would be cool to avoid that

alex-hh · 2024-10-08T13:48:58Z

Thanks! this works for me

However, it raises an error if batched=False and map batch_size isn't explicitly set to 1 due to map's default batch_size affecting the batch size of the RebatchedArrowExamplesIterable - is this a bug?

lhoestq · 2024-10-08T14:20:14Z

Thanks for the fix @alex-hh !

alex-hh · 2024-10-08T15:41:17Z

opened a new issue for the numpy slowdown #7206

alex-hh mentioned this issue Oct 8, 2024

Slow iteration over Torch tensors #5864

Open

alex-hh mentioned this issue Oct 8, 2024

fix unbatched arrow map for iterable datasets #7204

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Super slow iteration with trivial custom transform #6833

Super slow iteration with trivial custom transform #6833

xslittlegrass commented Apr 23, 2024

rangehow commented Apr 27, 2024 •

edited

Loading

lhoestq commented May 4, 2024

alex-hh commented Oct 8, 2024 •

edited

Loading

lhoestq commented Oct 8, 2024

alex-hh commented Oct 8, 2024 •

edited

Loading

lhoestq commented Oct 8, 2024 •

edited

Loading

alex-hh commented Oct 8, 2024

Super slow iteration with trivial custom transform #6833

Super slow iteration with trivial custom transform #6833

Comments

xslittlegrass commented Apr 23, 2024

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

rangehow commented Apr 27, 2024 • edited Loading

lhoestq commented May 4, 2024

alex-hh commented Oct 8, 2024 • edited Loading

lhoestq commented Oct 8, 2024

alex-hh commented Oct 8, 2024 • edited Loading

lhoestq commented Oct 8, 2024 • edited Loading

alex-hh commented Oct 8, 2024

rangehow commented Apr 27, 2024 •

edited

Loading

alex-hh commented Oct 8, 2024 •

edited

Loading

alex-hh commented Oct 8, 2024 •

edited

Loading

lhoestq commented Oct 8, 2024 •

edited

Loading