-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Super slow iteration with trivial custom transform #6833
Comments
Similar issue in text process tokenizer=AutoTokenizer.from_pretrained(model_dir[args.model])
train_dataset=datasets.load_from_disk(dataset_dir[args.dataset],keep_in_memory=True)['train']
train_dataset=train_dataset.map(partial(dname2func[args.dataset],tokenizer=tokenizer),batched=True,num_proc =50,remove_columns=train_dataset.features.keys(),desc='tokenize',keep_in_memory=True) After this train_dataset will be like Dataset({
features: ['input_ids', 'labels'],
num_rows: 51760
}) In which input_ids and labels are both List[int] for j in tqdm(range(len(train_dataset)),desc='first stage'):
input_id,label=train_dataset['input_ids'][j],train_dataset['labels'][j] |
The transform currently replaces the numpy formatting. So you're back to copying data to long python lists which is super slow. It would be cool for the transform to not remove the formatting in this case, but this requires a few changes in the lib |
This also (somewhat surprisingly) affects iterable datasets, making map very challenging to use for data with large arrays, unless there is some workaround? |
For iterable datasets you should be able to do this without slow downs ds = ds.with_format("arrow").map(...) I haven't tried with "numpy" though, maybe there is a step that does Arrow -> List -> NumPy instead of Arrow -> NumPy directly. If it's the case it would be cool to avoid that |
Thanks! this works for me However, it raises an error if batched=False and map batch_size isn't explicitly set to 1 due to map's default batch_size affecting the batch size of the RebatchedArrowExamplesIterable - is this a bug? |
Thanks for the fix @alex-hh ! |
opened a new issue for the numpy slowdown #7206 |
Describe the bug
Dataset is 10X slower when applying trivial transforms:
In my real code I'm using set_transform to apply some post-processing on-the-fly for the 2d array, but it significantly slows down the dataset even if the transform itself is trivial.
Related issue: #5841
Steps to reproduce the bug
Use code in the description to reproduce.
Expected behavior
Trivial custom transform in the example should not slowdown the dataset iteration.
Environment info
datasets
version: 2.18.0huggingface_hub
version: 0.20.2fsspec
version: 2023.12.2The text was updated successfully, but these errors were encountered: