Add IterableDataset.shard() #7252

lhoestq · 2024-10-25T11:07:12Z

Will be useful to distribute a dataset across workers (other than pytorch) like spark

I also renamed .n_shards -> .num_shards for consistency and kept the old name for backward compatibility. And a few changes in internal functions for consistency as well (rank, world_size -> num_shards, index)

Breaking change: the new default for contiguous in Dataset.shard() is True, but imo not a big deal since I couldn't find any usage of contiguous=False internally (we always do contiguous=True for map-style datasets since its more optimized) or in the wild

HuggingFaceDocBuilderDev · 2024-10-25T11:09:41Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

lhoestq added 3 commits October 25, 2024 12:56

add IterableDataset.shard (and rename n_shards -> num_shards)

b4a98f4

docs

d700230

add test

f14caf6

lhoestq added 5 commits October 25, 2024 13:22

fix tests

4a959de

again

12d5197

again

cfef85e

again

de5bdd1

minor

04729eb

lhoestq merged commit 65f6eb5 into main Oct 25, 2024
15 checks passed

lhoestq deleted the iterable-shard branch October 25, 2024 15:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add IterableDataset.shard() #7252

Add IterableDataset.shard() #7252

lhoestq commented Oct 25, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Oct 25, 2024

Add IterableDataset.shard() #7252

Add IterableDataset.shard() #7252

Conversation

lhoestq commented Oct 25, 2024 • edited Loading

HuggingFaceDocBuilderDev commented Oct 25, 2024

lhoestq commented Oct 25, 2024 •

edited

Loading