Fix uneven batches in distributed dataloading #237

awaelchli · 2024-07-16T13:41:13Z

Fixes #233

This PR changes/fixes the implementation of how items are assigned to workers.
Before: Chunks are first assigned to ranks, then samples from ranks assigned to workers
Now: Assign samples directly across combined world size of all workers/ranks.

This allows us to correctly apply drop_last and ensure that each rank returns the same amount of data. However, this means this PR is a breaking change.

IMPORTANT:
This changes the order in which samples are batched and returned. A consequence of this also is resuming from checkpoints made prior to this PR are not going to be restored correctly.

TODOS

Resuming logic
Chunk deletion logic
Apply fix for no-shuffle
Tests

…htning-AI/litdata into fix_uneven_number_of_batches

for more information, see https://pre-commit.ci

src/litdata/streaming/shuffle.py

tests/streaming/test_dataset.py

src/litdata/streaming/shuffle.py

for more information, see https://pre-commit.ci

tests/streaming/test_dataset.py

tests/streaming/test_combined.py

tests/streaming/test_dataloader.py

awaelchli · 2024-07-19T15:50:35Z

tests/streaming/test_dataset.py

        chunk_size=190,
-        num_workers=4,
+        num_workers=1,  # TODO: Want 4 here, but optimize() has deletion race condition


Looks like everywhere in the tests we use num_workers=1, and here I wanted 4 but there seems to be race conditions (?) on the copying/deletion of chunks, causing this test to fail because of missing chunks.

__________________ test_dataset_resume_on_future_chunks[True] __________________ shuffle = True tmpdir = local('/tmp/pytest-of-runner/pytest-0/test_dataset_resume_on_future_0') monkeypatch = <_pytest.monkeypatch.MonkeyPatch object at 0x7f6a4124f460> @pytest.mark.skipif(sys.platform == "win32", reason="Not tested on windows and MacOs") @mock.patch.dict(os.environ, {}, clear=True) @pytest.mark.timeout(60) @pytest.mark.parametrize("shuffle", [True, False]) def test_dataset_resume_on_future_chunks(shuffle, tmpdir, monkeypatch): """This test is constructed to test resuming from a chunk past the first chunk, when subsequent chunks don't have the same size.""" s3_cache_dir = str(tmpdir / "s3cache") optimize_data_cache_dir = str(tmpdir / "optimize_data_cache") optimize_cache_dir = str(tmpdir / "optimize_cache") data_dir = str(tmpdir / "optimized") monkeypatch.setenv("DATA_OPTIMIZER_DATA_CACHE_FOLDER", optimize_data_cache_dir) monkeypatch.setenv("DATA_OPTIMIZER_CACHE_FOLDER", optimize_cache_dir) > optimize( fn=_simple_preprocess, inputs=list(range(8)), output_dir=data_dir, chunk_size=190, num_workers=4, num_uploaders=1, copying /tmp/pytest-of-runner/pytest-0/test_dataset_resume_on_future_0/optimize_cache/chunk-3-1.bin to /tmp/pytest-of-runner/pytest-0/test_dataset_resume_on_future_0/optimized/chunk-3-1.bin putting /tmp/pytest-of-runner/pytest-0/test_dataset_resume_on_future_0/optimize_cache/chunk-3-1.bin on the remove queue Worker 1 is done. Worker 2 is done. Worker 3 is done. Worker 0 is done. Workers are finished. ----------------------------- Captured stderr call ----------------------------- Progress: 0%| | 0/8 [00:00<?, ?it/s]Process Process-85:1: Traceback (most recent call last): File "/opt/hostedtoolcache/Python/3.9.19/x64/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/opt/hostedtoolcache/Python/3.9.19/x64/lib/python3.9/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/home/runner/work/litdata/litdata/src/litdata/processing/data_processor.py", line 259, in _upload_fn shutil.copy(local_filepath, output_filepath) File "/opt/hostedtoolcache/Python/3.9.19/x64/lib/python3.9/shutil.py", line 427, in copy copyfile(src, dst, follow_symlinks=follow_symlinks) File "/opt/hostedtoolcache/Python/3.9.19/x64/lib/python3.9/shutil.py", line 264, in copyfile with open(src, 'rb') as fsrc: FileNotFoundError: [Errno 2] No such file or directory: '/tmp/pytest-of-runner/pytest-0/test_dataset_resume_on_future_0/optimize_cache/chunk-0-0.bin' Progress: 100%|██████████| 8/8 [00:00<00:00, 122.77it/s] =========================== short test summary info ============================ FAILED tests/streaming/test_dataset.py::test_dataset_resume_on_future_chunks[True] - RuntimeError: All the chunks should have been deleted. Found ['chunk-0-1.bin'] ====== 1 failed, 191 passed, 8 skipped, 11 warnings in [247](https://github.com/Lightning-AI/litdata/actions/runs/10010459328/job/27671682379?pr=237#step:10:248).94s (0:04:07) =======

src/litdata/utilities/shuffle.py

tests/streaming/test_combined.py

Co-authored-by: thomas chaton <[email protected]>

tchaton · 2024-07-19T18:11:08Z

Awesome work @awaelchli !

tchaton and others added 14 commits July 16, 2024 08:46

update

f749192

update

ed18cfe

update

c77821b

update

2732202

Merge branch 'main' into fix_uneven_number_of_batches

bef1698

update

a8dd576

Merge branch 'fix_uneven_number_of_batches' of https://github.com/Lig…

c4f3f4e

…htning-AI/litdata into fix_uneven_number_of_batches

fix with thomas

9f69690

stop length

34a9d74

remove redundant drop_last code

a80e430

debug resume

f38e8ff

[pre-commit.ci] auto fixes from pre-commit.com hooks

6b7578a

for more information, see https://pre-commit.ci

update resuming logic

0623680

update

265c4e9

awaelchli force-pushed the fix_uneven_number_of_batches2 branch from a7e425c to 265c4e9 Compare July 16, 2024 14:55

awaelchli added 2 commits July 16, 2024 12:26

length and resume fixes

c9ecec7

assert length in test

b0096c5

awaelchli force-pushed the fix_uneven_number_of_batches2 branch from ebc7e3b to b0096c5 Compare July 16, 2024 16:27

pre-commit-ci bot and others added 6 commits July 16, 2024 16:27

[pre-commit.ci] auto fixes from pre-commit.com hooks

24653b2

for more information, see https://pre-commit.ci

update

1734629

rename variables

c3edbb4

clean up dataset.py

99ca280

[pre-commit.ci] auto fixes from pre-commit.com hooks

4130f50

for more information, see https://pre-commit.ci

clean up shuffle

0053486

tchaton reviewed Jul 17, 2024

View reviewed changes

src/litdata/streaming/shuffle.py Show resolved Hide resolved

tests/streaming/test_dataset.py Outdated Show resolved Hide resolved

src/litdata/streaming/shuffle.py Show resolved Hide resolved

awaelchli and others added 5 commits July 17, 2024 12:40

Fix set_drop_last and test

3e94cd6

[pre-commit.ci] auto fixes from pre-commit.com hooks

a3b9457

for more information, see https://pre-commit.ci

fix epoch reshuffling test

f52a501

update combined test

e621185

update replay test

0641666

awaelchli mentioned this pull request Jul 19, 2024

Fix duplicated workflows on PR #243

Merged

awaelchli and others added 10 commits July 19, 2024 14:40

debug

5755998

[pre-commit.ci] auto fixes from pre-commit.com hooks

a9c688f

for more information, see https://pre-commit.ci

debug

18afb5e

Merge branch 'main' into fix_uneven_number_of_batches2

209e0ec

debug

6b04c22

debug

46daa79

[pre-commit.ci] auto fixes from pre-commit.com hooks

aafab96

for more information, see https://pre-commit.ci

debug

d27fb34

debug

06bf414

debug

bc64b77

awaelchli mentioned this pull request Jul 19, 2024

Terminate threads to avoid test interactions #244

Merged

awaelchli marked this pull request as ready for review July 19, 2024 15:49

awaelchli commented Jul 19, 2024

View reviewed changes

awaelchli requested a review from tchaton July 19, 2024 15:50

tchaton approved these changes Jul 19, 2024

View reviewed changes

src/litdata/utilities/shuffle.py Outdated Show resolved Hide resolved

src/litdata/utilities/shuffle.py Show resolved Hide resolved

src/litdata/utilities/shuffle.py Show resolved Hide resolved

tests/streaming/test_combined.py Show resolved Hide resolved

awaelchli force-pushed the fix_uneven_number_of_batches2 branch from 8350fc2 to bc64b77 Compare July 19, 2024 16:28

awaelchli and others added 2 commits July 19, 2024 12:28

Update src/litdata/utilities/shuffle.py

ac17f3e

Co-authored-by: thomas chaton <[email protected]>

comments and test

66017e8

awaelchli force-pushed the fix_uneven_number_of_batches2 branch from 6fc442e to 66017e8 Compare July 19, 2024 17:36

internals -> intervals

e2e9ff8

awaelchli enabled auto-merge (squash) July 19, 2024 17:44

awaelchli merged commit c58b673 into main Jul 19, 2024
26 checks passed

awaelchli deleted the fix_uneven_number_of_batches2 branch July 19, 2024 17:57

awaelchli mentioned this pull request Jul 19, 2024

optimize() with num_workers > 1 leads to deletion issues #245

Closed

tchaton mentioned this pull request Jul 21, 2024

Resuming Training w/ Streaming Dataset on DDP with Multiple Nodes Fails #248

Closed

awaelchli mentioned this pull request Jul 22, 2024

Fix index errors on world size > 0 #252

Merged

awaelchli mentioned this pull request Aug 6, 2024

Streaming from s3 hangs when num_workers > 1 #306

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix uneven batches in distributed dataloading #237

Fix uneven batches in distributed dataloading #237

awaelchli commented Jul 16, 2024 •

edited

Loading

awaelchli Jul 19, 2024

awaelchli Jul 19, 2024

tchaton commented Jul 19, 2024

Fix uneven batches in distributed dataloading #237

Fix uneven batches in distributed dataloading #237

Conversation

awaelchli commented Jul 16, 2024 • edited Loading

awaelchli Jul 19, 2024

Choose a reason for hiding this comment

awaelchli Jul 19, 2024

Choose a reason for hiding this comment

tchaton commented Jul 19, 2024

awaelchli commented Jul 16, 2024 •

edited

Loading