You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When conducting distributed learning with DDP, the data is being duplicated. Even though different ranks are confirmed when checking _DistributedEnv, the same data is being injected repeatedly into the learning process.
This issue is occurring with the Lightning Trainer, and I would like to seek a solution for this problem.
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/4
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/4
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/4
----------------------------------------------------------------------------------------------------
distributed_backend=gloo
All distributed processes registered. Starting with 4 processes
----------------------------------------------------------------------------------------------------
| Name | Type | Params | Mode
---------------------------------------------
0 | model | Sequential | 31 | train
---------------------------------------------
31 Trainable params
0 Non-trainable params
31 Total params
0.000 Total estimated model params size (MB)
[/Users/user/litdata-env/lib/python3.10/site-packages/lightning/pytorch/utilities/data.py:122](https://file+.vscode-resource.vscode-cdn.net/Users/user/litdata-env/lib/python3.10/site-packages/lightning/pytorch/utilities/data.py:122): Your `IterableDataset` has `__len__` defined. In combination with multi-process data loading (when num_workers > 1), `__len__` could be inaccurate if each worker is not configured independently to avoid having duplicate data.
I think that _DistributedEnv was not properly injected in the __iter__ step of Dataset.
It always has global rank: 0.
csy1204
changed the title
Resolve data duplication issues in DDP with Lightning Trainer
Resolve same global rank in DDP with Lightning Trainer
Jul 22, 2024
🐛 Bug
When conducting distributed learning with DDP, the data is being duplicated. Even though different ranks are confirmed when checking _DistributedEnv, the same data is being injected repeatedly into the learning process.
This issue is occurring with the Lightning Trainer, and I would like to seek a solution for this problem.
To Reproduce
Steps to reproduce the behavior:
Code sample
Expected behavior
Environment
conda
,pip
, source): pipAdditional context
The text was updated successfully, but these errors were encountered: