-
-
Notifications
You must be signed in to change notification settings - Fork 16.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
generator seed fix for DDP mAP drop #9545
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👋 Hello @Forever518, thank you for submitting a YOLOv5 🚀 PR! To allow your work to be integrated as seamlessly as possible, we advise you to:
- ✅ Verify your PR is up-to-date with
ultralytics/yolov5
master
branch. If your PR is behind you can update your code by clicking the 'Update branch' button or by runninggit pull
andgit merge master
locally. - ✅ Verify all YOLOv5 Continuous Integration (CI) checks are passing.
- ✅ Reduce changes to the absolute minimum required for your bug fix or feature addition. "It is not daily increase but daily decrease, hack away the unessential. The closer to the source, the less wastage there is." — Bruce Lee
@Forever518 thanks for the PR! A DDP mAP drop sounds like a serious problem, are you sure you're able to reproduce with the latest torch and YOLOv5 on a common dataset like COCO128 or VOC? |
@glenn-jocher Sure, I trained yolov5s using Pytorch 1.12.1+cu116 and the latest yolov5 on COCO dataset. The chart truly shows that with DDP mode on, the mAP drops from the beginning. And the result of DDP after 300 epochs is about 2% lower than official and my fix version. ( It's really a strange problem because I didn't see any drop in your wandb runs with multi-GPU. The coco128 is too small to use DDP training, while due to my network I have not test the VOC dataset yet. python -m torch.distributed.run --nproc_per_node 4 train.py --data data/coco.yaml --img 640 --epoch 300 --device 0,1,2,3 --hyp data/hyps/hyp.scratch-low.yaml --cfg models/yolov5s.yaml --weights '' --batch 128 --project yolov5-ddp-seed --name repro-1 |
@glenn-jocher I find another bug when trying to change the act field of import torch.nn as nn
def autopad(k, p=None, d=1): # kernel, padding, dilation
# Pad to 'same' shape outputs
if d > 1:
k = d * (k - 1) + 1 if isinstance(k, int) else [d * (x - 1) + 1 for x in k] # actual kernel-size
if p is None:
p = k // 2 if isinstance(k, int) else [x // 2 for x in k] # auto-pad
return p
class Conv(nn.Module):
# Standard convolution with args(ch_in, ch_out, kernel, stride, padding, groups, dilation, activation)
act = nn.SiLU() # default activation
def __init__(self, c1, c2, k=1, s=1, p=None, g=1, d=1, act=True):
super().__init__()
self.conv = nn.Conv2d(c1, c2, k, s, autopad(k, p, d), groups=g, dilation=d, bias=False)
self.bn = nn.BatchNorm2d(c2)
print(self.act)
self.act = nn.Identity()
print(self.act)
def forward(self, x):
return self.act(self.bn(self.conv(x)))
def forward_fuse(self, x):
return self.act(self.conv(x))
if __name__ == '__main__':
c = Conv(3, 64) |
@Forever518 ok, I ran an experiment and am able to reproduce mAP drop on COCO. I'll go ahead and merge this fix, except what's the reason for the large number on seed init? Could this also work just with RANK? - generator.manual_seed(6148914691236517205 + RANK)
+ generator.manual_seed(RANK) # can we use this instead? |
@glenn-jocher Pytorch official suggests using a large number with balanced 0 and 1 bits as the seed. See generator manual seed usage. I just set it to the number whose hex is Actually the result on my machine still cannot reach 37.4 with the lastest version even after this fix, less than 0.4 lower compared to official benchmark... I am afraid something about the training settings is still missing just as you mentioned in PR #8602. |
@Forever518 ah, if you are trying to reproduce officical then you should download COCO-segments dataset, i.e.:
Also note YOLOv5s results are on single-GPU, DDP may reduce mAP slightly. |
@Forever518 I'm going to merge but need to expand this solution to all 3 dataloaders with generators: detection, classification, segmentation first |
Haha maybe a framework is needed for supporting the 3 tasks at the same time in case of always changing the code 3 times. 🤣 |
@Forever518 yes definitely. This triple maintenance is not optimal. @AyushExel @Laughing-q this should resolve generator seed issues across tasks, and may lead to DDP reproducibility |
@Forever518 PR is merged. Thank you for your contributions to YOLOv5 🚀 and Vision AI ⭐ Can you verify that the DDP mAP drop issue is now resolved in master? |
@glenn-jocher OK I will try it later. Due to the computing resources constraint it may take some time... |
@Forever518 great, thank you! |
@Forever518 @AyushExel @Laughing-q can confirm that DetectionModel DDP low-mAP training bug is now fixed, and is also reproducible now across DDP runs (first time ever, awesome!). Tested with 4x A100 on YOLOv5m6 COCO twice vs baseline. |
Awesome!!! 🚀 |
The mAP drops severely in DDP training mode. The chart below shows the training result of single GPU without DDP and 4 GPU DDP.
I logged the random seed of each RANK and its workers. It's obvious that the current version generates the same seeds per RANK, while among workers the seeds are different. I guess this may affect the data augmentation so the mAP drops.
Following the torch guide of
torch.Generator.manual_seed
(link is here), I set the seed to a number with balanced 0 and 1 bits meantime associated with RANK. So the seeds of each RANK are different now and the mAP of DDP training increases to a "normal" status. You can see the result above or here.The remaining questions are:
🛠️ PR Summary
Made with ❤️ by Ultralytics Actions
🌟 Summary
Enhanced consistency in activation layers and improved data loader seeding for distributed training in the YOLOv5 model.
📊 Key Changes
Conv
to bedefault_act
for clearer code interpretation.yolo.py
to match the new class variable name.RANK
environment variable to ensure unique random seed generation in distributed settings forDataLoader
andInfiniteDataLoader
.🎯 Purpose & Impact
Conv
activation variable improves code readability and maintainability, making it easier for developers to understand the default behavior.RANK
environment variable, the change ensures that workers in different distributed training processes have varying random seeds, thereby enhancing reproducibility and reducing potential data overlap during training.