-
-
Notifications
You must be signed in to change notification settings - Fork 16.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementation of Early Stopping for DDP training #8345
Implementation of Early Stopping for DDP training #8345
Conversation
This edit correctly uses the broadcast_object_list() function to send slave processes a boolean so to end the training phase if the variable is True, thus allowing the master process to destroy the process group and terminate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👋 Hello @giacomoguiduzzi, thank you for submitting a YOLOv5 🚀 PR! To allow your work to be integrated as seamlessly as possible, we advise you to:
- ✅ Verify your PR is up-to-date with upstream/master. If your PR is behind upstream/master an automatic GitHub Actions merge may be attempted by writing /rebase in a new comment, or by running the following code, replacing 'feature' with the name of your local branch:
git remote add upstream https://github.com/ultralytics/yolov5.git
git fetch upstream
# git checkout feature # <--- replace 'feature' with local branch name
git merge upstream/master
git push -u origin -f
- ✅ Verify all Continuous Integration (CI) checks are passing.
- ✅ Reduce changes to the absolute minimum required for your bug fix or feature addition. "It is not daily increase but daily decrease, hack away the unessential. The closer to the source, the less wastage there is." -Bruce Lee
@giacomoguiduzzi thanks for the PR! Is there any way to drop some or all of this code into the stopper() method itself? The class can access the global RANK variable by defining it in utils/torch_utils if required, i.e.: RANK = int(os.getenv('RANK', -1)) |
Hi @glenn-jocher, no problem! I think if you wanted to drop this code into the EarlyStopper Summing up, I think it could be possible but it is necessary to broadcast the Let me know if you want me to look into it. |
@giacomoguiduzzi I've cleaned up the PR a bit while maintaining the functionality, I think. Can you test on your side to verify that everything still works correctly? If it all looks good after your review I will proceed to merge. Thanks! |
This cleans up the definition of broadcast_list and removes the requirement for clear() afterward.
@giacomoguiduzzi further cleaned up in 58bc763. I think this is ok but have not tested with DDP earlystopping. |
Hi @glenn-jocher, I've just tested your edits and the early stopping feature is working as intended. You're right, my code wasn't very pythonic... |
@giacomoguiduzzi got it! PR is merged. Thank you for your contributions to YOLOv5 🚀 and Vision AI ⭐ |
* Implementation of Early Stopping for DDP training This edit correctly uses the broadcast_object_list() function to send slave processes a boolean so to end the training phase if the variable is True, thus allowing the master process to destroy the process group and terminate. * Update train.py * Update train.py * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update train.py * Update train.py * Update train.py * Further cleanup This cleans up the definition of broadcast_list and removes the requirement for clear() afterward. Co-authored-by: Glenn Jocher <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
This edit correctly uses the
broadcast_object_list
function to send slave processes abool
so to end the training phase if the variable isTrue
, thus allowing the master process to destroy the process group and terminate.🛠️ PR Summary
Made with ❤️ by Ultralytics Actions
🌟 Summary
Implementation of a unified early stopping mechanism for both single-GPU and DDP training in the YOLOv5 model.
📊 Key Changes
stop
flag alongside the existingEarlyStopping
object for better early stop control.stop
flag which is used consistently across the training loop.stop
flag, removing previous early stop code for single-GPU and DDP.stop
flag in DDP (Distributed Data Parallel) training to ensure all processes stop simultaneously.🎯 Purpose & Impact