Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: Support mmclassification with MLU backend #1159

Merged
merged 1 commit into from
Nov 15, 2022

Conversation

Qiza-lyhm
Copy link

Motivation

Support training and distributed-training on MLU device.

Modification

Add MLUDataParallel and MLUDistributedDataParallel support to model wrapper.

Modified sync_random_seed with auto_select_device

BC-breaking (Optional)

Only Support MLU training with MMCV>=1.6.0.

Use cases (Optional)

It will automatically select device to train model.

Test with ResNet18 on imagenet.

Checklist

Before PR:

  • Pre-commit or other linting tools are used to fix the potential lint issues.
  • Bug fixes are fully covered by unit tests, the case that causes the bug should be added in the unit tests.
  • The modification is covered by complete unit tests. If not, please add more unit test to ensure the correctness.
  • The documentation has been modified accordingly, like docstring or example tutorials.

After PR:

  • If the modification has potential influence on downstream or other related projects, this PR should be tested with those projects, like MMDet or MMSeg.
  • CLA has been signed and all committers have signed the CLA in this PR.

  * Training on MLU is available
@mzr1996
Copy link
Member

mzr1996 commented Nov 2, 2022

Hello, you can check this PR #1072, and the latest mmcls have already handled the device of seed, you can check it.
https://github.com/open-mmlab/mmclassification/blob/8c63bb55a57866a6f7762c96fecb4619745e219e/mmcls/datasets/samplers/distributed_sampler.py#L34

@Qiza-lyhm Qiza-lyhm changed the title (feat): Support MLU backend [Feature]: Support mmclassification with MLU backend Nov 2, 2022
@Qiza-lyhm
Copy link
Author

Hello, you can check this PR #1072, and the latest mmcls have already handled the device of seed, you can check it.

https://github.com/open-mmlab/mmclassification/blob/8c63bb55a57866a6f7762c96fecb4619745e219e/mmcls/datasets/samplers/distributed_sampler.py#L34

在dataset/samplers下有distributed_sampler.py和repeat_aug.py两个文件,并且都需要添加auto select device的机制。

这里把auto select device的机制放进了对应的接口内部,来使外部接口不变。
目前的主要改动逻辑是将原本默认参数固定为‘cuda'的,修改为默认为None并使用auto_select_device。在外部显式指定设备的仍然以显式指定的设备为准。

第二种改法就是不修改现有的接口实现,默认值仍然为cuda,将所有的sync seed相关的接口改为显式传入auto_select_device。

我想确认一下mmcls这边是建议使用第二种方案,使用保留默认值修改传参的方式来实现嘛?

@mzr1996
Copy link
Member

mzr1996 commented Nov 4, 2022

Hello, you can check this PR #1072, and the latest mmcls have already handled the device of seed, you can check it.
https://github.com/open-mmlab/mmclassification/blob/8c63bb55a57866a6f7762c96fecb4619745e219e/mmcls/datasets/samplers/distributed_sampler.py#L34

在dataset/samplers下有distributed_sampler.py和repeat_aug.py两个文件,并且都需要添加auto select device的机制。

这里把auto select device的机制放进了对应的接口内部,来使外部接口不变。 目前的主要改动逻辑是将原本默认参数固定为‘cuda'的,修改为默认为None并使用auto_select_device。在外部显式指定设备的仍然以显式指定的设备为准。

第二种改法就是不修改现有的接口实现,默认值仍然为cuda,将所有的sync seed相关的接口改为显式传入auto_select_device。

我想确认一下mmcls这边是建议使用第二种方案,使用保留默认值修改传参的方式来实现嘛?
第一种可能更通用一些,ok

@codecov
Copy link

codecov bot commented Nov 4, 2022

Codecov Report

Base: 86.02% // Head: 85.94% // Decreases project coverage by -0.08% ⚠️

Coverage data is based on head (8da52b5) compared to base (17ed870).
Patch coverage: 37.50% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##              dev    #1159      +/-   ##
==========================================
- Coverage   86.02%   85.94%   -0.09%     
==========================================
  Files         142      142              
  Lines        9906     9917      +11     
  Branches     1715     1719       +4     
==========================================
+ Hits         8522     8523       +1     
- Misses       1122     1131       +9     
- Partials      262      263       +1     
Flag Coverage Δ
unittests 85.87% <37.50%> (-0.09%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
mmcls/utils/distribution.py 5.88% <0.00%> (-1.53%) ⬇️
mmcls/apis/train.py 14.63% <50.00%> (-0.37%) ⬇️
mmcls/core/utils/dist_utils.py 33.92% <75.00%> (+1.85%) ⬆️
mmcls/datasets/samplers/distributed_sampler.py 85.18% <100.00%> (-0.53%) ⬇️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@mzr1996 mzr1996 merged commit dc8691e into open-mmlab:dev Nov 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants