-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[MXNET-535] Fix bugs in LR Schedulers and add warmup #11234
Conversation
python/mxnet/lr_scheduler.py
Outdated
@@ -153,18 +153,57 @@ class PolyScheduler(LRScheduler): | |||
|
|||
""" | |||
|
|||
def __init__(self, max_update, base_lr=0.01, pwr=2): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't remove base_lr, it will break API.
Pass it to super init instead
python/mxnet/lr_scheduler.py
Outdated
if warmup_steps <= 0: | ||
raise ValueError("Warmup steps has to be positive") | ||
self.warmup_steps = warmup_steps | ||
self.lrs_updates = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the point of this cache? Looks like it will always miss
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would have for each batch, number of calls to call equal to the number of learnable parameter arrays.
python/mxnet/lr_scheduler.py
Outdated
self.lrs_updates[num_update] = self.lr_begin + increase | ||
else: | ||
if isinstance(self.scheduler, PolyScheduler): | ||
self.lrs_updates[num_update] = self.scheduler(num_update - self.warmup_steps) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why special case for PolyScheduler?
Is num_update - self.warmup_steps
standard? Does Tf or Pytorch do it this way?
Why not num_update directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PolyScheduler, and CosineScheduler(not implemented here) reduce lr from a "starting lr" to an "ending lr" smoothly, for example from 0.1 to 0.
With warmup, we first increase the lr from a small value (e.g. 0) to the starting lr, then apply the main scheduler. Assuming we have warmup for the first 5 epochs, and the total training epochs is 90, then the effective number of epochs for the poly scheduler is 85.
As for piecewise-constant/factor scheduler, it decays the lr only at certain points, on which the warmup stage has no effect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah for the above reason, but I'm updating the code to remove this special case, and pass a wamup_steps param to such schedulers so we can handle it cleanly
Improved how warmup is handled, please review |
@piiswrong Could you please review? This has a fix for an important bug where the MultiFactorScheduler didn't take a base_lr previously. This meant that the example/image_classification/ scripts didn't use the given LR correctly. It would drop from x to 0.001 and 0.0001 regardless of the LR given. |
@piiswrong @szha @eric-haibin-lin Please review |
@szha @eric-haibin-lin please review |
python/mxnet/lr_scheduler.py
Outdated
@@ -29,8 +30,31 @@ class LRScheduler(object): | |||
base_lr : float, optional | |||
The initial learning rate. | |||
""" | |||
def __init__(self, base_lr=0.01): | |||
def __init__(self, base_lr=0.01, warmup_steps=0, warmup_begin_lr=0, warmup_mode='linear'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mind adding documentation for warmup_begin_lr?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was some for the inherited classes, but not for this base abstract class. Anyway, now added for all. Please check.
python/mxnet/lr_scheduler.py
Outdated
""" | ||
|
||
def __init__(self, max_update, base_lr=0.01, pwr=2): | ||
super(PolyScheduler, self).__init__(base_lr) | ||
def __init__(self, max_update, base_lr=0.01, final_lr=0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why did you remove pwr? This is API breakage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've not removed it. Git is getting confused :/ It thinks I've changed PolyScheduler to CosineScheduler when in fact I've modified PolyScheduler and added a new CosineScheduler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please refer #11234 (comment)
Hopefully this will give committers confidence to merge InterfacesCosineScheduler
PolySchedulerThis PR
Earlier
MultiFactorSchedulerThis PR
Earlier
FactorSchedulerThis PR
Earlier
SchedulerThis PR
Earlier
Plots of LR decay from unit tests |
* Add warmup and fix inconsistencies with learning rate schedulers * add comments * remove assert
Description
Checklist
Essentials
Please feel free to remove inapplicable items for your PR.