Why does the training not stop？ #51

ZhaohanWang0217 · 2023-11-10T08:52:30Z

Reduce total_training_steps.Then,Why does the training not stop when the steps are reduced

subminu · 2024-09-24T08:05:24Z

I found the reason that you wonder.
In the run_loop function,

def run_loop(self):
    saved = False
    while (
        not self.lr_anneal_steps
        or self.step < self.lr_anneal_steps
        or self.global_step < self.total_training_steps
    ):
        batch, cond = next(self.data)
        self.run_step(batch, cond)
        saved = False
        if (
            self.global_step
            and self.save_interval != -1
            and self.global_step % self.save_interval == 0
        ):
            self.save()
            saved = True
            th.cuda.empty_cache()
            # Run for a finite amount of time in integration tests.
            if os.environ.get("DIFFUSION_TRAINING_TEST", "") and self.step > 0:
                return
        if self.global_step % self.log_interval == 0:
            logger.dumpkvs()

The condition not self.lr_anneal_steps always evaluates to True if lr_anneal_steps is left at its default value of 0.
You can temporarily fix the issue by removing not self.lr_anneal_steps or self.step < self.lr_anneal_steps.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why does the training not stop？ #51

Why does the training not stop？ #51

ZhaohanWang0217 commented Nov 10, 2023

subminu commented Sep 24, 2024 •

edited

Loading

Why does the training not stop？ #51

Why does the training not stop？ #51

Comments

ZhaohanWang0217 commented Nov 10, 2023

subminu commented Sep 24, 2024 • edited Loading

subminu commented Sep 24, 2024 •

edited

Loading