Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why does the training not stop? #51

Open
ZhaohanWang0217 opened this issue Nov 10, 2023 · 1 comment
Open

Why does the training not stop? #51

ZhaohanWang0217 opened this issue Nov 10, 2023 · 1 comment

Comments

@ZhaohanWang0217
Copy link

Reduce total_training_steps.Then,Why does the training not stop when the steps are reduced

@subminu
Copy link

subminu commented Sep 24, 2024

I found the reason that you wonder.
In the run_loop function,

def run_loop(self):
    saved = False
    while (
        not self.lr_anneal_steps
        or self.step < self.lr_anneal_steps
        or self.global_step < self.total_training_steps
    ):
        batch, cond = next(self.data)
        self.run_step(batch, cond)
        saved = False
        if (
            self.global_step
            and self.save_interval != -1
            and self.global_step % self.save_interval == 0
        ):
            self.save()
            saved = True
            th.cuda.empty_cache()
            # Run for a finite amount of time in integration tests.
            if os.environ.get("DIFFUSION_TRAINING_TEST", "") and self.step > 0:
                return
        if self.global_step % self.log_interval == 0:
            logger.dumpkvs()

The condition not self.lr_anneal_steps always evaluates to True if lr_anneal_steps is left at its default value of 0.
You can temporarily fix the issue by removing not self.lr_anneal_steps or self.step < self.lr_anneal_steps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants