Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: only train for one batch in PyTorch Trainer test mode #8260

Merged

Conversation

azhou-determined
Copy link
Contributor

Description

previous change to enable training lengths longer than searcher length introduced bug where test_mode flag will train for max_length instead of just one batch.

Test Plan

run a script using pytorch trainer API (example for mnist):

import determined as det
import yaml
import logging
from model_def import MNistTrial

def main():

    with open("const.yaml", "r") as file:
        exp_conf = yaml.safe_load(file)

    hparams = exp_conf["hyperparameters"]

    with det.pytorch.init(hparams=hparams, exp_conf=exp_conf) as train_context:
        trial = MNistTrial(train_context)
        trainer = det.pytorch.Trainer(trial, train_context)

        trainer.fit(
            max_length=det.pytorch.Epoch(1),
            test_mode=True
        )

if __name__ == "__main__":
    # Configure logging
    logging.basicConfig(level=logging.INFO, format=det.LOG_FORMAT)
    main()

verify in stdout that the code trains for 1 batch only.

Commentary (optional)

Checklist

  • Changes have been manually QA'd
  • User-facing API changes need the "User-facing API Change" label.
  • Release notes should be added as a separate file under docs/release-notes/.
    See Release Note for details.
  • Licenses should be included for new code which was copied and/or modified from any external code.

Ticket

@netlify
Copy link

netlify bot commented Oct 30, 2023

Deploy Preview for determined-ui ready!

Name Link
🔨 Latest commit 32490d8
🔍 Latest deploy log https://app.netlify.com/sites/determined-ui/deploys/653adb08ad89410008a6fbe6
😎 Deploy Preview https://deploy-preview-8260--determined-ui.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

Comment on lines 753 to 759
if self.local_training:
searcher_length = self.max_length
train_length = Batch(1) if self.test_mode else self.max_length
else:
searcher_length = TrainUnit._from_searcher_unit(
train_length = TrainUnit._from_searcher_unit(
op.length, self.searcher_unit, self.global_batch_size
)
assert searcher_length
assert train_length
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just confirming, we don't support --test without --local through any codepath except the legacy codepath, right?

This seems correct under that condition, but might make more sense to check for test mode first, then local, since if we did support nonlocal test mode we would still only train one batch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point. updated.

Comment on lines 753 to 760
if self.local_training:
searcher_length = self.max_length
train_length = self.max_length
else:
searcher_length = TrainUnit._from_searcher_unit(
train_length = TrainUnit._from_searcher_unit(
op.length, self.searcher_unit, self.global_batch_size
)
assert searcher_length
if self.test_mode:
train_length = Batch(1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: reorder your if statements so you aren't mutating variables; that would feel a lot more natural:

if self.test_mode:
  train_length = Batch(1)
elif self.local_training:
  train_length = self.max_length
else:
  train_length = TrainUnit._from_searcher_unit(...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duh. thanks.

@azhou-determined azhou-determined merged commit af40f2c into determined-ai:main Oct 31, 2023
55 of 61 checks passed
@dannysauer dannysauer added this to the 0.26.3 milestone Feb 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants