Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: correct logic for checking if a validation is the best one seen #601

Merged
merged 1 commit into from
Jun 3, 2020

Conversation

dzhu
Copy link
Contributor

@dzhu dzhu commented May 28, 2020

Description

Previously, every validation would be marked as the best one seen when smallerIsBetter was true. The new version avoids that and also separates the code a bit more cleanly into "decide whether the validation is best" followed by "do things accordingly".

Test Plan

  • write a new test that failed with the old code and passes with the new code
  • run an experiment with checkpoint_policy: best, smaller_is_better: true, min_validation_period: 1, and no min_checkpoint_period; check that checkpoints are taken when appropriate rather than after every validation

Previously, every validation would be marked as the best one seen when
`smallerIsBetter` was true. The new version avoids that and also
separates the code a bit more cleanly into "decide whether the
validation is best" followed by "do things accordingly".
@dzhu dzhu requested a review from ybt195 May 28, 2020 18:46
@cla-bot cla-bot bot added the cla-signed label May 28, 2020
@dzhu dzhu merged commit bdfd980 into determined-ai:master Jun 3, 2020
@dzhu dzhu deleted the fix/best-validation branch June 3, 2020 16:49
hamidzr added a commit to hamidzr/determined that referenced this pull request Jun 9, 2020
Squashed commit of the following:

commit 3b07678
Author: Hamid Zare <[email protected]>
Date:   Mon Jun 8 16:16:48 2020 -0700

    feat: add task list page route and placeholder [DET-3220] (determined-ai#636)

commit 4c2d0a6
Author: Hamid Zare <[email protected]>
Date:   Mon Jun 8 16:15:23 2020 -0700

    feat: remember last logged in username [DET-3274] (determined-ai#660)

commit 18c8125
Author: Hamid Zare <[email protected]>
Date:   Mon Jun 8 13:25:46 2020 -0700

    refactor: set up experiments context [DET-3255] (determined-ai#640)

commit 5e5b188
Author: Neil Conway <[email protected]>
Date:   Mon Jun 8 12:49:34 2020 -0700

    chore: add license to pip metadata (determined-ai#669)

    Without this change, `pip show determined-cli | grep License` returns:

        License: UNKNOWN

commit 05aa3d2
Author: Brian Friedenberg <[email protected]>
Date:   Mon Jun 8 11:53:17 2020 -0700

    feat: support TF Keras EarlyStopping callbacks [DET-3240] (determined-ai#666)

commit 4056146
Author: aaron276h <[email protected]>
Date:   Mon Jun 8 14:21:13 2020 -0400

    docs: add to FAQ how to port a TF core graph model (determined-ai#650)

commit c8bb942
Author: Brian Friedenberg <[email protected]>
Date:   Mon Jun 8 10:45:39 2020 -0700

    feat: support Estimator early stopping hooks [DET-3239] (determined-ai#661)

commit 3ab90a6
Author: Brian Friedenberg <[email protected]>
Date:   Mon Jun 8 10:44:17 2020 -0700

    test: temporarily disable AMP test since it causes NaNs (determined-ai#670)

commit 629f106
Author: Brian Friedenberg <[email protected]>
Date:   Mon Jun 8 09:22:37 2020 -0700

    feat: treat NaN metrics as an error (determined-ai#667)

    The expected behavior when we hit a NaN is to error the trial. This will restart the trial if we have not restarted max_restarts times. Before we would convert the NaN to the maximum float.

commit db76932
Author: Yoni Ben-tzur <[email protected]>
Date:   Mon Jun 8 08:50:48 2020 -0700

    fix: set auth cookie path to apply site wide (determined-ai#668)

commit 6588f77
Author: aaron276h <[email protected]>
Date:   Mon Jun 8 08:47:11 2020 -0400

    feat: decouple agent information from workloads starting tasks [DET-3178] (determined-ai#631)

    This is a step toward adding a generic resource provider interface.
    As part of this change, Tasks now receive a single `TaskAssigned` message
    when they are assigned rather than one `Assigned` message per container.
    They also start containers by sending specs back to the cluster rather
    than directly to agents.

commit f604a28
Author: Yoni Ben-tzur <[email protected]>
Date:   Fri Jun 5 16:44:11 2020 -0700

    feat: read cookies in the new API auth module (determined-ai#665)

commit 9da1063
Author: Danny Zhu <[email protected]>
Date:   Fri Jun 5 15:43:32 2020 -0700

    fix: space out WebUI plot x-axis ticks a bit more (determined-ai#658)

    The ticks could get bunched up too much with long-running experiments
    before; this just makes sure the spacing is a bit wider by enough to
    work well in reasonable cases. The ticks may now occasionally look a
    tiny bit wider than ideal with very short experiments, but it still
    works fine in practice. Similarly, a long last tick could get cut off at
    the end, so this also bumps the right side spacing correspondingly.

commit b9d9324
Author: Brian Friedenberg <[email protected]>
Date:   Fri Jun 5 15:24:54 2020 -0700

    feat: support early stopping callbacks on a validation step (determined-ai#662)

commit cfb3f51
Author: Yoni Ben-tzur <[email protected]>
Date:   Fri Jun 5 15:22:11 2020 -0700

    feat: add user auth to new api (determined-ai#649)

commit 414bfdf
Author: Hamid Zare <[email protected]>
Date:   Fri Jun 5 14:07:22 2020 -0700

    fix: set authentication failure reason synchronously. (determined-ai#659)

commit ed94d86
Author: aaron276h <[email protected]>
Date:   Fri Jun 5 16:15:25 2020 -0400

    feat: decouple agents from transmitting container status changes [DET-3174] (determined-ai#646)

    This is a step toward adding a generic resource provider interface.

commit f27146a
Author: Caleb Hoyoul Kang <[email protected]>
Date:   Fri Jun 5 13:52:03 2020 -0600

    fix: address minor login issues (determined-ai#611)

    * Increase frequency of auto login check to once per second.
    * Add catch-all routes to redirect invalid route requests to valid routes.

commit d014500
Author: Brian Friedenberg <[email protected]>
Date:   Fri Jun 5 11:47:43 2020 -0700

    revert: "revert: "feat: support stopping training in trial code [DET-3238] (determined-ai#648)" (determined-ai#654)" (determined-ai#656)

    This reverts commit 5baea6a.

commit 44a398a
Author: Caleb Hoyoul Kang <[email protected]>
Date:   Fri Jun 5 10:38:40 2020 -0600

    feat: ensure WebUI version is up to date with platform version (determined-ai#632)

commit 5baea6a
Author: Brian Friedenberg <[email protected]>
Date:   Fri Jun 5 08:55:58 2020 -0700

    revert: "feat: support stopping training in trial code [DET-3238] (determined-ai#648)" (determined-ai#654)

    This reverts commit ee1314f.

commit ee1314f
Author: Brian Friedenberg <[email protected]>
Date:   Fri Jun 5 08:02:13 2020 -0700

    feat: support stopping training in trial code [DET-3238] (determined-ai#648)

commit fa09a74
Author: Brian Friedenberg <[email protected]>
Date:   Fri Jun 5 07:46:24 2020 -0700

    ci: download protoc install to /tmp (determined-ai#653)

commit 9759ce7
Author: determined-dsw <[email protected]>
Date:   Thu Jun 4 18:49:47 2020 -0700

    docs: release notes for 0.12.5 (determined-ai#595) (determined-ai#651)

    (cherry picked from commit 071b3eb)

    Co-authored-by: Justin Chen <[email protected]>

commit 5f476df
Author: Hamid Zare <[email protected]>
Date:   Thu Jun 4 15:30:17 2020 -0700

    chore: remove yarn mentions from tests (determined-ai#635)

    we removed yarn as a dependency a while ago and shouldn't have references to it anymore.

commit 8662fda
Author: Danny Zhu <[email protected]>
Date:   Thu Jun 4 13:03:22 2020 -0700

    fix: correct filename in Elm Makefile (determined-ai#647)

    Due to a mismatch between different instances of what should've been the
    same name, the CSS file was getting built unnecessarily every time,
    considerably increasing the Elm build time.

commit 0e7ca0a
Author: Sidney Wijngaarde <[email protected]>
Date:   Thu Jun 4 12:42:41 2020 -0400

    feat: add checkpoint metadata to cli describe commands (determined-ai#645)

    [DET-3210]

commit 84e875a
Author: aaron276h <[email protected]>
Date:   Thu Jun 4 11:56:52 2020 -0400

    test: fix nightly nas and iris tf keras tests [DET-3264] (determined-ai#644)

    * docs: update NAS example to use correct gradient clipping
    * test: set random seed for nightly iris tf_keras test

commit 4ff9fa0
Author: Sidney Wijngaarde <[email protected]>
Date:   Thu Jun 4 10:47:38 2020 -0400

    feat: checkpoint metadata api (determined-ai#619)

    Adds checkpoint metadata management REST endpoints and python client methods.

    [DET-3207] [DET-3208] [DET-3209]

    PR includes an integration test covering the feature

commit cbbe117
Author: Yoni Ben-tzur <[email protected]>
Date:   Wed Jun 3 20:11:39 2020 -0700

    chore: move proto files to determined namespace (determined-ai#639)

commit fafd686
Author: Yoni Ben-tzur <[email protected]>
Date:   Wed Jun 3 17:01:33 2020 -0700

    feat: add template endpoints to new api (determined-ai#638)

commit 4bad652
Author: Brian Friedenberg <[email protected]>
Date:   Wed Jun 3 16:51:57 2020 -0700

    feat: support USER_CANCELLED exited reason (determined-ai#637)

commit d1146d3
Author: Caleb Hoyoul Kang <[email protected]>
Date:   Wed Jun 3 17:37:04 2020 -0600

    refactor: update link to support secure blank targets (determined-ai#612)

commit f71d64e
Author: Hamid Zare <[email protected]>
Date:   Wed Jun 3 15:33:15 2020 -0700

    feat: add page component [DET-3232] (determined-ai#614)

commit 25e725e
Author: aaron276h <[email protected]>
Date:   Wed Jun 3 17:25:12 2020 -0400

    feat: support gradient clipping in PyTorchTrial via callbacks (determined-ai#615)

    Breaking Change: we no longer accept gradient clipping as a special
    hyperparameter for PyTorchTrial.

commit 80e39d0
Author: Hamid Zare <[email protected]>
Date:   Wed Jun 3 12:28:48 2020 -0700

    feat: add antd breadcrumb stories [DET-3002] (determined-ai#582)

commit 5c9afa2
Author: Hamid Zare <[email protected]>
Date:   Wed Jun 3 12:27:26 2020 -0700

    feat: add activate, pause, and cancel actions to task cards [DET-2934] (determined-ai#585)

commit a3e121a
Author: aaron276h <[email protected]>
Date:   Wed Jun 3 13:34:10 2020 -0400

    feat: add end of training callback to EstimatorTrial (determined-ai#621)

    This will allow users to do any post-experiment cleanup they may need to do.

commit 8056055
Author: Shiyuan <[email protected]>
Date:   Wed Jun 3 10:24:40 2020 -0700

    feat: make agent starting period configurable [DET-3219] (determined-ai#624)

    * feat: make agent starting period configurable

    The provisioner would retry launching an agent after reaching
    a duration, which is set by max agent starting period. Previously,
    this configuration is hardcoded to 300 seconds. Now, change it to
    be in the configuration of the provisioner.

    DET-3219 #Done.

    * test: update tests for max agent starting period

    * chore: clean up provisioner code for better readability

    * chore: clean up provisioner tests for better readability

    * docs: update docs for agent starting period

    * chore: update gcp deployment tool for agent starting period

    * chore: update packaged master.yaml with agent starting period

    * chore: update aws deployment tool with agent starting perioo

commit 8fdc371
Author: Yoni Ben-tzur <[email protected]>
Date:   Wed Jun 3 10:12:13 2020 -0700

    chore: upgrade proto libraries (determined-ai#630)

commit bdfd980
Author: Danny Zhu <[email protected]>
Date:   Wed Jun 3 09:47:02 2020 -0700

    fix: correct logic for checking if a validation is the best one seen (determined-ai#601)

    Previously, every validation would be marked as the best one seen when
    `smallerIsBetter` was true. The new version avoids that and also
    separates the code a bit more cleanly into "decide whether the
    validation is best" followed by "do things accordingly".

commit f590fc3
Author: Yoni Ben-tzur <[email protected]>
Date:   Wed Jun 3 09:44:41 2020 -0700

    chore: remove container recovery (determined-ai#629)

commit a8c1bb2
Author: Yoni Ben-tzur <[email protected]>
Date:   Wed Jun 3 09:22:16 2020 -0700

    feat: add master endpoint to new api (determined-ai#627)

commit 678d53d
Author: Yoni Ben-tzur <[email protected]>
Date:   Wed Jun 3 08:31:45 2020 -0700

    chore: ignore pkg dir in proto sub project (determined-ai#628)

commit 65b5c17
Author: Brian Friedenberg <[email protected]>
Date:   Wed Jun 3 07:07:19 2020 -0700

    chore: bump version: 0.12.5.dev0 -> 0.12.6.dev0 (determined-ai#625)

commit 13c0db2
Author: Yoni Ben-tzur <[email protected]>
Date:   Tue Jun 2 21:47:53 2020 -0700

    chore: move proto to separate top level package (determined-ai#620)

    * chore: move proto to separate top level package

    * fix build issues

    * downgrade buf build for go1.13

    * build proto before publish

    * proto build before package

commit 897f2f6
Author: Shiyuan <[email protected]>
Date:   Tue Jun 2 17:29:30 2020 -0700

    revert: make agent starting period configurable [DET-3219] (determined-ai#623)

    This reverts commit 7f83e97.

commit 7f83e97
Author: Shiyuan <[email protected]>
Date:   Tue Jun 2 17:11:05 2020 -0700

    feat: make agent starting period configurable [DET-3219] (determined-ai#610)

    * feat: make agent starting period configurable

    The provisioner would retry launching an agent after reaching
    a duration, which is set by max agent starting period. Previously,
    this configuration is hardcoded to 300 seconds. Now, change it to
    be in the configuration of the provisioner.

    DET-3219 #Done.

    * test: update tests for max agent starting period

    * chore: clean up provisioner code for better readability

    * chore: clean up provisioner tests for better readability

    * docs: update docs for agent starting period

    * chore: update gcp deployment tool for agent starting period

    * chore: update packaged master.yaml with agent starting period

    * chore: update aws deployment tool with agent starting perioo

commit b01b560
Author: Armand McQueen <[email protected]>
Date:   Tue Jun 2 14:28:02 2020 -0700

    fix: read docker config file from HOME directory (determined-ai#587)

    The path to the docker config file varies depending on location of HOME directory. When an agent was run somewhere where /root is not the HOME directory, this prevented using docker credential stores.

commit e0d0447
Author: Shiyuan <[email protected]>
Date:   Tue Jun 2 13:37:48 2020 -0700

    feat: make GCP operation tracker timeout configuration [DET-3182] (determined-ai#598)

    * feat: make GCP provider operation timeout configurable

    * docs: update the cluster configuration doc for gcp operation timeout

    * docs: update gcp topic guide with operation timeout

    * docs: update default master configuration for gcp operation timeout

    * feat: update GCP terraform script for operation timeout

    * feat: update GCP deploy tool for operation timeout
tayritenour pushed a commit to tayritenour/determined that referenced this pull request Apr 25, 2023
eecsliu pushed a commit to eecsliu/determined that referenced this pull request Jun 23, 2023
@dannysauer dannysauer added this to the 0.12.7 milestone Feb 6, 2024
eecsliu pushed a commit to determined-ai/determined-release-testing that referenced this pull request Apr 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants