-
Notifications
You must be signed in to change notification settings - Fork 356
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: correct logic for checking if a validation is the best one seen #601
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Previously, every validation would be marked as the best one seen when `smallerIsBetter` was true. The new version avoids that and also separates the code a bit more cleanly into "decide whether the validation is best" followed by "do things accordingly".
ybt195
approved these changes
Jun 3, 2020
hamidzr
added a commit
to hamidzr/determined
that referenced
this pull request
Jun 9, 2020
Squashed commit of the following: commit 3b07678 Author: Hamid Zare <[email protected]> Date: Mon Jun 8 16:16:48 2020 -0700 feat: add task list page route and placeholder [DET-3220] (determined-ai#636) commit 4c2d0a6 Author: Hamid Zare <[email protected]> Date: Mon Jun 8 16:15:23 2020 -0700 feat: remember last logged in username [DET-3274] (determined-ai#660) commit 18c8125 Author: Hamid Zare <[email protected]> Date: Mon Jun 8 13:25:46 2020 -0700 refactor: set up experiments context [DET-3255] (determined-ai#640) commit 5e5b188 Author: Neil Conway <[email protected]> Date: Mon Jun 8 12:49:34 2020 -0700 chore: add license to pip metadata (determined-ai#669) Without this change, `pip show determined-cli | grep License` returns: License: UNKNOWN commit 05aa3d2 Author: Brian Friedenberg <[email protected]> Date: Mon Jun 8 11:53:17 2020 -0700 feat: support TF Keras EarlyStopping callbacks [DET-3240] (determined-ai#666) commit 4056146 Author: aaron276h <[email protected]> Date: Mon Jun 8 14:21:13 2020 -0400 docs: add to FAQ how to port a TF core graph model (determined-ai#650) commit c8bb942 Author: Brian Friedenberg <[email protected]> Date: Mon Jun 8 10:45:39 2020 -0700 feat: support Estimator early stopping hooks [DET-3239] (determined-ai#661) commit 3ab90a6 Author: Brian Friedenberg <[email protected]> Date: Mon Jun 8 10:44:17 2020 -0700 test: temporarily disable AMP test since it causes NaNs (determined-ai#670) commit 629f106 Author: Brian Friedenberg <[email protected]> Date: Mon Jun 8 09:22:37 2020 -0700 feat: treat NaN metrics as an error (determined-ai#667) The expected behavior when we hit a NaN is to error the trial. This will restart the trial if we have not restarted max_restarts times. Before we would convert the NaN to the maximum float. commit db76932 Author: Yoni Ben-tzur <[email protected]> Date: Mon Jun 8 08:50:48 2020 -0700 fix: set auth cookie path to apply site wide (determined-ai#668) commit 6588f77 Author: aaron276h <[email protected]> Date: Mon Jun 8 08:47:11 2020 -0400 feat: decouple agent information from workloads starting tasks [DET-3178] (determined-ai#631) This is a step toward adding a generic resource provider interface. As part of this change, Tasks now receive a single `TaskAssigned` message when they are assigned rather than one `Assigned` message per container. They also start containers by sending specs back to the cluster rather than directly to agents. commit f604a28 Author: Yoni Ben-tzur <[email protected]> Date: Fri Jun 5 16:44:11 2020 -0700 feat: read cookies in the new API auth module (determined-ai#665) commit 9da1063 Author: Danny Zhu <[email protected]> Date: Fri Jun 5 15:43:32 2020 -0700 fix: space out WebUI plot x-axis ticks a bit more (determined-ai#658) The ticks could get bunched up too much with long-running experiments before; this just makes sure the spacing is a bit wider by enough to work well in reasonable cases. The ticks may now occasionally look a tiny bit wider than ideal with very short experiments, but it still works fine in practice. Similarly, a long last tick could get cut off at the end, so this also bumps the right side spacing correspondingly. commit b9d9324 Author: Brian Friedenberg <[email protected]> Date: Fri Jun 5 15:24:54 2020 -0700 feat: support early stopping callbacks on a validation step (determined-ai#662) commit cfb3f51 Author: Yoni Ben-tzur <[email protected]> Date: Fri Jun 5 15:22:11 2020 -0700 feat: add user auth to new api (determined-ai#649) commit 414bfdf Author: Hamid Zare <[email protected]> Date: Fri Jun 5 14:07:22 2020 -0700 fix: set authentication failure reason synchronously. (determined-ai#659) commit ed94d86 Author: aaron276h <[email protected]> Date: Fri Jun 5 16:15:25 2020 -0400 feat: decouple agents from transmitting container status changes [DET-3174] (determined-ai#646) This is a step toward adding a generic resource provider interface. commit f27146a Author: Caleb Hoyoul Kang <[email protected]> Date: Fri Jun 5 13:52:03 2020 -0600 fix: address minor login issues (determined-ai#611) * Increase frequency of auto login check to once per second. * Add catch-all routes to redirect invalid route requests to valid routes. commit d014500 Author: Brian Friedenberg <[email protected]> Date: Fri Jun 5 11:47:43 2020 -0700 revert: "revert: "feat: support stopping training in trial code [DET-3238] (determined-ai#648)" (determined-ai#654)" (determined-ai#656) This reverts commit 5baea6a. commit 44a398a Author: Caleb Hoyoul Kang <[email protected]> Date: Fri Jun 5 10:38:40 2020 -0600 feat: ensure WebUI version is up to date with platform version (determined-ai#632) commit 5baea6a Author: Brian Friedenberg <[email protected]> Date: Fri Jun 5 08:55:58 2020 -0700 revert: "feat: support stopping training in trial code [DET-3238] (determined-ai#648)" (determined-ai#654) This reverts commit ee1314f. commit ee1314f Author: Brian Friedenberg <[email protected]> Date: Fri Jun 5 08:02:13 2020 -0700 feat: support stopping training in trial code [DET-3238] (determined-ai#648) commit fa09a74 Author: Brian Friedenberg <[email protected]> Date: Fri Jun 5 07:46:24 2020 -0700 ci: download protoc install to /tmp (determined-ai#653) commit 9759ce7 Author: determined-dsw <[email protected]> Date: Thu Jun 4 18:49:47 2020 -0700 docs: release notes for 0.12.5 (determined-ai#595) (determined-ai#651) (cherry picked from commit 071b3eb) Co-authored-by: Justin Chen <[email protected]> commit 5f476df Author: Hamid Zare <[email protected]> Date: Thu Jun 4 15:30:17 2020 -0700 chore: remove yarn mentions from tests (determined-ai#635) we removed yarn as a dependency a while ago and shouldn't have references to it anymore. commit 8662fda Author: Danny Zhu <[email protected]> Date: Thu Jun 4 13:03:22 2020 -0700 fix: correct filename in Elm Makefile (determined-ai#647) Due to a mismatch between different instances of what should've been the same name, the CSS file was getting built unnecessarily every time, considerably increasing the Elm build time. commit 0e7ca0a Author: Sidney Wijngaarde <[email protected]> Date: Thu Jun 4 12:42:41 2020 -0400 feat: add checkpoint metadata to cli describe commands (determined-ai#645) [DET-3210] commit 84e875a Author: aaron276h <[email protected]> Date: Thu Jun 4 11:56:52 2020 -0400 test: fix nightly nas and iris tf keras tests [DET-3264] (determined-ai#644) * docs: update NAS example to use correct gradient clipping * test: set random seed for nightly iris tf_keras test commit 4ff9fa0 Author: Sidney Wijngaarde <[email protected]> Date: Thu Jun 4 10:47:38 2020 -0400 feat: checkpoint metadata api (determined-ai#619) Adds checkpoint metadata management REST endpoints and python client methods. [DET-3207] [DET-3208] [DET-3209] PR includes an integration test covering the feature commit cbbe117 Author: Yoni Ben-tzur <[email protected]> Date: Wed Jun 3 20:11:39 2020 -0700 chore: move proto files to determined namespace (determined-ai#639) commit fafd686 Author: Yoni Ben-tzur <[email protected]> Date: Wed Jun 3 17:01:33 2020 -0700 feat: add template endpoints to new api (determined-ai#638) commit 4bad652 Author: Brian Friedenberg <[email protected]> Date: Wed Jun 3 16:51:57 2020 -0700 feat: support USER_CANCELLED exited reason (determined-ai#637) commit d1146d3 Author: Caleb Hoyoul Kang <[email protected]> Date: Wed Jun 3 17:37:04 2020 -0600 refactor: update link to support secure blank targets (determined-ai#612) commit f71d64e Author: Hamid Zare <[email protected]> Date: Wed Jun 3 15:33:15 2020 -0700 feat: add page component [DET-3232] (determined-ai#614) commit 25e725e Author: aaron276h <[email protected]> Date: Wed Jun 3 17:25:12 2020 -0400 feat: support gradient clipping in PyTorchTrial via callbacks (determined-ai#615) Breaking Change: we no longer accept gradient clipping as a special hyperparameter for PyTorchTrial. commit 80e39d0 Author: Hamid Zare <[email protected]> Date: Wed Jun 3 12:28:48 2020 -0700 feat: add antd breadcrumb stories [DET-3002] (determined-ai#582) commit 5c9afa2 Author: Hamid Zare <[email protected]> Date: Wed Jun 3 12:27:26 2020 -0700 feat: add activate, pause, and cancel actions to task cards [DET-2934] (determined-ai#585) commit a3e121a Author: aaron276h <[email protected]> Date: Wed Jun 3 13:34:10 2020 -0400 feat: add end of training callback to EstimatorTrial (determined-ai#621) This will allow users to do any post-experiment cleanup they may need to do. commit 8056055 Author: Shiyuan <[email protected]> Date: Wed Jun 3 10:24:40 2020 -0700 feat: make agent starting period configurable [DET-3219] (determined-ai#624) * feat: make agent starting period configurable The provisioner would retry launching an agent after reaching a duration, which is set by max agent starting period. Previously, this configuration is hardcoded to 300 seconds. Now, change it to be in the configuration of the provisioner. DET-3219 #Done. * test: update tests for max agent starting period * chore: clean up provisioner code for better readability * chore: clean up provisioner tests for better readability * docs: update docs for agent starting period * chore: update gcp deployment tool for agent starting period * chore: update packaged master.yaml with agent starting period * chore: update aws deployment tool with agent starting perioo commit 8fdc371 Author: Yoni Ben-tzur <[email protected]> Date: Wed Jun 3 10:12:13 2020 -0700 chore: upgrade proto libraries (determined-ai#630) commit bdfd980 Author: Danny Zhu <[email protected]> Date: Wed Jun 3 09:47:02 2020 -0700 fix: correct logic for checking if a validation is the best one seen (determined-ai#601) Previously, every validation would be marked as the best one seen when `smallerIsBetter` was true. The new version avoids that and also separates the code a bit more cleanly into "decide whether the validation is best" followed by "do things accordingly". commit f590fc3 Author: Yoni Ben-tzur <[email protected]> Date: Wed Jun 3 09:44:41 2020 -0700 chore: remove container recovery (determined-ai#629) commit a8c1bb2 Author: Yoni Ben-tzur <[email protected]> Date: Wed Jun 3 09:22:16 2020 -0700 feat: add master endpoint to new api (determined-ai#627) commit 678d53d Author: Yoni Ben-tzur <[email protected]> Date: Wed Jun 3 08:31:45 2020 -0700 chore: ignore pkg dir in proto sub project (determined-ai#628) commit 65b5c17 Author: Brian Friedenberg <[email protected]> Date: Wed Jun 3 07:07:19 2020 -0700 chore: bump version: 0.12.5.dev0 -> 0.12.6.dev0 (determined-ai#625) commit 13c0db2 Author: Yoni Ben-tzur <[email protected]> Date: Tue Jun 2 21:47:53 2020 -0700 chore: move proto to separate top level package (determined-ai#620) * chore: move proto to separate top level package * fix build issues * downgrade buf build for go1.13 * build proto before publish * proto build before package commit 897f2f6 Author: Shiyuan <[email protected]> Date: Tue Jun 2 17:29:30 2020 -0700 revert: make agent starting period configurable [DET-3219] (determined-ai#623) This reverts commit 7f83e97. commit 7f83e97 Author: Shiyuan <[email protected]> Date: Tue Jun 2 17:11:05 2020 -0700 feat: make agent starting period configurable [DET-3219] (determined-ai#610) * feat: make agent starting period configurable The provisioner would retry launching an agent after reaching a duration, which is set by max agent starting period. Previously, this configuration is hardcoded to 300 seconds. Now, change it to be in the configuration of the provisioner. DET-3219 #Done. * test: update tests for max agent starting period * chore: clean up provisioner code for better readability * chore: clean up provisioner tests for better readability * docs: update docs for agent starting period * chore: update gcp deployment tool for agent starting period * chore: update packaged master.yaml with agent starting period * chore: update aws deployment tool with agent starting perioo commit b01b560 Author: Armand McQueen <[email protected]> Date: Tue Jun 2 14:28:02 2020 -0700 fix: read docker config file from HOME directory (determined-ai#587) The path to the docker config file varies depending on location of HOME directory. When an agent was run somewhere where /root is not the HOME directory, this prevented using docker credential stores. commit e0d0447 Author: Shiyuan <[email protected]> Date: Tue Jun 2 13:37:48 2020 -0700 feat: make GCP operation tracker timeout configuration [DET-3182] (determined-ai#598) * feat: make GCP provider operation timeout configurable * docs: update the cluster configuration doc for gcp operation timeout * docs: update gcp topic guide with operation timeout * docs: update default master configuration for gcp operation timeout * feat: update GCP terraform script for operation timeout * feat: update GCP deploy tool for operation timeout
tayritenour
pushed a commit
to tayritenour/determined
that referenced
this pull request
Apr 25, 2023
eecsliu
pushed a commit
to eecsliu/determined
that referenced
this pull request
Jun 23, 2023
stoksc
pushed a commit
that referenced
this pull request
Jun 26, 2023
eecsliu
pushed a commit
that referenced
this pull request
Jun 28, 2023
eecsliu
pushed a commit
that referenced
this pull request
Jun 28, 2023
stoksc
pushed a commit
that referenced
this pull request
Jul 20, 2023
eecsliu
pushed a commit
that referenced
this pull request
Jul 24, 2023
stoksc
pushed a commit
that referenced
this pull request
Oct 17, 2023
azhou-determined
pushed a commit
that referenced
this pull request
Dec 7, 2023
wes-turner
pushed a commit
that referenced
this pull request
Feb 2, 2024
rb-determined-ai
pushed a commit
that referenced
this pull request
Feb 29, 2024
amandavialva01
pushed a commit
that referenced
this pull request
Mar 18, 2024
eecsliu
pushed a commit
that referenced
this pull request
Apr 18, 2024
eecsliu
pushed a commit
to determined-ai/determined-release-testing
that referenced
this pull request
Apr 22, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Previously, every validation would be marked as the best one seen when
smallerIsBetter
was true. The new version avoids that and also separates the code a bit more cleanly into "decide whether the validation is best" followed by "do things accordingly".Test Plan
checkpoint_policy: best
,smaller_is_better: true
,min_validation_period: 1
, and nomin_checkpoint_period
; check that checkpoints are taken when appropriate rather than after every validation