-
Notifications
You must be signed in to change notification settings - Fork 356
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: decouple agent information from workloads starting tasks [DET-3178] #631
Conversation
Performed manual tests confirming multi-container triale work as expected. Also checked commands, tensorboard, and notebooks just in case. |
2ece136
to
8bf902d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a slightly more detailed summary of what changed in the PR/commit message would be helpful. Something like "Tasks now receive a single `TaskAssigned` message when they are assigned rather than one `Assigned` message per container. They also start containers by sending specs back to the cluster rather than directly to agents.
master/internal/scheduler/task.go
Outdated
@@ -6,6 +6,7 @@ import ( | |||
"github.com/google/uuid" | |||
|
|||
"github.com/determined-ai/determined/master/pkg/actor" | |||
image "github.com/determined-ai/determined/master/pkg/tasks" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to have the "image" alias?
07cca1a
to
0332b33
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should've thought of this before, but, under the principle of "not just what, but why", perhaps there should also be something like "This is a step toward adding a generic resource provider interface" in the message for posterity's sake.
master/internal/scheduler/cluster.go
Outdated
func (c *Cluster) receiveStartTask(ctx *actor.Context, msg StartTask) { | ||
task := c.tasksByHandler[ctx.Sender()] | ||
if task == nil { | ||
ctx.Log().WithField("address", ctx.Sender().Address).Errorf("unknown task trying to start") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, my bad:
ctx.Log().WithField("address", ctx.Sender().Address).Errorf("unknown task trying to start") | |
ctx.Log().WithField("address", ctx.Sender().Address()).Errorf("unknown task trying to start") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice catch, thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heh, just canceling out the fact that I put a bad suggestion in the first place. 🤦♂️
This is a step toward adding a generic resource provider interface. As part of this change, Tasks now receive a single `TaskAssigned` message when they are assigned rather than one `Assigned` message per container. They also start containers by sending specs back to the cluster rather than directly to agents. DET-3178 #Done.
Squashed commit of the following: commit 3b07678 Author: Hamid Zare <[email protected]> Date: Mon Jun 8 16:16:48 2020 -0700 feat: add task list page route and placeholder [DET-3220] (determined-ai#636) commit 4c2d0a6 Author: Hamid Zare <[email protected]> Date: Mon Jun 8 16:15:23 2020 -0700 feat: remember last logged in username [DET-3274] (determined-ai#660) commit 18c8125 Author: Hamid Zare <[email protected]> Date: Mon Jun 8 13:25:46 2020 -0700 refactor: set up experiments context [DET-3255] (determined-ai#640) commit 5e5b188 Author: Neil Conway <[email protected]> Date: Mon Jun 8 12:49:34 2020 -0700 chore: add license to pip metadata (determined-ai#669) Without this change, `pip show determined-cli | grep License` returns: License: UNKNOWN commit 05aa3d2 Author: Brian Friedenberg <[email protected]> Date: Mon Jun 8 11:53:17 2020 -0700 feat: support TF Keras EarlyStopping callbacks [DET-3240] (determined-ai#666) commit 4056146 Author: aaron276h <[email protected]> Date: Mon Jun 8 14:21:13 2020 -0400 docs: add to FAQ how to port a TF core graph model (determined-ai#650) commit c8bb942 Author: Brian Friedenberg <[email protected]> Date: Mon Jun 8 10:45:39 2020 -0700 feat: support Estimator early stopping hooks [DET-3239] (determined-ai#661) commit 3ab90a6 Author: Brian Friedenberg <[email protected]> Date: Mon Jun 8 10:44:17 2020 -0700 test: temporarily disable AMP test since it causes NaNs (determined-ai#670) commit 629f106 Author: Brian Friedenberg <[email protected]> Date: Mon Jun 8 09:22:37 2020 -0700 feat: treat NaN metrics as an error (determined-ai#667) The expected behavior when we hit a NaN is to error the trial. This will restart the trial if we have not restarted max_restarts times. Before we would convert the NaN to the maximum float. commit db76932 Author: Yoni Ben-tzur <[email protected]> Date: Mon Jun 8 08:50:48 2020 -0700 fix: set auth cookie path to apply site wide (determined-ai#668) commit 6588f77 Author: aaron276h <[email protected]> Date: Mon Jun 8 08:47:11 2020 -0400 feat: decouple agent information from workloads starting tasks [DET-3178] (determined-ai#631) This is a step toward adding a generic resource provider interface. As part of this change, Tasks now receive a single `TaskAssigned` message when they are assigned rather than one `Assigned` message per container. They also start containers by sending specs back to the cluster rather than directly to agents. commit f604a28 Author: Yoni Ben-tzur <[email protected]> Date: Fri Jun 5 16:44:11 2020 -0700 feat: read cookies in the new API auth module (determined-ai#665) commit 9da1063 Author: Danny Zhu <[email protected]> Date: Fri Jun 5 15:43:32 2020 -0700 fix: space out WebUI plot x-axis ticks a bit more (determined-ai#658) The ticks could get bunched up too much with long-running experiments before; this just makes sure the spacing is a bit wider by enough to work well in reasonable cases. The ticks may now occasionally look a tiny bit wider than ideal with very short experiments, but it still works fine in practice. Similarly, a long last tick could get cut off at the end, so this also bumps the right side spacing correspondingly. commit b9d9324 Author: Brian Friedenberg <[email protected]> Date: Fri Jun 5 15:24:54 2020 -0700 feat: support early stopping callbacks on a validation step (determined-ai#662) commit cfb3f51 Author: Yoni Ben-tzur <[email protected]> Date: Fri Jun 5 15:22:11 2020 -0700 feat: add user auth to new api (determined-ai#649) commit 414bfdf Author: Hamid Zare <[email protected]> Date: Fri Jun 5 14:07:22 2020 -0700 fix: set authentication failure reason synchronously. (determined-ai#659) commit ed94d86 Author: aaron276h <[email protected]> Date: Fri Jun 5 16:15:25 2020 -0400 feat: decouple agents from transmitting container status changes [DET-3174] (determined-ai#646) This is a step toward adding a generic resource provider interface. commit f27146a Author: Caleb Hoyoul Kang <[email protected]> Date: Fri Jun 5 13:52:03 2020 -0600 fix: address minor login issues (determined-ai#611) * Increase frequency of auto login check to once per second. * Add catch-all routes to redirect invalid route requests to valid routes. commit d014500 Author: Brian Friedenberg <[email protected]> Date: Fri Jun 5 11:47:43 2020 -0700 revert: "revert: "feat: support stopping training in trial code [DET-3238] (determined-ai#648)" (determined-ai#654)" (determined-ai#656) This reverts commit 5baea6a. commit 44a398a Author: Caleb Hoyoul Kang <[email protected]> Date: Fri Jun 5 10:38:40 2020 -0600 feat: ensure WebUI version is up to date with platform version (determined-ai#632) commit 5baea6a Author: Brian Friedenberg <[email protected]> Date: Fri Jun 5 08:55:58 2020 -0700 revert: "feat: support stopping training in trial code [DET-3238] (determined-ai#648)" (determined-ai#654) This reverts commit ee1314f. commit ee1314f Author: Brian Friedenberg <[email protected]> Date: Fri Jun 5 08:02:13 2020 -0700 feat: support stopping training in trial code [DET-3238] (determined-ai#648) commit fa09a74 Author: Brian Friedenberg <[email protected]> Date: Fri Jun 5 07:46:24 2020 -0700 ci: download protoc install to /tmp (determined-ai#653) commit 9759ce7 Author: determined-dsw <[email protected]> Date: Thu Jun 4 18:49:47 2020 -0700 docs: release notes for 0.12.5 (determined-ai#595) (determined-ai#651) (cherry picked from commit 071b3eb) Co-authored-by: Justin Chen <[email protected]> commit 5f476df Author: Hamid Zare <[email protected]> Date: Thu Jun 4 15:30:17 2020 -0700 chore: remove yarn mentions from tests (determined-ai#635) we removed yarn as a dependency a while ago and shouldn't have references to it anymore. commit 8662fda Author: Danny Zhu <[email protected]> Date: Thu Jun 4 13:03:22 2020 -0700 fix: correct filename in Elm Makefile (determined-ai#647) Due to a mismatch between different instances of what should've been the same name, the CSS file was getting built unnecessarily every time, considerably increasing the Elm build time. commit 0e7ca0a Author: Sidney Wijngaarde <[email protected]> Date: Thu Jun 4 12:42:41 2020 -0400 feat: add checkpoint metadata to cli describe commands (determined-ai#645) [DET-3210] commit 84e875a Author: aaron276h <[email protected]> Date: Thu Jun 4 11:56:52 2020 -0400 test: fix nightly nas and iris tf keras tests [DET-3264] (determined-ai#644) * docs: update NAS example to use correct gradient clipping * test: set random seed for nightly iris tf_keras test commit 4ff9fa0 Author: Sidney Wijngaarde <[email protected]> Date: Thu Jun 4 10:47:38 2020 -0400 feat: checkpoint metadata api (determined-ai#619) Adds checkpoint metadata management REST endpoints and python client methods. [DET-3207] [DET-3208] [DET-3209] PR includes an integration test covering the feature commit cbbe117 Author: Yoni Ben-tzur <[email protected]> Date: Wed Jun 3 20:11:39 2020 -0700 chore: move proto files to determined namespace (determined-ai#639) commit fafd686 Author: Yoni Ben-tzur <[email protected]> Date: Wed Jun 3 17:01:33 2020 -0700 feat: add template endpoints to new api (determined-ai#638) commit 4bad652 Author: Brian Friedenberg <[email protected]> Date: Wed Jun 3 16:51:57 2020 -0700 feat: support USER_CANCELLED exited reason (determined-ai#637) commit d1146d3 Author: Caleb Hoyoul Kang <[email protected]> Date: Wed Jun 3 17:37:04 2020 -0600 refactor: update link to support secure blank targets (determined-ai#612) commit f71d64e Author: Hamid Zare <[email protected]> Date: Wed Jun 3 15:33:15 2020 -0700 feat: add page component [DET-3232] (determined-ai#614) commit 25e725e Author: aaron276h <[email protected]> Date: Wed Jun 3 17:25:12 2020 -0400 feat: support gradient clipping in PyTorchTrial via callbacks (determined-ai#615) Breaking Change: we no longer accept gradient clipping as a special hyperparameter for PyTorchTrial. commit 80e39d0 Author: Hamid Zare <[email protected]> Date: Wed Jun 3 12:28:48 2020 -0700 feat: add antd breadcrumb stories [DET-3002] (determined-ai#582) commit 5c9afa2 Author: Hamid Zare <[email protected]> Date: Wed Jun 3 12:27:26 2020 -0700 feat: add activate, pause, and cancel actions to task cards [DET-2934] (determined-ai#585) commit a3e121a Author: aaron276h <[email protected]> Date: Wed Jun 3 13:34:10 2020 -0400 feat: add end of training callback to EstimatorTrial (determined-ai#621) This will allow users to do any post-experiment cleanup they may need to do. commit 8056055 Author: Shiyuan <[email protected]> Date: Wed Jun 3 10:24:40 2020 -0700 feat: make agent starting period configurable [DET-3219] (determined-ai#624) * feat: make agent starting period configurable The provisioner would retry launching an agent after reaching a duration, which is set by max agent starting period. Previously, this configuration is hardcoded to 300 seconds. Now, change it to be in the configuration of the provisioner. DET-3219 #Done. * test: update tests for max agent starting period * chore: clean up provisioner code for better readability * chore: clean up provisioner tests for better readability * docs: update docs for agent starting period * chore: update gcp deployment tool for agent starting period * chore: update packaged master.yaml with agent starting period * chore: update aws deployment tool with agent starting perioo commit 8fdc371 Author: Yoni Ben-tzur <[email protected]> Date: Wed Jun 3 10:12:13 2020 -0700 chore: upgrade proto libraries (determined-ai#630) commit bdfd980 Author: Danny Zhu <[email protected]> Date: Wed Jun 3 09:47:02 2020 -0700 fix: correct logic for checking if a validation is the best one seen (determined-ai#601) Previously, every validation would be marked as the best one seen when `smallerIsBetter` was true. The new version avoids that and also separates the code a bit more cleanly into "decide whether the validation is best" followed by "do things accordingly". commit f590fc3 Author: Yoni Ben-tzur <[email protected]> Date: Wed Jun 3 09:44:41 2020 -0700 chore: remove container recovery (determined-ai#629) commit a8c1bb2 Author: Yoni Ben-tzur <[email protected]> Date: Wed Jun 3 09:22:16 2020 -0700 feat: add master endpoint to new api (determined-ai#627) commit 678d53d Author: Yoni Ben-tzur <[email protected]> Date: Wed Jun 3 08:31:45 2020 -0700 chore: ignore pkg dir in proto sub project (determined-ai#628) commit 65b5c17 Author: Brian Friedenberg <[email protected]> Date: Wed Jun 3 07:07:19 2020 -0700 chore: bump version: 0.12.5.dev0 -> 0.12.6.dev0 (determined-ai#625) commit 13c0db2 Author: Yoni Ben-tzur <[email protected]> Date: Tue Jun 2 21:47:53 2020 -0700 chore: move proto to separate top level package (determined-ai#620) * chore: move proto to separate top level package * fix build issues * downgrade buf build for go1.13 * build proto before publish * proto build before package commit 897f2f6 Author: Shiyuan <[email protected]> Date: Tue Jun 2 17:29:30 2020 -0700 revert: make agent starting period configurable [DET-3219] (determined-ai#623) This reverts commit 7f83e97. commit 7f83e97 Author: Shiyuan <[email protected]> Date: Tue Jun 2 17:11:05 2020 -0700 feat: make agent starting period configurable [DET-3219] (determined-ai#610) * feat: make agent starting period configurable The provisioner would retry launching an agent after reaching a duration, which is set by max agent starting period. Previously, this configuration is hardcoded to 300 seconds. Now, change it to be in the configuration of the provisioner. DET-3219 #Done. * test: update tests for max agent starting period * chore: clean up provisioner code for better readability * chore: clean up provisioner tests for better readability * docs: update docs for agent starting period * chore: update gcp deployment tool for agent starting period * chore: update packaged master.yaml with agent starting period * chore: update aws deployment tool with agent starting perioo commit b01b560 Author: Armand McQueen <[email protected]> Date: Tue Jun 2 14:28:02 2020 -0700 fix: read docker config file from HOME directory (determined-ai#587) The path to the docker config file varies depending on location of HOME directory. When an agent was run somewhere where /root is not the HOME directory, this prevented using docker credential stores. commit e0d0447 Author: Shiyuan <[email protected]> Date: Tue Jun 2 13:37:48 2020 -0700 feat: make GCP operation tracker timeout configuration [DET-3182] (determined-ai#598) * feat: make GCP provider operation timeout configurable * docs: update the cluster configuration doc for gcp operation timeout * docs: update gcp topic guide with operation timeout * docs: update default master configuration for gcp operation timeout * feat: update GCP terraform script for operation timeout * feat: update GCP deploy tool for operation timeout
…ory (determined-ai#631) * use /var/tmp as tmpfs Co-authored-by: Pankaj Mandal <[email protected]>
…ory (determined-ai#631) * use /var/tmp as tmpfs Co-authored-by: Pankaj Mandal <[email protected]>
…ory (#631) * use /var/tmp as tmpfs Co-authored-by: Pankaj Mandal <[email protected]>
…ory (#631) * use /var/tmp as tmpfs Co-authored-by: Pankaj Mandal <[email protected]>
…ory (#631) * use /var/tmp as tmpfs Co-authored-by: Pankaj Mandal <[email protected]>
…ory (#631) * use /var/tmp as tmpfs Co-authored-by: Pankaj Mandal <[email protected]>
…ory (#631) * use /var/tmp as tmpfs Co-authored-by: Pankaj Mandal <[email protected]>
…ory (#631) * use /var/tmp as tmpfs Co-authored-by: Pankaj Mandal <[email protected]>
…ory (#631) * use /var/tmp as tmpfs Co-authored-by: Pankaj Mandal <[email protected]>
…ory (#631) * use /var/tmp as tmpfs Co-authored-by: Pankaj Mandal <[email protected]>
…ory (#631) * use /var/tmp as tmpfs Co-authored-by: Pankaj Mandal <[email protected]>
…ory (#631) * use /var/tmp as tmpfs Co-authored-by: Pankaj Mandal <[email protected]>
…ory (#631) * use /var/tmp as tmpfs Co-authored-by: Pankaj Mandal <[email protected]>
…ory (determined-ai#631) * use /var/tmp as tmpfs Co-authored-by: Pankaj Mandal <[email protected]>
Description
As part of this change, Tasks now receive a single
TaskAssigned
messagewhen they are assigned rather than one
Assigned
message per container.They also start containers by sending specs back to the cluster rather than directly to agents.
Test Plan
Updated existing tests which should provide coverage. Additionally going to perform manual testing to test multi-container changes, and sanity check commands.