Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: decouple agent information from workloads starting tasks [DET-3178] #631

Merged
merged 1 commit into from
Jun 8, 2020

Conversation

aaron276h
Copy link
Contributor

@aaron276h aaron276h commented Jun 3, 2020

Description

As part of this change, Tasks now receive a single TaskAssigned message
when they are assigned rather than one Assigned message per container.
They also start containers by sending specs back to the cluster rather than directly to agents.

Test Plan

Updated existing tests which should provide coverage. Additionally going to perform manual testing to test multi-container changes, and sanity check commands.

@aaron276h aaron276h requested a review from dzhu June 3, 2020 19:06
@cla-bot cla-bot bot added the cla-signed label Jun 3, 2020
@aaron276h
Copy link
Contributor Author

aaron276h commented Jun 3, 2020

Performed manual tests confirming multi-container triale work as expected. Also checked commands, tensorboard, and notebooks just in case.

@aaron276h aaron276h force-pushed the DET-3178 branch 2 times, most recently from 2ece136 to 8bf902d Compare June 4, 2020 19:20
Copy link
Contributor

@dzhu dzhu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a slightly more detailed summary of what changed in the PR/commit message would be helpful. Something like "Tasks now receive a single `TaskAssigned` message when they are assigned rather than one `Assigned` message per container. They also start containers by sending specs back to the cluster rather than directly to agents.

master/internal/scheduler/cluster.go Outdated Show resolved Hide resolved
master/internal/scheduler/cluster.go Outdated Show resolved Hide resolved
master/internal/scheduler/cluster.go Outdated Show resolved Hide resolved
master/internal/scheduler/cluster.go Outdated Show resolved Hide resolved
master/internal/scheduler/cluster.go Outdated Show resolved Hide resolved
@@ -6,6 +6,7 @@ import (
"github.com/google/uuid"

"github.com/determined-ai/determined/master/pkg/actor"
image "github.com/determined-ai/determined/master/pkg/tasks"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to have the "image" alias?

master/internal/scheduler/resource_provider.go Outdated Show resolved Hide resolved
master/internal/scheduler/cluster.go Outdated Show resolved Hide resolved
master/internal/scheduler/task.go Show resolved Hide resolved
@dzhu dzhu assigned aaron276h and unassigned dzhu Jun 5, 2020
@aaron276h aaron276h force-pushed the DET-3178 branch 2 times, most recently from 07cca1a to 0332b33 Compare June 5, 2020 16:30
@aaron276h aaron276h assigned dzhu and unassigned aaron276h Jun 5, 2020
Copy link
Contributor

@dzhu dzhu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should've thought of this before, but, under the principle of "not just what, but why", perhaps there should also be something like "This is a step toward adding a generic resource provider interface" in the message for posterity's sake.

func (c *Cluster) receiveStartTask(ctx *actor.Context, msg StartTask) {
task := c.tasksByHandler[ctx.Sender()]
if task == nil {
ctx.Log().WithField("address", ctx.Sender().Address).Errorf("unknown task trying to start")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, my bad:

Suggested change
ctx.Log().WithField("address", ctx.Sender().Address).Errorf("unknown task trying to start")
ctx.Log().WithField("address", ctx.Sender().Address()).Errorf("unknown task trying to start")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch, thanks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh, just canceling out the fact that I put a bad suggestion in the first place. 🤦‍♂️

@dzhu dzhu assigned aaron276h and unassigned dzhu Jun 5, 2020
This is a step toward adding a generic resource provider interface.
As part of this change, Tasks now receive a single `TaskAssigned` message
when they are assigned rather than one `Assigned` message per container.
They also start containers by sending specs back to the cluster rather
than directly to agents.

DET-3178  #Done.
@aaron276h aaron276h merged commit 6588f77 into determined-ai:master Jun 8, 2020
hamidzr added a commit to hamidzr/determined that referenced this pull request Jun 9, 2020
Squashed commit of the following:

commit 3b07678
Author: Hamid Zare <[email protected]>
Date:   Mon Jun 8 16:16:48 2020 -0700

    feat: add task list page route and placeholder [DET-3220] (determined-ai#636)

commit 4c2d0a6
Author: Hamid Zare <[email protected]>
Date:   Mon Jun 8 16:15:23 2020 -0700

    feat: remember last logged in username [DET-3274] (determined-ai#660)

commit 18c8125
Author: Hamid Zare <[email protected]>
Date:   Mon Jun 8 13:25:46 2020 -0700

    refactor: set up experiments context [DET-3255] (determined-ai#640)

commit 5e5b188
Author: Neil Conway <[email protected]>
Date:   Mon Jun 8 12:49:34 2020 -0700

    chore: add license to pip metadata (determined-ai#669)

    Without this change, `pip show determined-cli | grep License` returns:

        License: UNKNOWN

commit 05aa3d2
Author: Brian Friedenberg <[email protected]>
Date:   Mon Jun 8 11:53:17 2020 -0700

    feat: support TF Keras EarlyStopping callbacks [DET-3240] (determined-ai#666)

commit 4056146
Author: aaron276h <[email protected]>
Date:   Mon Jun 8 14:21:13 2020 -0400

    docs: add to FAQ how to port a TF core graph model (determined-ai#650)

commit c8bb942
Author: Brian Friedenberg <[email protected]>
Date:   Mon Jun 8 10:45:39 2020 -0700

    feat: support Estimator early stopping hooks [DET-3239] (determined-ai#661)

commit 3ab90a6
Author: Brian Friedenberg <[email protected]>
Date:   Mon Jun 8 10:44:17 2020 -0700

    test: temporarily disable AMP test since it causes NaNs (determined-ai#670)

commit 629f106
Author: Brian Friedenberg <[email protected]>
Date:   Mon Jun 8 09:22:37 2020 -0700

    feat: treat NaN metrics as an error (determined-ai#667)

    The expected behavior when we hit a NaN is to error the trial. This will restart the trial if we have not restarted max_restarts times. Before we would convert the NaN to the maximum float.

commit db76932
Author: Yoni Ben-tzur <[email protected]>
Date:   Mon Jun 8 08:50:48 2020 -0700

    fix: set auth cookie path to apply site wide (determined-ai#668)

commit 6588f77
Author: aaron276h <[email protected]>
Date:   Mon Jun 8 08:47:11 2020 -0400

    feat: decouple agent information from workloads starting tasks [DET-3178] (determined-ai#631)

    This is a step toward adding a generic resource provider interface.
    As part of this change, Tasks now receive a single `TaskAssigned` message
    when they are assigned rather than one `Assigned` message per container.
    They also start containers by sending specs back to the cluster rather
    than directly to agents.

commit f604a28
Author: Yoni Ben-tzur <[email protected]>
Date:   Fri Jun 5 16:44:11 2020 -0700

    feat: read cookies in the new API auth module (determined-ai#665)

commit 9da1063
Author: Danny Zhu <[email protected]>
Date:   Fri Jun 5 15:43:32 2020 -0700

    fix: space out WebUI plot x-axis ticks a bit more (determined-ai#658)

    The ticks could get bunched up too much with long-running experiments
    before; this just makes sure the spacing is a bit wider by enough to
    work well in reasonable cases. The ticks may now occasionally look a
    tiny bit wider than ideal with very short experiments, but it still
    works fine in practice. Similarly, a long last tick could get cut off at
    the end, so this also bumps the right side spacing correspondingly.

commit b9d9324
Author: Brian Friedenberg <[email protected]>
Date:   Fri Jun 5 15:24:54 2020 -0700

    feat: support early stopping callbacks on a validation step (determined-ai#662)

commit cfb3f51
Author: Yoni Ben-tzur <[email protected]>
Date:   Fri Jun 5 15:22:11 2020 -0700

    feat: add user auth to new api (determined-ai#649)

commit 414bfdf
Author: Hamid Zare <[email protected]>
Date:   Fri Jun 5 14:07:22 2020 -0700

    fix: set authentication failure reason synchronously. (determined-ai#659)

commit ed94d86
Author: aaron276h <[email protected]>
Date:   Fri Jun 5 16:15:25 2020 -0400

    feat: decouple agents from transmitting container status changes [DET-3174] (determined-ai#646)

    This is a step toward adding a generic resource provider interface.

commit f27146a
Author: Caleb Hoyoul Kang <[email protected]>
Date:   Fri Jun 5 13:52:03 2020 -0600

    fix: address minor login issues (determined-ai#611)

    * Increase frequency of auto login check to once per second.
    * Add catch-all routes to redirect invalid route requests to valid routes.

commit d014500
Author: Brian Friedenberg <[email protected]>
Date:   Fri Jun 5 11:47:43 2020 -0700

    revert: "revert: "feat: support stopping training in trial code [DET-3238] (determined-ai#648)" (determined-ai#654)" (determined-ai#656)

    This reverts commit 5baea6a.

commit 44a398a
Author: Caleb Hoyoul Kang <[email protected]>
Date:   Fri Jun 5 10:38:40 2020 -0600

    feat: ensure WebUI version is up to date with platform version (determined-ai#632)

commit 5baea6a
Author: Brian Friedenberg <[email protected]>
Date:   Fri Jun 5 08:55:58 2020 -0700

    revert: "feat: support stopping training in trial code [DET-3238] (determined-ai#648)" (determined-ai#654)

    This reverts commit ee1314f.

commit ee1314f
Author: Brian Friedenberg <[email protected]>
Date:   Fri Jun 5 08:02:13 2020 -0700

    feat: support stopping training in trial code [DET-3238] (determined-ai#648)

commit fa09a74
Author: Brian Friedenberg <[email protected]>
Date:   Fri Jun 5 07:46:24 2020 -0700

    ci: download protoc install to /tmp (determined-ai#653)

commit 9759ce7
Author: determined-dsw <[email protected]>
Date:   Thu Jun 4 18:49:47 2020 -0700

    docs: release notes for 0.12.5 (determined-ai#595) (determined-ai#651)

    (cherry picked from commit 071b3eb)

    Co-authored-by: Justin Chen <[email protected]>

commit 5f476df
Author: Hamid Zare <[email protected]>
Date:   Thu Jun 4 15:30:17 2020 -0700

    chore: remove yarn mentions from tests (determined-ai#635)

    we removed yarn as a dependency a while ago and shouldn't have references to it anymore.

commit 8662fda
Author: Danny Zhu <[email protected]>
Date:   Thu Jun 4 13:03:22 2020 -0700

    fix: correct filename in Elm Makefile (determined-ai#647)

    Due to a mismatch between different instances of what should've been the
    same name, the CSS file was getting built unnecessarily every time,
    considerably increasing the Elm build time.

commit 0e7ca0a
Author: Sidney Wijngaarde <[email protected]>
Date:   Thu Jun 4 12:42:41 2020 -0400

    feat: add checkpoint metadata to cli describe commands (determined-ai#645)

    [DET-3210]

commit 84e875a
Author: aaron276h <[email protected]>
Date:   Thu Jun 4 11:56:52 2020 -0400

    test: fix nightly nas and iris tf keras tests [DET-3264] (determined-ai#644)

    * docs: update NAS example to use correct gradient clipping
    * test: set random seed for nightly iris tf_keras test

commit 4ff9fa0
Author: Sidney Wijngaarde <[email protected]>
Date:   Thu Jun 4 10:47:38 2020 -0400

    feat: checkpoint metadata api (determined-ai#619)

    Adds checkpoint metadata management REST endpoints and python client methods.

    [DET-3207] [DET-3208] [DET-3209]

    PR includes an integration test covering the feature

commit cbbe117
Author: Yoni Ben-tzur <[email protected]>
Date:   Wed Jun 3 20:11:39 2020 -0700

    chore: move proto files to determined namespace (determined-ai#639)

commit fafd686
Author: Yoni Ben-tzur <[email protected]>
Date:   Wed Jun 3 17:01:33 2020 -0700

    feat: add template endpoints to new api (determined-ai#638)

commit 4bad652
Author: Brian Friedenberg <[email protected]>
Date:   Wed Jun 3 16:51:57 2020 -0700

    feat: support USER_CANCELLED exited reason (determined-ai#637)

commit d1146d3
Author: Caleb Hoyoul Kang <[email protected]>
Date:   Wed Jun 3 17:37:04 2020 -0600

    refactor: update link to support secure blank targets (determined-ai#612)

commit f71d64e
Author: Hamid Zare <[email protected]>
Date:   Wed Jun 3 15:33:15 2020 -0700

    feat: add page component [DET-3232] (determined-ai#614)

commit 25e725e
Author: aaron276h <[email protected]>
Date:   Wed Jun 3 17:25:12 2020 -0400

    feat: support gradient clipping in PyTorchTrial via callbacks (determined-ai#615)

    Breaking Change: we no longer accept gradient clipping as a special
    hyperparameter for PyTorchTrial.

commit 80e39d0
Author: Hamid Zare <[email protected]>
Date:   Wed Jun 3 12:28:48 2020 -0700

    feat: add antd breadcrumb stories [DET-3002] (determined-ai#582)

commit 5c9afa2
Author: Hamid Zare <[email protected]>
Date:   Wed Jun 3 12:27:26 2020 -0700

    feat: add activate, pause, and cancel actions to task cards [DET-2934] (determined-ai#585)

commit a3e121a
Author: aaron276h <[email protected]>
Date:   Wed Jun 3 13:34:10 2020 -0400

    feat: add end of training callback to EstimatorTrial (determined-ai#621)

    This will allow users to do any post-experiment cleanup they may need to do.

commit 8056055
Author: Shiyuan <[email protected]>
Date:   Wed Jun 3 10:24:40 2020 -0700

    feat: make agent starting period configurable [DET-3219] (determined-ai#624)

    * feat: make agent starting period configurable

    The provisioner would retry launching an agent after reaching
    a duration, which is set by max agent starting period. Previously,
    this configuration is hardcoded to 300 seconds. Now, change it to
    be in the configuration of the provisioner.

    DET-3219 #Done.

    * test: update tests for max agent starting period

    * chore: clean up provisioner code for better readability

    * chore: clean up provisioner tests for better readability

    * docs: update docs for agent starting period

    * chore: update gcp deployment tool for agent starting period

    * chore: update packaged master.yaml with agent starting period

    * chore: update aws deployment tool with agent starting perioo

commit 8fdc371
Author: Yoni Ben-tzur <[email protected]>
Date:   Wed Jun 3 10:12:13 2020 -0700

    chore: upgrade proto libraries (determined-ai#630)

commit bdfd980
Author: Danny Zhu <[email protected]>
Date:   Wed Jun 3 09:47:02 2020 -0700

    fix: correct logic for checking if a validation is the best one seen (determined-ai#601)

    Previously, every validation would be marked as the best one seen when
    `smallerIsBetter` was true. The new version avoids that and also
    separates the code a bit more cleanly into "decide whether the
    validation is best" followed by "do things accordingly".

commit f590fc3
Author: Yoni Ben-tzur <[email protected]>
Date:   Wed Jun 3 09:44:41 2020 -0700

    chore: remove container recovery (determined-ai#629)

commit a8c1bb2
Author: Yoni Ben-tzur <[email protected]>
Date:   Wed Jun 3 09:22:16 2020 -0700

    feat: add master endpoint to new api (determined-ai#627)

commit 678d53d
Author: Yoni Ben-tzur <[email protected]>
Date:   Wed Jun 3 08:31:45 2020 -0700

    chore: ignore pkg dir in proto sub project (determined-ai#628)

commit 65b5c17
Author: Brian Friedenberg <[email protected]>
Date:   Wed Jun 3 07:07:19 2020 -0700

    chore: bump version: 0.12.5.dev0 -> 0.12.6.dev0 (determined-ai#625)

commit 13c0db2
Author: Yoni Ben-tzur <[email protected]>
Date:   Tue Jun 2 21:47:53 2020 -0700

    chore: move proto to separate top level package (determined-ai#620)

    * chore: move proto to separate top level package

    * fix build issues

    * downgrade buf build for go1.13

    * build proto before publish

    * proto build before package

commit 897f2f6
Author: Shiyuan <[email protected]>
Date:   Tue Jun 2 17:29:30 2020 -0700

    revert: make agent starting period configurable [DET-3219] (determined-ai#623)

    This reverts commit 7f83e97.

commit 7f83e97
Author: Shiyuan <[email protected]>
Date:   Tue Jun 2 17:11:05 2020 -0700

    feat: make agent starting period configurable [DET-3219] (determined-ai#610)

    * feat: make agent starting period configurable

    The provisioner would retry launching an agent after reaching
    a duration, which is set by max agent starting period. Previously,
    this configuration is hardcoded to 300 seconds. Now, change it to
    be in the configuration of the provisioner.

    DET-3219 #Done.

    * test: update tests for max agent starting period

    * chore: clean up provisioner code for better readability

    * chore: clean up provisioner tests for better readability

    * docs: update docs for agent starting period

    * chore: update gcp deployment tool for agent starting period

    * chore: update packaged master.yaml with agent starting period

    * chore: update aws deployment tool with agent starting perioo

commit b01b560
Author: Armand McQueen <[email protected]>
Date:   Tue Jun 2 14:28:02 2020 -0700

    fix: read docker config file from HOME directory (determined-ai#587)

    The path to the docker config file varies depending on location of HOME directory. When an agent was run somewhere where /root is not the HOME directory, this prevented using docker credential stores.

commit e0d0447
Author: Shiyuan <[email protected]>
Date:   Tue Jun 2 13:37:48 2020 -0700

    feat: make GCP operation tracker timeout configuration [DET-3182] (determined-ai#598)

    * feat: make GCP provider operation timeout configurable

    * docs: update the cluster configuration doc for gcp operation timeout

    * docs: update gcp topic guide with operation timeout

    * docs: update default master configuration for gcp operation timeout

    * feat: update GCP terraform script for operation timeout

    * feat: update GCP deploy tool for operation timeout
tayritenour pushed a commit to tayritenour/determined that referenced this pull request Apr 25, 2023
eecsliu pushed a commit to eecsliu/determined that referenced this pull request Jun 23, 2023
stoksc pushed a commit that referenced this pull request Jun 26, 2023
…ory (#631)

* use /var/tmp as tmpfs
Co-authored-by: Pankaj Mandal <[email protected]>
eecsliu pushed a commit that referenced this pull request Jun 28, 2023
…ory (#631)

* use /var/tmp as tmpfs
Co-authored-by: Pankaj Mandal <[email protected]>
eecsliu pushed a commit that referenced this pull request Jun 28, 2023
…ory (#631)

* use /var/tmp as tmpfs
Co-authored-by: Pankaj Mandal <[email protected]>
stoksc pushed a commit that referenced this pull request Jul 20, 2023
…ory (#631)

* use /var/tmp as tmpfs
Co-authored-by: Pankaj Mandal <[email protected]>
eecsliu pushed a commit that referenced this pull request Jul 24, 2023
…ory (#631)

* use /var/tmp as tmpfs
Co-authored-by: Pankaj Mandal <[email protected]>
stoksc pushed a commit that referenced this pull request Oct 17, 2023
…ory (#631)

* use /var/tmp as tmpfs
Co-authored-by: Pankaj Mandal <[email protected]>
azhou-determined pushed a commit that referenced this pull request Dec 7, 2023
…ory (#631)

* use /var/tmp as tmpfs
Co-authored-by: Pankaj Mandal <[email protected]>
wes-turner pushed a commit that referenced this pull request Feb 2, 2024
…ory (#631)

* use /var/tmp as tmpfs
Co-authored-by: Pankaj Mandal <[email protected]>
@dannysauer dannysauer added this to the 0.12.7 milestone Feb 6, 2024
rb-determined-ai pushed a commit that referenced this pull request Feb 29, 2024
…ory (#631)

* use /var/tmp as tmpfs
Co-authored-by: Pankaj Mandal <[email protected]>
amandavialva01 pushed a commit that referenced this pull request Mar 18, 2024
…ory (#631)

* use /var/tmp as tmpfs
Co-authored-by: Pankaj Mandal <[email protected]>
eecsliu pushed a commit that referenced this pull request Apr 18, 2024
…ory (#631)

* use /var/tmp as tmpfs
Co-authored-by: Pankaj Mandal <[email protected]>
eecsliu pushed a commit to determined-ai/determined-release-testing that referenced this pull request Apr 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants