Don't require a master or chief #192

jlewi · 2017-12-01T07:02:34Z

We should consider removing the requirement to have a master.

A lot of TF code just uses worker 0 as the chief. An example is inception using slim. Adapting such code to work with the existing TfJob API can be awkward. Inception for example uses worker 0 as the chief. If you try to treat the master as the chief and a worker you have problems because the inception code assigns the ops to "'/job:worker/device:GPU:0" with the expectation that worker maps to localhost since no task is specified. But this won't work on the master because the TF job name is "master" not "worker". The easiest work around I found was to just spin up a standard TF gRPC server for the master; this satisfied the TfJob requirement to have a master and also ensured all TF jobs in the ClusterSpec corresponded to valid gRPC endpoints.

TfJob could probably support code like this just by dropping the requirement that there be a master for a job. We'd have to adjust the exit criterion; we'd probably want to run until all workers finished.

Related to #61

jlewi · 2017-12-04T19:36:24Z

Proposal. Add a termination policy to the TfJob; something like

terminationPolicy:
  chief:
     replicaName: MASTER
     replicaIndex: 0

The chief policy corresponds to waiting for a particular process which is the chief to exit. By letting the user specify the replica and replica index to use we can easily accomodate the cases where

Chief/Master runs as a dedicated replica type
Chief is just worker index 0
- This appears to the pattern used by the TensorFlow estimator API starting with 1.4

The reason for using replicaName and not replicaType is I suspect we will eventually want to replace replicaType with a set of attributes controlling various aspects of the replica such as termination policy (This was one of the suggestions that came up in the internal review see #64).

An example where that might be useful is if we want to run model evaluation in a separate set of processes and terminate training when evaluation metrics satisfy some criterion.

For compatibility with the current use of replicaType's, the replicaName will just be the string version of the enum.

/cc @jhseu @mrry

jlewi · 2017-12-06T03:55:58Z

Some getting started pointers in case someone is considering taking this on

terminationPolicy would be defined by modifying the TfJobSpec
Modify SetDefaults to automatically set a default terminationPolicy if one isn't set
- The default policy should correspond to the existing behavior.
The previous changes above would probably make a good first PR.

Next PR would be to add support for a policy which used the WORKER replica 0 as the chief.

Update GetStatus to determine the state based on the replica and index specified in the termination policy
Update Validate
- master should no longer be required
- Ensure the replica corresponding to the chief is defined; (i.e if chief is worker 0 then job should include a worker replica).

Allow training jobs to work without a master by treating one of the workers as the chiefs. * Fixes #192 * This will allow TfJobs to be used with a lot of existing TensorFlow programs without modification since using worker 0 as the chief is a common pattern. Currently to run these programs using TfJob's you need to spin up a dummy TensorFlow gRPC server just to serve as the master. * This is also necessary to support changes in estimator API with TF 1.4 (#61)

jlewi added the area/api label Dec 4, 2017

lluunn mentioned this issue Dec 8, 2017

Add terminationPolicy to TfJobSpec #204

Merged

jlewi closed this as completed in 08ec97f Dec 11, 2017

This was referenced Dec 13, 2017

allow using WORKER:0 as chief #221

Merged

e2e test for the case that the chief is not master #235

Closed

jlewi mentioned this issue Dec 20, 2017

Get rid of master in tf_cnn example; use WORKER 0 as chief kubeflow/kubeflow#51

Closed

yupbank mentioned this issue Jan 11, 2018

update tf-cnn example use terminationPolicy to address not using master replica kubeflow/kubeflow#115

Merged

jlewi mentioned this issue Mar 7, 2018

Add tf-operator design doc for API v1alpha2 kubeflow/community#30

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't require a master or chief #192

Don't require a master or chief #192

jlewi commented Dec 1, 2017

jlewi commented Dec 4, 2017

jlewi commented Dec 6, 2017

Don't require a master or chief #192

Don't require a master or chief #192

Comments

jlewi commented Dec 1, 2017

jlewi commented Dec 4, 2017

jlewi commented Dec 6, 2017