-
Notifications
You must be signed in to change notification settings - Fork 697
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't require a master or chief #192
Comments
Proposal. Add a termination policy to the TfJob; something like
The chief policy corresponds to waiting for a particular process which is the chief to exit. By letting the user specify the replica and replica index to use we can easily accomodate the cases where
The reason for using replicaName and not replicaType is I suspect we will eventually want to replace replicaType with a set of attributes controlling various aspects of the replica such as termination policy (This was one of the suggestions that came up in the internal review see #64). An example where that might be useful is if we want to run model evaluation in a separate set of processes and terminate training when evaluation metrics satisfy some criterion. For compatibility with the current use of replicaType's, the replicaName will just be the string version of the enum. |
Some getting started pointers in case someone is considering taking this on
Next PR would be to add support for a policy which used the WORKER replica 0 as the chief. |
Allow training jobs to work without a master by treating one of the workers as the chiefs. * Fixes #192 * This will allow TfJobs to be used with a lot of existing TensorFlow programs without modification since using worker 0 as the chief is a common pattern. Currently to run these programs using TfJob's you need to spin up a dummy TensorFlow gRPC server just to serve as the master. * This is also necessary to support changes in estimator API with TF 1.4 (#61)
We should consider removing the requirement to have a master.
A lot of TF code just uses worker 0 as the chief. An example is inception using slim. Adapting such code to work with the existing TfJob API can be awkward. Inception for example uses worker 0 as the chief. If you try to treat the master as the chief and a worker you have problems because the inception code assigns the ops to "'/job:worker/device:GPU:0" with the expectation that worker maps to localhost since no task is specified. But this won't work on the master because the TF job name is "master" not "worker". The easiest work around I found was to just spin up a standard TF gRPC server for the master; this satisfied the TfJob requirement to have a master and also ensured all TF jobs in the ClusterSpec corresponded to valid gRPC endpoints.
TfJob could probably support code like this just by dropping the requirement that there be a master for a job. We'd have to adjust the exit criterion; we'd probably want to run until all workers finished.
Related to #61
The text was updated successfully, but these errors were encountered: