-
Notifications
You must be signed in to change notification settings - Fork 697
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tensorflow 1.4 and estimator support #61
Comments
Allow training jobs to work without a master by treating one of the workers as the chiefs. * Fixes #192 * This will allow TfJobs to be used with a lot of existing TensorFlow programs without modification since using worker 0 as the chief is a common pattern. Currently to run these programs using TfJob's you need to spin up a dummy TensorFlow gRPC server just to serve as the master. * This is also necessary to support changes in estimator API with TF 1.4 (#61)
How about claim the minimum version we supported now.
Is this will be a new |
I think we should get rid of TFReplicaType and allow TFJobs to have arbitrary number of replicas identified by a unique name and use properties to control various behaviors (e.g. TerminationPolicy) |
What is the status of estimators created from tf.keras? See tensorflow/tensorflow#14504 (comment) |
@bhack Can you be more specific? I haven't tried to use tf.keras estimators, do they require special support from TFJob? |
Was just to remember to check if models defined with the new tf.keras core high level api are compatibile with a distributed config when we will introduce estimators. |
I was referring to estimators created with https://www.tensorflow.org/api_docs/python/tf/keras/estimator/model_to_estimator |
@bhack Thanks for raising this. I don't know. Would be great if someone could investigate and figure out what if any changes are needed to support them. |
This could potentially be solved by the v2 API (kubeflow/community#30). In particular, if for our V2 API we get rid of replica type and let the user pick the names for replicas then user could pick whatever name matches TF_CONFIG. |
@jlewi Follow upstream tensorflow/tensorflow#14504. Many people are asking clarification/documentation about this. |
Closing this bug since it should be fixed by our v1alpha2 API and we have bugs already tracking that. |
In TensorFlow 1.4 TF_CONFIG uses "chief" and not "master" see here.
We should figure out what changes we should make to support this. We should also figure out how to continue supporting older versions of TF.
Estimator also added evaluation replicas which might not finish until after the master/chief finishes. So we will need to take evaluation replicas into account when determining job status.
The text was updated successfully, but these errors were encountered: