Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tensorflow 1.4 and estimator support #61

Closed
jlewi opened this issue Oct 19, 2017 · 10 comments
Closed

tensorflow 1.4 and estimator support #61

jlewi opened this issue Oct 19, 2017 · 10 comments

Comments

@jlewi
Copy link
Contributor

jlewi commented Oct 19, 2017

In TensorFlow 1.4 TF_CONFIG uses "chief" and not "master" see here.

We should figure out what changes we should make to support this. We should also figure out how to continue supporting older versions of TF.

Estimator also added evaluation replicas which might not finish until after the master/chief finishes. So we will need to take evaluation replicas into account when determining job status.

jlewi pushed a commit that referenced this issue Dec 20, 2017
Allow training jobs to work without a master by treating one of the workers as the chiefs.

* Fixes #192

* This will allow TfJobs to be used with a lot of existing TensorFlow programs without modification since using worker 0 as the chief is a common pattern. Currently to run these programs using TfJob's  you need to spin up a dummy TensorFlow gRPC server just to serve as the master.

* This is also necessary to support changes in estimator API with TF 1.4 (#61)
@jlewi jlewi added this to the Kubecon Europe milestone Jan 25, 2018
@jlewi jlewi added the area/api label Jan 25, 2018
@ScorpioCPH
Copy link
Member

how to continue supporting older versions of TF.

How about claim the minimum version we supported now.

evaluation replicas

Is this will be a new TFReplicaType in the future?

@jlewi
Copy link
Contributor Author

jlewi commented Feb 12, 2018

Is this will be a new TFReplicaType in the future?

I think we should get rid of TFReplicaType and allow TFJobs to have arbitrary number of replicas identified by a unique name and use properties to control various behaviors (e.g. TerminationPolicy)

@bhack
Copy link

bhack commented Mar 4, 2018

What is the status of estimators created from tf.keras? See tensorflow/tensorflow#14504 (comment)

@jlewi
Copy link
Contributor Author

jlewi commented Mar 4, 2018

@bhack Can you be more specific? I haven't tried to use tf.keras estimators, do they require special support from TFJob?

@bhack
Copy link

bhack commented Mar 4, 2018

Was just to remember to check if models defined with the new tf.keras core high level api are compatibile with a distributed config when we will introduce estimators.

@bhack
Copy link

bhack commented Mar 4, 2018

I was referring to estimators created with https://www.tensorflow.org/api_docs/python/tf/keras/estimator/model_to_estimator

@jlewi
Copy link
Contributor Author

jlewi commented Mar 5, 2018

@bhack Thanks for raising this. I don't know. Would be great if someone could investigate and figure out what if any changes are needed to support them.

@jlewi
Copy link
Contributor Author

jlewi commented Mar 7, 2018

This could potentially be solved by the v2 API (kubeflow/community#30). In particular, if for our V2 API we get rid of replica type and let the user pick the names for replicas then user could pick whatever name matches TF_CONFIG.

@bhack
Copy link

bhack commented Mar 7, 2018

@jlewi Follow upstream tensorflow/tensorflow#14504. Many people are asking clarification/documentation about this.

@jlewi
Copy link
Contributor Author

jlewi commented Apr 30, 2018

Closing this bug since it should be fixed by our v1alpha2 API and we have bugs already tracking that.

@jlewi jlewi closed this as completed Apr 30, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants