[training] Tensorflow interface for MultiNode SGD #5440

jichan3751 · 2019-08-12T19:40:50Z

What do these changes do?

Creates Tensorflow interface for MultiNode SGD.

TODO:

Smoke test data_augmentation_creator
Smoke test regular data_creator
Write docs
Verify that tests pass
Verify that this works on multi-node multi-gpu

Linter

I've run scripts/format.sh to lint the changes in this PR.

AmplabJenkins · 2019-08-12T23:13:53Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16240/
Test FAILed.

richardliaw · 2019-08-12T23:50:30Z

BTW, I don't think you need to fit the PyTorch API so closely. I think you should first get the Distributed TF example running in Ray, and then think about APIs afterwards.

… train_example works with cpu

AmplabJenkins · 2019-08-13T13:37:45Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16258/
Test FAILed.

…rainable

jichan3751 · 2019-08-17T11:36:43Z

Looks like there are problems with model save / loading.
There are basically 3 data needed to restore a model:

sturucture config
weights
optimizer weights
These are saved/ loaded without problems when writing / reading from disk using model.save().
However, it is little tricky to apply optimizer weights if we want to restore the model from the python objects: optimizer weights = [] if model.fit() is not called. and this is not letting us setting the weight from python weight object.
I included lots of hacks to get around this. Let me know your suggestion.

AmplabJenkins · 2019-08-17T14:48:27Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16365/
Test FAILed.

richardliaw · 2019-08-19T07:45:38Z

python/ray/experimental/sgd/tensorflow/distributed_tensorflow_runner.py

+logger = logging.getLogger(__name__)
+
+
+class DistributedTensorFlowRunner(TensorFlowRunner):


Is the inheritance necessary here?

class TensorFlowRunner's method get_state and set_state can be used to get current model. So I think inheritance is needed here.

I think you should just define get_state and set_state in this class, so I think we don't need a separate DistributedTensorFlowRunner clas - just a regular TensorFlowRunner.

AmplabJenkins · 2019-08-19T17:12:07Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16394/
Test FAILed.

AmplabJenkins · 2019-08-20T01:35:43Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16404/
Test FAILed.

richardliaw · 2019-08-20T07:30:29Z

python/ray/experimental/sgd/examples/tensorflow_train_example.py

+from ray.experimental.sgd.tensorflow.tensorflow_trainer import (
+    TensorFlowTrainer, TensorFlowTrainable)
+
+from ray.experimental.sgd.tests.tf_helper import (get_model, get_dataset)


For this, I think we should not use tf_helper and define get_model and get_dataset in this file. PyTorch should be updated in a separate PR.

richardliaw · 2019-08-20T07:30:55Z

python/ray/experimental/sgd/examples/tf-example-sgd.yaml

+    - pip install -U tensorflow-gpu==2.0.0-beta1
+
+file_mounts: {
+    ~/run/: /Users/jichanchung/OneDrive/FF/190812_tf2_tune_trainable/190814_multinode_mnist_ray


richardliaw · 2019-08-20T07:33:28Z

python/ray/experimental/sgd/tensorflow/distributed_tensorflow_runner.py

+
+        return stats
+
+    def validate(self):


Validate should only be called on the test_dataset.

richardliaw · 2019-08-20T07:33:54Z

python/ray/experimental/sgd/tensorflow/tensorflow_runner.py

@@ -0,0 +1,98 @@
+from __future__ import absolute_import


You should merge this with DistributedTensorflowRunner, as commented in that file

richardliaw · 2019-08-20T07:41:01Z

python/ray/experimental/sgd/tensorflow/tensorflow_runner.py

+
+    def set_state(self, state):
+        self.epoch = state["epoch"]
+        if self.model.optimizer.weights == []:


I don't quite understand this issue nor this comment - can you:

provide a comment of exactly what the error is, and

provide in the code a link to the stackoverflow or Tensorflow github issue link that you found which suggested this workaround?

richardliaw · 2019-08-20T07:41:45Z

python/ray/experimental/sgd/tests/tf_helper.py

@@ -0,0 +1,38 @@
+from __future__ import absolute_import, division, print_function, unicode_literals


this should be separate lines, and I don't think you need unicode_laterals

richardliaw · 2019-08-20T07:41:49Z

python/ray/experimental/sgd/tests/tf_helper.py

+from __future__ import absolute_import, division, print_function, unicode_literals
+import tensorflow as tf
+
+NUM_TRAIN_SAMPLES = 60000


can we make this 512?

shuffle(NUM_TRAIN_SAMPLES) is used to shuffle whole data.
Are you meaning we should only consider 512 datapoints from mnist dataset?

richardliaw · 2019-08-20T07:42:27Z

python/ray/experimental/sgd/tests/tf_helper.py

@@ -0,0 +1,38 @@
+from __future__ import absolute_import, division, print_function, unicode_literals


can you copy these functions to the example?

python/ray/experimental/sgd/utils.py

… to tf runner; removed trainloss decreasing check from example and test

AmplabJenkins · 2019-08-20T15:32:12Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16415/
Test FAILed.

AmplabJenkins · 2019-08-22T16:17:33Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16462/
Test FAILed.

AmplabJenkins · 2019-09-01T20:52:00Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16700/
Test FAILed.

AmplabJenkins · 2019-09-01T22:38:48Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16699/
Test FAILed.

AmplabJenkins · 2019-09-02T00:52:04Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16703/
Test FAILed.

AmplabJenkins · 2019-09-02T00:52:43Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16704/
Test FAILed.

AmplabJenkins · 2019-09-02T02:01:05Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16707/
Test FAILed.

AmplabJenkins · 2019-09-02T02:02:56Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16712/
Test FAILed.

AmplabJenkins · 2019-09-02T02:02:56Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16710/
Test FAILed.

AmplabJenkins · 2019-09-02T02:02:58Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16708/
Test FAILed.

AmplabJenkins · 2019-09-02T02:03:04Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16709/
Test FAILed.

AmplabJenkins · 2019-09-02T04:35:50Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16713/
Test FAILed.

AmplabJenkins · 2019-09-03T05:10:35Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16730/
Test PASSed.

added placeholder for changes. not yet working

e5a9035

chagned TF example from cifar to simple mnist for easy debugging; now…

9b6d581

… train_example works with cpu

jichan3751 added 2 commits August 17, 2019 04:29

refinement

a027c5a

Merge branch 'master' of https://github.com/ray-project/ray into tf_t…

1952f38

…rainable

richardliaw reviewed Aug 19, 2019

View reviewed changes

bugfux

3d29935

some lints

0b37ac0

richardliaw reviewed Aug 20, 2019

View reviewed changes

python/ray/experimental/sgd/utils.py Show resolved Hide resolved

removed tf_helper and merged to example; distributed tf runner merged…

576eca3

… to tf runner; removed trainloss decreasing check from example and test

jichan3751 added 2 commits August 22, 2019 03:13

fixed save/load

fce9c18

lint

8d4da2e

jichan3751 added 2 commits August 23, 2019 02:47

simpler tf example; moved pytorch_util.py from tests to example

d1aa990

changed num_samples option in test

9c87275

richardliaw added 6 commits September 1, 2019 10:14

jenkins

5367992

fix

ad79560

some docs

ad0c21b

docs

c3036dd

removepermission

e90cde7

doc

1150fa6

richardliaw added 3 commits September 1, 2019 14:14

fix

c83329f

fix tests

3eda153

fconfig

9f2835c

richardliaw added 4 commits September 1, 2019 15:43

tf

351da4e

fix

3b7e775

fixall

e08580a

Make TF more extensible

967f46f

fix

02880b6

richardliaw approved these changes Sep 2, 2019

View reviewed changes

richardliaw changed the title ~~Tensorflow interface for MultiNode SGD~~ [training] Tensorflow interface for MultiNode SGD Sep 2, 2019

richardliaw self-assigned this Sep 2, 2019

fix_dataset

b11decc

richardliaw merged commit 1711e20 into ray-project:master Sep 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[training] Tensorflow interface for MultiNode SGD #5440

[training] Tensorflow interface for MultiNode SGD #5440

jichan3751 commented Aug 12, 2019 •

edited by richardliaw

Loading

AmplabJenkins commented Aug 12, 2019

richardliaw commented Aug 12, 2019

AmplabJenkins commented Aug 13, 2019

jichan3751 commented Aug 17, 2019

AmplabJenkins commented Aug 17, 2019

richardliaw Aug 19, 2019

jichan3751 Aug 19, 2019

richardliaw Aug 20, 2019

AmplabJenkins commented Aug 19, 2019

AmplabJenkins commented Aug 20, 2019

richardliaw Aug 20, 2019

richardliaw Aug 20, 2019

richardliaw Aug 20, 2019

richardliaw Aug 20, 2019

richardliaw Aug 20, 2019

richardliaw Aug 20, 2019

richardliaw Aug 20, 2019

jichan3751 Aug 20, 2019

richardliaw Aug 20, 2019

AmplabJenkins commented Aug 20, 2019

AmplabJenkins commented Aug 22, 2019

AmplabJenkins commented Sep 1, 2019

AmplabJenkins commented Sep 1, 2019

AmplabJenkins commented Sep 2, 2019

AmplabJenkins commented Sep 2, 2019

AmplabJenkins commented Sep 2, 2019

AmplabJenkins commented Sep 2, 2019

AmplabJenkins commented Sep 2, 2019

AmplabJenkins commented Sep 2, 2019

AmplabJenkins commented Sep 2, 2019

AmplabJenkins commented Sep 2, 2019

AmplabJenkins commented Sep 3, 2019

		logger = logging.getLogger(__name__)


		class DistributedTensorFlowRunner(TensorFlowRunner):

		@@ -0,0 +1,38 @@
		from __future__ import absolute_import, division, print_function, unicode_literals

[training] Tensorflow interface for MultiNode SGD #5440

[training] Tensorflow interface for MultiNode SGD #5440

Conversation

jichan3751 commented Aug 12, 2019 • edited by richardliaw Loading

What do these changes do?

Linter

AmplabJenkins commented Aug 12, 2019

richardliaw commented Aug 12, 2019

AmplabJenkins commented Aug 13, 2019

jichan3751 commented Aug 17, 2019

AmplabJenkins commented Aug 17, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AmplabJenkins commented Aug 19, 2019

AmplabJenkins commented Aug 20, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AmplabJenkins commented Aug 20, 2019

AmplabJenkins commented Aug 22, 2019

AmplabJenkins commented Sep 1, 2019

AmplabJenkins commented Sep 1, 2019

AmplabJenkins commented Sep 2, 2019

AmplabJenkins commented Sep 2, 2019

AmplabJenkins commented Sep 2, 2019

AmplabJenkins commented Sep 2, 2019

AmplabJenkins commented Sep 2, 2019

AmplabJenkins commented Sep 2, 2019

AmplabJenkins commented Sep 2, 2019

AmplabJenkins commented Sep 2, 2019

AmplabJenkins commented Sep 3, 2019

jichan3751 commented Aug 12, 2019 •

edited by richardliaw

Loading