Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[training] Tensorflow interface for MultiNode SGD #5440

Merged
merged 40 commits into from
Sep 3, 2019

Conversation

jichan3751
Copy link
Contributor

@jichan3751 jichan3751 commented Aug 12, 2019

What do these changes do?

Creates Tensorflow interface for MultiNode SGD.

TODO:

  • Smoke test data_augmentation_creator
  • Smoke test regular data_creator
  • Write docs
  • Verify that tests pass
  • Verify that this works on multi-node multi-gpu

Linter

  • I've run scripts/format.sh to lint the changes in this PR.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16240/
Test FAILed.

@richardliaw
Copy link
Contributor

BTW, I don't think you need to fit the PyTorch API so closely. I think you should first get the Distributed TF example running in Ray, and then think about APIs afterwards.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16258/
Test FAILed.

@jichan3751
Copy link
Contributor Author

Looks like there are problems with model save / loading.
There are basically 3 data needed to restore a model:

  1. sturucture config
  2. weights
  3. optimizer weights
    These are saved/ loaded without problems when writing / reading from disk using model.save().
    However, it is little tricky to apply optimizer weights if we want to restore the model from the python objects: optimizer weights = [] if model.fit() is not called. and this is not letting us setting the weight from python weight object.
    I included lots of hacks to get around this. Let me know your suggestion.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16365/
Test FAILed.

logger = logging.getLogger(__name__)


class DistributedTensorFlowRunner(TensorFlowRunner):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the inheritance necessary here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

class TensorFlowRunner's method get_state and set_state can be used to get current model. So I think inheritance is needed here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should just define get_state and set_state in this class, so I think we don't need a separate DistributedTensorFlowRunner clas - just a regular TensorFlowRunner.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16394/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16404/
Test FAILed.

from ray.experimental.sgd.tensorflow.tensorflow_trainer import (
TensorFlowTrainer, TensorFlowTrainable)

from ray.experimental.sgd.tests.tf_helper import (get_model, get_dataset)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this, I think we should not use tf_helper and define get_model and get_dataset in this file. PyTorch should be updated in a separate PR.

- pip install -U tensorflow-gpu==2.0.0-beta1

file_mounts: {
~/run/: /Users/jichanchung/OneDrive/FF/190812_tf2_tune_trainable/190814_multinode_mnist_ray
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove?


return stats

def validate(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Validate should only be called on the test_dataset.

@@ -0,0 +1,98 @@
from __future__ import absolute_import
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should merge this with DistributedTensorflowRunner, as commented in that file


def set_state(self, state):
self.epoch = state["epoch"]
if self.model.optimizer.weights == []:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand this issue nor this comment - can you:

  1. provide a comment of exactly what the error is, and
  2. provide in the code a link to the stackoverflow or Tensorflow github issue link that you found which suggested this workaround?

@@ -0,0 +1,38 @@
from __future__ import absolute_import, division, print_function, unicode_literals
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be separate lines, and I don't think you need unicode_laterals

from __future__ import absolute_import, division, print_function, unicode_literals
import tensorflow as tf

NUM_TRAIN_SAMPLES = 60000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we make this 512?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shuffle(NUM_TRAIN_SAMPLES) is used to shuffle whole data.
Are you meaning we should only consider 512 datapoints from mnist dataset?

@@ -0,0 +1,38 @@
from __future__ import absolute_import, division, print_function, unicode_literals
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you copy these functions to the example?

… to tf runner; removed trainloss decreasing check from example and test
@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16415/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16462/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16700/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16699/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16703/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16704/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16707/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16712/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16710/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16708/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16709/
Test FAILed.

@richardliaw richardliaw changed the title Tensorflow interface for MultiNode SGD [training] Tensorflow interface for MultiNode SGD Sep 2, 2019
@richardliaw richardliaw self-assigned this Sep 2, 2019
@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16713/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16730/
Test PASSed.

@richardliaw richardliaw merged commit 1711e20 into ray-project:master Sep 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants