[Tune] Logging with multiple time intervals #4157

jeremyasapp · 2019-02-25T22:43:06Z

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04
Ray installed from (source or binary): Pip
Ray version: 0.6.3
Python version: 3.6.5

Describe the problem

The Trainable interface in tune expects the step method to output a logging dictionary. However, it is unclear how to annotate logging statements with different global steps. For example, one way want to record the model's gradients at every training iteration, but may only want to record the dev metric once per epoch.

One solution that we are exploring is building an adapter Trainable interface, with a custom logger (default logger being deactivated), that would be passed to the step method. The logger would then have a method such as: log(key, value, time_step). The custom logger, when given a result (i.e on_result) would then parse the list of (key, value, time_step) tuples and output the correct tensorboard graphs. The only downside from this method is that the user would have to wait until the step is finished before seeing the logs for that step.

I was wondering if you had faced this question before, and had some thoughts about the best way to approach it. Thank you in advance for your help!

The text was updated successfully, but these errors were encountered:

richardliaw · 2019-02-25T22:53:58Z

If I understand correctly, you want to have different fields being reported at different time scales? I think by default, the regular logging mechanism should be work with different metrics being reported through the training.

One thing for clarity that you could do is nest a dictionary of values - for example,

result_per_iter = {
    "some_val": 1
}

result_per_epoch = {
    "some_val": 2, 
    "epoch_info": {"val_acc": 123}
}

Does this help? If I misunderstood, let me know!

jeremyasapp · 2019-02-25T23:11:19Z

Hi, thanks for your quick response!

So I think I understand what you mean but these lines in the TFLogger gave me the impression that all the metrics are given the same step:

train_stats = tf.Summary(value=values)
t = result.get(TIMESTEPS_TOTAL) or result[TRAINING_ITERATION]
self._file_writer.add_summary(train_stats, t)

What I'm confused about is that we only get to return a single dictionary of values every step, but within a step, it's possible that you may generate multiple values for the same key. Take for example:

def step():
    losses = []
    # Do n training steps
    for i in (n_iter_per_step):
        batch = next(train_sampler)
        loss = train(batch)
        losses.append(loss)

    # Run evaluation
    metric = evaluate(self.dev_data)
    return dict(metric=metric, loss=losses)

In this example, it's unclear to me how Tensorboard will go about parsing the list of values. Ideally I would call tf_summary with the correct training time step for each value in losses (i.e n_iter_per_step * global_step + i). Please let me know if that makes things more clear, or I can give a more concrete example. Thank you!

nmatthews-asapp · 2019-02-26T17:25:54Z

As a secondary point, would there be anyway to support connecting the global python logger available from logging with the Ray logging system? I understand why you decided to implement the logger interface the way you did, as logging in a multi-machine setup is difficult, but the global logging interface is nice because you can access it anywhere at any time and don't have to pass it all the way down your object hierarchy.

richardliaw · 2019-02-27T09:59:46Z

@jeremyasapp - would something like this work?

def _train():
    batch = next(train_sampler)
    loss = train(batch)
    extras = {}
    if self._iteration % n_iter_per_step == 0:
        extras["metric"] = evaluate(self.dev_data)
    return dict(loss=loss, **extras)

Here, you'd get loss having the right training step, and extras would be reported every multiple of n_iter_per_step - I think this should work, but let me know if otherwise.

richardliaw · 2019-02-27T10:06:10Z

@nmatthews-asapp can you let me know more about what you're trying to do?

Does a custom logger work for you? Adding a custom logger would enable you to use the Python logger, although only on certain notifications. On the other hand, printing in the Trainable should be caught by a separate ray logging mechanism (i.e., take a look at ray.init(redirect_worker_output))

nmatthews-asapp · 2019-02-27T16:53:41Z

I think the issue with the first suggestion is more conceptual - what you're suggesting is to change the definition of what one training step is for the sake of logging, which will affect the way training happens and in some cases prevent us from following your guidelines about how long one step should be. Ideally we could decide what one step should be regardless of logging, and have control over the global_step used in tensorboard

Just to clarify on the second logging question, you're saying that we can somehow setup a custom Tune logger such that when we, in other files, do something like

import logging
logger = logging.getLogger(__name__)
...
logger.info(batch_loss)

That log will end up going to the Unified logger? How does that work?

jeremyasapp · 2019-02-27T17:18:30Z

@richardliaw Your suggestion could work, but as @nmatthews-asapp pointed out, this restricts us to a step becoming a single training iteration, which could be too short and result in overhead.

Of course, we could also only report training every N iterations, but then we have the same problem that we can only show a single aggregate number for the N passed iterations.

I'm also very curious about the use of Python logging with a custom logger. My original idea was to build our customer logger inside the Trainable, and use it inside the step method. If we can do it with python logging, then even better.

richardliaw · 2019-02-27T20:15:13Z

I see. In that case, instead of programming against the framework, I would probably just setup my own TF summary writing within _train and turn off the TF logger (#4104). This logs directly to the file system of the remote node (assuming you're doing something with a cluster).

Then, any output in the Trainable logdir will be continuously sync'ed back to the master node if you're using the autoscaler (or if you set TUNE_CLUSTER_SSH_KEY - see tune/cluster_info.py). Then you can use tensorboard as you usually do.

[ignore comment about using custom logger, as I think it doesn't apply in this case] Regarding the Python logger, you wouldn't be able to use the logger from your original driver process because Trainables are run in separate processes. But if you're willing to do the logging setup in the Trainable (i.e., in _setup), any output of Trainables can be redirected to the driver (or logged to files on the remote node then sync'ed back as mentioned above)

Does that help?

nmatthews-asapp · 2019-02-27T22:04:58Z

Great, thanks for making that PR last time to support this syncing btw!

And we might take your latter suggestion as well. We want to support users logging from arbitrary places in the codebase without passing loggers down, so having logging set up for example by adding a handler in _setup to save to the right file path to be rsync'd back as you suggest would be nice. Again, the PR you made makes this possible in a nice way.

richardliaw · 2019-03-01T01:05:55Z

No problem! Feel free to close the issue if resolved.

BTW, I'd be really interested in hearing more about your use case and seeing if there are any features we can add to Tune to better support what you're trying to do -- feel free to shoot me an email (rliaw [at] berkeley.edu)!

jeremyasapp · 2019-03-01T03:35:14Z

@richardliaw Thanks a lot for the help. I think it'd be really productive for us to chat offline. We'll reach out soon! We love Ray + Tune. Closing the issue for now!

richardliaw self-assigned this Feb 25, 2019

jeremyasapp closed this as completed Mar 1, 2019

jeremyasapp mentioned this issue Mar 7, 2019

[Tune] Revisiting checkpointing policy #4287

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Tune] Logging with multiple time intervals #4157

[Tune] Logging with multiple time intervals #4157

jeremyasapp commented Feb 25, 2019

richardliaw commented Feb 25, 2019

jeremyasapp commented Feb 25, 2019 •

edited

Loading

nmatthews-asapp commented Feb 26, 2019

richardliaw commented Feb 27, 2019 •

edited

Loading

richardliaw commented Feb 27, 2019

nmatthews-asapp commented Feb 27, 2019

jeremyasapp commented Feb 27, 2019

richardliaw commented Feb 27, 2019

nmatthews-asapp commented Feb 27, 2019 •

edited

Loading

richardliaw commented Mar 1, 2019

jeremyasapp commented Mar 1, 2019

[Tune] Logging with multiple time intervals #4157

[Tune] Logging with multiple time intervals #4157

Comments

jeremyasapp commented Feb 25, 2019

System information

Describe the problem

richardliaw commented Feb 25, 2019

jeremyasapp commented Feb 25, 2019 • edited Loading

nmatthews-asapp commented Feb 26, 2019

richardliaw commented Feb 27, 2019 • edited Loading

richardliaw commented Feb 27, 2019

nmatthews-asapp commented Feb 27, 2019

jeremyasapp commented Feb 27, 2019

richardliaw commented Feb 27, 2019

nmatthews-asapp commented Feb 27, 2019 • edited Loading

richardliaw commented Mar 1, 2019

jeremyasapp commented Mar 1, 2019

jeremyasapp commented Feb 25, 2019 •

edited

Loading

richardliaw commented Feb 27, 2019 •

edited

Loading

nmatthews-asapp commented Feb 27, 2019 •

edited

Loading