Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tune] Logging with multiple time intervals #4157

Closed
jeremyasapp opened this issue Feb 25, 2019 · 11 comments
Closed

[Tune] Logging with multiple time intervals #4157

jeremyasapp opened this issue Feb 25, 2019 · 11 comments
Assignees

Comments

@jeremyasapp
Copy link

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04
  • Ray installed from (source or binary): Pip
  • Ray version: 0.6.3
  • Python version: 3.6.5

Describe the problem

The Trainable interface in tune expects the step method to output a logging dictionary. However, it is unclear how to annotate logging statements with different global steps. For example, one way want to record the model's gradients at every training iteration, but may only want to record the dev metric once per epoch.

One solution that we are exploring is building an adapter Trainable interface, with a custom logger (default logger being deactivated), that would be passed to the step method. The logger would then have a method such as: log(key, value, time_step). The custom logger, when given a result (i.e on_result) would then parse the list of (key, value, time_step) tuples and output the correct tensorboard graphs. The only downside from this method is that the user would have to wait until the step is finished before seeing the logs for that step.

I was wondering if you had faced this question before, and had some thoughts about the best way to approach it. Thank you in advance for your help!

@richardliaw richardliaw self-assigned this Feb 25, 2019
@richardliaw
Copy link
Contributor

If I understand correctly, you want to have different fields being reported at different time scales? I think by default, the regular logging mechanism should be work with different metrics being reported through the training.

One thing for clarity that you could do is nest a dictionary of values - for example,

result_per_iter = {
    "some_val": 1
}

result_per_epoch = {
    "some_val": 2, 
    "epoch_info": {"val_acc": 123}
}

Does this help? If I misunderstood, let me know!

@jeremyasapp
Copy link
Author

jeremyasapp commented Feb 25, 2019

Hi, thanks for your quick response!

So I think I understand what you mean but these lines in the TFLogger gave me the impression that all the metrics are given the same step:

train_stats = tf.Summary(value=values)
t = result.get(TIMESTEPS_TOTAL) or result[TRAINING_ITERATION]
self._file_writer.add_summary(train_stats, t)

What I'm confused about is that we only get to return a single dictionary of values every step, but within a step, it's possible that you may generate multiple values for the same key. Take for example:

def step():
    losses = []
    # Do n training steps
    for i in (n_iter_per_step):
        batch = next(train_sampler)
        loss = train(batch)
        losses.append(loss)

    # Run evaluation
    metric = evaluate(self.dev_data)
    return dict(metric=metric, loss=losses)

In this example, it's unclear to me how Tensorboard will go about parsing the list of values. Ideally I would call tf_summary with the correct training time step for each value in losses (i.e n_iter_per_step * global_step + i). Please let me know if that makes things more clear, or I can give a more concrete example. Thank you!

@nmatthews-asapp
Copy link
Contributor

As a secondary point, would there be anyway to support connecting the global python logger available from logging with the Ray logging system? I understand why you decided to implement the logger interface the way you did, as logging in a multi-machine setup is difficult, but the global logging interface is nice because you can access it anywhere at any time and don't have to pass it all the way down your object hierarchy.

@richardliaw
Copy link
Contributor

richardliaw commented Feb 27, 2019

@jeremyasapp - would something like this work?

def _train():
    batch = next(train_sampler)
    loss = train(batch)
    extras = {}
    if self._iteration % n_iter_per_step == 0:
        extras["metric"] = evaluate(self.dev_data)
    return dict(loss=loss, **extras)

Here, you'd get loss having the right training step, and extras would be reported every multiple of n_iter_per_step - I think this should work, but let me know if otherwise.

@richardliaw
Copy link
Contributor

@nmatthews-asapp can you let me know more about what you're trying to do?

Does a custom logger work for you? Adding a custom logger would enable you to use the Python logger, although only on certain notifications. On the other hand, printing in the Trainable should be caught by a separate ray logging mechanism (i.e., take a look at ray.init(redirect_worker_output))

@nmatthews-asapp
Copy link
Contributor

I think the issue with the first suggestion is more conceptual - what you're suggesting is to change the definition of what one training step is for the sake of logging, which will affect the way training happens and in some cases prevent us from following your guidelines about how long one step should be. Ideally we could decide what one step should be regardless of logging, and have control over the global_step used in tensorboard

Just to clarify on the second logging question, you're saying that we can somehow setup a custom Tune logger such that when we, in other files, do something like

import logging
logger = logging.getLogger(__name__)
...
logger.info(batch_loss)

That log will end up going to the Unified logger? How does that work?

@jeremyasapp
Copy link
Author

@richardliaw Your suggestion could work, but as @nmatthews-asapp pointed out, this restricts us to a step becoming a single training iteration, which could be too short and result in overhead.

Of course, we could also only report training every N iterations, but then we have the same problem that we can only show a single aggregate number for the N passed iterations.

I'm also very curious about the use of Python logging with a custom logger. My original idea was to build our customer logger inside the Trainable, and use it inside the step method. If we can do it with python logging, then even better.

@richardliaw
Copy link
Contributor

I see. In that case, instead of programming against the framework, I would probably just setup my own TF summary writing within _train and turn off the TF logger (#4104). This logs directly to the file system of the remote node (assuming you're doing something with a cluster).

Then, any output in the Trainable logdir will be continuously sync'ed back to the master node if you're using the autoscaler (or if you set TUNE_CLUSTER_SSH_KEY - see tune/cluster_info.py). Then you can use tensorboard as you usually do.

[ignore comment about using custom logger, as I think it doesn't apply in this case] Regarding the Python logger, you wouldn't be able to use the logger from your original driver process because Trainables are run in separate processes. But if you're willing to do the logging setup in the Trainable (i.e., in _setup), any output of Trainables can be redirected to the driver (or logged to files on the remote node then sync'ed back as mentioned above)

Does that help?

@nmatthews-asapp
Copy link
Contributor

nmatthews-asapp commented Feb 27, 2019

Great, thanks for making that PR last time to support this syncing btw!

And we might take your latter suggestion as well. We want to support users logging from arbitrary places in the codebase without passing loggers down, so having logging set up for example by adding a handler in _setup to save to the right file path to be rsync'd back as you suggest would be nice. Again, the PR you made makes this possible in a nice way.

@richardliaw
Copy link
Contributor

No problem! Feel free to close the issue if resolved.

BTW, I'd be really interested in hearing more about your use case and seeing if there are any features we can add to Tune to better support what you're trying to do -- feel free to shoot me an email (rliaw [at] berkeley.edu)!

@jeremyasapp
Copy link
Author

@richardliaw Thanks a lot for the help. I think it'd be really productive for us to chat offline. We'll reach out soon! We love Ray + Tune. Closing the issue for now!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants