-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Tune] Logging with multiple time intervals #4157
Comments
If I understand correctly, you want to have different fields being reported at different time scales? I think by default, the regular logging mechanism should be work with different metrics being reported through the training. One thing for clarity that you could do is nest a dictionary of values - for example,
Does this help? If I misunderstood, let me know! |
Hi, thanks for your quick response! So I think I understand what you mean but these lines in the TFLogger gave me the impression that all the metrics are given the same step:
What I'm confused about is that we only get to return a single dictionary of values every step, but within a step, it's possible that you may generate multiple values for the same key. Take for example:
In this example, it's unclear to me how Tensorboard will go about parsing the list of values. Ideally I would call tf_summary with the correct training time step for each value in |
As a secondary point, would there be anyway to support connecting the global python logger available from |
@jeremyasapp - would something like this work? def _train():
batch = next(train_sampler)
loss = train(batch)
extras = {}
if self._iteration % n_iter_per_step == 0:
extras["metric"] = evaluate(self.dev_data)
return dict(loss=loss, **extras) Here, you'd get |
@nmatthews-asapp can you let me know more about what you're trying to do? Does a custom logger work for you? Adding a custom logger would enable you to use the Python logger, although only on certain notifications. On the other hand, printing in the Trainable should be caught by a separate ray logging mechanism (i.e., take a look at |
I think the issue with the first suggestion is more conceptual - what you're suggesting is to change the definition of what one training step is for the sake of logging, which will affect the way training happens and in some cases prevent us from following your guidelines about how long one step should be. Ideally we could decide what one step should be regardless of logging, and have control over the global_step used in tensorboard Just to clarify on the second logging question, you're saying that we can somehow setup a custom Tune logger such that when we, in other files, do something like
That log will end up going to the Unified logger? How does that work? |
@richardliaw Your suggestion could work, but as @nmatthews-asapp pointed out, this restricts us to a step becoming a single training iteration, which could be too short and result in overhead. Of course, we could also only report training every N iterations, but then we have the same problem that we can only show a single aggregate number for the N passed iterations. I'm also very curious about the use of Python logging with a custom logger. My original idea was to build our customer logger inside the Trainable, and use it inside the step method. If we can do it with python logging, then even better. |
I see. In that case, instead of programming against the framework, I would probably just setup my own TF summary writing within Then, any output in the Trainable logdir will be continuously sync'ed back to the master node if you're using the autoscaler (or if you set [ignore comment about using custom logger, as I think it doesn't apply in this case] Regarding the Python logger, you wouldn't be able to use the logger from your original driver process because Trainables are run in separate processes. But if you're willing to do the logging setup in the Trainable (i.e., in Does that help? |
Great, thanks for making that PR last time to support this syncing btw! And we might take your latter suggestion as well. We want to support users logging from arbitrary places in the codebase without passing loggers down, so having logging set up for example by adding a handler in |
No problem! Feel free to close the issue if resolved. BTW, I'd be really interested in hearing more about your use case and seeing if there are any features we can add to Tune to better support what you're trying to do -- feel free to shoot me an email (rliaw [at] berkeley.edu)! |
@richardliaw Thanks a lot for the help. I think it'd be really productive for us to chat offline. We'll reach out soon! We love Ray + Tune. Closing the issue for now! |
System information
Describe the problem
The Trainable interface in tune expects the step method to output a logging dictionary. However, it is unclear how to annotate logging statements with different global steps. For example, one way want to record the model's gradients at every training iteration, but may only want to record the dev metric once per epoch.
One solution that we are exploring is building an adapter Trainable interface, with a custom logger (default logger being deactivated), that would be passed to the step method. The logger would then have a method such as:
log(key, value, time_step)
. The custom logger, when given a result (i.eon_result
) would then parse the list of (key, value, time_step) tuples and output the correct tensorboard graphs. The only downside from this method is that the user would have to wait until the step is finished before seeing the logs for that step.I was wondering if you had faced this question before, and had some thoughts about the best way to approach it. Thank you in advance for your help!
The text was updated successfully, but these errors were encountered: