-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC-0026-logging-system #43
base: master
Are you sure you want to change the base?
Conversation
9dd373a
to
339f7f0
Compare
9442b54
to
15c24ec
Compare
15c24ec
to
ff90402
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for writing this down!
For the warning/error case, I'm curious what is the discrepancy / missing part in the current system?
Seems like a good proposal in general, @kurtamohler. I think it'd be helpful to expand this to the next level of detail and show how some of these scenarios would work, and how errors and warnings can be originated in both Python and C++ |
a6ec7d7
to
ba89c08
Compare
ba89c08
to
72797e2
Compare
Would it be at all helpful if I include a somewhat full description (or links to documentation when possible) for the current message APIs in PyTorch? I am going to write notes on it regardless for my own reference, and I can include it here if it's a good idea. EDIT: My notes on the current messaging APIs are here: https://github.com/kurtamohler/notes/blob/main/pytorch-messaging-api/current_messaging_api.md |
Yep, that's a great idea |
9222938
to
6883a9b
Compare
I've added a lot more detail, including a description of how message logging is currently done in PyTorch, and fixed some of the things brought up in discussion so far. I haven't described all of the APIs for creating messages in the new system yet--I will do that next |
Hey @kurtamohler! FYI I'll be on PTO for the next several weeks, and I think @albanD is on PTO currently. So this may take some time to respond to. |
6883a9b
to
96dcc2c
Compare
96dcc2c
to
d272df6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That looks great! Thanks for taking the time to write all of this down
|
||
* **`c10::BetaWarning`** - Python `torch.BetaWarning` | ||
- Emitted when a beta feature is called. See | ||
[PyTorch feature classifications](https://pytorch.org/blog/pytorch-feature-classification-changes/). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about this one tbh the page is 2+ years old and doesn't mention warning (also it is inacurate in saying that prototype features are not in releases)?
cc @gchanan do you have a strong opinion on this?
While I could see this being useful for some features for which we do expect to break BC, I think many beta/prototype features are just "not mature enough" and don't need such warnings to avoid too many warnings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I can see your point, maybe beta/prototype warning classes aren't a good idea. There are some warnings currently being emitted in these cases, so we will have to choose which warning type they use. For instance: https://github.com/pytorch/pytorch/blob/56dea92d973ff16b41cb812b69c5a7ab8c9d1109/torch/csrc/cuda/Module.cpp#L566-L568
Should we just use the default UserWarning
class for beta/prototype features that currently emit a warning, or provide something that's just a little more specific, like maybe UnstableFeatureWarning
?
### C++ to Python Warning Translation | ||
|
||
The conversion of warnings from C++ to Python is described [here](https://github.com/pytorch/pytorch/blob/72e4aab74b927c1ba5c3963cb17b4c0dce6e56bf/torch/csrc/Exceptions.h#L25-L48) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there anything notable about the current Info
system? it is the one I'm least familiar with and I am a bit curious :D
|
||
#### Info message classes: | ||
|
||
* **`c10::Info`** - Python `torch.Info` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking more about what would be the use case for this, I don't think that this should be used within core in general.
The only use case that I could see is a particular submodule that is expected to do some logging (distributed runner, benchmark tools, etc).
What do you think about forbidding to use the base Info
class and you are only allowed to use subclasses (DistributedInfo, BenchmarkInfo, etc). That will reduce the chances of having unrelated logs interleaved that the user cannot silence easily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that idea
we have torch.monitor (#30) that hasn't seen too much usage, perhaps we can use some of it for the new logging system. |
How is one supposed to selectively change the log verbosity of a specific subsystem? While having functions like # distributed.py
logger = torch.Logger("torch.distributed")
def broadcast():
logger.log_info("broadcasting stuff") Finally, is there a reason on why we can't integrate with python's logging package so there's very little users need to do to leverage this? |
e9baa60
to
c769f13
Compare
@kumpera, good questions.
At the moment, the general idea is that users will use a filter to silence messages. The specific API for that hasn't been written down yet. Although, in the case of warnings, the
Authors would use the appropriate message class. If an applicable message class doesn't exist yet, they should add it.
That might be a good idea. What do you think @albanD, @mruberry ?
I don't know much about it, I'll read about it and get back to you |
RFC-0026-logging-system.md
Outdated
The APIs for raising errors all check a boolean condition, the `cond` argument | ||
in the following signatures, and throw an error if that condition is false. | ||
|
||
The error APIs are listed below, with the C++ signature on the left and the | ||
corresponding Python signature on the right. | ||
|
||
**`TORCH_CHECK(cond, ...)`** - `torch.check(cond, *args)` | ||
- C++ error: `c10::Error` | ||
- Python error: `RuntimeError` | ||
|
||
**`TORCH_CHECK_INDEX(cond, ...)`** - `torch.check_index(cond, *args)` | ||
- C++ error: `c10::IndexError` | ||
- Python error: `IndexError` | ||
|
||
**`TORCH_CHECK_VALUE(cond, ...)`** - `torch.check_value(cond, *args)` | ||
- C++ error: `c10::ValueError` | ||
- Python error: `IndexError` | ||
|
||
**`TORCH_CHECK_TYPE(cond, ...)`** - `torch.check_type(cond, *args)` | ||
- C++ error: `c10::TypeError` | ||
- Python error: `TypeError` | ||
|
||
**`TORCH_CHECK_NOT_IMPLEMENTED(cond, ...)`** - `torch.check_not_implemented(cond, *args)` | ||
- C++ error: `c10::NotImplementedError` | ||
- Python error: `NotImplementedError` | ||
|
||
**`TORCH_CHECK_WITH(error_t, cond, ...)`** - `torch.check_with(error_type, cond, *args)` | ||
- C++ error: Specified by `error_t` argument | ||
- Python error: Specified by `error_type` argument |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want to change how asserts are written to support vmap (and/or Tensor Subclasses) over functions that call asserts (e.g. pytorch/pytorch#81767, cc @kshitij12345). In order to support those, the assert should go through the PyTorch dispatcher. Since we are re-thinking how asserts in PyTorch work, this is a good opportunity to figure out how we can have asserts as PyTorch operators.
Some background
folks write functions in C++ like the following:
Tensor safe_log(const Tensor& x) {
TORCH_CHECK((x > 0).all(), "requires x > 0");
return x.log()
}
If someone is doing vmap(safe_log)(torch.tensor([-1, 1, 2))
, then (x > 0).all()
is a Tensor that has multiple values. A Tensor that has multiple values cannot be coerced to a single boolean value, and the above errors out.
A sketch of a solution
For synchronous asserts, we would want two extension points. One to actually check the condition, and the other to raise an error message. So TORCH_CHECK(cond, ...)
would be implemented like:
if (at::is_true(cond)) {
msg = ...
at::fail(msg)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zou3519, I started looking into adding vmap support to TORCH_CHECK
, and I realized that it doesn't actually support tensor inputs at all, not even non-batched tensors. For example, if I do this:
TORCH_CHECK((at::zeros(10).eq(0)).all());
I get this build error:
/work2/kurtamohler/development/pytorch-1/c10/util/Exception.h:506:29: error: no match for 'operator!' (operand type is 'at::Tensor')
506 | if (C10_UNLIKELY_OR_CONST(!(cond))) { \
Without at::Tensor::operator!
, we currently have to do something like this instead:
TORCH_CHECK((at::zeros(10).eq(0)).all().item<bool>());
So I like the idea of adding support for tensor inputs to TORCH_CHECK
to allow us to avoid the explicit .item<bool>()
. This would also be a plus for Python/C++ consistency, since in Python torch.tensor(True)
and torch.Tensor(False)
are valid conditionals without needing to call .item()
on it.
One way, could be to use an at::is_true
function, like you mentioned. However, I'm not sure exactly how to do that. Since TORCH_CHECK
is defined in c10
, it doesn't have access to the at
namespace. But of course, since it's a macro, it's just doing substitution, so maybe we could have an #ifdef
that checks whether at::is_true
is available before using it, like this:
#ifdef AT_IS_TRUE_DEFINED
#define CHECK_COND(cond) at::is_true(cond)
#else
#define CHECK_COND(cond) (cond)
#endif
The #define AT_IS_TRUE_DEFINED
would be next to wherever the declaration of at::is_true
is, and then we can use CHECK_COND
in the definition of TORCH_CHECK
.
But when I tried this, I couldn't figure out which file to define at::is_true
such that its available for any TORCH_CHECK
call that might use tensors. It also just seems like kind of a sketchy solution.
Another option is to just define bool at::Tensor::operator!
. But this we would have to make it throw an error if the tensor has more than one logical element. Also, in order to make TORCH_CHECK
work correctly when cond
is a batched tensor, the actual meaning of at::Tensor::operator!
would have to be like !physical_tensor.any()
. Those seems like somewhat confusing inconsistencies to me. (Although, in the non-batched tensor case, it is actually consistent with Python, since not x
will error if x
has more than one element.) But maybe it's actually fine, I'm not sure. I think it's at least less sketchy than my AT_IS_TRUE_DEFINED
idea.
Maybe there's an overall better way to do this though
cc @albanD
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any reason you have to use TORCH_CHECK for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understood it, the suggestion was to make TORCH_CHECK((x > 0).all(), "requires x > 0");
, for example, work for both batched and non-batched tensors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess, for something like this where we're syncing, I think it is fine if we force the call sites to change what they look like. There's not many of this kind of check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with having something like a TORCH_CHECK_TENSOR that accepts a Tensor to sidestep the issues above. We control what gets written in our C++ internals so we can always refactor if we're not satisfied.
In Python, is it correct that we want torch.check
to be public API? If so it needs more scrutiny. My initial thought is that torch.check should do both TORCH_CHECK and TORCH_CHECK_TENSOR (there aren't technical reasons in Python for us to split those up)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the Python API needs to be public. I can make it private to start with, and we can make it public if we ever find a reason to
Sorry for not being specific here, what I meant by scoping is which subsystem is producing a given log message. For example, if one is troubleshooting an issue on backwards of a distributed model, they would enable logging for the "autograd" and "distributed" subsystems. I guess this boils down to whether log messages would carry a subsystem tag that can be used as part of filtering. |
RFC for a consistent C++/Python message logging system in PyTorch
Feature was requested in pytorch/pytorch#72948