Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC-0026-logging-system #43

Draft
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

kurtamohler
Copy link

@kurtamohler kurtamohler commented Jun 22, 2022

RFC for a consistent C++/Python message logging system in PyTorch

Feature was requested in pytorch/pytorch#72948

Copy link
Contributor

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for writing this down!
For the warning/error case, I'm curious what is the discrepancy / missing part in the current system?

RFC-0026-logging-system.md Outdated Show resolved Hide resolved
RFC-0026-logging-system.md Outdated Show resolved Hide resolved
RFC-0026-logging-system.md Outdated Show resolved Hide resolved
RFC-0026-logging-system.md Outdated Show resolved Hide resolved
RFC-0026-logging-system.md Outdated Show resolved Hide resolved
@mruberry
Copy link

Seems like a good proposal in general, @kurtamohler. I think it'd be helpful to expand this to the next level of detail and show how some of these scenarios would work, and how errors and warnings can be originated in both Python and C++

@kurtamohler
Copy link
Author

kurtamohler commented Jul 5, 2022

Would it be at all helpful if I include a somewhat full description (or links to documentation when possible) for the current message APIs in PyTorch? I am going to write notes on it regardless for my own reference, and I can include it here if it's a good idea.

EDIT: My notes on the current messaging APIs are here: https://github.com/kurtamohler/notes/blob/main/pytorch-messaging-api/current_messaging_api.md

@mruberry
Copy link

Would it be at all helpful if I include a somewhat full description (or links to documentation when possible) for the current message APIs in PyTorch? I am going to write notes on it regardless for my own reference, and I can include it here if it's a good idea.

EDIT: My notes on the current messaging APIs are here: https://github.com/kurtamohler/notes/blob/main/pytorch-messaging-api/current_messaging_api.md

Yep, that's a great idea

@kurtamohler
Copy link
Author

I've added a lot more detail, including a description of how message logging is currently done in PyTorch, and fixed some of the things brought up in discussion so far.

I haven't described all of the APIs for creating messages in the new system yet--I will do that next

@mruberry
Copy link

I've added a lot more detail, including a description of how message logging is currently done in PyTorch, and fixed some of the things brought up in discussion so far.

I haven't described all of the APIs for creating messages in the new system yet--I will do that next

Hey @kurtamohler! FYI I'll be on PTO for the next several weeks, and I think @albanD is on PTO currently. So this may take some time to respond to.

Copy link
Contributor

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks great! Thanks for taking the time to write all of this down

RFC-0026-logging-system.md Outdated Show resolved Hide resolved
RFC-0026-logging-system.md Outdated Show resolved Hide resolved
RFC-0026-logging-system.md Outdated Show resolved Hide resolved
RFC-0026-logging-system.md Outdated Show resolved Hide resolved

* **`c10::BetaWarning`** - Python `torch.BetaWarning`
- Emitted when a beta feature is called. See
[PyTorch feature classifications](https://pytorch.org/blog/pytorch-feature-classification-changes/).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this one tbh the page is 2+ years old and doesn't mention warning (also it is inacurate in saying that prototype features are not in releases)?
cc @gchanan do you have a strong opinion on this?

While I could see this being useful for some features for which we do expect to break BC, I think many beta/prototype features are just "not mature enough" and don't need such warnings to avoid too many warnings?

Copy link
Author

@kurtamohler kurtamohler Aug 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I can see your point, maybe beta/prototype warning classes aren't a good idea. There are some warnings currently being emitted in these cases, so we will have to choose which warning type they use. For instance: https://github.com/pytorch/pytorch/blob/56dea92d973ff16b41cb812b69c5a7ab8c9d1109/torch/csrc/cuda/Module.cpp#L566-L568

Should we just use the default UserWarning class for beta/prototype features that currently emit a warning, or provide something that's just a little more specific, like maybe UnstableFeatureWarning?

RFC-0026-logging-system.md Outdated Show resolved Hide resolved
RFC-0026-logging-system.md Show resolved Hide resolved
RFC-0026-logging-system.md Outdated Show resolved Hide resolved
RFC-0026-logging-system.md Show resolved Hide resolved
### C++ to Python Warning Translation

The conversion of warnings from C++ to Python is described [here](https://github.com/pytorch/pytorch/blob/72e4aab74b927c1ba5c3963cb17b4c0dce6e56bf/torch/csrc/Exceptions.h#L25-L48)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there anything notable about the current Info system? it is the one I'm least familiar with and I am a bit curious :D


#### Info message classes:

* **`c10::Info`** - Python `torch.Info`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking more about what would be the use case for this, I don't think that this should be used within core in general.
The only use case that I could see is a particular submodule that is expected to do some logging (distributed runner, benchmark tools, etc).
What do you think about forbidding to use the base Info class and you are only allowed to use subclasses (DistributedInfo, BenchmarkInfo, etc). That will reduce the chances of having unrelated logs interleaved that the user cannot silence easily.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that idea

@edward-io
Copy link

we have torch.monitor (#30) that hasn't seen too much usage, perhaps we can use some of it for the new logging system.

@kumpera
Copy link

kumpera commented Aug 2, 2022

How is one supposed to selectively change the log verbosity of a specific subsystem?
And how would a subsystem author produces logs that are properly scoped?

While having functions like torch.log_info is nice, it would be great if we could construct logger instances for particular modules. For example:

# distributed.py

logger = torch.Logger("torch.distributed")

def broadcast():
 logger.log_info("broadcasting stuff")

Finally, is there a reason on why we can't integrate with python's logging package so there's very little users need to do to leverage this?

@kurtamohler
Copy link
Author

kurtamohler commented Aug 4, 2022

@kumpera, good questions.

How is one supposed to selectively change the log verbosity of a specific subsystem?

At the moment, the general idea is that users will use a filter to silence messages. The specific API for that hasn't been written down yet. Although, in the case of warnings, the warnings module already does this. The missing detail is how the user would silence info messages, and I imagine that we would want to have something that has a very similar interface to the warnings module filter.

And how would a subsystem author produces logs that are properly scoped?

Authors would use the appropriate message class. If an applicable message class doesn't exist yet, they should add it.

While having functions like torch.log_info is nice, it would be great if we could construct logger instances for particular modules.

That might be a good idea. What do you think @albanD, @mruberry ?

Finally, is there a reason on why we can't integrate with python's logging package so there's very little users need to do to leverage this?

I don't know much about it, I'll read about it and get back to you

Comment on lines 216 to 244
The APIs for raising errors all check a boolean condition, the `cond` argument
in the following signatures, and throw an error if that condition is false.

The error APIs are listed below, with the C++ signature on the left and the
corresponding Python signature on the right.

**`TORCH_CHECK(cond, ...)`** - `torch.check(cond, *args)`
- C++ error: `c10::Error`
- Python error: `RuntimeError`

**`TORCH_CHECK_INDEX(cond, ...)`** - `torch.check_index(cond, *args)`
- C++ error: `c10::IndexError`
- Python error: `IndexError`

**`TORCH_CHECK_VALUE(cond, ...)`** - `torch.check_value(cond, *args)`
- C++ error: `c10::ValueError`
- Python error: `IndexError`

**`TORCH_CHECK_TYPE(cond, ...)`** - `torch.check_type(cond, *args)`
- C++ error: `c10::TypeError`
- Python error: `TypeError`

**`TORCH_CHECK_NOT_IMPLEMENTED(cond, ...)`** - `torch.check_not_implemented(cond, *args)`
- C++ error: `c10::NotImplementedError`
- Python error: `NotImplementedError`

**`TORCH_CHECK_WITH(error_t, cond, ...)`** - `torch.check_with(error_type, cond, *args)`
- C++ error: Specified by `error_t` argument
- Python error: Specified by `error_type` argument
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to change how asserts are written to support vmap (and/or Tensor Subclasses) over functions that call asserts (e.g. pytorch/pytorch#81767, cc @kshitij12345). In order to support those, the assert should go through the PyTorch dispatcher. Since we are re-thinking how asserts in PyTorch work, this is a good opportunity to figure out how we can have asserts as PyTorch operators.

Some background

folks write functions in C++ like the following:

Tensor safe_log(const Tensor& x) {
  TORCH_CHECK((x > 0).all(), "requires x > 0");
  return x.log()
}

If someone is doing vmap(safe_log)(torch.tensor([-1, 1, 2)), then (x > 0).all() is a Tensor that has multiple values. A Tensor that has multiple values cannot be coerced to a single boolean value, and the above errors out.

A sketch of a solution

For synchronous asserts, we would want two extension points. One to actually check the condition, and the other to raise an error message. So TORCH_CHECK(cond, ...) would be implemented like:

if (at::is_true(cond)) {
  msg = ...
  at::fail(msg)
}

Copy link
Author

@kurtamohler kurtamohler Nov 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zou3519, I started looking into adding vmap support to TORCH_CHECK, and I realized that it doesn't actually support tensor inputs at all, not even non-batched tensors. For example, if I do this:

TORCH_CHECK((at::zeros(10).eq(0)).all());

I get this build error:

/work2/kurtamohler/development/pytorch-1/c10/util/Exception.h:506:29: error: no match for 'operator!' (operand type is 'at::Tensor')
  506 |   if (C10_UNLIKELY_OR_CONST(!(cond))) {            \

Without at::Tensor::operator!, we currently have to do something like this instead:

TORCH_CHECK((at::zeros(10).eq(0)).all().item<bool>());

So I like the idea of adding support for tensor inputs to TORCH_CHECK to allow us to avoid the explicit .item<bool>(). This would also be a plus for Python/C++ consistency, since in Python torch.tensor(True) and torch.Tensor(False) are valid conditionals without needing to call .item() on it.

One way, could be to use an at::is_true function, like you mentioned. However, I'm not sure exactly how to do that. Since TORCH_CHECK is defined in c10, it doesn't have access to the at namespace. But of course, since it's a macro, it's just doing substitution, so maybe we could have an #ifdef that checks whether at::is_true is available before using it, like this:

#ifdef AT_IS_TRUE_DEFINED
#define CHECK_COND(cond) at::is_true(cond)
#else
#define CHECK_COND(cond) (cond)
#endif

The #define AT_IS_TRUE_DEFINED would be next to wherever the declaration of at::is_true is, and then we can use CHECK_COND in the definition of TORCH_CHECK.

But when I tried this, I couldn't figure out which file to define at::is_true such that its available for any TORCH_CHECK call that might use tensors. It also just seems like kind of a sketchy solution.

Another option is to just define bool at::Tensor::operator!. But this we would have to make it throw an error if the tensor has more than one logical element. Also, in order to make TORCH_CHECK work correctly when cond is a batched tensor, the actual meaning of at::Tensor::operator! would have to be like !physical_tensor.any(). Those seems like somewhat confusing inconsistencies to me. (Although, in the non-batched tensor case, it is actually consistent with Python, since not x will error if x has more than one element.) But maybe it's actually fine, I'm not sure. I think it's at least less sketchy than my AT_IS_TRUE_DEFINED idea.

Maybe there's an overall better way to do this though

cc @albanD

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason you have to use TORCH_CHECK for this

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understood it, the suggestion was to make TORCH_CHECK((x > 0).all(), "requires x > 0");, for example, work for both batched and non-batched tensors

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess, for something like this where we're syncing, I think it is fine if we force the call sites to change what they look like. There's not many of this kind of check.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with having something like a TORCH_CHECK_TENSOR that accepts a Tensor to sidestep the issues above. We control what gets written in our C++ internals so we can always refactor if we're not satisfied.

In Python, is it correct that we want torch.check to be public API? If so it needs more scrutiny. My initial thought is that torch.check should do both TORCH_CHECK and TORCH_CHECK_TENSOR (there aren't technical reasons in Python for us to split those up)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the Python API needs to be public. I can make it private to start with, and we can make it public if we ever find a reason to

@kumpera
Copy link

kumpera commented Aug 6, 2022

And how would a subsystem author produces logs that are properly scoped?

Authors would use the appropriate message class. If an applicable message class doesn't exist yet, they should add it.

Sorry for not being specific here, what I meant by scoping is which subsystem is producing a given log message.

For example, if one is troubleshooting an issue on backwards of a distributed model, they would enable logging for the "autograd" and "distributed" subsystems.

I guess this boils down to whether log messages would carry a subsystem tag that can be used as part of filtering.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants