[RFC] Structured Error Handling Mechanism #2279

JennyChou5 · 2018-12-12T08:34:53Z

Problem Statement
TVM is gaining users in the research community and in production deployments. Simplifying the user experience can drive faster adoption, particularly among developers and users unfamiliar with the stack. A critical aspect of UX is better error handling. A structured error handling mechanism would help users of the TVM stack reduce the time to debug and deliver long-term dividends in usability. A structured mechanism must separate error-handling code from regular code and provide the ability to propagate errors up the call stack. Although TVM stack uses built-in python/C++ exception handling, the error mechanism does not provide the ability to group and differentiate error types.

Current TVM Error Handling
The TVM stack currently throws built-in python/C++ exceptions as error messages. Parts of the code throw errors at the condition detecting site collecting information relevant to the error while parts of the code throw errors that are then passed implicitly to “exception” handlers, using the now classic try exception idiom. This exception class is an easy-to-use feature in modern object-oriented programming languages. However, the scheme is insufficient for production use when TVM is incorporated into a larger stack or service.

Proposed Error Handling Mechanism
We proposed a structured error-handling mechanism characterized by the following features:
o Represents all errors as functions or objects
o Allows you to define your own error types or classes
o Allows you to explicitly raise an error
o Allows you to catch particular error types in a particular layer
o Allow you to propagate error from the inner layer to the outer layer

First, we separate TVM into 4 major layers: model parsing layer (parse TF, MX, PyTorch, ONNX models into NNVM graph), NNVM pass layer (infer shape and type, graph-level optimization, e.g. op fusion, precompute), TVM compute layer, TVM schedule layer. For each layer, we catalog errors into pre-defined object and separate them in one util folder. All files will import from that file for using the error-handling related classes.

Example of the Proposed Error Handling Mechanism
For example, in the model parsing layer as parsing different frameworks into NNVM graphs, we can define some error-handling functions in one class. Here, we first categorize errors into four types in this layer: OperatorNotImplemented, OperatorAttributeNotImplemented, OperatorAttributeRequired, OperatorAttributeValueNotValid and provide three definitions for that (see functions: _required_attr, _raise_attr_value_error, _raise_not_supported). For all frame parsing files, such as mxnet.py, tenflow.py, onnx.py, they could import the corresponding error-handling class/functions and utilize them inside it. Of note, the following function are from mxnet.py.

In such case, we provide the unified way to categorize the error types in this layer. Users can add different error types easily and provide the ability to propagate error reporting up the call stack.

Risk and Limitation
We have to define the error-handling templates first in each layer and ask TVM’s contributors to follow. In addition, for the existing codebase, moving to structured error- handling scheme will need some effort.

tqchen · 2018-12-13T18:01:33Z

There is a trade-off between a very detailed error messaging system vs the ease of programming. I agree that we should build a reasonable hierarchy in the python side of the error message. But sometimes adding one error type per function can be overkill

In the C++ side, one of the easy ways to solve this problem is to have a specific error message encoding convention by string and allows re-dispatch. Either way, it is good to have a concrete proposal on the proposed error hierarchy and we should do it incrementally.

yidawang · 2018-12-17T08:14:37Z

Thanks @JennyChou5 for the proposal. I think it is a good step of moving TVM to the product line.

What we should do is to define unified error types and encourage contributors to refer to as needed. We also should reasonably categorize the error types so that most of the functions can throw error in known types.

In terms of the layers proposed above (model parsing layer, NNVM/Relay pass layer, TVM compute layer, TVM schedule layer), I think we should add TVM runtime layer. Also, I am not sure if TVM compute and schedule should be divided, perhaps combining them and adding codegen layer instead?

JennyChou5 · 2018-12-17T08:33:46Z

Thanks @tqchen @yidawang for the feedback. Will come out with proposed error hierarchy for review in each layer.

JennyChou5 · 2019-03-03T05:47:03Z

Error Type Categories

As proposed, we separate TVM into 4 major layers: framework parsing layer (parse TF, MX, PyTorch, ONNX models into NNVM or relay), nnvm/replay pass layer (infer shape and type, graph-level optimization, e.g. op fusion, precompute), TVM compute/schedule/codegen layer, TVM runtime layer.

Layer 1: framework parsing
• OperatorAttributeRequired: “Required attribute {} is not found in operator {}.".format(key, opname)
• OperatorAttributeValueNotValid: “Value {} in attr {} is not valid in operator {}".format(value, attr, op_name)
• OperatorNotImplemented: “{} are not supported”.format(missing_operators_list)
• OperatorAttributeNotImplemented: “attribute {} is not supported in operator {}.”. format(key, opname)

Layer 2: nnvm/relay pass
• InferPassError: “Insufficient/Invalid information for {} pass for {} op failed with this {}".format(pass_name, op_name, c_error_msg)
• OrderViolatedPassError: “{} pass need to be run before {} pass".format(pass_A, pass_B)
• OptimizationPassError: “{} pass failed".format(pass_name)
• GraphLoweringPassError: Lowering of {} op failed".format(op_name)

Layer 3: TVM compute/schedule/codegen
• InvalidComputeDef : “Invalid compute definition; check the tvm.compute function {}” .format(op_name)
• InvalidScheduleDef:
- InvalidScheduleOperation “Invalid schedule definition; check the schedule function {} defined under @autotvm.register_topi_schedule({}) or @generic.schedule_{}.register()”.format(op_name, target_name)
- TensorizePatternMismatchErr (msg)
- StorageDefErr (msg)
• TVMIRPassError:
- ConditionPassErr: “Fail to optimize condition statements; check the compute and schedule function to see whether the conditions can be removed, sometimes it is due to indivisible tiling. Details: {}”.format(err_msg)
- StoragePassErr “Fail to run pass on storage. May due to invalid buffer; check the scope, strides, offset, etc. of the buffer. Details: {}”.format(err_msg)
- ArithmeticPassErr “Details: {}”.format(err_msg)
- InjectIntrinErr “Fail to generate hardware intrinsics on what device; check the schedule function. Details: {}”.format(err_msg)
- NonDefinedError “Details: {}”.format(err_msg)
• Target-related Error: “{} observed in Target {}”. format(err_msg, target_name)

Layer 4: TVM runtime
• InvalideKeyErr
- InvalidNodeErr “Graph node with name {} does not exist: {}”.format(node_name, err_msg)
- InvalidOpCodeErr “Invalid OpCode: {}”.format(err_msg)
• FileIOErr: “Fail to access filename or file type at requested location: {}”.format(err_msg)
• TensorAttributeError “Array shape check failed, make sure both size and dimensions match”.format(err_msg)
• DataTypeErr “Bitwidth {} not supported on devices. Details: {}”. format (bitwidth type, err_msg)
• Missing implementation Err: “frequent cause: device specific .cc file missing. Details: {}”.format(err_mag)
• ExternalRuntimeErr “runtime initialization or execution failed: {}”.format(err_msg)
• Warning : MacroNotImplement, etc

Coding Format

As observed that, some of the functionality APIs in TVM are for conversion between Python and C++. However, we prefer using a unified-formatted code which will be easier to change and maintain over time. We prefer having python API as base and call python functions from C++.

For calling python functions from C++, we would like to use methods provided by the C-API (http://docs.python.org/c-api/). This is especially the case for libraries that support so-called “callback” functions. The Python program must pass the Python function object, PyObject. By using PyObject as the basic object for the C-API, we would be able to represent any kind of python basic types (string, float, int,...).

Clean Place for Error Handling APIs

We would like to create a new folder, names err_handling, under python/tvm. Inside this folder, there would be four files (error types in same layer will be placed in one file): FrameworkParsing_Err.py, Pass_Err.py, CompSche_Err.py, runtune_Err.py.

Example 1: inside FrameworkParsing_Err.py

def _raise_attr_required_error (attr, op):
raise AttributeError("OperatorAttributeRequired: Required attribute {} is not found in operator {}.".format(attr,op))
def _raise_attr_value_error(error, op):
raise AttributeError("OperatorAttributeValueNotValid: Value {} in attr {} is not valid in operator {}".format(value, attr, op_name)
def _raise_attr_not_supported_error(attr, op):
raise NotImplementedError("OperatorAttributeNotImplemented: {} is not supported in {}".format(attr, op))
def _raise_op_not_supported_error(missing_op):
raise NotImplementedError("OperatorNotImplemented: The following operators are not implemented: {}".format(missing_op))

Therefore, in order to use framework parsing error APIs, we could import FrameworkParsing_Err in the beginning of the mxnet.py, tensorflow.py, etc under frontend folder and use the defined error type function for throwing the error messages.

Example 2: inside Pass_Err.py

def _raise_infer_pass_error(c_error_msg, pass_name, op_name):
      raise RuntimeError("RelayInferPassError: Insufficient/Invalid information\
      for {} pass for {} op failed with this {}".format(pass_name, op_name, c_error_msg))
def _raise_pass_order_violated(pass_A, pass_B):
      raise RuntimeError("RelayPassOrderViolated: {} pass need to be run before {} pass".format(pass_A, pass_B))
def _raise_optimization_pass_error(pass_name):
    raise RuntimeError("RelayOptimizationPassError: {} pass failed".format(pass_name))
def _raise_graph_lowering_error(op_name):
    raise RuntimeError("RelayGraphLoweringError: Lowering of {} op failed".format(op_name))

JennyChou5 · 2019-03-03T05:48:41Z

@tqchen @yidawang Could you provide feedback? Thinking to start from the frontend parsers. Thanks.

tqchen · 2019-03-16T05:58:44Z

Thanks for the RFC. Sorry I was a bit late in recommend due to my latest travels. Here are some comments to the proposal

Let us start simple, while most of the proposal makes sense, there is always a risk of over-categorize, starting from front-end taxonomy makes sense
- We can open new RFC for new Error categories when we start to need them.
Do not use C++ to python callback to handle c++ exception
- The main reasoning is that we do not want to have too many backward dep from c++ to python
- Instead, use a common string format(e.g. prefix by Exception name) and translation
- Introduce an error translation function that translates TVMError message into a specific error type.
I personally do not like the raise_xyz wrapper, while for certain cases it is convenient, it is only an additional layer of abstraction. I would simply have specific error types when necessary and raise them instead of call into raise_xyz
- Take a look at other python codebases, e.g. tensorflow or pytorch to see what if there are similar use-cases, at least I did not see such style, which could be a sign on maybe we don't want to do it.
Put the error namespace under tvm/error.py (single file, because I don't think developers would ever be able to understand error hierarchy that goes beyond a single file) keep things simple
- Document each error class
- Edit the RFC to have a clear minimum error convention about each type of error and example.
- PR the error guideline to contribute/error_handling.rst

Perhaps I am being pedantic, but I really want us to build something that is minimum and elegant. Adding layers of abstraction always can solve problems in the short term, but could bring longer-term technical debt, as developers need to spend longer time understand how to use things.

Starting simple, and clearly, document the error class and string formatting convention that a developer can copy paste into the codebase would be very useful for such consolidation.

tqchen · 2019-03-16T06:08:09Z

Would also like to see comments from other community folks @dmlc/tvm-committer

tqchen · 2019-03-18T00:36:27Z

I created #2838 to provide some minimum scaffolds on the topic. @jroesch and @wweic raise the question about whether should we provide a more attribute aware error constructor.

   def clear_throw():
        raise OpNotImplemented("Operator relu is not implemented in the MXNet fronend")

    def _op_not_implemented(op_name):
        return OpNotImpelemented("Operator {} is not implemented.").format(op_name)

    def wrapper_based():
        raise _op_not_implemented("relu")

As demonstrated by the above examples. I personally prefer clear_throw style.

clear_throw is quite self-contained about the error type and what is the error message.
It removes a layer of abstraction and opens ways for error message customization
The developer can simply copy paste edit the code given that it is a one-liner or two to three lines, which is not as bad(especially when the wrapper function is not in the same file).

JennyChou5 · 2019-03-18T01:01:29Z

I agree clear_throw approached is easy to use for developers. However, the detailed error message might not be unified and can be generated error by typo. Vote for the second approach:

def _op_not_implemented(op_name):
        return OpNotImpelemented("Operator {} is not implemented.").format(op_name)

This way we can unify the error message and it could be easy for developers/users to follow. Of note, it would be good to make sure the API name is clear and simple for developers to understand (_op_not_implemented means operator not implemented)

tqchen · 2019-03-18T01:14:25Z

I would like to point out that adopting clear_throw style does not prevent us to get a unified error message. We can encourage the unified format by introducing a standard "error message convention" (see https://github.com/dmlc/tvm/pull/2838/files#diff-c89317db650ce6bced36dadc47755c5fR51) developers encouraged to copy and then edit.

Surprisingly, we can go quite far by using copy-editing for such kind of one-liner message and keep things consistent (this is not that different from writing similar docstrings).

Secondly, there is always a need for refining the error message. For example, in the case of clear_throw, it says the operator is not supported "in the MXNet frontend". If we want to have a common function (_op_not_implemented):

If it is specific to from_mxnet.py, then it is fine, and we can refine by adding "in the MXNet frontend"
If it is global, then the function itself can become complicated (developer need to go across files to understand such one-liner)

The plus side is that the code is easier to maintain:

Developers can easily follow what is the error message without go across the files
Refactoring error message style(which is rare) means find-replace and is usually supported by text editors.

As a middle ground, I think I am fine introducing _op_not_implemented as long as it is in the same file. So it is still very easy to find and allows specific error message crafting to some extent

tqchen · 2019-04-03T03:02:15Z

Close as the first part scaffolding is in. Let us open new RFCs for new error class proposals

JennyChou5 changed the title ~~Structured Error Handling Mechanism~~ [RFC] Structured Error Handling Mechanism Dec 13, 2018

tqchen added the status: RFC label Dec 13, 2018

hcho3 mentioned this issue Dec 14, 2018

Unify logging facilities. dmlc/xgboost#3982

Merged

yidawang mentioned this issue Mar 16, 2019

Unified error handling in NNVM and Relay frontends #2828

Merged

tqchen mentioned this issue Mar 18, 2019

[RUNTIME] Scaffold structured error handling. #2838

Merged

4 tasks

tqchen closed this as completed Apr 3, 2019

kice mentioned this issue Dec 16, 2019

[WINDOWS] Fix python exception Creation in Windows #4529

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Structured Error Handling Mechanism #2279

[RFC] Structured Error Handling Mechanism #2279

JennyChou5 commented Dec 12, 2018 •

edited

Loading

tqchen commented Dec 13, 2018

yidawang commented Dec 17, 2018

JennyChou5 commented Dec 17, 2018

JennyChou5 commented Mar 3, 2019

JennyChou5 commented Mar 3, 2019 •

edited

Loading

tqchen commented Mar 16, 2019 •

edited

Loading

tqchen commented Mar 16, 2019

tqchen commented Mar 18, 2019 •

edited

Loading

JennyChou5 commented Mar 18, 2019 •

edited by tqchen

Loading

tqchen commented Mar 18, 2019 •

edited

Loading

tqchen commented Apr 3, 2019

[RFC] Structured Error Handling Mechanism #2279

[RFC] Structured Error Handling Mechanism #2279

Comments

JennyChou5 commented Dec 12, 2018 • edited Loading

tqchen commented Dec 13, 2018

yidawang commented Dec 17, 2018

JennyChou5 commented Dec 17, 2018

JennyChou5 commented Mar 3, 2019

JennyChou5 commented Mar 3, 2019 • edited Loading

tqchen commented Mar 16, 2019 • edited Loading

tqchen commented Mar 16, 2019

tqchen commented Mar 18, 2019 • edited Loading

JennyChou5 commented Mar 18, 2019 • edited by tqchen Loading

tqchen commented Mar 18, 2019 • edited Loading

tqchen commented Apr 3, 2019

JennyChou5 commented Dec 12, 2018 •

edited

Loading

JennyChou5 commented Mar 3, 2019 •

edited

Loading

tqchen commented Mar 16, 2019 •

edited

Loading

tqchen commented Mar 18, 2019 •

edited

Loading

JennyChou5 commented Mar 18, 2019 •

edited by tqchen

Loading

tqchen commented Mar 18, 2019 •

edited

Loading