Skip to content
This repository has been archived by the owner on Oct 25, 2023. It is now read-only.

[TUZ-150] Add a simplified access point for Unity Flow #32

Merged
merged 162 commits into from
Mar 8, 2023
Merged

Conversation

jwfromm
Copy link
Contributor

@jwfromm jwfromm commented Feb 27, 2023

This PR adds a tvmc like simplified API for accessing the Unity flow. The goal is to provide an extremely simple interface that applies all needed steps of the flow. An example illustrating the API follows:

import tvm.octo
# Compile the model and offload as much as possible
octo_model = tvm.octo.compile("my_model.onnx")
# Benchmark the model
report = octo_model.profile()
# Run specific inputs through the model
outputs = octo_model.run(input_dict)

octo.compile contains utility features like automatically figuring out the current tvm target and generating proper random inputs if not specified.

Going forward, the thought is that new features (such as passes) should land in octo.compile to make sure we have a single place that contains everything needed to get great unity flow results.

mehrdadh and others added 14 commits February 24, 2023 17:14
This PR applies appropriate changes to make sure the CI fails if micro_tvmc.sh tutorial fails. This issue was captured in #14074.
This PR also makes changes to avoid this breakage in bash script tutorials in future. In addition, this PR fixes the bug in running TVMC tutorial which happened due to renaming zephyr_board to board.
This PR introduces async pipeline in the current TVM's MultiLevelTiling Rules. This PR is based on apache/tvm#13966, which is already merged. This is because some conv2d workload will use `tir.if_then_else` to pad the input to the correct size, and this PR uses async copy in such copy statement.

1. Add a subrule in `src/meta_schedule/schedule_rule/multi_level_tiling.h/.cc` that annotate async copy for mlt in supported arch (>= sm80).

In CUDA Core, this PR has a perf boost of around 1T GFLOP/s in most Conv2d test cases and 1T ~ 2T in most GEMM test cases.

All generated codes, scripts, and traces are available at https://github.com/Rainy-Memory/tvm-async-rule-benchmark.

Currently tested on commit `afbfb7aa7e43732cb716f8e443df696110be6afc` in conv2d NHWC workload, with a RTX 3080 GPU.

**Notice: given the stochastic nature of evolutionary search, perfromance might become worse if enable this PR.**

Workload: Conv2d NHWC

|Shape|Mainline TVM|Mainline TVM with Async|Performance Boost|
|-|-|-|-|
|N=1_H=224_W=224_C=3_K=64_R=7_S=7_STR=2_PAD=3_DIL=1|13838.05219|14687.89452|6.141343581679319%|
|N=1_H=56_W=56_C=64_K=64_R=1_S=1_STR=1_PAD=0_DIL=1|5398.305085|5613.892553|3.9936140067192905%|
|N=1_H=56_W=56_C=64_K=64_R=3_S=3_STR=1_PAD=1_DIL=1|11652.96825|13157.88249|12.91442839038028%|
|N=1_H=56_W=56_C=64_K=256_R=1_S=1_STR=1_PAD=0_DIL=1|10638.8309|11674.68499|9.736540600527816%|
|N=1_H=56_W=56_C=256_K=64_R=1_S=1_STR=1_PAD=0_DIL=1|8692.32829|9469.264089|8.938178277203573%|
|N=1_H=56_W=56_C=256_K=128_R=1_S=1_STR=2_PAD=0_DIL=1|4685.767442|5698.19634|21.606469175684712%|
|N=1_H=28_W=28_C=128_K=128_R=3_S=3_STR=1_PAD=1_DIL=1|9872.787087|10404.60405|5.38669535070061%|
|N=1_H=28_W=28_C=128_K=512_R=1_S=1_STR=1_PAD=0_DIL=1|9974.281496|10073.31657|0.9929043414276753%|
|N=1_H=28_W=28_C=512_K=128_R=1_S=1_STR=1_PAD=0_DIL=1|7075.866932|8564.572712|21.039199780135142%|
|N=1_H=28_W=28_C=512_K=256_R=1_S=1_STR=2_PAD=0_DIL=1|3648.330914|4021.923142|10.240086132713124%|
|N=1_H=14_W=14_C=256_K=256_R=3_S=3_STR=1_PAD=1_DIL=1|8192.954618|9160.182054|11.805599824451525%|
|N=1_H=14_W=14_C=256_K=1024_R=1_S=1_STR=1_PAD=0_DIL=1|8008.870153|9362.825279|16.90569456283206%|
|N=1_H=14_W=14_C=1024_K=256_R=1_S=1_STR=1_PAD=0_DIL=1|5210.062241|6051.208379|16.144646629759908%|
|N=1_H=14_W=14_C=1024_K=512_R=1_S=1_STR=2_PAD=0_DIL=1|2550.787202|3587.902938|40.65865373586739%|
|N=1_H=7_W=7_C=512_K=512_R=3_S=3_STR=1_PAD=1_DIL=1|4350.626084|5432.788068|24.873706981617943%|
|N=1_H=7_W=7_C=512_K=2048_R=1_S=1_STR=1_PAD=0_DIL=1|6672.068026|7663.725217|14.862815953549454%|
|N=1_H=7_W=7_C=2048_K=512_R=1_S=1_STR=1_PAD=0_DIL=1|3142.564263|4297.988014|36.766909259541826%|

Workload: GEMM NN

|Shape|Mainline TVM|Mainline TVM with Async|Performance Boost|
|-|-|-|-|
|M=512_N=256_K=640|8678.46|10607.37|22.226408832903555%|
|M=512_N=384_K=256|8109.13|10290.72|26.902886006267003%|
|M=512_N=512_K=512|11419.83|14000.86|22.601299669084398%|
|M=512_N=3072_K=768|19709.39|18351.61|-6.8890006235606425%|
|M=512_N=768_K=3072|12844.59|13730.88|6.90010346768561%|
|M=896_N=896_K=896|16149.91|16131.39|-0.11467556165947945%|
|M=1024_N=1024_K=1024|18842.11|19662.8|4.355616223448428%|
|M=1152_N=1152_K=1152|15386.79|16736.1|8.769275462913303%|
|M=1536_N=1536_K=1536|18522.67|18872.06|1.88628313304725%|
|M=2048_N=2048_K=2048|19515.42|18874.85|-3.282378754851291%|
|M=3072_N=3072_K=3072|19233.9|19291.42|0.2990553137948975%|
|M=4096_N=4096_K=4096|17122.17|19259.01|12.479960191961652%|
…Script (#14111)

This PR adds an op attribute `TScriptDtypePrintLocation`, and modifies the dtype printing logic of the builtin op to check this attribute. So that user defined operators can use it to specify how there dtype argument are printed by appending attributes instead of appending members to `dtype_first_arg`/`dtype_last_arg`.
Currently, if the same metadata object (e.g. a multi-line `tir.StringImm`) is referenced for more than one times in an IRModule, each reference will have different indices of the metadata array. For example, this code

```
str_imm = T.StringImm("aaa\nbbb\n")
@I.ir_module
class Module:
    @T.prim_func
    def foo() -> None:
        A = str_imm
        B = str_imm

    @T.prim_func
    def foo1() -> None:
        A = str_imm
Module.show()
```

where `str_imm` is referenced three times, will generate such output:

```
@I.ir_module
class Module:
    @T.prim_func
    def foo():
        A: T.handle = metadata["tir.StringImm"][0]
        B: T.handle = metadata["tir.StringImm"][1]
        T.evaluate(0)

    @T.prim_func
    def foo1():
        A: T.handle = metadata["tir.StringImm"][2]
        T.evaluate(0)
```

Each time has a different metadata index. 

This PR fixes this problem by detecting duplicate item in `IRDocsifierNode::AddMetadata`.
Minor fix in pytorch frontend to compile gpt2 model, reproduce script.
torch_version = 1.13.1
transformers_version = 4.26.1

```
from transformers import GPT2LMHeadModel
import torch
import tvm
from tvm import relay

inp = torch.ones((1, 128)).to(torch.int64)
input_shapes = [("input_ids", ((1, 128), "int64"))]

model = GPT2LMHeadModel.from_pretrained('gpt2', return_dict=False)
trace_model = torch.jit.trace(model, inp, strict=False)
outputs = trace_model(inp)

mod, params = relay.frontend.from_pytorch(trace_model, input_shapes)
with tvm.transform.PassContext(opt_level=3):
    lib = relay.build(mod, target='llvm', params=params)

runtime = tvm.contrib.graph_executor.GraphModule(lib["default"](tvm.device('cpu', 0)))
runtime.set_input("input_ids", inp.numpy())
runtime.run()
out = runtime.get_output(0).numpy()
print(out)
print('Done...')
```

Before the fix, the error message
```
Traceback (most recent call last):
  File "gpt2_compile.py", line 13, in <module>
    mod, params = relay.frontend.from_pytorch(trace_model, input_shapes)
  File "/home/ubuntu/apache_tvm/tvm/python/tvm/relay/frontend/pytorch.py", line 4791, in from_pytorch
    outputs = converter.convert_operators(_get_operator_nodes(graph.nodes()), outputs, ret_name)
  File "/home/ubuntu/apache_tvm/tvm/python/tvm/relay/frontend/pytorch.py", line 4164, in convert_operators
    relay_out = relay_op(
  File "/home/ubuntu/apache_tvm/tvm/python/tvm/relay/frontend/pytorch.py", line 841, in full
    return self.full_impl(data, fill_value, dtype)
  File "/home/ubuntu/apache_tvm/tvm/python/tvm/relay/frontend/pytorch.py", line 743, in full_impl
    fill_value = _expr.const(fill_value, dtype=dtype)
  File "/home/ubuntu/apache_tvm/tvm/python/tvm/relay/expr.py", line 707, in const
    raise ValueError("value has to be scalar or NDArray")
ValueError: value has to be scalar or NDArray
```

because `fill_value` is
```
%0 = cast(64, dtype="float32");
power(%0, 0.5f)
```
Let the Android NDK version configurable as a command line argument
This PR is another way to fix the issue described in #14118.

Since we do not have a standard for json file on the format of float
numbers (for example, we cannot require a json file producer to print
the "integer" float numbers with at least one decimal), and the json
parser is not responsible for determining if an integer in a json file
should be parsed to a float or an int, the most convenient way of fixing
the SampleCategorical issue will be allowing both FloatImms and IntImms
as input, and converting all IntImms to FloatImms accordingly.

This PR fixes the issue in this way.
…… (#13918)

[Arith] ConstIntBound was incorrectly assuming bounds were over int64_t range

This commit improved the floormod and floordiv conversion check to be
simpler for the negative range and adds a  test to cover all integer data types.
* [CMSIS-NN] Reduction in code size of AOT test runner binary


Co-authored-by: Ashutosh Parkhi <[email protected]>
[CMSIS-NN] Add a runtime error message

APIs TVMAPISetLastError and TVMGetLastError are used to propagate CMSIS-NN
errors caught in the backend. AOT test runner was improved to observe the contents
of this global variable. A test was added to check for the last set error as part of this
commit.
This PR removes old macros in crt_config.h.template.
* fix squeeze shape function issue and add testcase.

* fix lint
…m_inject_ptx_async_copy (#14138)

This PR merge two related unittests into one.
* Fix TFLite frontend bug and add test

* lint
* remove scatter attr class

* update pytorch: scatter was replaced by scatter_elements

* remove scatter compute and strategy registration

* remove scatter attrs registration

* update onnx front-end: replace _op.scatter by _op.scatter_elements, add checks

* update oneflow front-end

* update paddlepaddle front-end

* update pytorch utils

* remove front-end scatter definition

* fix scatter strategy for rocm

* small update

* remove scatter definition in back-end

* remove scatter strategy for cuda, gpu. transfer special case to scatter_elements

* fix test

* small fix

* upstream scatter with torch description

* last upstream of scatter in pytorch front-end

* fix reduction attribute in cuda strategy

* set scalar to test instead of tensor. update check for dynamic dim

* skip scalar source check in tests for scatter due to issue on torch side

* remove scatter op implementation from topi/cuda

* remove scatter op implementation from topi. small clean code

---------

Co-authored-by: Valery Chernov <[email protected]>
Copy link
Contributor

@driazati driazati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good! This will make the unity runners in the scorecard CI much simpler to implement and maintain

python/tvm/octo/compile.py Outdated Show resolved Hide resolved
python/tvm/octo/utils/target_info.py Outdated Show resolved Hide resolved
tqchen and others added 3 commits February 27, 2023 20:14
Remove relax prefix for now

This PR cleans up relax prefix in printer for now.
While these setups are useful and do not cause any
technical debts in the codebase. We remove it given requests.
They can be added back to unity branch and later as part
of upstream
Supports legalizing a relay sum operation to an equivalent series of NPU operations. It supports case with int8 output type and channel axis.
…14143)

This PR fixes redundant stages if visiting `InitializeWithTuneContext`
multiple times.
python/tvm/octo/compile.py Outdated Show resolved Hide resolved
python/tvm/octo/compile.py Outdated Show resolved Hide resolved
python/tvm/octo/compile.py Outdated Show resolved Hide resolved
python/tvm/octo/compile.py Outdated Show resolved Hide resolved
python/tvm/octo/compile.py Outdated Show resolved Hide resolved
python/tvm/octo/utils/target_info.py Outdated Show resolved Hide resolved
python/tvm/octo/utils/target_info.py Outdated Show resolved Hide resolved
python/tvm/octo/utils/target_info.py Outdated Show resolved Hide resolved
python/tvm/octo/utils/target_info.py Show resolved Hide resolved
python/tvm/octo/utils/target_info.py Show resolved Hide resolved
Hzfengsy and others added 2 commits March 1, 2023 07:41
Please join me @chengven027-intellif as a new Reviewer in TVM.
Cheng has contributed to ONNX/PyTorch frontend and Relay passes, making TVM support more input models.

- [Commits History](https://github.com/apache/tvm/pulls?q=author%3Achengven027-intellif+)
- [Code Review](https://github.com/apache/tvm/pulls?q=reviewed-by%3Achengven027-intellif+)
…odels (#14147)

* add load_params_from_file

* add save_params_to_file

* avoid making another copy in save_params

* black

* add test

* update doc
@jwfromm
Copy link
Contributor Author

jwfromm commented Mar 8, 2023

I think all feedback has been addressed so I'm going to merge this. The GPU ci instance seems to be having some trouble with Cutlass, which is probably something we should fix but otherwise everything seems to be working as expected. Thank you for all the feedback folks!

@jwfromm jwfromm merged commit baa2516 into relax Mar 8, 2023
@jwfromm jwfromm deleted the TUZ-150 branch March 8, 2023 22:56
vinx13 pushed a commit to vinx13/relax-octo that referenced this pull request Mar 29, 2023
* fixes

* revert checked_type visitor and fix relax usage

* ExprNormalizer

* fix that annoying bug and get tests passing

* Memoization fix for the ExprMutator; separate VisitVarDef from use.

* rebase.

* rebase.

* address part of comments.

* address more comments

* address more comments and add doc

* address more comments

* fix potential mutation bug

* always assign normalized shape if can

* address comments

Co-authored-by: Altan Haan <[email protected]>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.