[TUZ-150] Add a simplified access point for Unity Flow #32

jwfromm · 2023-02-27T23:36:51Z

This PR adds a tvmc like simplified API for accessing the Unity flow. The goal is to provide an extremely simple interface that applies all needed steps of the flow. An example illustrating the API follows:

import tvm.octo
# Compile the model and offload as much as possible
octo_model = tvm.octo.compile("my_model.onnx")
# Benchmark the model
report = octo_model.profile()
# Run specific inputs through the model
outputs = octo_model.run(input_dict)

octo.compile contains utility features like automatically figuring out the current tvm target and generating proper random inputs if not specified.

Going forward, the thought is that new features (such as passes) should land in octo.compile to make sure we have a single place that contains everything needed to get great unity flow results.

This PR applies appropriate changes to make sure the CI fails if micro_tvmc.sh tutorial fails. This issue was captured in #14074. This PR also makes changes to avoid this breakage in bash script tutorials in future. In addition, this PR fixes the bug in running TVMC tutorial which happened due to renaming zephyr_board to board.

This PR introduces async pipeline in the current TVM's MultiLevelTiling Rules. This PR is based on apache/tvm#13966, which is already merged. This is because some conv2d workload will use `tir.if_then_else` to pad the input to the correct size, and this PR uses async copy in such copy statement. 1. Add a subrule in `src/meta_schedule/schedule_rule/multi_level_tiling.h/.cc` that annotate async copy for mlt in supported arch (>= sm80). In CUDA Core, this PR has a perf boost of around 1T GFLOP/s in most Conv2d test cases and 1T ~ 2T in most GEMM test cases. All generated codes, scripts, and traces are available at https://github.com/Rainy-Memory/tvm-async-rule-benchmark. Currently tested on commit `afbfb7aa7e43732cb716f8e443df696110be6afc` in conv2d NHWC workload, with a RTX 3080 GPU. **Notice: given the stochastic nature of evolutionary search, perfromance might become worse if enable this PR.** Workload: Conv2d NHWC |Shape|Mainline TVM|Mainline TVM with Async|Performance Boost| |-|-|-|-| |N=1_H=224_W=224_C=3_K=64_R=7_S=7_STR=2_PAD=3_DIL=1|13838.05219|14687.89452|6.141343581679319%| |N=1_H=56_W=56_C=64_K=64_R=1_S=1_STR=1_PAD=0_DIL=1|5398.305085|5613.892553|3.9936140067192905%| |N=1_H=56_W=56_C=64_K=64_R=3_S=3_STR=1_PAD=1_DIL=1|11652.96825|13157.88249|12.91442839038028%| |N=1_H=56_W=56_C=64_K=256_R=1_S=1_STR=1_PAD=0_DIL=1|10638.8309|11674.68499|9.736540600527816%| |N=1_H=56_W=56_C=256_K=64_R=1_S=1_STR=1_PAD=0_DIL=1|8692.32829|9469.264089|8.938178277203573%| |N=1_H=56_W=56_C=256_K=128_R=1_S=1_STR=2_PAD=0_DIL=1|4685.767442|5698.19634|21.606469175684712%| |N=1_H=28_W=28_C=128_K=128_R=3_S=3_STR=1_PAD=1_DIL=1|9872.787087|10404.60405|5.38669535070061%| |N=1_H=28_W=28_C=128_K=512_R=1_S=1_STR=1_PAD=0_DIL=1|9974.281496|10073.31657|0.9929043414276753%| |N=1_H=28_W=28_C=512_K=128_R=1_S=1_STR=1_PAD=0_DIL=1|7075.866932|8564.572712|21.039199780135142%| |N=1_H=28_W=28_C=512_K=256_R=1_S=1_STR=2_PAD=0_DIL=1|3648.330914|4021.923142|10.240086132713124%| |N=1_H=14_W=14_C=256_K=256_R=3_S=3_STR=1_PAD=1_DIL=1|8192.954618|9160.182054|11.805599824451525%| |N=1_H=14_W=14_C=256_K=1024_R=1_S=1_STR=1_PAD=0_DIL=1|8008.870153|9362.825279|16.90569456283206%| |N=1_H=14_W=14_C=1024_K=256_R=1_S=1_STR=1_PAD=0_DIL=1|5210.062241|6051.208379|16.144646629759908%| |N=1_H=14_W=14_C=1024_K=512_R=1_S=1_STR=2_PAD=0_DIL=1|2550.787202|3587.902938|40.65865373586739%| |N=1_H=7_W=7_C=512_K=512_R=3_S=3_STR=1_PAD=1_DIL=1|4350.626084|5432.788068|24.873706981617943%| |N=1_H=7_W=7_C=512_K=2048_R=1_S=1_STR=1_PAD=0_DIL=1|6672.068026|7663.725217|14.862815953549454%| |N=1_H=7_W=7_C=2048_K=512_R=1_S=1_STR=1_PAD=0_DIL=1|3142.564263|4297.988014|36.766909259541826%| Workload: GEMM NN |Shape|Mainline TVM|Mainline TVM with Async|Performance Boost| |-|-|-|-| |M=512_N=256_K=640|8678.46|10607.37|22.226408832903555%| |M=512_N=384_K=256|8109.13|10290.72|26.902886006267003%| |M=512_N=512_K=512|11419.83|14000.86|22.601299669084398%| |M=512_N=3072_K=768|19709.39|18351.61|-6.8890006235606425%| |M=512_N=768_K=3072|12844.59|13730.88|6.90010346768561%| |M=896_N=896_K=896|16149.91|16131.39|-0.11467556165947945%| |M=1024_N=1024_K=1024|18842.11|19662.8|4.355616223448428%| |M=1152_N=1152_K=1152|15386.79|16736.1|8.769275462913303%| |M=1536_N=1536_K=1536|18522.67|18872.06|1.88628313304725%| |M=2048_N=2048_K=2048|19515.42|18874.85|-3.282378754851291%| |M=3072_N=3072_K=3072|19233.9|19291.42|0.2990553137948975%| |M=4096_N=4096_K=4096|17122.17|19259.01|12.479960191961652%|

…Script (#14111) This PR adds an op attribute `TScriptDtypePrintLocation`, and modifies the dtype printing logic of the builtin op to check this attribute. So that user defined operators can use it to specify how there dtype argument are printed by appending attributes instead of appending members to `dtype_first_arg`/`dtype_last_arg`.

Currently, if the same metadata object (e.g. a multi-line `tir.StringImm`) is referenced for more than one times in an IRModule, each reference will have different indices of the metadata array. For example, this code ``` str_imm = T.StringImm("aaa\nbbb\n") @I.ir_module class Module: @T.prim_func def foo() -> None: A = str_imm B = str_imm @T.prim_func def foo1() -> None: A = str_imm Module.show() ``` where `str_imm` is referenced three times, will generate such output: ``` @I.ir_module class Module: @T.prim_func def foo(): A: T.handle = metadata["tir.StringImm"][0] B: T.handle = metadata["tir.StringImm"][1] T.evaluate(0) @T.prim_func def foo1(): A: T.handle = metadata["tir.StringImm"][2] T.evaluate(0) ``` Each time has a different metadata index. This PR fixes this problem by detecting duplicate item in `IRDocsifierNode::AddMetadata`.

Minor fix in pytorch frontend to compile gpt2 model, reproduce script. torch_version = 1.13.1 transformers_version = 4.26.1 ``` from transformers import GPT2LMHeadModel import torch import tvm from tvm import relay inp = torch.ones((1, 128)).to(torch.int64) input_shapes = [("input_ids", ((1, 128), "int64"))] model = GPT2LMHeadModel.from_pretrained('gpt2', return_dict=False) trace_model = torch.jit.trace(model, inp, strict=False) outputs = trace_model(inp) mod, params = relay.frontend.from_pytorch(trace_model, input_shapes) with tvm.transform.PassContext(opt_level=3): lib = relay.build(mod, target='llvm', params=params) runtime = tvm.contrib.graph_executor.GraphModule(lib["default"](tvm.device('cpu', 0))) runtime.set_input("input_ids", inp.numpy()) runtime.run() out = runtime.get_output(0).numpy() print(out) print('Done...') ``` Before the fix, the error message ``` Traceback (most recent call last): File "gpt2_compile.py", line 13, in <module> mod, params = relay.frontend.from_pytorch(trace_model, input_shapes) File "/home/ubuntu/apache_tvm/tvm/python/tvm/relay/frontend/pytorch.py", line 4791, in from_pytorch outputs = converter.convert_operators(_get_operator_nodes(graph.nodes()), outputs, ret_name) File "/home/ubuntu/apache_tvm/tvm/python/tvm/relay/frontend/pytorch.py", line 4164, in convert_operators relay_out = relay_op( File "/home/ubuntu/apache_tvm/tvm/python/tvm/relay/frontend/pytorch.py", line 841, in full return self.full_impl(data, fill_value, dtype) File "/home/ubuntu/apache_tvm/tvm/python/tvm/relay/frontend/pytorch.py", line 743, in full_impl fill_value = _expr.const(fill_value, dtype=dtype) File "/home/ubuntu/apache_tvm/tvm/python/tvm/relay/expr.py", line 707, in const raise ValueError("value has to be scalar or NDArray") ValueError: value has to be scalar or NDArray ``` because `fill_value` is ``` %0 = cast(64, dtype="float32"); power(%0, 0.5f) ```

Let the Android NDK version configurable as a command line argument

This PR is another way to fix the issue described in #14118. Since we do not have a standard for json file on the format of float numbers (for example, we cannot require a json file producer to print the "integer" float numbers with at least one decimal), and the json parser is not responsible for determining if an integer in a json file should be parsed to a float or an int, the most convenient way of fixing the SampleCategorical issue will be allowing both FloatImms and IntImms as input, and converting all IntImms to FloatImms accordingly. This PR fixes the issue in this way.

…… (#13918) [Arith] ConstIntBound was incorrectly assuming bounds were over int64_t range This commit improved the floormod and floordiv conversion check to be simpler for the negative range and adds a test to cover all integer data types.

* [CMSIS-NN] Reduction in code size of AOT test runner binary Co-authored-by: Ashutosh Parkhi <[email protected]>

[CMSIS-NN] Add a runtime error message APIs TVMAPISetLastError and TVMGetLastError are used to propagate CMSIS-NN errors caught in the backend. AOT test runner was improved to observe the contents of this global variable. A test was added to check for the last set error as part of this commit.

This PR removes old macros in crt_config.h.template.

* fix squeeze shape function issue and add testcase. * fix lint

…m_inject_ptx_async_copy (#14138) This PR merge two related unittests into one.

* Fix TFLite frontend bug and add test * lint

python/tvm/octo/octo_model.py

python/tvm/octo/utils/target_info.py

* remove scatter attr class * update pytorch: scatter was replaced by scatter_elements * remove scatter compute and strategy registration * remove scatter attrs registration * update onnx front-end: replace _op.scatter by _op.scatter_elements, add checks * update oneflow front-end * update paddlepaddle front-end * update pytorch utils * remove front-end scatter definition * fix scatter strategy for rocm * small update * remove scatter definition in back-end * remove scatter strategy for cuda, gpu. transfer special case to scatter_elements * fix test * small fix * upstream scatter with torch description * last upstream of scatter in pytorch front-end * fix reduction attribute in cuda strategy * set scalar to test instead of tensor. update check for dynamic dim * skip scalar source check in tests for scatter due to issue on torch side * remove scatter op implementation from topi/cuda * remove scatter op implementation from topi. small clean code --------- Co-authored-by: Valery Chernov <[email protected]>

driazati

Looks pretty good! This will make the unity runners in the scorecard CI much simpler to implement and maintain

python/tvm/octo/compile.py

python/tvm/octo/utils/target_info.py

Remove relax prefix for now This PR cleans up relax prefix in printer for now. While these setups are useful and do not cause any technical debts in the codebase. We remove it given requests. They can be added back to unity branch and later as part of upstream

Supports legalizing a relay sum operation to an equivalent series of NPU operations. It supports case with int8 output type and channel axis.

…14143) This PR fixes redundant stages if visiting `InitializeWithTuneContext` multiple times.

python/tvm/octo/octo_model.py

python/tvm/octo/compile.py

python/tvm/octo/utils/target_info.py

@chengven027-intellif

Please join me @chengven027-intellif as a new Reviewer in TVM. Cheng has contributed to ONNX/PyTorch frontend and Relay passes, making TVM support more input models. - [Commits History](https://github.com/apache/tvm/pulls?q=author%3Achengven027-intellif+) - [Code Review](https://github.com/apache/tvm/pulls?q=reviewed-by%3Achengven027-intellif+)

…odels (#14147) * add load_params_from_file * add save_params_to_file * avoid making another copy in save_params * black * add test * update doc

jwfromm · 2023-03-08T22:16:46Z

I think all feedback has been addressed so I'm going to merge this. The GPU ci instance seems to be having some trouble with Cutlass, which is probably something we should fix but otherwise everything seems to be working as expected. Thank you for all the feedback folks!

* fixes * revert checked_type visitor and fix relax usage * ExprNormalizer * fix that annoying bug and get tests passing * Memoization fix for the ExprMutator; separate VisitVarDef from use. * rebase. * rebase. * address part of comments. * address more comments * address more comments and add doc * address more comments * fix potential mutation bug * always assign normalized shape if can * address comments Co-authored-by: Altan Haan <[email protected]>

mehrdadh and others added 14 commits February 24, 2023 17:14

[DOCKER] Configurable NDK version support (#14000)

d9b0a80

Let the Android NDK version configurable as a command line argument

[CMSIS-NN] Reduction in code size of AOT test runner binary (#13815)

0e046da

* [CMSIS-NN] Reduction in code size of AOT test runner binary Co-authored-by: Ashutosh Parkhi <[email protected]>

[CRT]Cleanup unused macros in crt_config.h.template (#14125)

bf589f3

This PR removes old macros in crt_config.h.template.

[Fix][Relay] Fix axis transformation in squeeze shape function (#14135)

663f7ae

* fix squeeze shape function issue and add testcase. * fix lint

[Unittest] merge test_cp_async_in_if_then_else into test_tir_transfor…

4d152fe

…m_inject_ptx_async_copy (#14138) This PR merge two related unittests into one.

[Frontend][TFLite] Fix conv2d import bug (#14124)

2feb243

* Fix TFLite frontend bug and add test * lint

jwfromm requested review from areusch, driazati and zxybazh February 27, 2023 23:37

ehsanmok reviewed Feb 27, 2023

View reviewed changes

python/tvm/octo/octo_model.py Show resolved Hide resolved

ehsanmok reviewed Feb 28, 2023

View reviewed changes

python/tvm/octo/utils/target_info.py Outdated Show resolved Hide resolved

driazati reviewed Feb 28, 2023

View reviewed changes

python/tvm/octo/compile.py Outdated Show resolved Hide resolved

python/tvm/octo/utils/target_info.py Outdated Show resolved Hide resolved

tqchen and others added 3 commits February 27, 2023 20:14

[microNPU] Sum legalization support (#13997)

7d67bb1

Supports legalizing a relay sum operation to an equivalent series of NPU operations. It supports case with int8 output type and channel axis.

[Fix][MetaSchedule] Fix redundant stages in async pipeline for mlt (#…

7c06de5

…14143) This PR fixes redundant stages if visiting `InitializeWithTuneContext` multiple times.

haijieg reviewed Feb 28, 2023

View reviewed changes

python/tvm/octo/octo_model.py Show resolved Hide resolved

areusch reviewed Feb 28, 2023

View reviewed changes

jwfromm force-pushed the relax branch from 7e1a319 to e3fce10 Compare February 28, 2023 22:07

jwfromm force-pushed the TUZ-150 branch from c248406 to 46d241b Compare February 28, 2023 22:13

Hzfengsy and others added 2 commits March 1, 2023 07:41

[Runtime] Fix high RAM usage when saving / loading paramters of big m…

1043136

…odels (#14147) * add load_params_from_file * add save_params_to_file * avoid making another copy in save_params * black * add test * update doc

Remove now unnecessary ScheduleForTarget pass.

b12320b

jwfromm force-pushed the TUZ-150 branch from 2a1da3a to a06ae38 Compare March 6, 2023 22:55

jwfromm force-pushed the relax branch from bb34d97 to a80dd38 Compare March 6, 2023 23:18

Josh Fromm added 16 commits March 6, 2023 15:19

Go back to standard ci_cpu image

627fb0a

Refactor importer locations for consistency and cleaner import.

3d72050

Add initial octoml utility functions.

1dd7a56

Add gpu target extraction

d843a77

Start compilation helper file.

ea79707

Add entrypoint compile function.

dfe37fd

Add OctoModel helper class and testing.

ab05b8c

Add full feature support and testing.

44f4742

Chatgpt refactoring of my regular expressions.

61e03a4

Cleanup lint issues.

e49e924

API cleanups after rebase.

e7ab17e

Simplify cuda thread binding until Xiyous full pass lands.

dfc0313

Importer improvements such that full flow works end to end

ee05ae0

Improvements after merge with main

b4d172f

Merge branch 'relax' into TUZ-150

57cf2ff

Improve tests

f5b86c1

jwfromm force-pushed the TUZ-150 branch from a06ae38 to f5b86c1 Compare March 7, 2023 22:45

Josh Fromm added 6 commits March 7, 2023 16:05

Lint cleanup

447bc59

Add octo tests to CI

e68e731

Incorporate feedback

022042c

Fix lint

b9813e8

Fix small type bug

b5777e1

Fix test target

dfc2931

jwfromm merged commit baa2516 into relax Mar 8, 2023

jwfromm deleted the TUZ-150 branch March 8, 2023 22:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TUZ-150] Add a simplified access point for Unity Flow #32

[TUZ-150] Add a simplified access point for Unity Flow #32

jwfromm commented Feb 27, 2023 •

edited

Loading

driazati left a comment •

edited

Loading

jwfromm commented Mar 8, 2023

[TUZ-150] Add a simplified access point for Unity Flow #32

[TUZ-150] Add a simplified access point for Unity Flow #32

Conversation

jwfromm commented Feb 27, 2023 • edited Loading

driazati left a comment • edited Loading

Choose a reason for hiding this comment

jwfromm commented Mar 8, 2023

jwfromm commented Feb 27, 2023 •

edited

Loading

driazati left a comment •

edited

Loading