This repository has been archived by the owner on Oct 25, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 13
[TUZ-150] Add a simplified access point for Unity Flow #32
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This PR applies appropriate changes to make sure the CI fails if micro_tvmc.sh tutorial fails. This issue was captured in #14074. This PR also makes changes to avoid this breakage in bash script tutorials in future. In addition, this PR fixes the bug in running TVMC tutorial which happened due to renaming zephyr_board to board.
This PR introduces async pipeline in the current TVM's MultiLevelTiling Rules. This PR is based on apache/tvm#13966, which is already merged. This is because some conv2d workload will use `tir.if_then_else` to pad the input to the correct size, and this PR uses async copy in such copy statement. 1. Add a subrule in `src/meta_schedule/schedule_rule/multi_level_tiling.h/.cc` that annotate async copy for mlt in supported arch (>= sm80). In CUDA Core, this PR has a perf boost of around 1T GFLOP/s in most Conv2d test cases and 1T ~ 2T in most GEMM test cases. All generated codes, scripts, and traces are available at https://github.com/Rainy-Memory/tvm-async-rule-benchmark. Currently tested on commit `afbfb7aa7e43732cb716f8e443df696110be6afc` in conv2d NHWC workload, with a RTX 3080 GPU. **Notice: given the stochastic nature of evolutionary search, perfromance might become worse if enable this PR.** Workload: Conv2d NHWC |Shape|Mainline TVM|Mainline TVM with Async|Performance Boost| |-|-|-|-| |N=1_H=224_W=224_C=3_K=64_R=7_S=7_STR=2_PAD=3_DIL=1|13838.05219|14687.89452|6.141343581679319%| |N=1_H=56_W=56_C=64_K=64_R=1_S=1_STR=1_PAD=0_DIL=1|5398.305085|5613.892553|3.9936140067192905%| |N=1_H=56_W=56_C=64_K=64_R=3_S=3_STR=1_PAD=1_DIL=1|11652.96825|13157.88249|12.91442839038028%| |N=1_H=56_W=56_C=64_K=256_R=1_S=1_STR=1_PAD=0_DIL=1|10638.8309|11674.68499|9.736540600527816%| |N=1_H=56_W=56_C=256_K=64_R=1_S=1_STR=1_PAD=0_DIL=1|8692.32829|9469.264089|8.938178277203573%| |N=1_H=56_W=56_C=256_K=128_R=1_S=1_STR=2_PAD=0_DIL=1|4685.767442|5698.19634|21.606469175684712%| |N=1_H=28_W=28_C=128_K=128_R=3_S=3_STR=1_PAD=1_DIL=1|9872.787087|10404.60405|5.38669535070061%| |N=1_H=28_W=28_C=128_K=512_R=1_S=1_STR=1_PAD=0_DIL=1|9974.281496|10073.31657|0.9929043414276753%| |N=1_H=28_W=28_C=512_K=128_R=1_S=1_STR=1_PAD=0_DIL=1|7075.866932|8564.572712|21.039199780135142%| |N=1_H=28_W=28_C=512_K=256_R=1_S=1_STR=2_PAD=0_DIL=1|3648.330914|4021.923142|10.240086132713124%| |N=1_H=14_W=14_C=256_K=256_R=3_S=3_STR=1_PAD=1_DIL=1|8192.954618|9160.182054|11.805599824451525%| |N=1_H=14_W=14_C=256_K=1024_R=1_S=1_STR=1_PAD=0_DIL=1|8008.870153|9362.825279|16.90569456283206%| |N=1_H=14_W=14_C=1024_K=256_R=1_S=1_STR=1_PAD=0_DIL=1|5210.062241|6051.208379|16.144646629759908%| |N=1_H=14_W=14_C=1024_K=512_R=1_S=1_STR=2_PAD=0_DIL=1|2550.787202|3587.902938|40.65865373586739%| |N=1_H=7_W=7_C=512_K=512_R=3_S=3_STR=1_PAD=1_DIL=1|4350.626084|5432.788068|24.873706981617943%| |N=1_H=7_W=7_C=512_K=2048_R=1_S=1_STR=1_PAD=0_DIL=1|6672.068026|7663.725217|14.862815953549454%| |N=1_H=7_W=7_C=2048_K=512_R=1_S=1_STR=1_PAD=0_DIL=1|3142.564263|4297.988014|36.766909259541826%| Workload: GEMM NN |Shape|Mainline TVM|Mainline TVM with Async|Performance Boost| |-|-|-|-| |M=512_N=256_K=640|8678.46|10607.37|22.226408832903555%| |M=512_N=384_K=256|8109.13|10290.72|26.902886006267003%| |M=512_N=512_K=512|11419.83|14000.86|22.601299669084398%| |M=512_N=3072_K=768|19709.39|18351.61|-6.8890006235606425%| |M=512_N=768_K=3072|12844.59|13730.88|6.90010346768561%| |M=896_N=896_K=896|16149.91|16131.39|-0.11467556165947945%| |M=1024_N=1024_K=1024|18842.11|19662.8|4.355616223448428%| |M=1152_N=1152_K=1152|15386.79|16736.1|8.769275462913303%| |M=1536_N=1536_K=1536|18522.67|18872.06|1.88628313304725%| |M=2048_N=2048_K=2048|19515.42|18874.85|-3.282378754851291%| |M=3072_N=3072_K=3072|19233.9|19291.42|0.2990553137948975%| |M=4096_N=4096_K=4096|17122.17|19259.01|12.479960191961652%|
…Script (#14111) This PR adds an op attribute `TScriptDtypePrintLocation`, and modifies the dtype printing logic of the builtin op to check this attribute. So that user defined operators can use it to specify how there dtype argument are printed by appending attributes instead of appending members to `dtype_first_arg`/`dtype_last_arg`.
Currently, if the same metadata object (e.g. a multi-line `tir.StringImm`) is referenced for more than one times in an IRModule, each reference will have different indices of the metadata array. For example, this code ``` str_imm = T.StringImm("aaa\nbbb\n") @I.ir_module class Module: @T.prim_func def foo() -> None: A = str_imm B = str_imm @T.prim_func def foo1() -> None: A = str_imm Module.show() ``` where `str_imm` is referenced three times, will generate such output: ``` @I.ir_module class Module: @T.prim_func def foo(): A: T.handle = metadata["tir.StringImm"][0] B: T.handle = metadata["tir.StringImm"][1] T.evaluate(0) @T.prim_func def foo1(): A: T.handle = metadata["tir.StringImm"][2] T.evaluate(0) ``` Each time has a different metadata index. This PR fixes this problem by detecting duplicate item in `IRDocsifierNode::AddMetadata`.
Minor fix in pytorch frontend to compile gpt2 model, reproduce script. torch_version = 1.13.1 transformers_version = 4.26.1 ``` from transformers import GPT2LMHeadModel import torch import tvm from tvm import relay inp = torch.ones((1, 128)).to(torch.int64) input_shapes = [("input_ids", ((1, 128), "int64"))] model = GPT2LMHeadModel.from_pretrained('gpt2', return_dict=False) trace_model = torch.jit.trace(model, inp, strict=False) outputs = trace_model(inp) mod, params = relay.frontend.from_pytorch(trace_model, input_shapes) with tvm.transform.PassContext(opt_level=3): lib = relay.build(mod, target='llvm', params=params) runtime = tvm.contrib.graph_executor.GraphModule(lib["default"](tvm.device('cpu', 0))) runtime.set_input("input_ids", inp.numpy()) runtime.run() out = runtime.get_output(0).numpy() print(out) print('Done...') ``` Before the fix, the error message ``` Traceback (most recent call last): File "gpt2_compile.py", line 13, in <module> mod, params = relay.frontend.from_pytorch(trace_model, input_shapes) File "/home/ubuntu/apache_tvm/tvm/python/tvm/relay/frontend/pytorch.py", line 4791, in from_pytorch outputs = converter.convert_operators(_get_operator_nodes(graph.nodes()), outputs, ret_name) File "/home/ubuntu/apache_tvm/tvm/python/tvm/relay/frontend/pytorch.py", line 4164, in convert_operators relay_out = relay_op( File "/home/ubuntu/apache_tvm/tvm/python/tvm/relay/frontend/pytorch.py", line 841, in full return self.full_impl(data, fill_value, dtype) File "/home/ubuntu/apache_tvm/tvm/python/tvm/relay/frontend/pytorch.py", line 743, in full_impl fill_value = _expr.const(fill_value, dtype=dtype) File "/home/ubuntu/apache_tvm/tvm/python/tvm/relay/expr.py", line 707, in const raise ValueError("value has to be scalar or NDArray") ValueError: value has to be scalar or NDArray ``` because `fill_value` is ``` %0 = cast(64, dtype="float32"); power(%0, 0.5f) ```
Let the Android NDK version configurable as a command line argument
This PR is another way to fix the issue described in #14118. Since we do not have a standard for json file on the format of float numbers (for example, we cannot require a json file producer to print the "integer" float numbers with at least one decimal), and the json parser is not responsible for determining if an integer in a json file should be parsed to a float or an int, the most convenient way of fixing the SampleCategorical issue will be allowing both FloatImms and IntImms as input, and converting all IntImms to FloatImms accordingly. This PR fixes the issue in this way.
…… (#13918) [Arith] ConstIntBound was incorrectly assuming bounds were over int64_t range This commit improved the floormod and floordiv conversion check to be simpler for the negative range and adds a test to cover all integer data types.
* [CMSIS-NN] Reduction in code size of AOT test runner binary Co-authored-by: Ashutosh Parkhi <[email protected]>
[CMSIS-NN] Add a runtime error message APIs TVMAPISetLastError and TVMGetLastError are used to propagate CMSIS-NN errors caught in the backend. AOT test runner was improved to observe the contents of this global variable. A test was added to check for the last set error as part of this commit.
This PR removes old macros in crt_config.h.template.
* fix squeeze shape function issue and add testcase. * fix lint
…m_inject_ptx_async_copy (#14138) This PR merge two related unittests into one.
* Fix TFLite frontend bug and add test * lint
ehsanmok
reviewed
Feb 27, 2023
ehsanmok
reviewed
Feb 28, 2023
* remove scatter attr class * update pytorch: scatter was replaced by scatter_elements * remove scatter compute and strategy registration * remove scatter attrs registration * update onnx front-end: replace _op.scatter by _op.scatter_elements, add checks * update oneflow front-end * update paddlepaddle front-end * update pytorch utils * remove front-end scatter definition * fix scatter strategy for rocm * small update * remove scatter definition in back-end * remove scatter strategy for cuda, gpu. transfer special case to scatter_elements * fix test * small fix * upstream scatter with torch description * last upstream of scatter in pytorch front-end * fix reduction attribute in cuda strategy * set scalar to test instead of tensor. update check for dynamic dim * skip scalar source check in tests for scatter due to issue on torch side * remove scatter op implementation from topi/cuda * remove scatter op implementation from topi. small clean code --------- Co-authored-by: Valery Chernov <[email protected]>
driazati
reviewed
Feb 28, 2023
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good! This will make the unity runners in the scorecard CI much simpler to implement and maintain
Remove relax prefix for now This PR cleans up relax prefix in printer for now. While these setups are useful and do not cause any technical debts in the codebase. We remove it given requests. They can be added back to unity branch and later as part of upstream
Supports legalizing a relay sum operation to an equivalent series of NPU operations. It supports case with int8 output type and channel axis.
…14143) This PR fixes redundant stages if visiting `InitializeWithTuneContext` multiple times.
haijieg
reviewed
Feb 28, 2023
areusch
reviewed
Feb 28, 2023
Please join me @chengven027-intellif as a new Reviewer in TVM. Cheng has contributed to ONNX/PyTorch frontend and Relay passes, making TVM support more input models. - [Commits History](https://github.com/apache/tvm/pulls?q=author%3Achengven027-intellif+) - [Code Review](https://github.com/apache/tvm/pulls?q=reviewed-by%3Achengven027-intellif+)
…odels (#14147) * add load_params_from_file * add save_params_to_file * avoid making another copy in save_params * black * add test * update doc
I think all feedback has been addressed so I'm going to merge this. The GPU ci instance seems to be having some trouble with Cutlass, which is probably something we should fix but otherwise everything seems to be working as expected. Thank you for all the feedback folks! |
vinx13
pushed a commit
to vinx13/relax-octo
that referenced
this pull request
Mar 29, 2023
* fixes * revert checked_type visitor and fix relax usage * ExprNormalizer * fix that annoying bug and get tests passing * Memoization fix for the ExprMutator; separate VisitVarDef from use. * rebase. * rebase. * address part of comments. * address more comments * address more comments and add doc * address more comments * fix potential mutation bug * always assign normalized shape if can * address comments Co-authored-by: Altan Haan <[email protected]>
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR adds a tvmc like simplified API for accessing the Unity flow. The goal is to provide an extremely simple interface that applies all needed steps of the flow. An example illustrating the API follows:
octo.compile
contains utility features like automatically figuring out the current tvm target and generating proper random inputs if not specified.Going forward, the thought is that new features (such as passes) should land in
octo.compile
to make sure we have a single place that contains everything needed to get great unity flow results.