[Ansor][AutoTVM v2.0] Phase 0: Ansor minimum system for auto schedule generating #5962

jcf94 · 2020-06-30T06:10:47Z

Hi all,
The last PR of Ansor #5883 is not clear enough for reviewers to fully understand our design. After some discussion, we changed our upstream plan to propose a minimal version of Ansor which contains a small but complete framework, so others can get a better understanding of the whole structure of Ansor.

In [RFC] Ansor: An Auto-scheduler for TVM (AutoTVM v2.0), we've introduced the Ansor auto-scheduler. And we reached an agreement that should replace AutoTVM with Ansor in the end.
For most existing templates, current Ansor can directly replace them with better performance and less tuning time.
For other special templates (low-precision, sparse), the plan is to introduce search space customization and gradually rewrite them with Ansor's new API.

This PR introduces a self-contained minimum version of Ansor with most of the bones.

This PR includes the interface of core data structures and an empty search policy that does nothing. More advanced search policy and cost models will be in the next few PRs.

Infrastructure: A Sketch IR for Schedule Searching

Different from AutoTVM, whose tuning spaces are composed of predefined parameters, Ansor constructs a search space by manipulating the loop structures of the given compute DAG.
To enable flexible manipulation of the loop structures, we implemented a lightweight loop structure IR (Intermediate Representation) based on the original TVM IR but specifically for schedule search. The IR is composed of the state and action, which are defined as follows:

State: A state of schedule search is the loop structure defined by the schedule (i.e., the TVM IR created by tvm.lower). See LoopState in the Key Data Structure for details.
Action: An action is composed of one or more schedule primitives to manipulate (e.g., split, reorder, fuse) a state. See TransformStep in the Key Data Structure for details.

We don't use the existing TVM IR but to extend a new Sketch IR on it is because:

We want fast incremental change to the loop structures;
We want serializable transform history for replay, backtracking, and mutation;
We may create some macro schedule primitives that represent the combination of several TVM schedule primitives

After the search is done, we will lower this to TVM IR with TVM schedule primitives.

Key Data Structures

To help build an overview of Ansor, this shows the class relations of some important Ansor data structures:

ComputeDAG: Compute declaration graph and its related analysis tools.
Related source files: src/ansor/compute_dag.*, python/tvm/ansor/compute_dag.py
Ansor takes a compute declaration, which could be a single operator or a subgraph, described by tvm.compute as an input and converts it to a ComputeDAG.
ComputeDAG implementation includes a set of analyses such as the total float operations, consumer/producer relations of each operation stage, whether a stage should be tiled/compute inlined, and so on (some of the analysis will be included in the follow-up PRs). These analyses can help the search policy to do specific decisions during schedule search process.
LoopState: This defines the "state" for the search problem.
Related source files: src/ansor/loop_state.*, python/tvm/ansor/loop_state.py.
Each LoopState corresponds to a specific schedule for the target ComputeDAG.
A LoopState consists of: 1. the current loop structure; 2. the transform history to reach this loop structure.
The loop structure keeps a preview of how the schedule will finally look like after lowering (number of iterators, the extent of each iterator, the compute_at locations, etc), which can help the search policy to make decisions during the search.
The transform history is a sequence of TransformStep which will finally be mapped to schedule primitives.
TransformStep: This defines the "action" for the search problem, i.e., the schedule primitives for our sketch IR.
Related files: src/ansor/transform_step.*, python/tvm/ansor/loop_state.py
Each step has its corresponding tvm.te schedule primitives. We record all TransformSteps for every state as its transform history. After finishing the schedule search, these transform steps will be lowered with their corresponding TVM's schedule primitives.
Note: This PR only contains a small subset of the TransformSteps. The complete set of transform steps will be in the next PRs.

ComputeDAG is also playing a role of connecting Ansor state system to TVM schedule system. That is, ComputeDAG is able to replay TransformSteps to the final TVM schedule (e.g., ComputeDAG(state, actions)).

SearchTask: Meta information and hardware parameters for a specific schedule search task.
Related source files: src/ansor/search_task.*
This structure includes the target ComputeDAG, device information as well as some hardware parameters obtained from the system or user inputs.
SearchPolicy: The search policy is defined for Ansor to auto-generate a high-performance schedule for different computations.
Related source files: src/ansor/search_policy/*
A SearchPolicy takes a SearchTask, system information and some tuning options as inputs, performs the schedule search, and returns a state with the best performance. The resulting state can be used to apply to TVM schedule later.

Note that in Ansor paper (https://arxiv.org/abs/2006.06762), we proposed a sketch generation policy which achieves pretty good results with various workloads on different devices. On the other hand, in this minimum system PR, we only provide an EmptyPolicy to illustrate the search policy interface.

Ansor Minimum System

This is a brief diagram for the Ansor system:

Define the target computation with TVM te API and create a ComputeDAG structure.
Specify the target device, hardware parameters, tuning options and pack those with ComputeDAG to create a SearchTask structure.
SearchPolicy takes the SearchTask as input and performs the schedule search. During the search process, SearchPolicy will generate multiple candidate states, each of which corresponds to a specific TVM schedule.
Get the best state and use ComputeDAG API to transform it to the final TVM schedule.

In the Ansor system, we use the sketch generation policy (will be brought in later PRs) described in the paper as the default search policy, which should be enough to cover most use cases. Meanwhile, we will have an RFC for a custom rule mechanism that enables user-defined template search to serve the same functionality as the current AutoTVM template. Specifically, we will provide Python APIs for the new IR that is intended to be used by users for sketch customization. They look very similar to existing schedule primitives, as shown in python/tvm/ansor/loop_state.py.

Our goal is to make sure Ansor can cover all AutoTVM functionalities while achieving the same or better performance so that the community can gradually switch to Ansor from AutoTVM.

More Details on LoopState

In this section, we illustrate how a loop state looks like and how does it connect to the current TVM build system.

Take a simple State that includes Split, Fuse and Reorder steps for example:

A, B, C = matmul_ansor_test(512, 512, 512)
dag = ansor.ComputeDAG([A, B, C])
state = dag.get_init_state()
i, j, k = state[C].iters
io, ii = state.split(C, i, [16])
jo, ji = state.split(C, j, [8])
state.reorder(C, [io, jo, k, ji, ii])
fused_it = state.fuse(C, [io, jo])

First, let's print out a state. It shows the loop structure of the corresponding TVM schedule, the "preview":

>>> print(state)

Placeholder: A, B
for i.0@j.0@ (0,2048)
  for k (0,512)
    for j.1 (0,8)
      for i.1 (0,16)
        C = ...

It stores all history transform steps required to reach the current state. We can print the history transform steps as TVM's python schedule API. This can be used for debugging or to apply the schedule on a former TVM version without Ansor support.

>>> print(dag.print_python_code_from_state(state))

i, j, k = tuple(C.op.axis) + tuple(C.op.reduce_axis)
i_o, i_i = s[C].split(i, factor=16)
j_o, j_i = s[C].split(j, factor=8)
s[C].reorder(i_o, j_o, k, j_i, i_i)
i_o_j_o_fused = s[C].fuse(i_o, j_o)

We can also replay these steps to get a schedule for tvm.lower and tvm.build.

>>> sche, args = dag.apply_steps_from_state(state)
>>> print(tvm.lower(sche, args, simple_mode=True))

primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
  attr = {"global_symbol": "main", "tir.noalias": True}
  buffers = {C: Buffer(C_2: handle, float32, [512, 512], []),
             A: Buffer(A_2: handle, float32, [512, 512], []),
             B: Buffer(B_2: handle, float32, [512, 512], [])}
  buffer_map = {A_1: A, B_1: B, C_1: C} {
  for (i.outer.j.outer.fused: int32, 0, 2048) {
    for (j.inner.init: int32, 0, 8) {
      for (i.inner.init: int32, 0, 16) {
        C_2[((((floordiv(i.outer.j.outer.fused, 64)*8192) + (i.inner.init*512)) + (floormod(i.outer.j.outer.fused, 64)*8)) + j.inner.init)] = 0f32
      }
    }
    for (k: int32, 0, 512) {
      for (j.inner: int32, 0, 8) {
        for (i.inner: int32, 0, 16) {
          C_2[((((floordiv(i.outer.j.outer.fused, 64)*8192) + (i.inner*512)) + (floormod(i.outer.j.outer.fused, 64)*8)) + j.inner)] = ((float32*)C_2[((((floordiv(i.outer.j.outer.fused, 64)*8192) + (i.inner*512)) + (floormod(i.outer.j.outer.fused, 64)*8)) + j.inner)]) + ((float32*)A_2[(((floordiv(i.outer.j.outer.fused, 64)*8192) + (i.inner*512)) + k)])*(float32*)B_2[(((k*512) + (floormod(i.outer.j.outer.fused, 64)*8)) + j.inner)])))
        }
      }
    }
  }
}

The steps of this state can be serialized into the log file as:

>>> target = tvm.target.create("llvm")
>>> task = ansor.SearchTask(dag, "test", target)
>>> inp = ansor.measure.MeasureInput(task, state)
>>> res = ansor.measure.MeasureResult([0.1], 0, "", 0.2, 1)
>>> with open("test.log", "w") as fp:
>>>     ansor.serialization.write_measure_records_to_file(fp.name, [inp], [res])

{"i": [["test", "llvm"], [[], [["SP", 2, 0, 512, [16], 1], ["SP", 2, 2, 512, [8], 1], ["RE", 2, [0, 2, 4, 3, 1]], ["FU", 2, [0, 1]]]]], "r": [[0.1], 0, 0.2, 1], "v": "v0.2"}

Ansor serializes all transform steps to the log file; while AutoTVM serializes parameters of a predefined template. The log format discussion would be based on https://discuss.tvm.ai/t/rfc-canonicalizing-autotvm-log-format/7038/.

In the next few PRs, we'll introduce the complete search policy and tutorials for single op/ subgraph schedule search, Relay integration, and tutorials for end-to-end network schedule search, custom rules to support customized search space.

This is a joint work by @merrymercy @jcf94 @minminsun @FrozenGene @comaniac @yangjunpro @yidawang .

* Init commit: Code migration Start * Add loop_state.cc/h * Add ComputeDAG basic test

* Split transform_step out * Update GetProducers & GetConsumers * Update UTs * Add UT for CacheReadWrite & Some bug fix

* Add FollowSplit & FollowFusedSplit tests * Update dag.InferBound & its UT * Add search_task, measure and serialization * Update Serialization UT

* Add feature * Add cost_model, meta_tile_rewrite_policy * Add MetaTileRewritePolicy basic UT

* Add Basic Python API for State * Add UTs for State

* Update the return value of state operation * Add task * Copy measure.py & utils.py * Fix LocalBuilder * Fix LocalRunner

…che#8) * Add basic Python support for ansor.auto_schedule * Update AutoSchedule API * Bug fix for get the attach point of a fused iter * Update UT after infer bug fix

* Delete C++ UT hack since Python is ready * Add ndarray.non_empty * Update Serialization python API

* Update c++ code style and unit test * Update python State wrapper and test cases

* Add RPCRunner & OpenCL search test * Add CUDA search test * Add RPCRunner test

* Add basic tutorial

* Add XGBModel & RPCRunnerWarpper * Revert "Add Parallel Granularity Mutation"

* add workload registry * update * update

* add tune_test.py (the old tune_wkl.py) * update * fix measure * fix for gpu

* Bug fix for tutorials * Add PreLoadMeasuredStates * Add search_callback support for task tuner * Code refine for tune_test.py * Update * Update * Update * Update * Bug fix

* Add custom sketch rule * Bug fix

* relay integration

) * Add single op tune scripts * Add tune subgraph support * Merge all op & all subgraph to one file * Rename file

* Add vectorized cooperative_fetching test * Update math simplify for vectorized CF * File rename * Update tune_network * API update

* Add a threading wrapper to fix the test bug * Set default TVM_USE_AUTO_SCHEDULER to false * Update PreLoadMeasuredStates callback

* Add tensorize step

* Start to update api * Add compute_dag to state * API update

* kernel layout rewrite * remove some hacks * add defuse_ops pass and move kernel_layout_rewrite pass after fuse_ops pass * set TVM_RELAY_DISABLE_BUILD_CACHE for task extraction and prepare_layout_rewrite

jroesch · 2020-07-13T23:48:10Z

@jcf94 and @merrymercy thanks for all the hard work! Can I request that we put another unresolved issue? In my opinion the written English parts i.e comments, explanations, etc could still use some improvement with both content and grammar and I would propose in general that we do some at least 1 or 2 rounds of full documentation polish (comments, examples, tests, tutorials, etc) before we officially release a feature (in this case when all of Ansor is landed in master). We tried to do this with Relay but I think we should continue to strive to do a better job with new features like this.

jcf94 · 2020-07-14T02:12:06Z

Does not have to change now, but let us change the use of ThreadPool to parallel_for abstraction.

Does that means to just modify ThreadPool to ParallelFor now? Class renamed & added some comments on the member function.

@jcf94 and @merrymercy thanks for all the hard work! Can I request that we put another unresolved issue? In my opinion the written English parts i.e comments, explanations, etc could still use some improvement with both content and grammar and I would propose in general that we do some at least 1 or 2 rounds of full documentation polish (comments, examples, tests, tutorials, etc) before we officially release a feature (in this case when all of Ansor is landed in master). We tried to do this with Relay but I think we should continue to strive to do a better job with new features like this.

Thanks! That would be of great help since I'm not a native speaker. The documentation does need to be polished.

tqchen · 2020-07-14T02:23:30Z

src/auto_schedule/utils.h

 * TODO(merrymercy): Move this to `src/support/parallel_for`
 */
-class ThreadPool {
+class ParallelFor {


@jcf94 Sorry I wasn't meant to say that we should rename ThreadPool to ParallelFor, instead we should hide the use of threadpool behind a parallel_for API, in similar style to https://docs.microsoft.com/en-us/cpp/parallel/concrt/parallel-algorithms?view=vs-2019#parallel_for

Ok, I get it. However after checking the current code, I found out that actually we have removed all the use of ThreadPool in this minimum system. Didn't realize that before.

jcf94 · 2020-07-14T03:07:05Z

src/auto_schedule/utils.cc

+  return *pool;
+}
+
+void parallel_for(int start, int end, std::function<void(int index)> f, int stride) {


@tqchen Added a temporary implementation of parallel_for here.

Thanks @jcf94 , let me try to elaborate further. To simplify the abstraction, we should:

Add src/support/parallel_for.h

Move the threadpool as a detail of parallel_for.cc, remove thread_pool from utils.h

It is unclear whether threadpool is needed to implement parallel for, it is very possible that we can just launch n std::thread(because std::thread is quite lightweight in c++)

Use parallel_for for all necessary usecases of threadpool.

Also consider remove the stride argument, or make it optional since stride is not used.

@tqchen Ok, I understand that(the stride argument has been set to 1 in default in utils.h), and it's fine to further clean these code.
Just confused about the "does not have to change now" above. :)
And actually the ThreadPool is never used in current code base...

Yes, we should just remove the thread pool

To avoid involving extra review effort, removed ThreadPool from the current code base. cc @tqchen

python/tvm/auto_schedule/auto_schedule.py

MarisaKirisame · 2020-07-14T20:00:10Z

python/tvm/auto_schedule/__init__.py

@@ -0,0 +1,34 @@
+# Licensed to the Apache Software Foundation (ASF) under one


should this folder be named auto_scheduler?

The namespace is auto_schedule

I think @MarisaKirisame means the namespace should be a noun. So auto_scheduler is better.

That also works if we all agree, we can send a followup PR for it.

Sent new PR for namespace renaming: #6059

yangjunpro · 2020-07-15T02:07:09Z

@jcf94 and @merrymercy thanks for all the hard work! Can I request that we put another unresolved issue? In my opinion the written English parts i.e comments, explanations, etc could still use some improvement with both content and grammar and I would propose in general that we do some at least 1 or 2 rounds of full documentation polish (comments, examples, tests, tutorials, etc) before we officially release a feature (in this case when all of Ansor is landed in master). We tried to do this with Relay but I think we should continue to strive to do a better job with new features like this.

I do appreciate and support the proposal. Let's move forward with the feature upstream process and after the major features being merged into master we can work together to refine the documentation.

MarisaKirisame · 2020-07-16T04:01:48Z

I had been looking over this PR a bit more, and it seems like a lot of review is dropped when file is updated. This is no one flaw's but a flaw in github's review design, and I think it just mean we should be careful in the upcoming Ansor review - just because code is merged, doesnt mean it is at top quality, and we should continue going over them.

…generating (apache#5962) * Code migration Start (neo-ai#1) * Init commit: Code migration Start * Add loop_state.cc/h * Add ComputeDAG basic test * Split transform_step out & Update more UTs (neo-ai#3) * Split transform_step out * Update GetProducers & GetConsumers * Update UTs * Add UT for CacheReadWrite & Some bug fix * Add search_task, measure and serialization (neo-ai#4) * Add FollowSplit & FollowFusedSplit tests * Update dag.InferBound & its UT * Add search_task, measure and serialization * Update Serialization UT * Add MetaTileRewritePolicy (neo-ai#5) * Add feature * Add cost_model, meta_tile_rewrite_policy * Add MetaTileRewritePolicy basic UT * Basic Python API for State (neo-ai#6) * Add Basic Python API for State * Add UTs for State * Add Python API: Measure & Task (neo-ai#7) * Update the return value of state operation * Add task * Copy measure.py & utils.py * Fix LocalBuilder * Fix LocalRunner * Add ansor.auto_schedule() API; First AutoSchedule working version(neo-ai#8) * Add basic Python support for ansor.auto_schedule * Update AutoSchedule API * Bug fix for get the attach point of a fused iter * Update UT after infer bug fix * Bug fix & Add python serialization API (neo-ai#10) * Delete C++ UT hack since Python is ready * Add ndarray.non_empty * Update Serialization python API * Improve code style, python wrapper and test cases (neo-ai#11) * Update c++ code style and unit test * Update python State wrapper and test cases * fix unit tests * Add RPCRunner & OpenCL/CUDA test (neo-ai#12) * Add RPCRunner & OpenCL search test * Add CUDA search test * Add RPCRunner test * rebase to upstream/master * Add Ansor basic tutorial (neo-ai#13) * Add basic tutorial * migrate feature extraction (neo-ai#14) * Add XGBModel & RPCRunnerWarpper (neo-ai#15) * Add XGBModel & RPCRunnerWarpper * Revert "Add Parallel Granularity Mutation" * Migrate workload_registry.py (neo-ai#16) * add workload registry * update * update * add task scheduler (neo-ai#17) * Add conv2d cuda tutorial with workload registry (neo-ai#18) * add tune_test.py (the old tune_wkl.py) (neo-ai#19) * add tune_test.py (the old tune_wkl.py) * update * fix measure * fix for gpu * Code refine for tune_test.py & Add a pre load callback (neo-ai#20) * Bug fix for tutorials * Add PreLoadMeasuredStates * Add search_callback support for task tuner * Code refine for tune_test.py * Update * Update * Update * Update * Bug fix * Add python custom sketch rule (neo-ai#21) * Add custom sketch rule * Bug fix * Ansor Relay Integration (without layout rewrite) (neo-ai#22) * relay integration * Add tune_op_subgraph.py & Some code clean for tune_network.py (neo-ai#23) * Add single op tune scripts * Add tune subgraph support * Merge all op & all subgraph to one file * Rename file * add explicit_unroll_max_extent (neo-ai#25) * Add Index simplification & API update (neo-ai#26) * Add vectorized cooperative_fetching test * Update math simplify for vectorized CF * File rename * Update tune_network * API update * Update PreLoadMeasuredStates & Some bug fix (neo-ai#27) * Add a threading wrapper to fix the test bug * Set default TVM_USE_AUTO_SCHEDULER to false * Update PreLoadMeasuredStates callback * Add tensorize step for loop_state (neo-ai#31) * Add tensorize step * State python api update (neo-ai#33) * Start to update api * Add compute_dag to state * API update * kernel layout rewrite (neo-ai#28) * kernel layout rewrite * remove some hacks * add defuse_ops pass and move kernel_layout_rewrite pass after fuse_ops pass * set TVM_RELAY_DISABLE_BUILD_CACHE for task extraction and prepare_layout_rewrite * [cache flush] port cache flush to ansor (neo-ai#32) * Improve relay integration (neo-ai#34) * tmp checkpoint * Improve relay integration * Improve relay integration * Fix xgb error & Simplify dispatcher (neo-ai#35) * Rename "MetaTileRewritePolicy" to "SketchPolicy". (neo-ai#36) * Rename "MetaTileRewritePolicy" to "SketchPolicy". * Add a new class for auto_unroll_max_step, storage_offset in StageNode * fix tune_op_subgraph.py * rebase * Migrate all node::make to noderef's construct function (neo-ai#37) * Start to move xxxnode::make to noderef() * Update * Update * Finish transform_step * Finish comute dag & auto schedule * Update * Update * Update * Update * Update * Code refine * Code refine * Code refine * Update * Update * Some lint fix & Recover the double constructor of tvm::PrimExpr (neo-ai#39) * lint fix * clang-format-fix * pylint fix * Update * Recover the double constructor of tvm::PrimExpr * Fix pylint * pylint fix * pylint fix * Add MutateComputeLocation and MutateParallel in evolutionary search (neo-ai#40) * Add MutateComputeLocation and MutateParallel in evolutionary search * fix lint * Improve loop state python API (stage_tensors -> stage_ops) (neo-ai#41) * improve loop state python API (stage_tensors -> stage_ops) * fix * ComputeDAG bug fix & Add Custom TensorCore Matmul Example (neo-ai#42) * Bug Fix * Sample example of Custom TensorCore Matmul * Rever Commits, Start to build minimum Ansor system * Code clean for minimum Ansor system * Bug fix & Delete AccessAnalyzer * Delete attachmap & Code clean * Doc update Update statenode::stages from vector to Array * Headfile update & Python doc update * clang-format fix * pylint fix * Update * Doc update * Update * Bug fix after code merge to the new master * clang-format fix * Update * Update * Update std::vector to Array; Update verbosity setting; Some commemts addressed * std::vector->Array & std::string->String * Add init_state to ComputeDAG * Update * Update some unordered_map to Map * clang-format fix * Comments addressed Delete ReplayAndInferBound Delete ReplaySteps & InferBoundCommon * Lint fix * Update * Update * Update * Update * Update * Update * Update * Update * Update * Rename ansor namespace to auto_schedule * Update * Rename ThreadPool to ParallelFor * Add parallel_for * Remove ThreadPool * Update python/tvm/auto_schedule/auto_schedule.py * trigger CI Co-authored-by: Lianmin Zheng <[email protected]> Co-authored-by: Minmin Sun (孙敏敏) <[email protected]> Co-authored-by: Zhao Wu <[email protected]>

…generating (apache#5962) * Code migration Start (#1) * Init commit: Code migration Start * Add loop_state.cc/h * Add ComputeDAG basic test * Split transform_step out & Update more UTs (#3) * Split transform_step out * Update GetProducers & GetConsumers * Update UTs * Add UT for CacheReadWrite & Some bug fix * Add search_task, measure and serialization (#4) * Add FollowSplit & FollowFusedSplit tests * Update dag.InferBound & its UT * Add search_task, measure and serialization * Update Serialization UT * Add MetaTileRewritePolicy (#5) * Add feature * Add cost_model, meta_tile_rewrite_policy * Add MetaTileRewritePolicy basic UT * Basic Python API for State (#6) * Add Basic Python API for State * Add UTs for State * Add Python API: Measure & Task (#7) * Update the return value of state operation * Add task * Copy measure.py & utils.py * Fix LocalBuilder * Fix LocalRunner * Add ansor.auto_schedule() API; First AutoSchedule working version(#8) * Add basic Python support for ansor.auto_schedule * Update AutoSchedule API * Bug fix for get the attach point of a fused iter * Update UT after infer bug fix * Bug fix & Add python serialization API (#10) * Delete C++ UT hack since Python is ready * Add ndarray.non_empty * Update Serialization python API * Improve code style, python wrapper and test cases (#11) * Update c++ code style and unit test * Update python State wrapper and test cases * fix unit tests * Add RPCRunner & OpenCL/CUDA test (#12) * Add RPCRunner & OpenCL search test * Add CUDA search test * Add RPCRunner test * rebase to upstream/master * Add Ansor basic tutorial (#13) * Add basic tutorial * migrate feature extraction (#14) * Add XGBModel & RPCRunnerWarpper (#15) * Add XGBModel & RPCRunnerWarpper * Revert "Add Parallel Granularity Mutation" * Migrate workload_registry.py (#16) * add workload registry * update * update * add task scheduler (#17) * Add conv2d cuda tutorial with workload registry (#18) * add tune_test.py (the old tune_wkl.py) (#19) * add tune_test.py (the old tune_wkl.py) * update * fix measure * fix for gpu * Code refine for tune_test.py & Add a pre load callback (#20) * Bug fix for tutorials * Add PreLoadMeasuredStates * Add search_callback support for task tuner * Code refine for tune_test.py * Update * Update * Update * Update * Bug fix * Add python custom sketch rule (#21) * Add custom sketch rule * Bug fix * Ansor Relay Integration (without layout rewrite) (#22) * relay integration * Add tune_op_subgraph.py & Some code clean for tune_network.py (#23) * Add single op tune scripts * Add tune subgraph support * Merge all op & all subgraph to one file * Rename file * add explicit_unroll_max_extent (#25) * Add Index simplification & API update (#26) * Add vectorized cooperative_fetching test * Update math simplify for vectorized CF * File rename * Update tune_network * API update * Update PreLoadMeasuredStates & Some bug fix (#27) * Add a threading wrapper to fix the test bug * Set default TVM_USE_AUTO_SCHEDULER to false * Update PreLoadMeasuredStates callback * Add tensorize step for loop_state (#31) * Add tensorize step * State python api update (#33) * Start to update api * Add compute_dag to state * API update * kernel layout rewrite (#28) * kernel layout rewrite * remove some hacks * add defuse_ops pass and move kernel_layout_rewrite pass after fuse_ops pass * set TVM_RELAY_DISABLE_BUILD_CACHE for task extraction and prepare_layout_rewrite * [cache flush] port cache flush to ansor (#32) * Improve relay integration (#34) * tmp checkpoint * Improve relay integration * Improve relay integration * Fix xgb error & Simplify dispatcher (#35) * Rename "MetaTileRewritePolicy" to "SketchPolicy". (#36) * Rename "MetaTileRewritePolicy" to "SketchPolicy". * Add a new class for auto_unroll_max_step, storage_offset in StageNode * fix tune_op_subgraph.py * rebase * Migrate all node::make to noderef's construct function (#37) * Start to move xxxnode::make to noderef() * Update * Update * Finish transform_step * Finish comute dag & auto schedule * Update * Update * Update * Update * Update * Code refine * Code refine * Code refine * Update * Update * Some lint fix & Recover the double constructor of tvm::PrimExpr (#39) * lint fix * clang-format-fix * pylint fix * Update * Recover the double constructor of tvm::PrimExpr * Fix pylint * pylint fix * pylint fix * Add MutateComputeLocation and MutateParallel in evolutionary search (#40) * Add MutateComputeLocation and MutateParallel in evolutionary search * fix lint * Improve loop state python API (stage_tensors -> stage_ops) (#41) * improve loop state python API (stage_tensors -> stage_ops) * fix * ComputeDAG bug fix & Add Custom TensorCore Matmul Example (#42) * Bug Fix * Sample example of Custom TensorCore Matmul * Rever Commits, Start to build minimum Ansor system * Code clean for minimum Ansor system * Bug fix & Delete AccessAnalyzer * Delete attachmap & Code clean * Doc update Update statenode::stages from vector to Array * Headfile update & Python doc update * clang-format fix * pylint fix * Update * Doc update * Update * Bug fix after code merge to the new master * clang-format fix * Update * Update * Update std::vector to Array; Update verbosity setting; Some commemts addressed * std::vector->Array & std::string->String * Add init_state to ComputeDAG * Update * Update some unordered_map to Map * clang-format fix * Comments addressed Delete ReplayAndInferBound Delete ReplaySteps & InferBoundCommon * Lint fix * Update * Update * Update * Update * Update * Update * Update * Update * Update * Rename ansor namespace to auto_schedule * Update * Rename ThreadPool to ParallelFor * Add parallel_for * Remove ThreadPool * Update python/tvm/auto_schedule/auto_schedule.py * trigger CI Co-authored-by: Lianmin Zheng <[email protected]> Co-authored-by: Minmin Sun (孙敏敏) <[email protected]> Co-authored-by: Zhao Wu <[email protected]>

jcf94 and others added 30 commits June 20, 2020 09:01

Code migration Start (#1)

7ee0902

* Init commit: Code migration Start * Add loop_state.cc/h * Add ComputeDAG basic test

Split transform_step out & Update more UTs (apache#3)

9fcbf0b

* Split transform_step out * Update GetProducers & GetConsumers * Update UTs * Add UT for CacheReadWrite & Some bug fix

Add search_task, measure and serialization (apache#4)

f43e82f

* Add FollowSplit & FollowFusedSplit tests * Update dag.InferBound & its UT * Add search_task, measure and serialization * Update Serialization UT

Add MetaTileRewritePolicy (apache#5)

e0a5ed5

* Add feature * Add cost_model, meta_tile_rewrite_policy * Add MetaTileRewritePolicy basic UT

Basic Python API for State (apache#6)

359905a

* Add Basic Python API for State * Add UTs for State

Add Python API: Measure & Task (apache#7)

2032a64

* Update the return value of state operation * Add task * Copy measure.py & utils.py * Fix LocalBuilder * Fix LocalRunner

Add ansor.auto_schedule() API; First AutoSchedule working version(apa…

6b21dc6

…che#8) * Add basic Python support for ansor.auto_schedule * Update AutoSchedule API * Bug fix for get the attach point of a fused iter * Update UT after infer bug fix

Bug fix & Add python serialization API (apache#10)

e52135f

* Delete C++ UT hack since Python is ready * Add ndarray.non_empty * Update Serialization python API

Improve code style, python wrapper and test cases (apache#11)

1fe6638

* Update c++ code style and unit test * Update python State wrapper and test cases

fix unit tests

43d1530

Add RPCRunner & OpenCL/CUDA test (apache#12)

f367d15

* Add RPCRunner & OpenCL search test * Add CUDA search test * Add RPCRunner test

rebase to upstream/master

2bd6471

Add Ansor basic tutorial (apache#13)

c860f2c

* Add basic tutorial

migrate feature extraction (apache#14)

f60d1a6

Add XGBModel & RPCRunnerWarpper (apache#15)

b839c0f

* Add XGBModel & RPCRunnerWarpper * Revert "Add Parallel Granularity Mutation"

Migrate workload_registry.py (apache#16)

cfe58d7

* add workload registry * update * update

add task scheduler (apache#17)

143ea45

Add conv2d cuda tutorial with workload registry (apache#18)

ed075c2

add tune_test.py (the old tune_wkl.py) (apache#19)

74ec7d0

* add tune_test.py (the old tune_wkl.py) * update * fix measure * fix for gpu

Code refine for tune_test.py & Add a pre load callback (apache#20)

cd0a516

* Bug fix for tutorials * Add PreLoadMeasuredStates * Add search_callback support for task tuner * Code refine for tune_test.py * Update * Update * Update * Update * Bug fix

Add python custom sketch rule (apache#21)

3a24e49

* Add custom sketch rule * Bug fix

Ansor Relay Integration (without layout rewrite) (apache#22)

a155c1f

* relay integration

Add tune_op_subgraph.py & Some code clean for tune_network.py (apache#23

674027f

) * Add single op tune scripts * Add tune subgraph support * Merge all op & all subgraph to one file * Rename file

add explicit_unroll_max_extent (apache#25)

2f241ed

Add Index simplification & API update (apache#26)

18d44b8

* Add vectorized cooperative_fetching test * Update math simplify for vectorized CF * File rename * Update tune_network * API update

Update PreLoadMeasuredStates & Some bug fix (apache#27)

4ea6712

* Add a threading wrapper to fix the test bug * Set default TVM_USE_AUTO_SCHEDULER to false * Update PreLoadMeasuredStates callback

Add tensorize step for loop_state (apache#31)

6126cdb

* Add tensorize step

State python api update (apache#33)

c7364df

* Start to update api * Add compute_dag to state * API update

kernel layout rewrite (apache#28)

36cd9ef

* kernel layout rewrite * remove some hacks * add defuse_ops pass and move kernel_layout_rewrite pass after fuse_ops pass * set TVM_RELAY_DISABLE_BUILD_CACHE for task extraction and prepare_layout_rewrite

[cache flush] port cache flush to ansor (apache#32)

145e61c

jroesch approved these changes Jul 13, 2020

View reviewed changes

Rename ThreadPool to ParallelFor

3a4e5da

jcf94 requested a review from tqchen July 14, 2020 02:13

tqchen requested changes Jul 14, 2020

View reviewed changes

Add parallel_for

dbe019b

jcf94 commented Jul 14, 2020

View reviewed changes

Remove ThreadPool

1f1b878

tqchen approved these changes Jul 14, 2020

View reviewed changes

merrymercy reviewed Jul 14, 2020

View reviewed changes

python/tvm/auto_schedule/auto_schedule.py Outdated Show resolved Hide resolved

Update python/tvm/auto_schedule/auto_schedule.py

02fede9

This comment has been minimized.

Sign in to view

merrymercy reviewed Jul 14, 2020

View reviewed changes

python/tvm/auto_schedule/auto_schedule.py Outdated Show resolved Hide resolved

trigger CI

eea0989

MarisaKirisame approved these changes Jul 14, 2020

View reviewed changes

tqchen merged commit 456c58d into apache:master Jul 15, 2020

jcf94 deleted the upstream_0_new branch July 15, 2020 01:57

tqchen added status: accepted and removed status: need review labels Jul 15, 2020

jcf94 changed the title ~~[Ansor][AutoTVM v2.0] Part 0: Ansor minimum system for auto schedule generating~~ [Ansor][AutoTVM v2.0] Phase 0: Ansor minimum system for auto schedule generating Jul 17, 2020

jcf94 mentioned this pull request Aug 14, 2020

[Support] Add parallel_for support to run a loop in parallel #6275

Merged

ZihengJiang mentioned this pull request Sep 25, 2020

TVM v0.7 Release Note Candidate #6486

Closed

jcf94 mentioned this pull request Mar 11, 2021

[TensorIR][M1a] TVMScript Parser/Printer #7630

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Ansor][AutoTVM v2.0] Phase 0: Ansor minimum system for auto schedule generating #5962

[Ansor][AutoTVM v2.0] Phase 0: Ansor minimum system for auto schedule generating #5962

jcf94 commented Jun 30, 2020

jroesch commented Jul 13, 2020

jcf94 commented Jul 14, 2020

tqchen Jul 14, 2020

jcf94 Jul 14, 2020

jcf94 Jul 14, 2020

tqchen Jul 14, 2020

jcf94 Jul 14, 2020 •

edited

Loading

merrymercy Jul 14, 2020 •

edited

Loading

jcf94 Jul 14, 2020 •

edited

Loading

This comment has been minimized.

MarisaKirisame Jul 14, 2020

tqchen Jul 15, 2020

merrymercy Jul 15, 2020 •

edited

Loading

tqchen Jul 15, 2020

jcf94 Jul 15, 2020

yangjunpro commented Jul 15, 2020 •

edited

Loading

MarisaKirisame commented Jul 16, 2020

		@@ -0,0 +1,34 @@
		# Licensed to the Apache Software Foundation (ASF) under one

[Ansor][AutoTVM v2.0] Phase 0: Ansor minimum system for auto schedule generating #5962

[Ansor][AutoTVM v2.0] Phase 0: Ansor minimum system for auto schedule generating #5962

Conversation

jcf94 commented Jun 30, 2020

Infrastructure: A Sketch IR for Schedule Searching

Key Data Structures

Ansor Minimum System

More Details on LoopState

jroesch commented Jul 13, 2020

jcf94 commented Jul 14, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jcf94 Jul 14, 2020 • edited Loading

Choose a reason for hiding this comment

merrymercy Jul 14, 2020 • edited Loading

Choose a reason for hiding this comment

jcf94 Jul 14, 2020 • edited Loading

Choose a reason for hiding this comment

This comment has been minimized.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

merrymercy Jul 15, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yangjunpro commented Jul 15, 2020 • edited Loading

MarisaKirisame commented Jul 16, 2020

jcf94 Jul 14, 2020 •

edited

Loading

merrymercy Jul 14, 2020 •

edited

Loading

jcf94 Jul 14, 2020 •

edited

Loading

merrymercy Jul 15, 2020 •

edited

Loading

yangjunpro commented Jul 15, 2020 •

edited

Loading