Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Ansor][AutoTVM v2.0] Phase 0: Ansor minimum system for auto schedule generating #5962

Merged
merged 80 commits into from
Jul 15, 2020

Conversation

jcf94
Copy link
Contributor

@jcf94 jcf94 commented Jun 30, 2020

Hi all,
The last PR of Ansor #5883 is not clear enough for reviewers to fully understand our design. After some discussion, we changed our upstream plan to propose a minimal version of Ansor which contains a small but complete framework, so others can get a better understanding of the whole structure of Ansor.


In [RFC] Ansor: An Auto-scheduler for TVM (AutoTVM v2.0), we've introduced the Ansor auto-scheduler. And we reached an agreement that should replace AutoTVM with Ansor in the end.
For most existing templates, current Ansor can directly replace them with better performance and less tuning time.
For other special templates (low-precision, sparse), the plan is to introduce search space customization and gradually rewrite them with Ansor's new API.

This PR introduces a self-contained minimum version of Ansor with most of the bones.

This PR includes the interface of core data structures and an empty search policy that does nothing. More advanced search policy and cost models will be in the next few PRs.

Infrastructure: A Sketch IR for Schedule Searching

Different from AutoTVM, whose tuning spaces are composed of predefined parameters, Ansor constructs a search space by manipulating the loop structures of the given compute DAG.
To enable flexible manipulation of the loop structures, we implemented a lightweight loop structure IR (Intermediate Representation) based on the original TVM IR but specifically for schedule search. The IR is composed of the state and action, which are defined as follows:

  • State: A state of schedule search is the loop structure defined by the schedule (i.e., the TVM IR created by tvm.lower). See LoopState in the Key Data Structure for details.

  • Action: An action is composed of one or more schedule primitives to manipulate (e.g., split, reorder, fuse) a state. See TransformStep in the Key Data Structure for details.

We don't use the existing TVM IR but to extend a new Sketch IR on it is because:

  1. We want fast incremental change to the loop structures;
  2. We want serializable transform history for replay, backtracking, and mutation;
  3. We may create some macro schedule primitives that represent the combination of several TVM schedule primitives

After the search is done, we will lower this to TVM IR with TVM schedule primitives.

Key Data Structures

To help build an overview of Ansor, this shows the class relations of some important Ansor data structures:
截屏2020-06-30 上午11 17 25

  • ComputeDAG: Compute declaration graph and its related analysis tools.
    Related source files: src/ansor/compute_dag.*, python/tvm/ansor/compute_dag.py
    Ansor takes a compute declaration, which could be a single operator or a subgraph, described by tvm.compute as an input and converts it to a ComputeDAG.
    ComputeDAG implementation includes a set of analyses such as the total float operations, consumer/producer relations of each operation stage, whether a stage should be tiled/compute inlined, and so on (some of the analysis will be included in the follow-up PRs). These analyses can help the search policy to do specific decisions during schedule search process.

  • LoopState: This defines the "state" for the search problem.
    Related source files: src/ansor/loop_state.*, python/tvm/ansor/loop_state.py.
    Each LoopState corresponds to a specific schedule for the target ComputeDAG.
    A LoopState consists of: 1. the current loop structure; 2. the transform history to reach this loop structure.
    The loop structure keeps a preview of how the schedule will finally look like after lowering (number of iterators, the extent of each iterator, the compute_at locations, etc), which can help the search policy to make decisions during the search.
    The transform history is a sequence of TransformStep which will finally be mapped to schedule primitives.

  • TransformStep: This defines the "action" for the search problem, i.e., the schedule primitives for our sketch IR.
    Related files: src/ansor/transform_step.*, python/tvm/ansor/loop_state.py
    Each step has its corresponding tvm.te schedule primitives. We record all TransformSteps for every state as its transform history. After finishing the schedule search, these transform steps will be lowered with their corresponding TVM's schedule primitives.
    Note: This PR only contains a small subset of the TransformSteps. The complete set of transform steps will be in the next PRs.

ComputeDAG is also playing a role of connecting Ansor state system to TVM schedule system. That is, ComputeDAG is able to replay TransformSteps to the final TVM schedule (e.g., ComputeDAG(state, actions)).

  • SearchTask: Meta information and hardware parameters for a specific schedule search task.
    Related source files: src/ansor/search_task.*
    This structure includes the target ComputeDAG, device information as well as some hardware parameters obtained from the system or user inputs.

  • SearchPolicy: The search policy is defined for Ansor to auto-generate a high-performance schedule for different computations.
    Related source files: src/ansor/search_policy/*
    A SearchPolicy takes a SearchTask, system information and some tuning options as inputs, performs the schedule search, and returns a state with the best performance. The resulting state can be used to apply to TVM schedule later.

    Note that in Ansor paper (https://arxiv.org/abs/2006.06762), we proposed a sketch generation policy which achieves pretty good results with various workloads on different devices. On the other hand, in this minimum system PR, we only provide an EmptyPolicy to illustrate the search policy interface.

Ansor Minimum System

This is a brief diagram for the Ansor system:
截屏2020-06-30 上午10 44 59

  1. Define the target computation with TVM te API and create a ComputeDAG structure.
  2. Specify the target device, hardware parameters, tuning options and pack those with ComputeDAG to create a SearchTask structure.
  3. SearchPolicy takes the SearchTask as input and performs the schedule search. During the search process, SearchPolicy will generate multiple candidate states, each of which corresponds to a specific TVM schedule.
  4. Get the best state and use ComputeDAG API to transform it to the final TVM schedule.

In the Ansor system, we use the sketch generation policy (will be brought in later PRs) described in the paper as the default search policy, which should be enough to cover most use cases. Meanwhile, we will have an RFC for a custom rule mechanism that enables user-defined template search to serve the same functionality as the current AutoTVM template. Specifically, we will provide Python APIs for the new IR that is intended to be used by users for sketch customization. They look very similar to existing schedule primitives, as shown in python/tvm/ansor/loop_state.py.

Our goal is to make sure Ansor can cover all AutoTVM functionalities while achieving the same or better performance so that the community can gradually switch to Ansor from AutoTVM.

More Details on LoopState

In this section, we illustrate how a loop state looks like and how does it connect to the current TVM build system.

Take a simple State that includes Split, Fuse and Reorder steps for example:

A, B, C = matmul_ansor_test(512, 512, 512)
dag = ansor.ComputeDAG([A, B, C])
state = dag.get_init_state()
i, j, k = state[C].iters
io, ii = state.split(C, i, [16])
jo, ji = state.split(C, j, [8])
state.reorder(C, [io, jo, k, ji, ii])
fused_it = state.fuse(C, [io, jo])

First, let's print out a state. It shows the loop structure of the corresponding TVM schedule, the "preview":

>>> print(state)

Placeholder: A, B
for i.0@j.0@ (0,2048)
  for k (0,512)
    for j.1 (0,8)
      for i.1 (0,16)
        C = ...

It stores all history transform steps required to reach the current state. We can print the history transform steps as TVM's python schedule API. This can be used for debugging or to apply the schedule on a former TVM version without Ansor support.

>>> print(dag.print_python_code_from_state(state))

i, j, k = tuple(C.op.axis) + tuple(C.op.reduce_axis)
i_o, i_i = s[C].split(i, factor=16)
j_o, j_i = s[C].split(j, factor=8)
s[C].reorder(i_o, j_o, k, j_i, i_i)
i_o_j_o_fused = s[C].fuse(i_o, j_o)

We can also replay these steps to get a schedule for tvm.lower and tvm.build.

>>> sche, args = dag.apply_steps_from_state(state)
>>> print(tvm.lower(sche, args, simple_mode=True))

primfn(A_1: handle, B_1: handle, C_1: handle) -> ()
  attr = {"global_symbol": "main", "tir.noalias": True}
  buffers = {C: Buffer(C_2: handle, float32, [512, 512], []),
             A: Buffer(A_2: handle, float32, [512, 512], []),
             B: Buffer(B_2: handle, float32, [512, 512], [])}
  buffer_map = {A_1: A, B_1: B, C_1: C} {
  for (i.outer.j.outer.fused: int32, 0, 2048) {
    for (j.inner.init: int32, 0, 8) {
      for (i.inner.init: int32, 0, 16) {
        C_2[((((floordiv(i.outer.j.outer.fused, 64)*8192) + (i.inner.init*512)) + (floormod(i.outer.j.outer.fused, 64)*8)) + j.inner.init)] = 0f32
      }
    }
    for (k: int32, 0, 512) {
      for (j.inner: int32, 0, 8) {
        for (i.inner: int32, 0, 16) {
          C_2[((((floordiv(i.outer.j.outer.fused, 64)*8192) + (i.inner*512)) + (floormod(i.outer.j.outer.fused, 64)*8)) + j.inner)] = ((float32*)C_2[((((floordiv(i.outer.j.outer.fused, 64)*8192) + (i.inner*512)) + (floormod(i.outer.j.outer.fused, 64)*8)) + j.inner)]) + ((float32*)A_2[(((floordiv(i.outer.j.outer.fused, 64)*8192) + (i.inner*512)) + k)])*(float32*)B_2[(((k*512) + (floormod(i.outer.j.outer.fused, 64)*8)) + j.inner)])))
        }
      }
    }
  }
}

The steps of this state can be serialized into the log file as:

>>> target = tvm.target.create("llvm")
>>> task = ansor.SearchTask(dag, "test", target)
>>> inp = ansor.measure.MeasureInput(task, state)
>>> res = ansor.measure.MeasureResult([0.1], 0, "", 0.2, 1)
>>> with open("test.log", "w") as fp:
>>>     ansor.serialization.write_measure_records_to_file(fp.name, [inp], [res])

{"i": [["test", "llvm"], [[], [["SP", 2, 0, 512, [16], 1], ["SP", 2, 2, 512, [8], 1], ["RE", 2, [0, 2, 4, 3, 1]], ["FU", 2, [0, 1]]]]], "r": [[0.1], 0, 0.2, 1], "v": "v0.2"}

Ansor serializes all transform steps to the log file; while AutoTVM serializes parameters of a predefined template. The log format discussion would be based on https://discuss.tvm.ai/t/rfc-canonicalizing-autotvm-log-format/7038/.


In the next few PRs, we'll introduce the complete search policy and tutorials for single op/ subgraph schedule search, Relay integration, and tutorials for end-to-end network schedule search, custom rules to support customized search space.

This is a joint work by @merrymercy @jcf94 @minminsun @FrozenGene @comaniac @yangjunpro @yidawang .

jcf94 and others added 30 commits June 20, 2020 09:01
* Init commit: Code migration Start

* Add loop_state.cc/h

* Add ComputeDAG basic test
* Split transform_step out

* Update GetProducers & GetConsumers

* Update UTs

* Add UT for CacheReadWrite & Some bug fix
* Add FollowSplit & FollowFusedSplit tests

* Update dag.InferBound & its UT

* Add search_task, measure and serialization

* Update Serialization UT
* Add feature

* Add cost_model, meta_tile_rewrite_policy

* Add MetaTileRewritePolicy basic UT
* Add Basic Python API for State

* Add UTs for State
* Update the return value of state operation

* Add task

* Copy measure.py & utils.py

* Fix LocalBuilder

* Fix LocalRunner
…che#8)

* Add basic Python support for ansor.auto_schedule

* Update AutoSchedule API

* Bug fix for get the attach point of a fused iter

* Update UT after infer bug fix
* Delete C++ UT hack since Python is ready

* Add ndarray.non_empty

* Update Serialization python API
* Update c++ code style and unit test

* Update python State wrapper and test cases
* Add RPCRunner & OpenCL search test

* Add CUDA search test

* Add RPCRunner test
* Add XGBModel & RPCRunnerWarpper

* Revert "Add Parallel Granularity Mutation"
* add workload registry

* update

* update
* add tune_test.py (the old tune_wkl.py)

* update

* fix measure

* fix for gpu
* Bug fix for tutorials

* Add PreLoadMeasuredStates

* Add search_callback support for task tuner

* Code refine for tune_test.py

* Update

* Update

* Update

* Update

* Bug fix
* Add custom sketch rule

* Bug fix
)

* Add single op tune scripts

* Add tune subgraph support

* Merge all op & all subgraph to one file

* Rename file
* Add vectorized cooperative_fetching test

* Update math simplify for vectorized CF

* File rename

* Update tune_network

* API update
* Add a threading wrapper to fix the test bug

* Set default TVM_USE_AUTO_SCHEDULER to false

* Update PreLoadMeasuredStates callback
* Start to update api

* Add compute_dag to state

* API update
* kernel layout rewrite

* remove some hacks

* add defuse_ops pass and move kernel_layout_rewrite pass after fuse_ops pass

* set TVM_RELAY_DISABLE_BUILD_CACHE for task extraction and prepare_layout_rewrite
@jroesch
Copy link
Member

jroesch commented Jul 13, 2020

@jcf94 and @merrymercy thanks for all the hard work! Can I request that we put another unresolved issue? In my opinion the written English parts i.e comments, explanations, etc could still use some improvement with both content and grammar and I would propose in general that we do some at least 1 or 2 rounds of full documentation polish (comments, examples, tests, tutorials, etc) before we officially release a feature (in this case when all of Ansor is landed in master). We tried to do this with Relay but I think we should continue to strive to do a better job with new features like this.

@jcf94
Copy link
Contributor Author

jcf94 commented Jul 14, 2020

Does not have to change now, but let us change the use of ThreadPool to parallel_for abstraction.

Does that means to just modify ThreadPool to ParallelFor now? Class renamed & added some comments on the member function.

@jcf94 and @merrymercy thanks for all the hard work! Can I request that we put another unresolved issue? In my opinion the written English parts i.e comments, explanations, etc could still use some improvement with both content and grammar and I would propose in general that we do some at least 1 or 2 rounds of full documentation polish (comments, examples, tests, tutorials, etc) before we officially release a feature (in this case when all of Ansor is landed in master). We tried to do this with Relay but I think we should continue to strive to do a better job with new features like this.

Thanks! That would be of great help since I'm not a native speaker. The documentation does need to be polished.

@jcf94 jcf94 requested a review from tqchen July 14, 2020 02:13
* TODO(merrymercy): Move this to `src/support/parallel_for`
*/
class ThreadPool {
class ParallelFor {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jcf94 Sorry I wasn't meant to say that we should rename ThreadPool to ParallelFor, instead we should hide the use of threadpool behind a parallel_for API, in similar style to https://docs.microsoft.com/en-us/cpp/parallel/concrt/parallel-algorithms?view=vs-2019#parallel_for

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I get it. However after checking the current code, I found out that actually we have removed all the use of ThreadPool in this minimum system. Didn't realize that before.

return *pool;
}

void parallel_for(int start, int end, std::function<void(int index)> f, int stride) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tqchen Added a temporary implementation of parallel_for here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jcf94 , let me try to elaborate further. To simplify the abstraction, we should:

  • Add src/support/parallel_for.h
    • Move the threadpool as a detail of parallel_for.cc, remove thread_pool from utils.h
    • It is unclear whether threadpool is needed to implement parallel for, it is very possible that we can just launch n std::thread(because std::thread is quite lightweight in c++)
  • Use parallel_for for all necessary usecases of threadpool.

Also consider remove the stride argument, or make it optional since stride is not used.

Copy link
Contributor Author

@jcf94 jcf94 Jul 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tqchen Ok, I understand that(the stride argument has been set to 1 in default in utils.h), and it's fine to further clean these code.
Just confused about the "does not have to change now" above. :)
And actually the ThreadPool is never used in current code base...

Copy link
Member

@merrymercy merrymercy Jul 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should just remove the thread pool

Copy link
Contributor Author

@jcf94 jcf94 Jul 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid involving extra review effort, removed ThreadPool from the current code base. cc @tqchen

@merrymercy

This comment has been minimized.

@@ -0,0 +1,34 @@
# Licensed to the Apache Software Foundation (ASF) under one
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this folder be named auto_scheduler?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The namespace is auto_schedule

Copy link
Member

@merrymercy merrymercy Jul 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @MarisaKirisame means the namespace should be a noun. So auto_scheduler is better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That also works if we all agree, we can send a followup PR for it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sent new PR for namespace renaming: #6059

@tqchen tqchen merged commit 456c58d into apache:master Jul 15, 2020
@jcf94 jcf94 deleted the upstream_0_new branch July 15, 2020 01:57
@yangjunpro
Copy link

yangjunpro commented Jul 15, 2020

@jcf94 and @merrymercy thanks for all the hard work! Can I request that we put another unresolved issue? In my opinion the written English parts i.e comments, explanations, etc could still use some improvement with both content and grammar and I would propose in general that we do some at least 1 or 2 rounds of full documentation polish (comments, examples, tests, tutorials, etc) before we officially release a feature (in this case when all of Ansor is landed in master). We tried to do this with Relay but I think we should continue to strive to do a better job with new features like this.

I do appreciate and support the proposal. Let's move forward with the feature upstream process and after the major features being merged into master we can work together to refine the documentation.

@MarisaKirisame
Copy link
Contributor

I had been looking over this PR a bit more, and it seems like a lot of review is dropped when file is updated. This is no one flaw's but a flaw in github's review design, and I think it just mean we should be careful in the upcoming Ansor review - just because code is merged, doesnt mean it is at top quality, and we should continue going over them.

@jcf94 jcf94 changed the title [Ansor][AutoTVM v2.0] Part 0: Ansor minimum system for auto schedule generating [Ansor][AutoTVM v2.0] Phase 0: Ansor minimum system for auto schedule generating Jul 17, 2020
trevor-m pushed a commit to trevor-m/tvm that referenced this pull request Aug 26, 2020
…generating (apache#5962)

* Code migration Start (neo-ai#1)

* Init commit: Code migration Start

* Add loop_state.cc/h

* Add ComputeDAG basic test

* Split transform_step out & Update more UTs (neo-ai#3)

* Split transform_step out

* Update GetProducers & GetConsumers

* Update UTs

* Add UT for CacheReadWrite & Some bug fix

* Add search_task, measure and serialization (neo-ai#4)

* Add FollowSplit & FollowFusedSplit tests

* Update dag.InferBound & its UT

* Add search_task, measure and serialization

* Update Serialization UT

* Add MetaTileRewritePolicy (neo-ai#5)

* Add feature

* Add cost_model, meta_tile_rewrite_policy

* Add MetaTileRewritePolicy basic UT

* Basic Python API for State (neo-ai#6)

* Add Basic Python API for State

* Add UTs for State

* Add Python API: Measure & Task (neo-ai#7)

* Update the return value of state operation

* Add task

* Copy measure.py & utils.py

* Fix LocalBuilder

* Fix LocalRunner

* Add ansor.auto_schedule() API; First AutoSchedule working version(neo-ai#8)

* Add basic Python support for ansor.auto_schedule

* Update AutoSchedule API

* Bug fix for get the attach point of a fused iter

* Update UT after infer bug fix

* Bug fix & Add python serialization API (neo-ai#10)

* Delete C++ UT hack since Python is ready

* Add ndarray.non_empty

* Update Serialization python API

* Improve code style, python wrapper and test cases (neo-ai#11)

* Update c++ code style and unit test

* Update python State wrapper and test cases

* fix unit tests

* Add RPCRunner & OpenCL/CUDA test (neo-ai#12)

* Add RPCRunner & OpenCL search test

* Add CUDA search test

* Add RPCRunner test

* rebase to upstream/master

* Add Ansor basic tutorial (neo-ai#13)

* Add basic tutorial

* migrate feature extraction (neo-ai#14)

* Add XGBModel & RPCRunnerWarpper (neo-ai#15)

* Add XGBModel & RPCRunnerWarpper

* Revert "Add Parallel Granularity Mutation"

* Migrate workload_registry.py (neo-ai#16)

* add workload registry

* update

* update

* add task scheduler (neo-ai#17)

* Add conv2d cuda tutorial with workload registry (neo-ai#18)

* add tune_test.py (the old tune_wkl.py) (neo-ai#19)

* add tune_test.py (the old tune_wkl.py)

* update

* fix measure

* fix for gpu

* Code refine for tune_test.py & Add a pre load callback (neo-ai#20)

* Bug fix for tutorials

* Add PreLoadMeasuredStates

* Add search_callback support for task tuner

* Code refine for tune_test.py

* Update

* Update

* Update

* Update

* Bug fix

* Add python custom sketch rule (neo-ai#21)

* Add custom sketch rule

* Bug fix

* Ansor Relay Integration (without layout rewrite) (neo-ai#22)

* relay integration

* Add tune_op_subgraph.py & Some code clean for tune_network.py (neo-ai#23)

* Add single op tune scripts

* Add tune subgraph support

* Merge all op & all subgraph to one file

* Rename file

* add explicit_unroll_max_extent (neo-ai#25)

* Add Index simplification & API update (neo-ai#26)

* Add vectorized cooperative_fetching test

* Update math simplify for vectorized CF

* File rename

* Update tune_network

* API update

* Update PreLoadMeasuredStates & Some bug fix (neo-ai#27)

* Add a threading wrapper to fix the test bug

* Set default TVM_USE_AUTO_SCHEDULER to false

* Update PreLoadMeasuredStates callback

* Add tensorize step for loop_state (neo-ai#31)

* Add tensorize step

* State python api update (neo-ai#33)

* Start to update api

* Add compute_dag to state

* API update

* kernel layout rewrite (neo-ai#28)

* kernel layout rewrite

* remove some hacks

* add defuse_ops pass and move kernel_layout_rewrite pass after fuse_ops pass

* set TVM_RELAY_DISABLE_BUILD_CACHE for task extraction and prepare_layout_rewrite

* [cache flush] port cache flush to ansor (neo-ai#32)

* Improve relay integration (neo-ai#34)

* tmp checkpoint

* Improve relay integration

* Improve relay integration

* Fix xgb error & Simplify dispatcher (neo-ai#35)

* Rename "MetaTileRewritePolicy" to "SketchPolicy". (neo-ai#36)

* Rename "MetaTileRewritePolicy" to "SketchPolicy".

* Add a new class for auto_unroll_max_step, storage_offset in StageNode

* fix tune_op_subgraph.py

* rebase

* Migrate all node::make to noderef's construct function (neo-ai#37)

* Start to move xxxnode::make to noderef()

* Update

* Update

* Finish transform_step

* Finish comute dag & auto schedule

* Update

* Update

* Update

* Update

* Update

* Code refine

* Code refine

* Code refine

* Update

* Update

* Some lint fix & Recover the double constructor of tvm::PrimExpr (neo-ai#39)

* lint fix

* clang-format-fix

* pylint fix

* Update

* Recover the double constructor of tvm::PrimExpr

* Fix pylint

* pylint fix

* pylint fix

* Add MutateComputeLocation and MutateParallel in evolutionary search (neo-ai#40)

* Add MutateComputeLocation and MutateParallel in evolutionary search

* fix lint

* Improve loop state python API (stage_tensors -> stage_ops) (neo-ai#41)

* improve loop state python API (stage_tensors -> stage_ops)

* fix

* ComputeDAG bug fix & Add Custom TensorCore Matmul Example (neo-ai#42)

* Bug Fix

* Sample example of Custom TensorCore Matmul

* Rever Commits, Start to build minimum Ansor system

* Code clean for minimum Ansor system

* Bug fix & Delete AccessAnalyzer

* Delete attachmap & Code clean

* Doc update

Update statenode::stages from vector to Array

* Headfile update & Python doc update

* clang-format fix

* pylint fix

* Update

* Doc update

* Update

* Bug fix after code merge to the new master

* clang-format fix

* Update

* Update

* Update std::vector to Array; Update verbosity setting; Some commemts
addressed

* std::vector->Array & std::string->String

* Add init_state to ComputeDAG

* Update

* Update some unordered_map to Map

* clang-format fix

* Comments addressed
Delete ReplayAndInferBound
Delete ReplaySteps & InferBoundCommon

* Lint fix

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Rename ansor namespace to auto_schedule

* Update

* Rename ThreadPool to ParallelFor

* Add parallel_for

* Remove ThreadPool

* Update python/tvm/auto_schedule/auto_schedule.py

* trigger CI

Co-authored-by: Lianmin Zheng <[email protected]>
Co-authored-by: Minmin Sun (孙敏敏) <[email protected]>
Co-authored-by: Zhao Wu <[email protected]>
trevor-m pushed a commit to trevor-m/tvm that referenced this pull request Aug 26, 2020
…generating (apache#5962)

* Code migration Start (neo-ai#1)

* Init commit: Code migration Start

* Add loop_state.cc/h

* Add ComputeDAG basic test

* Split transform_step out & Update more UTs (neo-ai#3)

* Split transform_step out

* Update GetProducers & GetConsumers

* Update UTs

* Add UT for CacheReadWrite & Some bug fix

* Add search_task, measure and serialization (neo-ai#4)

* Add FollowSplit & FollowFusedSplit tests

* Update dag.InferBound & its UT

* Add search_task, measure and serialization

* Update Serialization UT

* Add MetaTileRewritePolicy (neo-ai#5)

* Add feature

* Add cost_model, meta_tile_rewrite_policy

* Add MetaTileRewritePolicy basic UT

* Basic Python API for State (neo-ai#6)

* Add Basic Python API for State

* Add UTs for State

* Add Python API: Measure & Task (neo-ai#7)

* Update the return value of state operation

* Add task

* Copy measure.py & utils.py

* Fix LocalBuilder

* Fix LocalRunner

* Add ansor.auto_schedule() API; First AutoSchedule working version(neo-ai#8)

* Add basic Python support for ansor.auto_schedule

* Update AutoSchedule API

* Bug fix for get the attach point of a fused iter

* Update UT after infer bug fix

* Bug fix & Add python serialization API (neo-ai#10)

* Delete C++ UT hack since Python is ready

* Add ndarray.non_empty

* Update Serialization python API

* Improve code style, python wrapper and test cases (neo-ai#11)

* Update c++ code style and unit test

* Update python State wrapper and test cases

* fix unit tests

* Add RPCRunner & OpenCL/CUDA test (neo-ai#12)

* Add RPCRunner & OpenCL search test

* Add CUDA search test

* Add RPCRunner test

* rebase to upstream/master

* Add Ansor basic tutorial (neo-ai#13)

* Add basic tutorial

* migrate feature extraction (neo-ai#14)

* Add XGBModel & RPCRunnerWarpper (neo-ai#15)

* Add XGBModel & RPCRunnerWarpper

* Revert "Add Parallel Granularity Mutation"

* Migrate workload_registry.py (neo-ai#16)

* add workload registry

* update

* update

* add task scheduler (neo-ai#17)

* Add conv2d cuda tutorial with workload registry (neo-ai#18)

* add tune_test.py (the old tune_wkl.py) (neo-ai#19)

* add tune_test.py (the old tune_wkl.py)

* update

* fix measure

* fix for gpu

* Code refine for tune_test.py & Add a pre load callback (neo-ai#20)

* Bug fix for tutorials

* Add PreLoadMeasuredStates

* Add search_callback support for task tuner

* Code refine for tune_test.py

* Update

* Update

* Update

* Update

* Bug fix

* Add python custom sketch rule (neo-ai#21)

* Add custom sketch rule

* Bug fix

* Ansor Relay Integration (without layout rewrite) (neo-ai#22)

* relay integration

* Add tune_op_subgraph.py & Some code clean for tune_network.py (neo-ai#23)

* Add single op tune scripts

* Add tune subgraph support

* Merge all op & all subgraph to one file

* Rename file

* add explicit_unroll_max_extent (neo-ai#25)

* Add Index simplification & API update (neo-ai#26)

* Add vectorized cooperative_fetching test

* Update math simplify for vectorized CF

* File rename

* Update tune_network

* API update

* Update PreLoadMeasuredStates & Some bug fix (neo-ai#27)

* Add a threading wrapper to fix the test bug

* Set default TVM_USE_AUTO_SCHEDULER to false

* Update PreLoadMeasuredStates callback

* Add tensorize step for loop_state (neo-ai#31)

* Add tensorize step

* State python api update (neo-ai#33)

* Start to update api

* Add compute_dag to state

* API update

* kernel layout rewrite (neo-ai#28)

* kernel layout rewrite

* remove some hacks

* add defuse_ops pass and move kernel_layout_rewrite pass after fuse_ops pass

* set TVM_RELAY_DISABLE_BUILD_CACHE for task extraction and prepare_layout_rewrite

* [cache flush] port cache flush to ansor (neo-ai#32)

* Improve relay integration (neo-ai#34)

* tmp checkpoint

* Improve relay integration

* Improve relay integration

* Fix xgb error & Simplify dispatcher (neo-ai#35)

* Rename "MetaTileRewritePolicy" to "SketchPolicy". (neo-ai#36)

* Rename "MetaTileRewritePolicy" to "SketchPolicy".

* Add a new class for auto_unroll_max_step, storage_offset in StageNode

* fix tune_op_subgraph.py

* rebase

* Migrate all node::make to noderef's construct function (neo-ai#37)

* Start to move xxxnode::make to noderef()

* Update

* Update

* Finish transform_step

* Finish comute dag & auto schedule

* Update

* Update

* Update

* Update

* Update

* Code refine

* Code refine

* Code refine

* Update

* Update

* Some lint fix & Recover the double constructor of tvm::PrimExpr (neo-ai#39)

* lint fix

* clang-format-fix

* pylint fix

* Update

* Recover the double constructor of tvm::PrimExpr

* Fix pylint

* pylint fix

* pylint fix

* Add MutateComputeLocation and MutateParallel in evolutionary search (neo-ai#40)

* Add MutateComputeLocation and MutateParallel in evolutionary search

* fix lint

* Improve loop state python API (stage_tensors -> stage_ops) (neo-ai#41)

* improve loop state python API (stage_tensors -> stage_ops)

* fix

* ComputeDAG bug fix & Add Custom TensorCore Matmul Example (neo-ai#42)

* Bug Fix

* Sample example of Custom TensorCore Matmul

* Rever Commits, Start to build minimum Ansor system

* Code clean for minimum Ansor system

* Bug fix & Delete AccessAnalyzer

* Delete attachmap & Code clean

* Doc update

Update statenode::stages from vector to Array

* Headfile update & Python doc update

* clang-format fix

* pylint fix

* Update

* Doc update

* Update

* Bug fix after code merge to the new master

* clang-format fix

* Update

* Update

* Update std::vector to Array; Update verbosity setting; Some commemts
addressed

* std::vector->Array & std::string->String

* Add init_state to ComputeDAG

* Update

* Update some unordered_map to Map

* clang-format fix

* Comments addressed
Delete ReplayAndInferBound
Delete ReplaySteps & InferBoundCommon

* Lint fix

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Rename ansor namespace to auto_schedule

* Update

* Rename ThreadPool to ParallelFor

* Add parallel_for

* Remove ThreadPool

* Update python/tvm/auto_schedule/auto_schedule.py

* trigger CI

Co-authored-by: Lianmin Zheng <[email protected]>
Co-authored-by: Minmin Sun (孙敏敏) <[email protected]>
Co-authored-by: Zhao Wu <[email protected]>
trevor-m pushed a commit to trevor-m/tvm that referenced this pull request Sep 2, 2020
…generating (apache#5962)

* Code migration Start (neo-ai#1)

* Init commit: Code migration Start

* Add loop_state.cc/h

* Add ComputeDAG basic test

* Split transform_step out & Update more UTs (neo-ai#3)

* Split transform_step out

* Update GetProducers & GetConsumers

* Update UTs

* Add UT for CacheReadWrite & Some bug fix

* Add search_task, measure and serialization (neo-ai#4)

* Add FollowSplit & FollowFusedSplit tests

* Update dag.InferBound & its UT

* Add search_task, measure and serialization

* Update Serialization UT

* Add MetaTileRewritePolicy (neo-ai#5)

* Add feature

* Add cost_model, meta_tile_rewrite_policy

* Add MetaTileRewritePolicy basic UT

* Basic Python API for State (neo-ai#6)

* Add Basic Python API for State

* Add UTs for State

* Add Python API: Measure & Task (neo-ai#7)

* Update the return value of state operation

* Add task

* Copy measure.py & utils.py

* Fix LocalBuilder

* Fix LocalRunner

* Add ansor.auto_schedule() API; First AutoSchedule working version(neo-ai#8)

* Add basic Python support for ansor.auto_schedule

* Update AutoSchedule API

* Bug fix for get the attach point of a fused iter

* Update UT after infer bug fix

* Bug fix & Add python serialization API (neo-ai#10)

* Delete C++ UT hack since Python is ready

* Add ndarray.non_empty

* Update Serialization python API

* Improve code style, python wrapper and test cases (neo-ai#11)

* Update c++ code style and unit test

* Update python State wrapper and test cases

* fix unit tests

* Add RPCRunner & OpenCL/CUDA test (neo-ai#12)

* Add RPCRunner & OpenCL search test

* Add CUDA search test

* Add RPCRunner test

* rebase to upstream/master

* Add Ansor basic tutorial (neo-ai#13)

* Add basic tutorial

* migrate feature extraction (neo-ai#14)

* Add XGBModel & RPCRunnerWarpper (neo-ai#15)

* Add XGBModel & RPCRunnerWarpper

* Revert "Add Parallel Granularity Mutation"

* Migrate workload_registry.py (neo-ai#16)

* add workload registry

* update

* update

* add task scheduler (neo-ai#17)

* Add conv2d cuda tutorial with workload registry (neo-ai#18)

* add tune_test.py (the old tune_wkl.py) (neo-ai#19)

* add tune_test.py (the old tune_wkl.py)

* update

* fix measure

* fix for gpu

* Code refine for tune_test.py & Add a pre load callback (neo-ai#20)

* Bug fix for tutorials

* Add PreLoadMeasuredStates

* Add search_callback support for task tuner

* Code refine for tune_test.py

* Update

* Update

* Update

* Update

* Bug fix

* Add python custom sketch rule (neo-ai#21)

* Add custom sketch rule

* Bug fix

* Ansor Relay Integration (without layout rewrite) (neo-ai#22)

* relay integration

* Add tune_op_subgraph.py & Some code clean for tune_network.py (neo-ai#23)

* Add single op tune scripts

* Add tune subgraph support

* Merge all op & all subgraph to one file

* Rename file

* add explicit_unroll_max_extent (neo-ai#25)

* Add Index simplification & API update (neo-ai#26)

* Add vectorized cooperative_fetching test

* Update math simplify for vectorized CF

* File rename

* Update tune_network

* API update

* Update PreLoadMeasuredStates & Some bug fix (neo-ai#27)

* Add a threading wrapper to fix the test bug

* Set default TVM_USE_AUTO_SCHEDULER to false

* Update PreLoadMeasuredStates callback

* Add tensorize step for loop_state (neo-ai#31)

* Add tensorize step

* State python api update (neo-ai#33)

* Start to update api

* Add compute_dag to state

* API update

* kernel layout rewrite (neo-ai#28)

* kernel layout rewrite

* remove some hacks

* add defuse_ops pass and move kernel_layout_rewrite pass after fuse_ops pass

* set TVM_RELAY_DISABLE_BUILD_CACHE for task extraction and prepare_layout_rewrite

* [cache flush] port cache flush to ansor (neo-ai#32)

* Improve relay integration (neo-ai#34)

* tmp checkpoint

* Improve relay integration

* Improve relay integration

* Fix xgb error & Simplify dispatcher (neo-ai#35)

* Rename "MetaTileRewritePolicy" to "SketchPolicy". (neo-ai#36)

* Rename "MetaTileRewritePolicy" to "SketchPolicy".

* Add a new class for auto_unroll_max_step, storage_offset in StageNode

* fix tune_op_subgraph.py

* rebase

* Migrate all node::make to noderef's construct function (neo-ai#37)

* Start to move xxxnode::make to noderef()

* Update

* Update

* Finish transform_step

* Finish comute dag & auto schedule

* Update

* Update

* Update

* Update

* Update

* Code refine

* Code refine

* Code refine

* Update

* Update

* Some lint fix & Recover the double constructor of tvm::PrimExpr (neo-ai#39)

* lint fix

* clang-format-fix

* pylint fix

* Update

* Recover the double constructor of tvm::PrimExpr

* Fix pylint

* pylint fix

* pylint fix

* Add MutateComputeLocation and MutateParallel in evolutionary search (neo-ai#40)

* Add MutateComputeLocation and MutateParallel in evolutionary search

* fix lint

* Improve loop state python API (stage_tensors -> stage_ops) (neo-ai#41)

* improve loop state python API (stage_tensors -> stage_ops)

* fix

* ComputeDAG bug fix & Add Custom TensorCore Matmul Example (neo-ai#42)

* Bug Fix

* Sample example of Custom TensorCore Matmul

* Rever Commits, Start to build minimum Ansor system

* Code clean for minimum Ansor system

* Bug fix & Delete AccessAnalyzer

* Delete attachmap & Code clean

* Doc update

Update statenode::stages from vector to Array

* Headfile update & Python doc update

* clang-format fix

* pylint fix

* Update

* Doc update

* Update

* Bug fix after code merge to the new master

* clang-format fix

* Update

* Update

* Update std::vector to Array; Update verbosity setting; Some commemts
addressed

* std::vector->Array & std::string->String

* Add init_state to ComputeDAG

* Update

* Update some unordered_map to Map

* clang-format fix

* Comments addressed
Delete ReplayAndInferBound
Delete ReplaySteps & InferBoundCommon

* Lint fix

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Rename ansor namespace to auto_schedule

* Update

* Rename ThreadPool to ParallelFor

* Add parallel_for

* Remove ThreadPool

* Update python/tvm/auto_schedule/auto_schedule.py

* trigger CI

Co-authored-by: Lianmin Zheng <[email protected]>
Co-authored-by: Minmin Sun (孙敏敏) <[email protected]>
Co-authored-by: Zhao Wu <[email protected]>
trevor-m pushed a commit to neo-ai/tvm that referenced this pull request Sep 3, 2020
…generating (apache#5962)

* Code migration Start (#1)

* Init commit: Code migration Start

* Add loop_state.cc/h

* Add ComputeDAG basic test

* Split transform_step out & Update more UTs (#3)

* Split transform_step out

* Update GetProducers & GetConsumers

* Update UTs

* Add UT for CacheReadWrite & Some bug fix

* Add search_task, measure and serialization (#4)

* Add FollowSplit & FollowFusedSplit tests

* Update dag.InferBound & its UT

* Add search_task, measure and serialization

* Update Serialization UT

* Add MetaTileRewritePolicy (#5)

* Add feature

* Add cost_model, meta_tile_rewrite_policy

* Add MetaTileRewritePolicy basic UT

* Basic Python API for State (#6)

* Add Basic Python API for State

* Add UTs for State

* Add Python API: Measure & Task (#7)

* Update the return value of state operation

* Add task

* Copy measure.py & utils.py

* Fix LocalBuilder

* Fix LocalRunner

* Add ansor.auto_schedule() API; First AutoSchedule working version(#8)

* Add basic Python support for ansor.auto_schedule

* Update AutoSchedule API

* Bug fix for get the attach point of a fused iter

* Update UT after infer bug fix

* Bug fix & Add python serialization API (#10)

* Delete C++ UT hack since Python is ready

* Add ndarray.non_empty

* Update Serialization python API

* Improve code style, python wrapper and test cases (#11)

* Update c++ code style and unit test

* Update python State wrapper and test cases

* fix unit tests

* Add RPCRunner & OpenCL/CUDA test (#12)

* Add RPCRunner & OpenCL search test

* Add CUDA search test

* Add RPCRunner test

* rebase to upstream/master

* Add Ansor basic tutorial (#13)

* Add basic tutorial

* migrate feature extraction (#14)

* Add XGBModel & RPCRunnerWarpper (#15)

* Add XGBModel & RPCRunnerWarpper

* Revert "Add Parallel Granularity Mutation"

* Migrate workload_registry.py (#16)

* add workload registry

* update

* update

* add task scheduler (#17)

* Add conv2d cuda tutorial with workload registry (#18)

* add tune_test.py (the old tune_wkl.py) (#19)

* add tune_test.py (the old tune_wkl.py)

* update

* fix measure

* fix for gpu

* Code refine for tune_test.py & Add a pre load callback (#20)

* Bug fix for tutorials

* Add PreLoadMeasuredStates

* Add search_callback support for task tuner

* Code refine for tune_test.py

* Update

* Update

* Update

* Update

* Bug fix

* Add python custom sketch rule (#21)

* Add custom sketch rule

* Bug fix

* Ansor Relay Integration (without layout rewrite) (#22)

* relay integration

* Add tune_op_subgraph.py & Some code clean for tune_network.py (#23)

* Add single op tune scripts

* Add tune subgraph support

* Merge all op & all subgraph to one file

* Rename file

* add explicit_unroll_max_extent (#25)

* Add Index simplification & API update (#26)

* Add vectorized cooperative_fetching test

* Update math simplify for vectorized CF

* File rename

* Update tune_network

* API update

* Update PreLoadMeasuredStates & Some bug fix (#27)

* Add a threading wrapper to fix the test bug

* Set default TVM_USE_AUTO_SCHEDULER to false

* Update PreLoadMeasuredStates callback

* Add tensorize step for loop_state (#31)

* Add tensorize step

* State python api update (#33)

* Start to update api

* Add compute_dag to state

* API update

* kernel layout rewrite (#28)

* kernel layout rewrite

* remove some hacks

* add defuse_ops pass and move kernel_layout_rewrite pass after fuse_ops pass

* set TVM_RELAY_DISABLE_BUILD_CACHE for task extraction and prepare_layout_rewrite

* [cache flush] port cache flush to ansor (#32)

* Improve relay integration (#34)

* tmp checkpoint

* Improve relay integration

* Improve relay integration

* Fix xgb error & Simplify dispatcher (#35)

* Rename "MetaTileRewritePolicy" to "SketchPolicy". (#36)

* Rename "MetaTileRewritePolicy" to "SketchPolicy".

* Add a new class for auto_unroll_max_step, storage_offset in StageNode

* fix tune_op_subgraph.py

* rebase

* Migrate all node::make to noderef's construct function (#37)

* Start to move xxxnode::make to noderef()

* Update

* Update

* Finish transform_step

* Finish comute dag & auto schedule

* Update

* Update

* Update

* Update

* Update

* Code refine

* Code refine

* Code refine

* Update

* Update

* Some lint fix & Recover the double constructor of tvm::PrimExpr (#39)

* lint fix

* clang-format-fix

* pylint fix

* Update

* Recover the double constructor of tvm::PrimExpr

* Fix pylint

* pylint fix

* pylint fix

* Add MutateComputeLocation and MutateParallel in evolutionary search (#40)

* Add MutateComputeLocation and MutateParallel in evolutionary search

* fix lint

* Improve loop state python API (stage_tensors -> stage_ops) (#41)

* improve loop state python API (stage_tensors -> stage_ops)

* fix

* ComputeDAG bug fix & Add Custom TensorCore Matmul Example (#42)

* Bug Fix

* Sample example of Custom TensorCore Matmul

* Rever Commits, Start to build minimum Ansor system

* Code clean for minimum Ansor system

* Bug fix & Delete AccessAnalyzer

* Delete attachmap & Code clean

* Doc update

Update statenode::stages from vector to Array

* Headfile update & Python doc update

* clang-format fix

* pylint fix

* Update

* Doc update

* Update

* Bug fix after code merge to the new master

* clang-format fix

* Update

* Update

* Update std::vector to Array; Update verbosity setting; Some commemts
addressed

* std::vector->Array & std::string->String

* Add init_state to ComputeDAG

* Update

* Update some unordered_map to Map

* clang-format fix

* Comments addressed
Delete ReplayAndInferBound
Delete ReplaySteps & InferBoundCommon

* Lint fix

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Rename ansor namespace to auto_schedule

* Update

* Rename ThreadPool to ParallelFor

* Add parallel_for

* Remove ThreadPool

* Update python/tvm/auto_schedule/auto_schedule.py

* trigger CI

Co-authored-by: Lianmin Zheng <[email protected]>
Co-authored-by: Minmin Sun (孙敏敏) <[email protected]>
Co-authored-by: Zhao Wu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.