In topi schedules, for OpenCL target, num_thread == 512 is too big for AMD GPU #530

masahi · 2017-10-10T06:54:11Z

Hi, I noticed that some schedules in topi.cuda use thread block of size 512 in one dimension.
For example, here and here.

512 threads in one dimension is too big for AMD OpenCL, because their driver restricts the maximum blockDim.x to be 256. According to the discussion here, this restriction is due to their implementation of driver, not their hardware itself.

I understand that schedules in topi/cuda are optimized for CUDA, but if I change the 512 threads to 256 threads, all tests in https://github.com/dmlc/tvm/tree/master/topi/tests/python pass for my AMD gpu as well. I was also able to run tutorials and resnet benchmark from the latest nnvm, on AMD gpu.

So, unless there is a strong reason to use thread block of size 512, is it possible to change 512 to 256?
Of course, if you are planning to implement dedicated schedules for opencl, that's better :)

tqchen · 2017-10-10T17:18:07Z

I agree that we should have dedicated schedules for devices(e.g. AMD). We will do a upgrade of the topi interface to make plugin of device-specific schedule easy

kaishijeng · 2017-10-11T00:33:00Z

On firefly3399, I can pass all unittests except test_topi_conv2d_nchw.py which complains running out of resources. I probably need to reduce # of threads to a smaller number.

Just curious, are you able to run NNVM benchmark with OpenCL on amd platform?
NNVM benchmark has 2 benchmarks, one is for rasp and the other one is for cuda. There is no Opencl benchmark

Thanks,

masahi · 2017-10-11T00:46:47Z

@kaishijeng yes, I modified cuda bench to use opencl target and context. I also modified nnvm internal to enable opencl target.
Proof:
$python cuda_imagenet_bench.py --model resnet
[09:38:49] src/runtime/opencl/opencl_device_api.cc:195: Initialize OpenCL platform 'AMD Accelerated Parallel Processing'
[09:38:49] src/runtime/opencl/opencl_device_api.cc:215: opencl(0)='gfx803' cl_device_id=0x7f85d4dc46f0
benchmark args: Namespace(model='resnet', num_iter=100, opt_level=3)
ProfileResult(mean=0.01073681044)
ProfileResult(mean=0.01073411568)
ProfileResult(mean=0.01073584366)

I'm not sure if performance is good or bad. My card is R9 Nano, with 8 TFLOPs of peak perf.

For test_topi_conv2d_nchw, you can use my modified schedule here

The fact that I can load PyTorch or MXNet - trained model and run inference with it on AMD gpu is pretty exciting. Plus, my opencl driver and tool chains are fully open source with rocm.

masahi · 2017-10-11T00:54:50Z

To enable opencl target in nnvm, I changed

if target == "cuda":
to
if target == "cuda" or target == "opencl":

in
python/nnvm/top/nn.py
python/nnvm/top/reduction.py
python/nnvm/top/tensor.py

kaishijeng · 2017-10-11T00:59:13Z

With your modified schedule, I still got out of resource error below. I guess firefly3399 has smaller GPU than your amd. Do you mind to share your modifed cuda_imagenet_bench.py and how you modify nnvm internal to enable opencl target? TVM: Initializing cython mode... Use memoize topi.tests.test_topi_conv2d.verify_con2d_nchw.get_ref_data.pkl(5, (1, 3, 224, 224), 'float32', 2, 3, (64, 3, 7, 7)) Skip because cuda is not enabled [17:55:55] src/runtime/opencl/opencl_device_api.cc:195: Initialize OpenCL platform 'ARM Platform' [17:55:55] src/runtime/opencl/opencl_device_api.cc:215: opencl(0)='Mali-T860' cl_device_id=0x7f93fa8dd8 [17:55:55] /home/firefly/2TB/src/firefly/tvm/dmlc-core/include/dmlc/logging.h:308: [17:55:55] src/runtime/opencl/opencl_module.cc:223: Check failed: e == CL_SUCCESS OpenCL Error, code=-5: CL_OUT_OF_RESOURCES Stack trace returned 10 entries: [bt] (0) /usr/local/lib/python2.7/dist-packages/tvm-0.1.0-py2.7-linux-aarch64.egg/tvm/libtvm.so(_ZN4dmlc15LogMessageFatalD1Ev+0x44) [0x7f944bc6e4] [bt] (1) /usr/local/lib/python2.7/dist-packages/tvm-0.1.0-py2.7-linux-aarch64.egg/tvm/libtvm.so(_ZNK3tvm7runtime17OpenCLWrappedFuncclENS0_7TVMArgsEPNS0_11TVMRetValueEPPv+0xbac) [0x7f94854b9c] [bt] (2) /usr/local/lib/python2.7/dist-packages/tvm-0.1.0-py2.7-linux-aarch64.egg/tvm/libtvm.so(_ZNSt17_Function_handlerIFvN3tvm7runtime7TVMArgsEPNS1_11TVMRetValueEEZNS1_6detail17PackFuncVoidAddr_ILi4ENS1_17OpenCLWrappedFuncEEENS1_10PackedFuncET0_RKSt6vectorINS6_14ArgConvertCodeESaISC_EEEUlS2_S4_E_E9_M_invokeERKSt9_Any_dataOS2_OS4_+0xd0) [0x7f94854f10] [bt] (3) /usr/local/lib/python2.7/dist-packages/tvm-0.1.0-py2.7-linux-aarch64.egg/tvm/libtvm.so(TVMFuncCall+0x74) [0x7f9482951c] [bt] (4) [0x7f9676a9dc] [bt] (5) [0x7f9676a2ac] [bt] (6) /usr/local/lib/python2.7/dist-packages/tvm-0.1.0-py2.7-linux-aarch64.egg/tvm/libtvm.so(_ZZN3tvm7codegen14LLVMModuleNode11GetFunctionERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKSt10shared_ptrINS_7runtime10ModuleNodeEEENKUlNSB_7TVMArgsEPNSB_11TVMRetValueEE0_clESG_SI_+0x3c) [0x7f946f2a3c] [bt] (7) /usr/local/lib/python2.7/dist-packages/tvm-0.1.0-py2.7-linux-aarch64.egg/tvm/libtvm.so(_ZNSt17_Function_handlerIFvN3tvm7runtime7TVMArgsEPNS1_11TVMRetValueEEZNS0_7codegen14LLVMModuleNode11GetFunctionERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKSt10shared_ptrINS1_10ModuleNodeEEEUlS2_S4_E0_E9_M_invokeERKSt9_Any_dataOS2_OS4_+0x2c) [0x7f946f2cfc] [bt] (8) /usr/local/lib/python2.7/dist-packages/tvm-0.1.0-py2.7-linux-aarch64.egg/tvm/libtvm.so(TVMFuncCall+0x74) [0x7f9482951c] [bt] (9) /usr/local/lib/python2.7/dist-packages/tvm-0.1.0-py2.7-linux-aarch64.egg/tvm/_ffi/_cy2/core.so(+0x180f8) [0x7f8a5280f8] [17:55:55] /home/firefly/2TB/src/firefly/tvm/dmlc-core/include/dmlc/logging.h:308: [17:55:55] src/codegen/llvm/llvm_module.cc:59: Check failed: ret == 0 (-1 vs. 0) [17:55:55] src/runtime/opencl/opencl_module.cc:223: Check failed: e == CL_SUCCESS OpenCL Error, code=-5: CL_OUT_OF_RESOURCES Stack trace returned 10 entries: [bt] (0) /usr/local/lib/python2.7/dist-packages/tvm-0.1.0-py2.7-linux-aarch64.egg/tvm/libtvm.so(_ZN4dmlc15LogMessageFatalD1Ev+0x44) [0x7f944bc6e4] [bt] (1) /usr/local/lib/python2.7/dist-packages/tvm-0.1.0-py2.7-linux-aarch64.egg/tvm/libtvm.so(_ZNK3tvm7runtime17OpenCLWrappedFuncclENS0_7TVMArgsEPNS0_11TVMRetValueEPPv+0xbac) [0x7f94854b9c] [bt] (2) /usr/local/lib/python2.7/dist-packages/tvm-0.1.0-py2.7-linux-aarch64.egg/tvm/libtvm.so(_ZNSt17_Function_handlerIFvN3tvm7runtime7TVMArgsEPNS1_11TVMRetValueEEZNS1_6detail17PackFuncVoidAddr_ILi4ENS1_17OpenCLWrappedFuncEEENS1_10PackedFuncET0_RKSt6vectorINS6_14ArgConvertCodeESaISC_EEEUlS2_S4_E_E9_M_invokeERKSt9_Any_dataOS2_OS4_+0xd0) [0x7f94854f10] [bt] (3) /usr/local/lib/python2.7/dist-packages/tvm-0.1.0-py2.7-linux-aarch64.egg/tvm/libtvm.so(TVMFuncCall+0x74) [0x7f9482951c] [bt] (4) [0x7f9676a9dc] [bt] (5) [0x7f9676a2ac] [bt] (6) /usr/local/lib/python2.7/dist-packages/tvm-0.1.0-py2.7-linux-aarch64.egg/tvm/libtvm.so(_ZZN3tvm7codegen14LLVMModuleNode11GetFunctionERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKSt10shared_ptrINS_7runtime10ModuleNodeEEENKUlNSB_7TVMArgsEPNSB_11TVMRetValueEE0_clESG_SI_+0x3c) [0x7f946f2a3c] [bt] (7) /usr/local/lib/python2.7/dist-packages/tvm-0.1.0-py2.7-linux-aarch64.egg/tvm/libtvm.so(_ZNSt17_Function_handlerIFvN3tvm7runtime7TVMArgsEPNS1_11TVMRetValueEEZNS0_7codegen14LLVMModuleNode11GetFunctionERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKSt10shared_ptrINS1_10ModuleNodeEEEUlS2_S4_E0_E9_M_invokeERKSt9_Any_dataOS2_OS4_+0x2c) [0x7f946f2cfc] [bt] (8) /usr/local/lib/python2.7/dist-packages/tvm-0.1.0-py2.7-linux-aarch64.egg/tvm/libtvm.so(TVMFuncCall+0x74) [0x7f9482951c] [bt] (9) /usr/local/lib/python2.7/dist-packages/tvm-0.1.0-py2.7-linux-aarch64.egg/tvm/_ffi/_cy2/core.so(+0x180f8) [0x7f8a5280f8] Stack trace returned 6 entries: [bt] (0) /usr/local/lib/python2.7/dist-packages/tvm-0.1.0-py2.7-linux-aarch64.egg/tvm/libtvm.so(_ZN4dmlc15LogMessageFatalD1Ev+0x44) [0x7f944bc6e4] [bt] (1) /usr/local/lib/python2.7/dist-packages/tvm-0.1.0-py2.7-linux-aarch64.egg/tvm/libtvm.so(_ZZN3tvm7codegen14LLVMModuleNode11GetFunctionERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKSt10shared_ptrINS_7runtime10ModuleNodeEEENKUlNSB_7TVMArgsEPNSB_11TVMRetValueEE0_clESG_SI_+0x1f4) [0x7f946f2bf4] [bt] (2) /usr/local/lib/python2.7/dist-packages/tvm-0.1.0-py2.7-linux-aarch64.egg/tvm/libtvm.so(_ZNSt17_Function_handlerIFvN3tvm7runtime7TVMArgsEPNS1_11TVMRetValueEEZNS0_7codegen14LLVMModuleNode11GetFunctionERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKSt10shared_ptrINS1_10ModuleNodeEEEUlS2_S4_E0_E9_M_invokeERKSt9_Any_dataOS2_OS4_+0x2c) [0x7f946f2cfc] [bt] (3) /usr/local/lib/python2.7/dist-packages/tvm-0.1.0-py2.7-linux-aarch64.egg/tvm/libtvm.so(TVMFuncCall+0x74) [0x7f9482951c] [bt] (4) /usr/local/lib/python2.7/dist-packages/tvm-0.1.0-py2.7-linux-aarch64.egg/tvm/_ffi/_cy2/core.so(+0x180f8) [0x7f8a5280f8] [bt] (5) python2(PyObject_Call+0x50) [0x458ee8] Traceback (most recent call last): File "./test_topi_conv2d_nchw.py", line 72, in <module> test_conv2d_nchw() File "./test_topi_conv2d_nchw.py", line 58, in test_conv2d_nchw verify_conv2d_nchw(1, 3, 224, 64, 7, 3, 2) File "./test_topi_conv2d_nchw.py", line 54, in verify_conv2d_nchw check_device(device) File "./test_topi_conv2d_nchw.py", line 48, in check_device func1(a, w, b) File "/usr/local/lib/python2.7/dist-packages/tvm-0.1.0-py2.7-linux-aarch64.egg/tvm/_ffi/function.py", line 128, in __call__ return f(*args) File "tvm/_ffi/_cython/./function.pxi", line 264, in core.FunctionBase.__call__ File "tvm/_ffi/_cython/./function.pxi", line 213, in core.FuncCall File "tvm/_ffi/_cython/./function.pxi", line 205, in core.FuncCall3 File "tvm/_ffi/_cython/./base.pxi", line 131, in core.CALL tvm._ffi.base.TVMError: [17:55:55] src/codegen/llvm/llvm_module.cc:59: Check failed: ret == 0 (-1 vs. 0) [17:55:55] src/runtime/opencl/opencl_module.cc:223: Check failed: e == CL_SUCCESS OpenCL Error, code=-5: CL_OUT_OF_RESOURCES Stack trace returned 10 entries: [bt] (0) /usr/local/lib/python2.7/dist-packages/tvm-0.1.0-py2.7-linux-aarch64.egg/tvm/libtvm.so(_ZN4dmlc15LogMessageFatalD1Ev+0x44) [0x7f944bc6e4] [bt] (1) /usr/local/lib/python2.7/dist-packages/tvm-0.1.0-py2.7-linux-aarch64.egg/tvm/libtvm.so(_ZNK3tvm7runtime17OpenCLWrappedFuncclENS0_7TVMArgsEPNS0_11TVMRetValueEPPv+0xbac) [0x7f94854b9c] [bt] (2) /usr/local/lib/python2.7/dist-packages/tvm-0.1.0-py2.7-linux-aarch64.egg/tvm/libtvm.so(_ZNSt17_Function_handlerIFvN3tvm7runtime7TVMArgsEPNS1_11TVMRetValueEEZNS1_6detail17PackFuncVoidAddr_ILi4ENS1_17OpenCLWrappedFuncEEENS1_10PackedFuncET0_RKSt6vectorINS6_14ArgConvertCodeESaISC_EEEUlS2_S4_E_E9_M_invokeERKSt9_Any_dataOS2_OS4_+0xd0) [0x7f94854f10] [bt] (3) /usr/local/lib/python2.7/dist-packages/tvm-0.1.0-py2.7-linux-aarch64.egg/tvm/libtvm.so(TVMFuncCall+0x74) [0x7f9482951c] [bt] (4) [0x7f9676a9dc] [bt] (5) [0x7f9676a2ac] [bt] (6) /usr/local/lib/python2.7/dist-packages/tvm-0.1.0-py2.7-linux-aarch64.egg/tvm/libtvm.so(_ZZN3tvm7codegen14LLVMModuleNode11GetFunctionERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKSt10shared_ptrINS_7runtime10ModuleNodeEEENKUlNSB_7TVMArgsEPNSB_11TVMRetValueEE0_clESG_SI_+0x3c) [0x7f946f2a3c] [bt] (7) /usr/local/lib/python2.7/dist-packages/tvm-0.1.0-py2.7-linux-aarch64.egg/tvm/libtvm.so(_ZNSt17_Function_handlerIFvN3tvm7runtime7TVMArgsEPNS1_11TVMRetValueEEZNS0_7codegen14LLVMModuleNode11GetFunctionERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKSt10shared_ptrINS1_10ModuleNodeEEEUlS2_S4_E0_E9_M_invokeERKSt9_Any_dataOS2_OS4_+0x2c) [0x7f946f2cfc] [bt] (8) /usr/local/lib/python2.7/dist-packages/tvm-0.1.0-py2.7-linux-aarch64.egg/tvm/libtvm.so(TVMFuncCall+0x74) [0x7f9482951c] [bt] (9) /usr/local/lib/python2.7/dist-packages/tvm-0.1.0-py2.7-linux-aarch64.egg/tvm/_ffi/_cy2/core.so(+0x180f8) [0x7f8a5280f8] Stack trace returned 6 entries: [bt] (0) /usr/local/lib/python2.7/dist-packages/tvm-0.1.0-py2.7-linux-aarch64.egg/tvm/libtvm.so(_ZN4dmlc15LogMessageFatalD1Ev+0x44) [0x7f944bc6e4] [bt] (1) /usr/local/lib/python2.7/dist-packages/tvm-0.1.0-py2.7-linux-aarch64.egg/tvm/libtvm.so(_ZZN3tvm7codegen14LLVMModuleNode11GetFunctionERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKSt10shared_ptrINS_7runtime10ModuleNodeEEENKUlNSB_7TVMArgsEPNSB_11TVMRetValueEE0_clESG_SI_+0x1f4) [0x7f946f2bf4] [bt] (2) /usr/local/lib/python2.7/dist-packages/tvm-0.1.0-py2.7-linux-aarch64.egg/tvm/libtvm.so(_ZNSt17_Function_handlerIFvN3tvm7runtime7TVMArgsEPNS1_11TVMRetValueEEZNS0_7codegen14LLVMModuleNode11GetFunctionERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKSt10shared_ptrINS1_10ModuleNodeEEEUlS2_S4_E0_E9_M_invokeERKSt9_Any_dataOS2_OS4_+0x2c) [0x7f946f2cfc] [bt] (3) /usr/local/lib/python2.7/dist-packages/tvm-0.1.0-py2.7-linux-aarch64.egg/tvm/libtvm.so(TVMFuncCall+0x74) [0x7f9482951c] [bt] (4) /usr/local/lib/python2.7/dist-packages/tvm-0.1.0-py2.7-linux-aarch64.egg/tvm/_ffi/_cy2/core.so(+0x180f8) [0x7f8a5280f8] [bt] (5) python2(PyObject_Call+0x50) [0x458ee8] [17:55:55] /home/firefly/2TB/src/firefly/tvm/dmlc-core/include/dmlc/logging.h:308: [17:55:55] src/runtime/workspace_pool.cc:92: Check failed: allocated_.size() == 1 (3 vs. 1) Stack trace returned 4 entries: [bt] (0) /usr/local/lib/python2.7/dist-packages/tvm-0.1.0-py2.7-linux-aarch64.egg/tvm/libtvm.so(_ZN3tvm7runtime13WorkspacePool4Pool7ReleaseE9DLContextPNS0_9DeviceAPIE+0x488) [0x7f94838fc0] [bt] (1) /usr/local/lib/python2.7/dist-packages/tvm-0.1.0-py2.7-linux-aarch64.egg/tvm/libtvm.so(_ZN3tvm7runtime13WorkspacePoolD1Ev+0x48) [0x7f94837710] [bt] (2) /usr/local/lib/python2.7/dist-packages/tvm-0.1.0-py2.7-linux-aarch64.egg/tvm/libtvm.so(_ZN3tvm7runtime2cl17OpenCLThreadEntryD1Ev+0x18) [0x7f9484c818] [bt] (3) /lib/aarch64-linux-gnu/libc.so.6(__call_tls_dtors+0x44) [0x7f96985474] terminate called after throwing an instance of 'dmlc::Error' what(): [17:55:55] src/runtime/workspace_pool.cc:92: Check failed: allocated_.size() == 1 (3 vs. 1) Stack trace returned 4 entries: [bt] (0) /usr/local/lib/python2.7/dist-packages/tvm-0.1.0-py2.7-linux-aarch64.egg/tvm/libtvm.so(_ZN3tvm7runtime13WorkspacePool4Pool7ReleaseE9DLContextPNS0_9DeviceAPIE+0x488) [0x7f94838fc0] [bt] (1) /usr/local/lib/python2.7/dist-packages/tvm-0.1.0-py2.7-linux-aarch64.egg/tvm/libtvm.so(_ZN3tvm7runtime13WorkspacePoolD1Ev+0x48) [0x7f94837710] [bt] (2) /usr/local/lib/python2.7/dist-packages/tvm-0.1.0-py2.7-linux-aarch64.egg/tvm/libtvm.so(_ZN3tvm7runtime2cl17OpenCLThreadEntryD1Ev+0x18) [0x7f9484c818] [bt] (3) /lib/aarch64-linux-gnu/libc.so.6(__call_tls_dtors+0x44) [0x7f96985474] Aborted (core dumped)

…

On Tue, Oct 10, 2017 at 5:46 PM, masahi ***@***.***> wrote: @kaishijeng <https://github.com/kaishijeng> yes, I modified cuda bench to use opencl target and context. I also modified nnvm internal to enable opencl target. Proof: $python cuda_imagenet_bench.py --model resnet *[09:38:49] src/runtime/opencl/opencl_device_api.cc:195: Initialize OpenCL platform 'AMD Accelerated Parallel Processing' [09:38:49] src/runtime/opencl/opencl_device_api.cc:215: opencl(0)='gfx803' cl_device_id=0x7f85d4dc46f0 benchmark args: Namespace(model='resnet', num_iter=100, opt_level=3) ProfileResult(mean=0.01073681044) ProfileResult(mean=0.01073411568) ProfileResult(mean=0.01073584366)* For test_topi_conv2d_nchw, you can use my modified schedule here <https://gist.github.com/masahi/18b4ed1fb8fa1a049ffdba35ed807509> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#530 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AMGg3kpr7vIftAYXZ7FlNEcyTzrgMd_Qks5srA_9gaJpZM4PzdCV> .

tqchen · 2017-10-11T01:01:41Z

This is mainly because AMD card and ARM GPU have different resources and requires different kind of schedules

kaishijeng · 2017-10-11T01:08:50Z

I change num_thread from your 256 to 32, still got out of resources which looks strange to me. I also see num_thread_x = 64 in the code, do I need to chage this variable too? Thanks,

…

On Tue, Oct 10, 2017 at 6:01 PM, Tianqi Chen ***@***.***> wrote: This is mainly because AMD card and ARM GPU have different resources and requires different kind of schedules — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#530 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AMGg3v0WfHr-F5Hx3fDgxcFWPist5RDSks5srBN4gaJpZM4PzdCV> .

masahi · 2017-10-11T01:37:03Z

@kaishijeng Sorry I don't know anything about ARM gpus. Maybe you can try smaller inputs?
Say, verify_conv2d_nchw(1, 3, 64, 16, 3, 1, 0) ?
Run clinfo (not sure if it exists in ARM platform) to see the maximum work group size.

I can push my repo, after I clean it up.

kaishijeng · 2017-10-11T02:04:09Z

Yes, it works OK with verify_conv2d_nchw(1, 3, 64, 64, 3, 1, 0) Thanks,

…

On Tue, Oct 10, 2017 at 6:37 PM, masahi ***@***.***> wrote: @kaishijeng <https://github.com/kaishijeng> Sorry I don't know anything about ARM gpus. Maybe you can try smaller inputs? Say, verify_conv2d_nchw(1, 3, 64, 64, 3, 1, 0) ? I can push my repo, after I clean it up. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#530 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AMGg3rzJGs8Uq8nmRtLKl9f_D8_uTtiwks5srBvDgaJpZM4PzdCV> .

masahi · 2017-10-11T02:06:16Z

Glad to hear that.
So your CL_OUT_OF_RESOURCES is due to insufficient memory, not the amount of threads.

kaishijeng · 2017-10-11T02:11:39Z

Can you share your modified cuda_imagenet_bench.py to enable OpenCL? Thanks,

…

On Tue, Oct 10, 2017 at 7:04 PM, kaishi Jeng ***@***.***> wrote: Yes, it works OK with verify_conv2d_nchw(1, 3, 64, 64, 3, 1, 0) Thanks, On Tue, Oct 10, 2017 at 6:37 PM, masahi ***@***.***> wrote: > @kaishijeng <https://github.com/kaishijeng> Sorry I don't know anything > about ARM gpus. Maybe you can try smaller inputs? > Say, verify_conv2d_nchw(1, 3, 64, 64, 3, 1, 0) ? > > I can push my repo, after I clean it up. > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#530 (comment)>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/AMGg3rzJGs8Uq8nmRtLKl9f_D8_uTtiwks5srBvDgaJpZM4PzdCV> > . >

masahi · 2017-10-11T02:26:29Z

Sure it's here

Note that I all changed is just
target = "opencl"
ctx = tvm.cl(0)

With the same change you can also run tutorial/from_onnx.py and tutorial/from_mxnet.py etc.

You can also use
python/nnvm/top/nn.py
python/nnvm/top/reduction.py
python/nnvm/top/tensor.py

This should be enough to run resnet bench and tutorials.

kaishijeng · 2017-10-11T02:47:05Z

I encountered CL_OUT_RESOURCES again with your changes. tvm._ffi.base.TVMError: [19:44:47] src/codegen/llvm/llvm_module.cc:59: Check failed: ret == 0 (-1 vs. 0) [19:44:47] src/runtime/opencl/opencl_module.cc:223: Check failed: e == CL_SUCCESS OpenCL Error, code=-5: CL_OUT_OF_RESOURCES

…

On Tue, Oct 10, 2017 at 7:26 PM, masahi ***@***.***> wrote: Sure it's here <https://gist.github.com/masahi/de658ec98b1e275de6331a4cf4c873b3> Note that I all changed is just *target = "opencl" ctx = tvm.cl <http://tvm.cl>(0)* With the same change you can also run tutorial/from_onnx.py and tutorial/from_mxnet.py etc. You can also use python/nnvm/top/nn.py <https://gist.github.com/masahi/a9b67fe27b24b783334ef8fc6f877ef9> python/nnvm/top/reduction.py <https://gist.github.com/masahi/706fd19cb4c589bc7ea59d68cde99a8a> python/nnvm/top/tensor.py <https://gist.github.com/masahi/1d503f9e7839636dc8086d3737523d64> This should be enough to run resnet bench and tutorials. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#530 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AMGg3rLt4GKrsFxLsDkZ7bU_HeNNS6abks5srCdZgaJpZM4PzdCV> .

masahi · 2017-10-11T02:49:56Z

Are you running resnet bench? If so, try mobilenet bench first. It requires less memory I think. 2017/10/11 午前11:47 "kaishijeng" <[email protected]>:

…

I encountered CL_OUT_RESOURCES again with your changes. tvm._ffi.base.TVMError: [19:44:47] src/codegen/llvm/llvm_module.cc:59: Check failed: ret == 0 (-1 vs. 0) [19:44:47] src/runtime/opencl/opencl_module.cc:223: Check failed: e == CL_SUCCESS OpenCL Error, code=-5: CL_OUT_OF_RESOURCES On Tue, Oct 10, 2017 at 7:26 PM, masahi ***@***.***> wrote: > Sure it's here > <https://gist.github.com/masahi/de658ec98b1e275de6331a4cf4c873b3> > > Note that I all changed is just > > *target = "opencl" ctx = tvm.cl <http://tvm.cl>(0)* > > With the same change you can also run tutorial/from_onnx.py and > tutorial/from_mxnet.py etc. > > You can also use > python/nnvm/top/nn.py > <https://gist.github.com/masahi/a9b67fe27b24b783334ef8fc6f877ef9> > python/nnvm/top/reduction.py > <https://gist.github.com/masahi/706fd19cb4c589bc7ea59d68cde99a8a> > python/nnvm/top/tensor.py > <https://gist.github.com/masahi/1d503f9e7839636dc8086d3737523d64> > > This should be enough to run resnet bench and tutorials. > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#530 (comment)>, or mute > the thread > <https://github.com/notifications/unsubscribe- auth/AMGg3rLt4GKrsFxLsDkZ7bU_HeNNS6abks5srCdZgaJpZM4PzdCV> > . > — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#530 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABsbE1POGJTHEVVfNe-2fQ8nVIAyfVBVks5srCwwgaJpZM4PzdCV> .

kaishijeng · 2017-10-11T02:57:15Z

No luck with mobilenet. It is probably related to how tvm/nnvm allocates GPU resources in various layers because OpenCL functions in ARM ComputeLibrary is running ok on Firefly3399 Thanks,

…

On Tue, Oct 10, 2017 at 7:50 PM, masahi ***@***.***> wrote: Are you running resnet bench? If so, try mobilenet bench first. It requires less memory I think. 2017/10/11 午前11:47 "kaishijeng" ***@***.***>: > I encountered CL_OUT_RESOURCES again with your changes. > > tvm._ffi.base.TVMError: [19:44:47] src/codegen/llvm/llvm_module.cc:59: > Check failed: ret == 0 (-1 vs. 0) [19:44:47] > src/runtime/opencl/opencl_module.cc:223: Check failed: e == CL_SUCCESS > OpenCL Error, code=-5: CL_OUT_OF_RESOURCES > > On Tue, Oct 10, 2017 at 7:26 PM, masahi ***@***.***> wrote: > > > Sure it's here > > <https://gist.github.com/masahi/de658ec98b1e275de6331a4cf4c873b3> > > > > Note that I all changed is just > > > > *target = "opencl" ctx = tvm.cl <http://tvm.cl>(0)* > > > > With the same change you can also run tutorial/from_onnx.py and > > tutorial/from_mxnet.py etc. > > > > You can also use > > python/nnvm/top/nn.py > > <https://gist.github.com/masahi/a9b67fe27b24b783334ef8fc6f877ef9> > > python/nnvm/top/reduction.py > > <https://gist.github.com/masahi/706fd19cb4c589bc7ea59d68cde99a8a> > > python/nnvm/top/tensor.py > > <https://gist.github.com/masahi/1d503f9e7839636dc8086d3737523d64> > > > > This should be enough to run resnet bench and tutorials. > > > > — > > You are receiving this because you were mentioned. > > Reply to this email directly, view it on GitHub > > <#530 (comment)>, or mute > > the thread > > <https://github.com/notifications/unsubscribe- > auth/AMGg3rLt4GKrsFxLsDkZ7bU_HeNNS6abks5srCdZgaJpZM4PzdCV> > > . > > > > — > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub > <#530 (comment)>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/ABsbE1POGJTHEVVfNe- 2fQ8nVIAyfVBVks5srCwwgaJpZM4PzdCV> > . > — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#530 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AMGg3psKYLJ6VQRek8IY7MJQ47HLedFmks5srCzZgaJpZM4PzdCV> .

masahi · 2017-10-11T08:09:19Z

@tqchen Halide has auto scheduling work merged in master (here is an example of conv layer with auto scheduling). The results in the paper are impressive. Do you think auto scheduling in tvm is possible? That will be pretty cool.

Note that it is different from auto tuning. They automatically generate a image processing schedule that is on par or better than hand written ones by Halide experts, without actually running it. I wonder how auto scheduling can apply to conv and gemm.

tqchen · 2017-10-11T17:06:38Z

yes, we are looking into it. The auto schedule pattern for GPU and multi-core CPU could go beyond what pattern we can use for image processing works and we are gaining some experience on what new changes need to be done here

masahi · 2017-10-11T23:17:53Z

That's great, looking forward to learning the techniques.

kaishijeng · 2017-10-12T04:56:42Z

Why is resource required in conv2d_nchw.py much biigger than conv2d_hwnc? I still try to figure out why there is not enough resource in conv2d_nchw.py, but conv2d_hwnc.py is OK on unittest Thanks

…

On Wed, Oct 11, 2017 at 4:17 PM, masahi ***@***.***> wrote: That's great, looking forward to learning the techniques. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#530 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AMGg3jUrLUfkqemOnJeLuCYeapJ79M9mks5srUykgaJpZM4PzdCV> .

tqchen · 2017-10-12T17:23:25Z

The conv2d_nchw is quite specially tuned for K80 card, which will launch quite a lot more threads with shared memory resources. They are not as good schedule for the mobile GPUs.

kaishijeng · 2017-10-12T19:00:28Z

1) Can I always use conv2d_hwnc even though it is not optimized for a particular processor/platform? 2) How do I know whch one , ie conv2d_nchw or conv2d_hwnc is used in the cuda_imagenet_bench.py <https://github.com/dmlc/nnvm/blob/master/examples/benchmark/cuda_imagenet_bench.py> ? Thanks,

…

On Thu, Oct 12, 2017 at 10:23 AM, Tianqi Chen ***@***.***> wrote: The conv2d_nchw is quite specially tuned for K80 card, which will launch quite a lot more threads with shared memory resources. They are not as good schedule for the mobile GPUs. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#530 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AMGg3n-DLJxIH3Iw6_6jt6xK8UD19JdFks5srksRgaJpZM4PzdCV> .

tqchen · 2017-10-12T22:52:00Z

The imagenet bench always use the nchw one, the hnwc one is designed for the batch kernel in cuda. I think we can try to make the schedule more adjustable so it can adapt to more platforms as a future goal

kaishijeng · 2017-10-13T06:08:17Z

Will wait for your new design. I think if NNVM/TVM can take advantage of mobile GPU to speed up deep learning, it will be enable more deep learning applications on mobile devices. Thanks,

…

On Thu, Oct 12, 2017 at 3:52 PM, Tianqi Chen ***@***.***> wrote: The imagenet bench always use the nchw one, the hnwc one is designed for the batch kernel in cuda. I think we can try to make the schedule more adjustable so it can adapt to more platforms as a future goal — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#530 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AMGg3jY291ZSSjLacwyrpD4r2vrdcZr1ks5srpgUgaJpZM4PzdCV> .

masahi · 2017-10-14T23:55:04Z

Solved by #556

kaishijeng · 2017-10-15T17:59:56Z

Masahi, How do #556 solve this issue? With the latest code, I got different error of test_topi_conv2d_nchw.py below. The error of previous code is an insufficient resource of CL allocation. TVM: Initializing cython mode... Use memoize topi.tests.test_topi_conv2d.verify_con2d_nchw.get_ref_data.pkl(5, (1, 3, 224, 224), 'float32', 2, 3, (64, 3, 7, 7)) Skip because cuda is not enabled [10:56:06] src/runtime/opencl/opencl_device_api.cc:195: Initialize OpenCL platform 'ARM Platform' [10:56:06] src/runtime/opencl/opencl_device_api.cc:215: opencl(0)='Mali-T860' cl_device_id=0x7f75be3dd8 [10:56:06] /home/firefly/2TB/src/firefly/tvm/dmlc-core/include/dmlc/logging.h:308: [10:56:06] src/runtime/opencl/opencl_module.cc:223: Check failed: e == CL_SUCCESS OpenCL Error, code=-55: CL_INVALID_WORK_ITEM_SIZE

…

On Sat, Oct 14, 2017 at 4:55 PM, masahi ***@***.***> wrote: Closed #530 <#530>. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#530 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AMGg3j_Ea_PtPZxSj4n-PtcyY_EdIEp9ks5ssUnfgaJpZM4PzdCV> .

tqchen · 2017-10-15T18:51:05Z

To enable Mali GPUs, we need to further mobile GPU options here https://github.com/dmlc/tvm/blob/master/python/tvm/target.py#L63

Note that most of the current cuda schedules need tweaks for the mali target. So additional registered function from topi is needed

masahi · 2017-10-16T06:30:10Z

Thanks, it looks like not all schedules under topi.cuda have been updated to use max_num_threads. I think we shouldn't simply replace every occurrence of 512 with max_num_threads. I need to discuss this with @tqchen

I am at work now, I will check the code later.

kaishijeng · 2017-10-17T02:36:22Z

Tiangi/masahi Just curious, why is there no opencl directory in tvm/topi/python/topi? There are rasp and cuda., but no opencl.

…

On Mon, Oct 16, 2017 at 2:39 AM, masahi ***@***.***> wrote: Thanks, it looks like not all schedules under topi.cuda have been updated to use max_num_threads. I need to discuss this with @tqchen <https://github.com/tqchen> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#530 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AMGg3mxaHZH9RHjd6GQtU_0x78wfI-Gyks5ssvf3gaJpZM4PzdCV> .

tqchen · 2017-10-17T02:44:44Z

This is mainly due to fact that there is not yet effort to optimize schedule for opencl devices. We do expect that there should be topi/rocm(for amd gpus) and topi/mali (for mobile gpus)

kaishijeng · 2017-10-23T01:07:58Z

masahi

Does the latest code have proper  max_num_threads enforced?

If yes, what changes do I need to do to reduce max_num_threads for firefly3399?

Thanks,

masahi · 2017-10-23T02:27:55Z

@kaishijeng yes, at least it solves my AMD opencl problem. Not sure if it will work for you, though.

If max_num_threads = 256 is still too big for your card, you can set a smaller number here.

kaishijeng · 2017-10-23T04:47:39Z

Masahi, WIll try different values if 256 is too high. Just curious, I saw the change of max_num_threads is under target_name in ("rocm", "opencl"). Will firefly 3399 use the same target name? Thanks,

…

On Sun, Oct 22, 2017 at 7:28 PM, masahi ***@***.***> wrote: @kaishijeng <https://github.com/kaishijeng> yes, at least it solves my AMD opencl problem. Not sure if it will work for you, though. If max_num_threads = 256 is still too big for your card, you can set a smaller number here <https://github.com/dmlc/tvm/blob/122183583663ae11a5389e46d3ef6c92e2b82d05/python/tvm/target.py#L113> . — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#530 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AMGg3tQ8a_q9SQmmAZVjZdCS1FIvXKYLks5su_mtgaJpZM4PzdCV> .

masahi · 2017-10-23T06:12:00Z

Yes, what matters is the keyword "gpu" in self.keys += ("rocm", "gpu")
This will cause schedules registered under "gpu" to be called by opencl backend.

kaishijeng · 2017-10-23T06:16:05Z

I have tried different values of self.max_num_threads in target.py and even down to 8,still got Check failed: e == CL_SUCCESS OpenCL Error, code=-5: CL_OUT_OF_RESOURCES when doing test_conv2d_nchw Thanks,

…

On Sun, Oct 22, 2017 at 11:12 PM, masahi ***@***.***> wrote: Yes, what matters is the keyword "gpu" in self.keys += ("rocm", "gpu") This will cause schedules registered under "gpu" to be called by opencl backend. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#530 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AMGg3lpV7hB1Y7P3_gwLQTBC6LB08OWrks5svC4zgaJpZM4PzdCV> .

masahi · 2017-10-23T06:24:19Z

Seems your card doesn't have enough memory (global or local) to run test_conv2d_nchw
You should try smaller inputs, as I told you earlier.

kaishijeng · 2017-10-23T06:38:07Z

But it is not useful to have a smaller input because most of nets have input size of (224,224, 3). Thanks,

…

On Sun, Oct 22, 2017 at 11:24 PM, masahi ***@***.***> wrote: Seems your card doesn't have enough memory to run test_conv2d_nchw You should try smaller inputs, as I told you earlier. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#530 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AMGg3hn17G0KiMLMJEIJ45njMDbQCmXQks5svDEWgaJpZM4PzdCV> .

masahi · 2017-10-23T07:47:48Z

you can try reducing output channels or smaller filter size.

tqchen · 2017-10-23T16:58:18Z

The existing schedule for AMD/CUDA GPUs do not work well on the ARM GPUs, and we will need independent schedules for them. For example, it is usually not helpful to take use of shared memory

[Meta Schedule][M3c] Schedule Rules, Mutator & Postprocs (apache#485) [Meta Schedule][M3c] PostOrderApply (apache#486) Fix Post Order Apply (apache#490) [MetaSchedule] Relay Integration (apache#489) [M3c][Meta Schedule] Add Trace Correctness Test for PostOrderApply (apache#492) Fix replay trace. (apache#493) [M3c][Meta Schedule] Implement the Replay Func class. (apache#495) [PR] Test script for meta-schedule task extraction. Interface to load… (apache#494) [Meta Schedule Refactor] Get child blocks (apache#500) Read-at && Write-at (apache#497) [M3c][Meta Schedule] Measure Callbacks (apache#498) [Bug] Fix Infinite Loop Caused When Calling Methods Not Overrided In PyClass (apache#496) [MetaSchedule] Sample-Perfect-Tile (apache#501) [MetaSchedule] TE Workloads (apache#502) [TensorIR] GetProducer, GetConsumer (apache#506) [MetaScheduleRefactor] Annotate&Unannotate (apache#505) [MetaSchedule] Multi-Level-Tiling & Auto-Inline (apache#503) [Tests] Add unittests for auto-inline and multi-level-tiling (apache#508) [Meta Schedule] Minor Fixes (apache#507) [MetaSchedule] Rewrite Cooperative-Fetching / Unbound-Block / Reduction-Block (apache#509) [MetaSchedule] Rewrite Parallel-Vectorize-Unroll / Verify-GPU / Disallow-Dynamic-Loops (apache#499) [Meta Schedule] Add Helper Function & Minor Modification (apache#512) [MetaSchedule] Test for Rewrite Parallel-Vectorize-Unroll (apache#513) [Meta Schedule] Feature Extractor & Cost Model (apache#510) Blockize & Tensorize (apache#514) Layout Rewriting: Suggest-Index-Map (apache#520) [MetaSchedule] Parallel-Vectorize-Unroll & Random-Compute-Location (apache#516) [Meta Schedule] Per-Store-Feature (apache#521) Add traced schedule for blockize & tensorize (apache#526) [Meta Schedule] Add XGBoost Model & Random Model (apache#519) User-Interface: Tune-TIR (apache#525) User-Interface: Tune-TE (apache#527) [Minor] More logging on python (apache#528) Get CUDA tuning working (apache#529) [MetaSchedule] TensorRT BYOC (apache#518) [BugFix] LocalBuilder API (apache#531) [Meta Schedule] Add Cost Model Update Measure Callback (apache#530) [Bugfix] BuilderInput with default params (apache#532) [MetaSchedule] Mutator-Tile-Size, Mutate-Parallel, Mutate-Unroll (apache#534) [Meta Schedule] Evolutionary Search (apache#522) [BugFix] Remove duplicated definition of MakeMultinomialSampler (apache#535) [Meta Schedule] Fix some bugs (apache#537) Initiate Experiments for CPU Performance Alignment with Ansor (apache#538) [Meta Schedule] Tweak experiment scripts (apache#539) [Meta Schedule] Initiate experiments on CUDA (apache#540) [TIR][Schedule] Buffer transform (apache#523) Auto Tensor Core (apache#524) Working on Evo Search (apache#542) [Meta Schedule] Add Replay Tuning Interface (apache#543) Evolutionary Search on CPU (apache#544) Misc improvement over the error message (apache#545) [TIR][Schedule] Software pipelining (apache#533) [Meta Schedule Refactor] fixing unit tests (apache#547) [MetaSchedule] Mutator-Compute-Location (apache#548) Misc Improvement of Evolutionary Search (apache#549) Hotfix for software pipeline (apache#552) Misc Improvement (apache#550) Co-authored-by: Siyuan Feng <[email protected]> Co-authored-by: Bohan Hou <[email protected]> Co-authored-by: Hongyi Jin <[email protected]> Co-authored-by: Ruihang Lai <[email protected]> Co-authored-by: Junru Shao <[email protected]> Co-authored-by: Wuwei Lin <[email protected]> Co-authored-by: Sunghyun Park <[email protected]> Co-authored-by: Xiyou Zhou <[email protected]>

[Meta Schedule][M3c] Schedule Rules, Mutator & Postprocs (apache#485) [Meta Schedule][M3c] PostOrderApply (apache#486) Fix Post Order Apply (apache#490) [MetaSchedule] Relay Integration (apache#489) [M3c][Meta Schedule] Add Trace Correctness Test for PostOrderApply (apache#492) Fix replay trace. (apache#493) [M3c][Meta Schedule] Implement the Replay Func class. (apache#495) [PR] Test script for meta-schedule task extraction. Interface to load… (apache#494) [Meta Schedule Refactor] Get child blocks (apache#500) Read-at && Write-at (apache#497) [M3c][Meta Schedule] Measure Callbacks (apache#498) [Bug] Fix Infinite Loop Caused When Calling Methods Not Overrided In PyClass (apache#496) [MetaSchedule] Sample-Perfect-Tile (apache#501) [MetaSchedule] TE Workloads (apache#502) [TensorIR] GetProducer, GetConsumer (apache#506) [MetaScheduleRefactor] Annotate&Unannotate (apache#505) [MetaSchedule] Multi-Level-Tiling & Auto-Inline (apache#503) [Tests] Add unittests for auto-inline and multi-level-tiling (apache#508) [Meta Schedule] Minor Fixes (apache#507) [MetaSchedule] Rewrite Cooperative-Fetching / Unbound-Block / Reduction-Block (apache#509) [MetaSchedule] Rewrite Parallel-Vectorize-Unroll / Verify-GPU / Disallow-Dynamic-Loops (apache#499) [Meta Schedule] Add Helper Function & Minor Modification (apache#512) [MetaSchedule] Test for Rewrite Parallel-Vectorize-Unroll (apache#513) [Meta Schedule] Feature Extractor & Cost Model (apache#510) Blockize & Tensorize (apache#514) Layout Rewriting: Suggest-Index-Map (apache#520) [MetaSchedule] Parallel-Vectorize-Unroll & Random-Compute-Location (apache#516) [Meta Schedule] Per-Store-Feature (apache#521) Add traced schedule for blockize & tensorize (apache#526) [Meta Schedule] Add XGBoost Model & Random Model (apache#519) User-Interface: Tune-TIR (apache#525) User-Interface: Tune-TE (apache#527) [Minor] More logging on python (apache#528) Get CUDA tuning working (apache#529) [MetaSchedule] TensorRT BYOC (apache#518) [BugFix] LocalBuilder API (apache#531) [Meta Schedule] Add Cost Model Update Measure Callback (apache#530) [Bugfix] BuilderInput with default params (apache#532) [MetaSchedule] Mutator-Tile-Size, Mutate-Parallel, Mutate-Unroll (apache#534) [Meta Schedule] Evolutionary Search (apache#522) [BugFix] Remove duplicated definition of MakeMultinomialSampler (apache#535) [Meta Schedule] Fix some bugs (apache#537) Co-authored-by: Siyuan Feng <[email protected]> Co-authored-by: Bohan Hou <[email protected]> Co-authored-by: Hongyi Jin <[email protected]> Co-authored-by: Ruihang Lai <[email protected]> Co-authored-by: Junru Shao <[email protected]> Co-authored-by: Wuwei Lin <[email protected]> Co-authored-by: Sunghyun Park <[email protected]> Co-authored-by: Xiyou Zhou <[email protected]>

[Meta Schedule][M3c] Schedule Rules, Mutator & Postprocs (apache#485) [Meta Schedule][M3c] PostOrderApply (apache#486) Fix Post Order Apply (apache#490) [MetaSchedule] Relay Integration (apache#489) [M3c][Meta Schedule] Add Trace Correctness Test for PostOrderApply (apache#492) Fix replay trace. (apache#493) [M3c][Meta Schedule] Implement the Replay Func class. (apache#495) [PR] Test script for meta-schedule task extraction. Interface to load… (apache#494) [Meta Schedule Refactor] Get child blocks (apache#500) Read-at && Write-at (apache#497) [M3c][Meta Schedule] Measure Callbacks (apache#498) [Bug] Fix Infinite Loop Caused When Calling Methods Not Overrided In PyClass (apache#496) [MetaSchedule] Sample-Perfect-Tile (apache#501) [MetaSchedule] TE Workloads (apache#502) [TensorIR] GetProducer, GetConsumer (apache#506) [MetaScheduleRefactor] Annotate&Unannotate (apache#505) [MetaSchedule] Multi-Level-Tiling & Auto-Inline (apache#503) [Tests] Add unittests for auto-inline and multi-level-tiling (apache#508) [Meta Schedule] Minor Fixes (apache#507) [MetaSchedule] Rewrite Cooperative-Fetching / Unbound-Block / Reduction-Block (apache#509) [MetaSchedule] Rewrite Parallel-Vectorize-Unroll / Verify-GPU / Disallow-Dynamic-Loops (apache#499) [Meta Schedule] Add Helper Function & Minor Modification (apache#512) [MetaSchedule] Test for Rewrite Parallel-Vectorize-Unroll (apache#513) [Meta Schedule] Feature Extractor & Cost Model (apache#510) Blockize & Tensorize (apache#514) Layout Rewriting: Suggest-Index-Map (apache#520) [MetaSchedule] Parallel-Vectorize-Unroll & Random-Compute-Location (apache#516) [Meta Schedule] Per-Store-Feature (apache#521) Add traced schedule for blockize & tensorize (apache#526) [Meta Schedule] Add XGBoost Model & Random Model (apache#519) User-Interface: Tune-TIR (apache#525) User-Interface: Tune-TE (apache#527) [Minor] More logging on python (apache#528) Get CUDA tuning working (apache#529) [MetaSchedule] TensorRT BYOC (apache#518) [BugFix] LocalBuilder API (apache#531) [Meta Schedule] Add Cost Model Update Measure Callback (apache#530) [Bugfix] BuilderInput with default params (apache#532) [MetaSchedule] Mutator-Tile-Size, Mutate-Parallel, Mutate-Unroll (apache#534) [Meta Schedule] Evolutionary Search (apache#522) [BugFix] Remove duplicated definition of MakeMultinomialSampler (apache#535) [Meta Schedule] Fix some bugs (apache#537) Initiate Experiments for CPU Performance Alignment with Ansor (apache#538) [Meta Schedule] Tweak experiment scripts (apache#539) [Meta Schedule] Initiate experiments on CUDA (apache#540) [TIR][Schedule] Buffer transform (apache#523) Auto Tensor Core (apache#524) Working on Evo Search (apache#542) [Meta Schedule] Add Replay Tuning Interface (apache#543) Evolutionary Search on CPU (apache#544) Misc improvement over the error message (apache#545) [TIR][Schedule] Software pipelining (apache#533) [Meta Schedule Refactor] fixing unit tests (apache#547) [MetaSchedule] Mutator-Compute-Location (apache#548) Misc Improvement of Evolutionary Search (apache#549) Hotfix for software pipeline (apache#552) Misc Improvement (apache#550) Co-authored-by: Siyuan Feng <[email protected]> Co-authored-by: Bohan Hou <[email protected]> Co-authored-by: Hongyi Jin <[email protected]> Co-authored-by: Ruihang Lai <[email protected]> Co-authored-by: Junru Shao <[email protected]> Co-authored-by: Wuwei Lin <[email protected]> Co-authored-by: Sunghyun Park <[email protected]> Co-authored-by: Xiyou Zhou <[email protected]> Squashed commit [Meta Schedule][M3c] Schedule Rules, Mutator & Postprocs (apache#485) [Meta Schedule][M3c] PostOrderApply (apache#486) Fix Post Order Apply (apache#490) [MetaSchedule] Relay Integration (apache#489) [M3c][Meta Schedule] Add Trace Correctness Test for PostOrderApply (apache#492) Fix replay trace. (apache#493) [M3c][Meta Schedule] Implement the Replay Func class. (apache#495) [PR] Test script for meta-schedule task extraction. Interface to load… (apache#494) [Meta Schedule Refactor] Get child blocks (apache#500) Read-at && Write-at (apache#497) [M3c][Meta Schedule] Measure Callbacks (apache#498) [Bug] Fix Infinite Loop Caused When Calling Methods Not Overrided In PyClass (apache#496) [MetaSchedule] Sample-Perfect-Tile (apache#501) [MetaSchedule] TE Workloads (apache#502) [TensorIR] GetProducer, GetConsumer (apache#506) [MetaScheduleRefactor] Annotate&Unannotate (apache#505) [MetaSchedule] Multi-Level-Tiling & Auto-Inline (apache#503) [Tests] Add unittests for auto-inline and multi-level-tiling (apache#508) [Meta Schedule] Minor Fixes (apache#507) [MetaSchedule] Rewrite Cooperative-Fetching / Unbound-Block / Reduction-Block (apache#509) [MetaSchedule] Rewrite Parallel-Vectorize-Unroll / Verify-GPU / Disallow-Dynamic-Loops (apache#499) [Meta Schedule] Add Helper Function & Minor Modification (apache#512) [MetaSchedule] Test for Rewrite Parallel-Vectorize-Unroll (apache#513) [Meta Schedule] Feature Extractor & Cost Model (apache#510) Blockize & Tensorize (apache#514) Layout Rewriting: Suggest-Index-Map (apache#520) [MetaSchedule] Parallel-Vectorize-Unroll & Random-Compute-Location (apache#516) [Meta Schedule] Per-Store-Feature (apache#521) Add traced schedule for blockize & tensorize (apache#526) [Meta Schedule] Add XGBoost Model & Random Model (apache#519) User-Interface: Tune-TIR (apache#525) User-Interface: Tune-TE (apache#527) [Minor] More logging on python (apache#528) Get CUDA tuning working (apache#529) [MetaSchedule] TensorRT BYOC (apache#518) [BugFix] LocalBuilder API (apache#531) [Meta Schedule] Add Cost Model Update Measure Callback (apache#530) [Bugfix] BuilderInput with default params (apache#532) [MetaSchedule] Mutator-Tile-Size, Mutate-Parallel, Mutate-Unroll (apache#534) [Meta Schedule] Evolutionary Search (apache#522) [BugFix] Remove duplicated definition of MakeMultinomialSampler (apache#535) [Meta Schedule] Fix some bugs (apache#537) Co-authored-by: Siyuan Feng <[email protected]> Co-authored-by: Bohan Hou <[email protected]> Co-authored-by: Hongyi Jin <[email protected]> Co-authored-by: Ruihang Lai <[email protected]> Co-authored-by: Junru Shao <[email protected]> Co-authored-by: Wuwei Lin <[email protected]> Co-authored-by: Sunghyun Park <[email protected]> Co-authored-by: Xiyou Zhou <[email protected]> Initiate Experiments for CPU Performance Alignment with Ansor (apache#538) * ... * update * update * print * more [Meta Schedule] Tweak experiment scripts (apache#539) [Meta Schedule] Initiate experiments on CUDA (apache#540) * [Meta Schedule] Initiate experiments on CUDA * ... * fix boolean printing Auto Tensor Core (apache#524) Working on Evo Search (apache#542) Squashed commit [Meta Schedule][M3c] Schedule Rules, Mutator & Postprocs (apache#485) [Meta Schedule][M3c] PostOrderApply (apache#486) Fix Post Order Apply (apache#490) [MetaSchedule] Relay Integration (apache#489) [M3c][Meta Schedule] Add Trace Correctness Test for PostOrderApply (apache#492) Fix replay trace. (apache#493) [M3c][Meta Schedule] Implement the Replay Func class. (apache#495) [PR] Test script for meta-schedule task extraction. Interface to load… (apache#494) [Meta Schedule Refactor] Get child blocks (apache#500) Read-at && Write-at (apache#497) [M3c][Meta Schedule] Measure Callbacks (apache#498) [Bug] Fix Infinite Loop Caused When Calling Methods Not Overrided In PyClass (apache#496) [MetaSchedule] Sample-Perfect-Tile (apache#501) [MetaSchedule] TE Workloads (apache#502) [TensorIR] GetProducer, GetConsumer (apache#506) [MetaScheduleRefactor] Annotate&Unannotate (apache#505) [MetaSchedule] Multi-Level-Tiling & Auto-Inline (apache#503) [Tests] Add unittests for auto-inline and multi-level-tiling (apache#508) [Meta Schedule] Minor Fixes (apache#507) [MetaSchedule] Rewrite Cooperative-Fetching / Unbound-Block / Reduction-Block (apache#509) [MetaSchedule] Rewrite Parallel-Vectorize-Unroll / Verify-GPU / Disallow-Dynamic-Loops (apache#499) [Meta Schedule] Add Helper Function & Minor Modification (apache#512) [MetaSchedule] Test for Rewrite Parallel-Vectorize-Unroll (apache#513) [Meta Schedule] Feature Extractor & Cost Model (apache#510) Blockize & Tensorize (apache#514) Layout Rewriting: Suggest-Index-Map (apache#520) Co-authored-by: Siyuan Feng <[email protected]> Co-authored-by: Bohan Hou <[email protected]> Co-authored-by: Hongyi Jin <[email protected]> Co-authored-by: Ruihang Lai <[email protected]> Co-authored-by: Junru Shao <[email protected]> Co-authored-by: Wuwei Lin <[email protected]> Co-authored-by: Sunghyun Park <[email protected]> [MetaSchedule] Parallel-Vectorize-Unroll & Random-Compute-Location (apache#516) * parallel vectorize unroll & random compute location * rebased [Meta Schedule] Per-Store-Feature (apache#521) [Meta Schedule] Add XGBoost Model & Random Model (apache#519) * Squashed commit [Meta Schedule][M3c] Schedule Rules, Mutator & Postprocs (apache#485) [Meta Schedule][M3c] PostOrderApply (apache#486) Fix Post Order Apply (apache#490) [MetaSchedule] Relay Integration (apache#489) [M3c][Meta Schedule] Add Trace Correctness Test for PostOrderApply (apache#492) Fix replay trace. (apache#493) [M3c][Meta Schedule] Implement the Replay Func class. (apache#495) [PR] Test script for meta-schedule task extraction. Interface to load… (apache#494) [Meta Schedule Refactor] Get child blocks (apache#500) Read-at && Write-at (apache#497) [M3c][Meta Schedule] Measure Callbacks (apache#498) [Bug] Fix Infinite Loop Caused When Calling Methods Not Overrided In PyClass (apache#496) [MetaSchedule] Sample-Perfect-Tile (apache#501) [MetaSchedule] TE Workloads (apache#502) Co-authored-by: Siyuan Feng <[email protected]> Co-authored-by: Bohan Hou <[email protected]> Co-authored-by: Hongyi Jin <[email protected]> Co-authored-by: Ruihang Lai <[email protected]> Co-authored-by: Junru Shao <[email protected]> Co-authored-by: Wuwei Lin <[email protected]> Co-authored-by: Sunghyun Park <[email protected]> [TensorIR] GetProducer, GetConsumer (apache#506) [MetaScheduleRefactor] Annotate&Unannotate (apache#505) * annotate * annotate * lint * test * fix * fix * fix [MetaSchedule] Rewrite Cooperative-Fetching / Unbound-Block / Reduction-Block (apache#509) Fix sttr func & schedule naming. Fix schedule -> sch. Add feature extractor. Fix init. Add cost model. Remove unused include. [MetaSchedule] Rewrite Parallel-Vectorize-Unroll / Verify-GPU / Disallow-Dynamic-Loops (apache#499) * wip fix * revoke change to gallery * split postprocessors to separate files * rename attrs * minor * minor tweak on utils.h * refactor disallow-dynamic-loop * refactor verify_gpu_code * succesfully give up refactoring parallelize-vectorize-unroll * python structuring * unittests Co-authored-by: Junru Shao <[email protected]> Fix issues. Fix init. Finish random model part. Finish xgb model. Minor fix. Rebase. Add init. Await refactor of callback. Update a bit on the test case. Move impos. Minor fix. More fixes. Remove unused import. Fix per store feature test. Update model save / load. * Fix model save / load with tar. * Fix issues. * Remove dup. Co-authored-by: Junru Shao <[email protected]> User-Interface: Tune-TIR (apache#525) * User-Interface: Tune-TIR * fix fix fix User-Interface: Tune-TE (apache#527) * fix a lot of issues * Add tune-te Get CUDA tuning working (apache#529) [Meta Schedule] Evolutionary Search (apache#522) * Checkpoint. Fix cost model comment. Finish evolutionary seaarch. Remove extra code. Fix compile. Add comments. Add python part. Ad test. Update other files & comments. * Squashed commit [Meta Schedule][M3c] Schedule Rules, Mutator & Postprocs (apache#485) [Meta Schedule][M3c] PostOrderApply (apache#486) Fix Post Order Apply (apache#490) [MetaSchedule] Relay Integration (apache#489) [M3c][Meta Schedule] Add Trace Correctness Test for PostOrderApply (apache#492) Fix replay trace. (apache#493) [M3c][Meta Schedule] Implement the Replay Func class. (apache#495) [PR] Test script for meta-schedule task extraction. Interface to load… (apache#494) [Meta Schedule Refactor] Get child blocks (apache#500) Read-at && Write-at (apache#497) [M3c][Meta Schedule] Measure Callbacks (apache#498) [Bug] Fix Infinite Loop Caused When Calling Methods Not Overrided In PyClass (apache#496) [MetaSchedule] Sample-Perfect-Tile (apache#501) [MetaSchedule] TE Workloads (apache#502) Co-authored-by: Siyuan Feng <[email protected]> Co-authored-by: Bohan Hou <[email protected]> Co-authored-by: Hongyi Jin <[email protected]> Co-authored-by: Ruihang Lai <[email protected]> Co-authored-by: Junru Shao <[email protected]> Co-authored-by: Wuwei Lin <[email protected]> Co-authored-by: Sunghyun Park <[email protected]> * [TensorIR] GetProducer, GetConsumer (apache#506) * [MetaScheduleRefactor] Annotate&Unannotate (apache#505) * annotate * annotate * lint * test * fix * fix * fix * [MetaSchedule] Rewrite Cooperative-Fetching / Unbound-Block / Reduction-Block (apache#509) * Blockize & Tensorize (apache#514) * Blockize & Tensorize * Update tensor intrin * Fix blockized & Recalculate affine flags * Cleanup utils.cc * Add test cases of blockize * Re-enable affine flag checking * Checkpoint. Fix cost model comment. Finish evolutionary seaarch. Remove extra code. Fix compile. Add comments. Add python part. Ad test. Update other files & comments. Fix random seed bug. Minor fix. Fix num-cores. Add docs. Check point. Add max_fail_cnt. Minor fix. Minor fix. Segfault. Fix pointers to trace. Test fix. Remove measure callbacks. Refactor a bit. Split function. Adjust variable name. Minor fixes. Add mutator probs to TuneContext. Add token. Fix loops. Remove include. Add has workload for database. Add check. Add concurrent bitmask. * Fix TuneContext. * Fix haash & stuff. * Modifyy shash. * Remove trace field. * Minor fix. * Fix cbmask. * Fix numbers. Co-authored-by: Junru Shao <[email protected]> Co-authored-by: Siyuan Feng <[email protected]> Co-authored-by: Bohan Hou <[email protected]> Co-authored-by: Hongyi Jin <[email protected]> Co-authored-by: Ruihang Lai <[email protected]> Co-authored-by: Wuwei Lin <[email protected]> Co-authored-by: Sunghyun Park <[email protected]> [BugFix] Remove duplicated definition of MakeMultinomialSampler (apache#535) Tune relay. Further add interface. Remove unused import Fix rebase. Add task name dispatch. Add task deduplication. Rename extract_task to extract_task_from_relay Remove duplicate function def. Minor fix.

[Meta Schedule][M3c] Schedule Rules, Mutator & Postprocs (apache#485) [Meta Schedule][M3c] PostOrderApply (apache#486) Fix Post Order Apply (apache#490) [MetaSchedule] Relay Integration (apache#489) [M3c][Meta Schedule] Add Trace Correctness Test for PostOrderApply (apache#492) Fix replay trace. (apache#493) [M3c][Meta Schedule] Implement the Replay Func class. (apache#495) [PR] Test script for meta-schedule task extraction. Interface to load… (apache#494) [Meta Schedule Refactor] Get child blocks (apache#500) Read-at && Write-at (apache#497) [M3c][Meta Schedule] Measure Callbacks (apache#498) [Bug] Fix Infinite Loop Caused When Calling Methods Not Overrided In PyClass (apache#496) [MetaSchedule] Sample-Perfect-Tile (apache#501) [MetaSchedule] TE Workloads (apache#502) [TensorIR] GetProducer, GetConsumer (apache#506) [MetaScheduleRefactor] Annotate&Unannotate (apache#505) [MetaSchedule] Multi-Level-Tiling & Auto-Inline (apache#503) [Tests] Add unittests for auto-inline and multi-level-tiling (apache#508) [Meta Schedule] Minor Fixes (apache#507) [MetaSchedule] Rewrite Cooperative-Fetching / Unbound-Block / Reduction-Block (apache#509) [MetaSchedule] Rewrite Parallel-Vectorize-Unroll / Verify-GPU / Disallow-Dynamic-Loops (apache#499) [Meta Schedule] Add Helper Function & Minor Modification (apache#512) [MetaSchedule] Test for Rewrite Parallel-Vectorize-Unroll (apache#513) [Meta Schedule] Feature Extractor & Cost Model (apache#510) Blockize & Tensorize (apache#514) Layout Rewriting: Suggest-Index-Map (apache#520) [MetaSchedule] Parallel-Vectorize-Unroll & Random-Compute-Location (apache#516) [Meta Schedule] Per-Store-Feature (apache#521) Add traced schedule for blockize & tensorize (apache#526) [Meta Schedule] Add XGBoost Model & Random Model (apache#519) User-Interface: Tune-TIR (apache#525) User-Interface: Tune-TE (apache#527) [Minor] More logging on python (apache#528) Get CUDA tuning working (apache#529) [MetaSchedule] TensorRT BYOC (apache#518) [BugFix] LocalBuilder API (apache#531) [Meta Schedule] Add Cost Model Update Measure Callback (apache#530) [Bugfix] BuilderInput with default params (apache#532) [MetaSchedule] Mutator-Tile-Size, Mutate-Parallel, Mutate-Unroll (apache#534) [Meta Schedule] Evolutionary Search (apache#522) [BugFix] Remove duplicated definition of MakeMultinomialSampler (apache#535) [Meta Schedule] Fix some bugs (apache#537) Initiate Experiments for CPU Performance Alignment with Ansor (apache#538) [Meta Schedule] Tweak experiment scripts (apache#539) [Meta Schedule] Initiate experiments on CUDA (apache#540) [TIR][Schedule] Buffer transform (apache#523) Auto Tensor Core (apache#524) Working on Evo Search (apache#542) [Meta Schedule] Add Replay Tuning Interface (apache#543) Evolutionary Search on CPU (apache#544) Misc improvement over the error message (apache#545) [TIR][Schedule] Software pipelining (apache#533) [Meta Schedule Refactor] fixing unit tests (apache#547) [MetaSchedule] Mutator-Compute-Location (apache#548) Misc Improvement of Evolutionary Search (apache#549) Hotfix for software pipeline (apache#552) Misc Improvement (apache#550) [Cherry-Pick][TensorIR] Primitive "SetScope" (apache#9738) (apache#555) Rule RFactor (apache#551) [MemHammer] Rewrite Rules (apache#554) [MetaSchedule] Schedule Rule: Cross-Thread Reduction (apache#556) [MetaSchedule] Performance Alignment - NRM and SFM (CUDA) (apache#559) [MetaSchedule] Perf Alignment - NRM on CUDA (apache#560) [TIR] Reorder the block iters of the blocks generated by RFactor (apache#561) Removing 2 unit tests for software pipelining (apache#562) [MemHammer] Lower Pass + Unittests (apache#557) Perf Align: Remove Auto-inline before Multi-level-tiling (apache#564) Fix Sketch Generation Unittests (apache#565) speed up VerifyGpuCode (apache#568) [Performance Align] fixing codegen problems (apache#569) [Meta schedule] improve search space (#1) Hot fix for bound predicate (#3) [Meta Schedule] Update Tune Relay (#4) [Performance Align] fixing codegen problems (#5) [PerfAlign] NRM & SFM on Raspi Aligned (#6) [BugFix] Apply bound predicate directly to loops when possible (#12) [BugFix] Fix CrossThreadReduction on CUDA (#13) [MetaSchedule] Enable BertTuning with MetaScheduler (#11) [Minor][MemHammer] Minor tweaks in code review (#14) [Meta Schedule] Add customizable search space to PostOrderApply. (#16) Fix cooperative fetching (#17) Fixes for codegen (#18) [Hotfix] A unittest (#19) Fix for GRP sketch gen (#21) Add threadIdx filtering in Multi-Level-Tiling and Verify-GPU-Code (#20) [BugFix][TIR] Fix cross-thread reduction when single reduction loop with predicate (apache#10016) (#22) [MemHammer][Refactor] Code Review (#15) [Meta Schedule] Add Winograd Test for Customizable Search Space (#24) Co-authored-by: Siyuan Feng <[email protected]> Co-authored-by: Bohan Hou <[email protected]> Co-authored-by: Hongyi Jin <[email protected]> Co-authored-by: Ruihang Lai <[email protected]> Co-authored-by: Junru Shao <[email protected]> Co-authored-by: Wuwei Lin <[email protected]> Co-authored-by: Sunghyun Park <[email protected]> Co-authored-by: Xiyou Zhou <[email protected]>

[Meta Schedule][M3c] Schedule Rules, Mutator & Postprocs (apache#485) [Meta Schedule][M3c] PostOrderApply (apache#486) Fix Post Order Apply (apache#490) [MetaSchedule] Relay Integration (apache#489) [M3c][Meta Schedule] Add Trace Correctness Test for PostOrderApply (apache#492) Fix replay trace. (apache#493) [M3c][Meta Schedule] Implement the Replay Func class. (apache#495) [PR] Test script for meta-schedule task extraction. Interface to load… (apache#494) [Meta Schedule Refactor] Get child blocks (apache#500) Read-at && Write-at (apache#497) [M3c][Meta Schedule] Measure Callbacks (apache#498) [Bug] Fix Infinite Loop Caused When Calling Methods Not Overrided In PyClass (apache#496) [MetaSchedule] Sample-Perfect-Tile (apache#501) [MetaSchedule] TE Workloads (apache#502) [TensorIR] GetProducer, GetConsumer (apache#506) [MetaScheduleRefactor] Annotate&Unannotate (apache#505) [MetaSchedule] Multi-Level-Tiling & Auto-Inline (apache#503) [Tests] Add unittests for auto-inline and multi-level-tiling (apache#508) [Meta Schedule] Minor Fixes (apache#507) [MetaSchedule] Rewrite Cooperative-Fetching / Unbound-Block / Reduction-Block (apache#509) [MetaSchedule] Rewrite Parallel-Vectorize-Unroll / Verify-GPU / Disallow-Dynamic-Loops (apache#499) [Meta Schedule] Add Helper Function & Minor Modification (apache#512) [MetaSchedule] Test for Rewrite Parallel-Vectorize-Unroll (apache#513) [Meta Schedule] Feature Extractor & Cost Model (apache#510) Blockize & Tensorize (apache#514) Layout Rewriting: Suggest-Index-Map (apache#520) [MetaSchedule] Parallel-Vectorize-Unroll & Random-Compute-Location (apache#516) [Meta Schedule] Per-Store-Feature (apache#521) Add traced schedule for blockize & tensorize (apache#526) [Meta Schedule] Add XGBoost Model & Random Model (apache#519) User-Interface: Tune-TIR (apache#525) User-Interface: Tune-TE (apache#527) [Minor] More logging on python (apache#528) Get CUDA tuning working (apache#529) [MetaSchedule] TensorRT BYOC (apache#518) [BugFix] LocalBuilder API (apache#531) [Meta Schedule] Add Cost Model Update Measure Callback (apache#530) [Bugfix] BuilderInput with default params (apache#532) [MetaSchedule] Mutator-Tile-Size, Mutate-Parallel, Mutate-Unroll (apache#534) [Meta Schedule] Evolutionary Search (apache#522) [BugFix] Remove duplicated definition of MakeMultinomialSampler (apache#535) [Meta Schedule] Fix some bugs (apache#537) Initiate Experiments for CPU Performance Alignment with Ansor (apache#538) [Meta Schedule] Tweak experiment scripts (apache#539) [Meta Schedule] Initiate experiments on CUDA (apache#540) [TIR][Schedule] Buffer transform (apache#523) Auto Tensor Core (apache#524) Working on Evo Search (apache#542) [Meta Schedule] Add Replay Tuning Interface (apache#543) Evolutionary Search on CPU (apache#544) Misc improvement over the error message (apache#545) [TIR][Schedule] Software pipelining (apache#533) [Meta Schedule Refactor] fixing unit tests (apache#547) [MetaSchedule] Mutator-Compute-Location (apache#548) Misc Improvement of Evolutionary Search (apache#549) Hotfix for software pipeline (apache#552) Misc Improvement (apache#550) [Cherry-Pick][TensorIR] Primitive "SetScope" (apache#9738) (apache#555) Rule RFactor (apache#551) [MemHammer] Rewrite Rules (apache#554) [MetaSchedule] Schedule Rule: Cross-Thread Reduction (apache#556) [MetaSchedule] Performance Alignment - NRM and SFM (CUDA) (apache#559) [MetaSchedule] Perf Alignment - NRM on CUDA (apache#560) [TIR] Reorder the block iters of the blocks generated by RFactor (apache#561) Removing 2 unit tests for software pipelining (apache#562) [MemHammer] Lower Pass + Unittests (apache#557) Perf Align: Remove Auto-inline before Multi-level-tiling (apache#564) Fix Sketch Generation Unittests (apache#565) speed up VerifyGpuCode (apache#568) [Performance Align] fixing codegen problems (apache#569) [Meta schedule] improve search space (#1) Hot fix for bound predicate (apache#3) [Meta Schedule] Update Tune Relay (apache#4) [Performance Align] fixing codegen problems (apache#5) [PerfAlign] NRM & SFM on Raspi Aligned (apache#6) [BugFix] Apply bound predicate directly to loops when possible (apache#12) [BugFix] Fix CrossThreadReduction on CUDA (apache#13) [MetaSchedule] Enable BertTuning with MetaScheduler (apache#11) [Minor][MemHammer] Minor tweaks in code review (apache#14) [Meta Schedule] Add customizable search space to PostOrderApply. (apache#16) Fix cooperative fetching (apache#17) Fixes for codegen (apache#18) [Hotfix] A unittest (apache#19) Fix for GRP sketch gen (apache#21) Add threadIdx filtering in Multi-Level-Tiling and Verify-GPU-Code (apache#20) [BugFix][TIR] Fix cross-thread reduction when single reduction loop with predicate (apache#10016) (apache#22) [MemHammer][Refactor] Code Review (apache#15) [Meta Schedule] Add Winograd Test for Customizable Search Space (apache#24) Co-authored-by: Siyuan Feng <[email protected]> Co-authored-by: Bohan Hou <[email protected]> Co-authored-by: Hongyi Jin <[email protected]> Co-authored-by: Ruihang Lai <[email protected]> Co-authored-by: Junru Shao <[email protected]> Co-authored-by: Wuwei Lin <[email protected]> Co-authored-by: Sunghyun Park <[email protected]> Co-authored-by: Xiyou Zhou <[email protected]>

[Meta Schedule][M3c] Schedule Rules, Mutator & Postprocs (apache#485) [Meta Schedule][M3c] PostOrderApply (apache#486) Fix Post Order Apply (apache#490) [MetaSchedule] Relay Integration (apache#489) [M3c][Meta Schedule] Add Trace Correctness Test for PostOrderApply (apache#492) Fix replay trace. (apache#493) [M3c][Meta Schedule] Implement the Replay Func class. (apache#495) [PR] Test script for meta-schedule task extraction. Interface to load… (apache#494) [Meta Schedule Refactor] Get child blocks (apache#500) Read-at && Write-at (apache#497) [M3c][Meta Schedule] Measure Callbacks (apache#498) [Bug] Fix Infinite Loop Caused When Calling Methods Not Overrided In PyClass (apache#496) [MetaSchedule] Sample-Perfect-Tile (apache#501) [MetaSchedule] TE Workloads (apache#502) [TensorIR] GetProducer, GetConsumer (apache#506) [MetaScheduleRefactor] Annotate&Unannotate (apache#505) [MetaSchedule] Multi-Level-Tiling & Auto-Inline (apache#503) [Tests] Add unittests for auto-inline and multi-level-tiling (apache#508) [Meta Schedule] Minor Fixes (apache#507) [MetaSchedule] Rewrite Cooperative-Fetching / Unbound-Block / Reduction-Block (apache#509) [MetaSchedule] Rewrite Parallel-Vectorize-Unroll / Verify-GPU / Disallow-Dynamic-Loops (apache#499) [Meta Schedule] Add Helper Function & Minor Modification (apache#512) [MetaSchedule] Test for Rewrite Parallel-Vectorize-Unroll (apache#513) [Meta Schedule] Feature Extractor & Cost Model (apache#510) Blockize & Tensorize (apache#514) Layout Rewriting: Suggest-Index-Map (apache#520) [MetaSchedule] Parallel-Vectorize-Unroll & Random-Compute-Location (apache#516) [Meta Schedule] Per-Store-Feature (apache#521) Add traced schedule for blockize & tensorize (apache#526) [Meta Schedule] Add XGBoost Model & Random Model (apache#519) User-Interface: Tune-TIR (apache#525) User-Interface: Tune-TE (apache#527) [Minor] More logging on python (apache#528) Get CUDA tuning working (apache#529) [MetaSchedule] TensorRT BYOC (apache#518) [BugFix] LocalBuilder API (apache#531) [Meta Schedule] Add Cost Model Update Measure Callback (apache#530) [Bugfix] BuilderInput with default params (apache#532) [MetaSchedule] Mutator-Tile-Size, Mutate-Parallel, Mutate-Unroll (apache#534) [Meta Schedule] Evolutionary Search (apache#522) [BugFix] Remove duplicated definition of MakeMultinomialSampler (apache#535) [Meta Schedule] Fix some bugs (apache#537) Initiate Experiments for CPU Performance Alignment with Ansor (apache#538) [Meta Schedule] Tweak experiment scripts (apache#539) [Meta Schedule] Initiate experiments on CUDA (apache#540) [TIR][Schedule] Buffer transform (apache#523) Auto Tensor Core (apache#524) Working on Evo Search (apache#542) [Meta Schedule] Add Replay Tuning Interface (apache#543) Evolutionary Search on CPU (apache#544) Misc improvement over the error message (apache#545) [TIR][Schedule] Software pipelining (apache#533) [Meta Schedule Refactor] fixing unit tests (apache#547) [MetaSchedule] Mutator-Compute-Location (apache#548) Misc Improvement of Evolutionary Search (apache#549) Hotfix for software pipeline (apache#552) Misc Improvement (apache#550) [Cherry-Pick][TensorIR] Primitive "SetScope" (apache#9738) (apache#555) Rule RFactor (apache#551) [MemHammer] Rewrite Rules (apache#554) [MetaSchedule] Schedule Rule: Cross-Thread Reduction (apache#556) [MetaSchedule] Performance Alignment - NRM and SFM (CUDA) (apache#559) [MetaSchedule] Perf Alignment - NRM on CUDA (apache#560) [TIR] Reorder the block iters of the blocks generated by RFactor (apache#561) Removing 2 unit tests for software pipelining (apache#562) [MemHammer] Lower Pass + Unittests (apache#557) Perf Align: Remove Auto-inline before Multi-level-tiling (apache#564) Fix Sketch Generation Unittests (apache#565) speed up VerifyGpuCode (apache#568) [Performance Align] fixing codegen problems (apache#569) [Meta schedule] improve search space (#1) Hot fix for bound predicate (apache#3) [Meta Schedule] Update Tune Relay (apache#4) [Performance Align] fixing codegen problems (apache#5) [PerfAlign] NRM & SFM on Raspi Aligned (apache#6) [BugFix] Apply bound predicate directly to loops when possible (apache#12) [BugFix] Fix CrossThreadReduction on CUDA (apache#13) [MetaSchedule] Enable BertTuning with MetaScheduler (apache#11) [Minor][MemHammer] Minor tweaks in code review (apache#14) [Meta Schedule] Add customizable search space to PostOrderApply. (apache#16) Fix cooperative fetching (apache#17) Fixes for codegen (apache#18) [Hotfix] A unittest (apache#19) Fix for GRP sketch gen (apache#21) Add threadIdx filtering in Multi-Level-Tiling and Verify-GPU-Code (apache#20) [BugFix][TIR] Fix cross-thread reduction when single reduction loop with predicate (apache#10016) (apache#22) [MemHammer][Refactor] Code Review (apache#15) [Meta Schedule] Add Winograd Test for Customizable Search Space (apache#24) Co-authored-by: Siyuan Feng <[email protected]> Co-authored-by: Bohan Hou <[email protected]> Co-authored-by: Hongyi Jin <[email protected]> Co-authored-by: Ruihang Lai <[email protected]> Co-authored-by: Junru Shao <[email protected]> Co-authored-by: Wuwei Lin <[email protected]> Co-authored-by: Sunghyun Park <[email protected]> Co-authored-by: Xiyou Zhou <[email protected]> fix some fixes fix test

[Meta Schedule][M3c] Schedule Rules, Mutator & Postprocs (apache#485) [Meta Schedule][M3c] PostOrderApply (apache#486) Fix Post Order Apply (apache#490) [MetaSchedule] Relay Integration (apache#489) [M3c][Meta Schedule] Add Trace Correctness Test for PostOrderApply (apache#492) Fix replay trace. (apache#493) [M3c][Meta Schedule] Implement the Replay Func class. (apache#495) [PR] Test script for meta-schedule task extraction. Interface to load… (apache#494) [Meta Schedule Refactor] Get child blocks (apache#500) Read-at && Write-at (apache#497) [M3c][Meta Schedule] Measure Callbacks (apache#498) [Bug] Fix Infinite Loop Caused When Calling Methods Not Overrided In PyClass (apache#496) [MetaSchedule] Sample-Perfect-Tile (apache#501) [MetaSchedule] TE Workloads (apache#502) [TensorIR] GetProducer, GetConsumer (apache#506) [MetaScheduleRefactor] Annotate&Unannotate (apache#505) [MetaSchedule] Multi-Level-Tiling & Auto-Inline (apache#503) [Tests] Add unittests for auto-inline and multi-level-tiling (apache#508) [Meta Schedule] Minor Fixes (apache#507) [MetaSchedule] Rewrite Cooperative-Fetching / Unbound-Block / Reduction-Block (apache#509) [MetaSchedule] Rewrite Parallel-Vectorize-Unroll / Verify-GPU / Disallow-Dynamic-Loops (apache#499) [Meta Schedule] Add Helper Function & Minor Modification (apache#512) [MetaSchedule] Test for Rewrite Parallel-Vectorize-Unroll (apache#513) [Meta Schedule] Feature Extractor & Cost Model (apache#510) Blockize & Tensorize (apache#514) Layout Rewriting: Suggest-Index-Map (apache#520) [MetaSchedule] Parallel-Vectorize-Unroll & Random-Compute-Location (apache#516) [Meta Schedule] Per-Store-Feature (apache#521) Add traced schedule for blockize & tensorize (apache#526) [Meta Schedule] Add XGBoost Model & Random Model (apache#519) User-Interface: Tune-TIR (apache#525) User-Interface: Tune-TE (apache#527) [Minor] More logging on python (apache#528) Get CUDA tuning working (apache#529) [MetaSchedule] TensorRT BYOC (apache#518) [BugFix] LocalBuilder API (apache#531) [Meta Schedule] Add Cost Model Update Measure Callback (apache#530) [Bugfix] BuilderInput with default params (apache#532) [MetaSchedule] Mutator-Tile-Size, Mutate-Parallel, Mutate-Unroll (apache#534) [Meta Schedule] Evolutionary Search (apache#522) [BugFix] Remove duplicated definition of MakeMultinomialSampler (apache#535) [Meta Schedule] Fix some bugs (apache#537) Initiate Experiments for CPU Performance Alignment with Ansor (apache#538) [Meta Schedule] Tweak experiment scripts (apache#539) [Meta Schedule] Initiate experiments on CUDA (apache#540) [TIR][Schedule] Buffer transform (apache#523) Auto Tensor Core (apache#524) Working on Evo Search (apache#542) [Meta Schedule] Add Replay Tuning Interface (apache#543) Evolutionary Search on CPU (apache#544) Misc improvement over the error message (apache#545) [TIR][Schedule] Software pipelining (apache#533) [Meta Schedule Refactor] fixing unit tests (apache#547) [MetaSchedule] Mutator-Compute-Location (apache#548) Misc Improvement of Evolutionary Search (apache#549) Hotfix for software pipeline (apache#552) Misc Improvement (apache#550) [Cherry-Pick][TensorIR] Primitive "SetScope" (apache#9738) (apache#555) Rule RFactor (apache#551) [MemHammer] Rewrite Rules (apache#554) [MetaSchedule] Schedule Rule: Cross-Thread Reduction (apache#556) [MetaSchedule] Performance Alignment - NRM and SFM (CUDA) (apache#559) [MetaSchedule] Perf Alignment - NRM on CUDA (apache#560) [TIR] Reorder the block iters of the blocks generated by RFactor (apache#561) Removing 2 unit tests for software pipelining (apache#562) [MemHammer] Lower Pass + Unittests (apache#557) Perf Align: Remove Auto-inline before Multi-level-tiling (apache#564) Fix Sketch Generation Unittests (apache#565) speed up VerifyGpuCode (apache#568) [Performance Align] fixing codegen problems (apache#569) [Meta schedule] improve search space (#1) Hot fix for bound predicate (#3) [Meta Schedule] Update Tune Relay (#4) [Performance Align] fixing codegen problems (#5) [PerfAlign] NRM & SFM on Raspi Aligned (#6) [BugFix] Apply bound predicate directly to loops when possible (#12) [BugFix] Fix CrossThreadReduction on CUDA (#13) [MetaSchedule] Enable BertTuning with MetaScheduler (#11) [Minor][MemHammer] Minor tweaks in code review (#14) [Meta Schedule] Add customizable search space to PostOrderApply. (#16) Fix cooperative fetching (#17) Fixes for codegen (#18) [Hotfix] A unittest (#19) Fix for GRP sketch gen (#21) Add threadIdx filtering in Multi-Level-Tiling and Verify-GPU-Code (#20) [BugFix][TIR] Fix cross-thread reduction when single reduction loop with predicate (apache#10016) (#22) [MemHammer][Refactor] Code Review (#15) [Meta Schedule] Add Winograd Test for Customizable Search Space (#24) Import & Cache Mechanism (#26) [BugFix] Fix Winograd Test Script (#25) Add task extraction & caching (#27) A few fixes for task extraction (#28) Co-authored-by: Siyuan Feng <[email protected]> Co-authored-by: Bohan Hou <[email protected]> Co-authored-by: Hongyi Jin <[email protected]> Co-authored-by: Ruihang Lai <[email protected]> Co-authored-by: Junru Shao <[email protected]> Co-authored-by: Wuwei Lin <[email protected]> Co-authored-by: Sunghyun Park <[email protected]> Co-authored-by: Xiyou Zhou <[email protected]>

[Meta Schedule][M3c] Schedule Rules, Mutator & Postprocs (apache#485) [Meta Schedule][M3c] PostOrderApply (apache#486) Fix Post Order Apply (apache#490) [MetaSchedule] Relay Integration (apache#489) [M3c][Meta Schedule] Add Trace Correctness Test for PostOrderApply (apache#492) Fix replay trace. (apache#493) [M3c][Meta Schedule] Implement the Replay Func class. (apache#495) [PR] Test script for meta-schedule task extraction. Interface to load… (apache#494) [Meta Schedule Refactor] Get child blocks (apache#500) Read-at && Write-at (apache#497) [M3c][Meta Schedule] Measure Callbacks (apache#498) [Bug] Fix Infinite Loop Caused When Calling Methods Not Overrided In PyClass (apache#496) [MetaSchedule] Sample-Perfect-Tile (apache#501) [MetaSchedule] TE Workloads (apache#502) [TensorIR] GetProducer, GetConsumer (apache#506) [MetaScheduleRefactor] Annotate&Unannotate (apache#505) [MetaSchedule] Multi-Level-Tiling & Auto-Inline (apache#503) [Tests] Add unittests for auto-inline and multi-level-tiling (apache#508) [Meta Schedule] Minor Fixes (apache#507) [MetaSchedule] Rewrite Cooperative-Fetching / Unbound-Block / Reduction-Block (apache#509) [MetaSchedule] Rewrite Parallel-Vectorize-Unroll / Verify-GPU / Disallow-Dynamic-Loops (apache#499) [Meta Schedule] Add Helper Function & Minor Modification (apache#512) [MetaSchedule] Test for Rewrite Parallel-Vectorize-Unroll (apache#513) [Meta Schedule] Feature Extractor & Cost Model (apache#510) Blockize & Tensorize (apache#514) Layout Rewriting: Suggest-Index-Map (apache#520) [MetaSchedule] Parallel-Vectorize-Unroll & Random-Compute-Location (apache#516) [Meta Schedule] Per-Store-Feature (apache#521) Add traced schedule for blockize & tensorize (apache#526) [Meta Schedule] Add XGBoost Model & Random Model (apache#519) User-Interface: Tune-TIR (apache#525) User-Interface: Tune-TE (apache#527) [Minor] More logging on python (apache#528) Get CUDA tuning working (apache#529) [MetaSchedule] TensorRT BYOC (apache#518) [BugFix] LocalBuilder API (apache#531) [Meta Schedule] Add Cost Model Update Measure Callback (apache#530) [Bugfix] BuilderInput with default params (apache#532) [MetaSchedule] Mutator-Tile-Size, Mutate-Parallel, Mutate-Unroll (apache#534) [Meta Schedule] Evolutionary Search (apache#522) [BugFix] Remove duplicated definition of MakeMultinomialSampler (apache#535) [Meta Schedule] Fix some bugs (apache#537) Initiate Experiments for CPU Performance Alignment with Ansor (apache#538) [Meta Schedule] Tweak experiment scripts (apache#539) [Meta Schedule] Initiate experiments on CUDA (apache#540) [TIR][Schedule] Buffer transform (apache#523) Auto Tensor Core (apache#524) Working on Evo Search (apache#542) [Meta Schedule] Add Replay Tuning Interface (apache#543) Evolutionary Search on CPU (apache#544) Misc improvement over the error message (apache#545) [TIR][Schedule] Software pipelining (apache#533) [Meta Schedule Refactor] fixing unit tests (apache#547) [MetaSchedule] Mutator-Compute-Location (apache#548) Misc Improvement of Evolutionary Search (apache#549) Hotfix for software pipeline (apache#552) Misc Improvement (apache#550) [Cherry-Pick][TensorIR] Primitive "SetScope" (apache#9738) (apache#555) Rule RFactor (apache#551) [MemHammer] Rewrite Rules (apache#554) [MetaSchedule] Schedule Rule: Cross-Thread Reduction (apache#556) [MetaSchedule] Performance Alignment - NRM and SFM (CUDA) (apache#559) [MetaSchedule] Perf Alignment - NRM on CUDA (apache#560) [TIR] Reorder the block iters of the blocks generated by RFactor (apache#561) Removing 2 unit tests for software pipelining (apache#562) [MemHammer] Lower Pass + Unittests (apache#557) Perf Align: Remove Auto-inline before Multi-level-tiling (apache#564) Fix Sketch Generation Unittests (apache#565) speed up VerifyGpuCode (apache#568) [Performance Align] fixing codegen problems (apache#569) [Meta schedule] improve search space (apache#1) Hot fix for bound predicate (apache#3) [Meta Schedule] Update Tune Relay (apache#4) [Performance Align] fixing codegen problems (apache#5) [PerfAlign] NRM & SFM on Raspi Aligned (apache#6) [BugFix] Apply bound predicate directly to loops when possible (apache#12) [BugFix] Fix CrossThreadReduction on CUDA (apache#13) [MetaSchedule] Enable BertTuning with MetaScheduler (apache#11) [Minor][MemHammer] Minor tweaks in code review (apache#14) [Meta Schedule] Add customizable search space to PostOrderApply. (apache#16) Fix cooperative fetching (apache#17) Fixes for codegen (apache#18) [Hotfix] A unittest (apache#19) Fix for GRP sketch gen (apache#21) Add threadIdx filtering in Multi-Level-Tiling and Verify-GPU-Code (apache#20) [BugFix][TIR] Fix cross-thread reduction when single reduction loop with predicate (apache#10016) (apache#22) [MemHammer][Refactor] Code Review (apache#15) [Meta Schedule] Add Winograd Test for Customizable Search Space (apache#24) Co-authored-by: Siyuan Feng <[email protected]> Co-authored-by: Bohan Hou <[email protected]> Co-authored-by: Hongyi Jin <[email protected]> Co-authored-by: Ruihang Lai <[email protected]> Co-authored-by: Junru Shao <[email protected]> Co-authored-by: Wuwei Lin <[email protected]> Co-authored-by: Sunghyun Park <[email protected]> Co-authored-by: Xiyou Zhou <[email protected]>

[Meta Schedule][M3c] Schedule Rules, Mutator & Postprocs (apache#485) [Meta Schedule][M3c] PostOrderApply (apache#486) Fix Post Order Apply (apache#490) [MetaSchedule] Relay Integration (apache#489) [M3c][Meta Schedule] Add Trace Correctness Test for PostOrderApply (apache#492) Fix replay trace. (apache#493) [M3c][Meta Schedule] Implement the Replay Func class. (apache#495) [PR] Test script for meta-schedule task extraction. Interface to load… (apache#494) [Meta Schedule Refactor] Get child blocks (apache#500) Read-at && Write-at (apache#497) [M3c][Meta Schedule] Measure Callbacks (apache#498) [Bug] Fix Infinite Loop Caused When Calling Methods Not Overrided In PyClass (apache#496) [MetaSchedule] Sample-Perfect-Tile (apache#501) [MetaSchedule] TE Workloads (apache#502) [TensorIR] GetProducer, GetConsumer (apache#506) [MetaScheduleRefactor] Annotate&Unannotate (apache#505) [MetaSchedule] Multi-Level-Tiling & Auto-Inline (apache#503) [Tests] Add unittests for auto-inline and multi-level-tiling (apache#508) [Meta Schedule] Minor Fixes (apache#507) [MetaSchedule] Rewrite Cooperative-Fetching / Unbound-Block / Reduction-Block (apache#509) [MetaSchedule] Rewrite Parallel-Vectorize-Unroll / Verify-GPU / Disallow-Dynamic-Loops (apache#499) [Meta Schedule] Add Helper Function & Minor Modification (apache#512) [MetaSchedule] Test for Rewrite Parallel-Vectorize-Unroll (apache#513) [Meta Schedule] Feature Extractor & Cost Model (apache#510) Blockize & Tensorize (apache#514) Layout Rewriting: Suggest-Index-Map (apache#520) [MetaSchedule] Parallel-Vectorize-Unroll & Random-Compute-Location (apache#516) [Meta Schedule] Per-Store-Feature (apache#521) Add traced schedule for blockize & tensorize (apache#526) [Meta Schedule] Add XGBoost Model & Random Model (apache#519) User-Interface: Tune-TIR (apache#525) User-Interface: Tune-TE (apache#527) [Minor] More logging on python (apache#528) Get CUDA tuning working (apache#529) [MetaSchedule] TensorRT BYOC (apache#518) [BugFix] LocalBuilder API (apache#531) [Meta Schedule] Add Cost Model Update Measure Callback (apache#530) [Bugfix] BuilderInput with default params (apache#532) [MetaSchedule] Mutator-Tile-Size, Mutate-Parallel, Mutate-Unroll (apache#534) [Meta Schedule] Evolutionary Search (apache#522) [BugFix] Remove duplicated definition of MakeMultinomialSampler (apache#535) [Meta Schedule] Fix some bugs (apache#537) Initiate Experiments for CPU Performance Alignment with Ansor (apache#538) [Meta Schedule] Tweak experiment scripts (apache#539) [Meta Schedule] Initiate experiments on CUDA (apache#540) [TIR][Schedule] Buffer transform (apache#523) Auto Tensor Core (apache#524) Working on Evo Search (apache#542) [Meta Schedule] Add Replay Tuning Interface (apache#543) Evolutionary Search on CPU (apache#544) Misc improvement over the error message (apache#545) [TIR][Schedule] Software pipelining (apache#533) [Meta Schedule Refactor] fixing unit tests (apache#547) [MetaSchedule] Mutator-Compute-Location (apache#548) Misc Improvement of Evolutionary Search (apache#549) Hotfix for software pipeline (apache#552) Misc Improvement (apache#550) [Cherry-Pick][TensorIR] Primitive "SetScope" (apache#9738) (apache#555) Rule RFactor (apache#551) [MemHammer] Rewrite Rules (apache#554) [MetaSchedule] Schedule Rule: Cross-Thread Reduction (apache#556) [MetaSchedule] Performance Alignment - NRM and SFM (CUDA) (apache#559) [MetaSchedule] Perf Alignment - NRM on CUDA (apache#560) [TIR] Reorder the block iters of the blocks generated by RFactor (apache#561) Removing 2 unit tests for software pipelining (apache#562) [MemHammer] Lower Pass + Unittests (apache#557) Perf Align: Remove Auto-inline before Multi-level-tiling (apache#564) Fix Sketch Generation Unittests (apache#565) speed up VerifyGpuCode (apache#568) [Performance Align] fixing codegen problems (apache#569) [Meta schedule] improve search space (apache#1) Hot fix for bound predicate (apache#3) [Meta Schedule] Update Tune Relay (apache#4) [Performance Align] fixing codegen problems (apache#5) [PerfAlign] NRM & SFM on Raspi Aligned (apache#6) [BugFix] Apply bound predicate directly to loops when possible (apache#12) [BugFix] Fix CrossThreadReduction on CUDA (apache#13) [MetaSchedule] Enable BertTuning with MetaScheduler (apache#11) [Minor][MemHammer] Minor tweaks in code review (apache#14) [Meta Schedule] Add customizable search space to PostOrderApply. (apache#16) Fix cooperative fetching (apache#17) Fixes for codegen (apache#18) [Hotfix] A unittest (apache#19) Fix for GRP sketch gen (apache#21) Add threadIdx filtering in Multi-Level-Tiling and Verify-GPU-Code (apache#20) [BugFix][TIR] Fix cross-thread reduction when single reduction loop with predicate (apache#10016) (apache#22) [MemHammer][Refactor] Code Review (apache#15) [Meta Schedule] Add Winograd Test for Customizable Search Space (apache#24) Import & Cache Mechanism (apache#26) [BugFix] Fix Winograd Test Script (apache#25) Add task extraction & caching (apache#27) A few fixes for task extraction (apache#28) Co-authored-by: Siyuan Feng <[email protected]> Co-authored-by: Bohan Hou <[email protected]> Co-authored-by: Hongyi Jin <[email protected]> Co-authored-by: Ruihang Lai <[email protected]> Co-authored-by: Junru Shao <[email protected]> Co-authored-by: Wuwei Lin <[email protected]> Co-authored-by: Sunghyun Park <[email protected]> Co-authored-by: Xiyou Zhou <[email protected]>

[Meta Schedule][M3c] Schedule Rules, Mutator & Postprocs (apache#485) [Meta Schedule][M3c] PostOrderApply (apache#486) Fix Post Order Apply (apache#490) [MetaSchedule] Relay Integration (apache#489) [M3c][Meta Schedule] Add Trace Correctness Test for PostOrderApply (apache#492) Fix replay trace. (apache#493) [M3c][Meta Schedule] Implement the Replay Func class. (apache#495) [PR] Test script for meta-schedule task extraction. Interface to load… (apache#494) [Meta Schedule Refactor] Get child blocks (apache#500) Read-at && Write-at (apache#497) [M3c][Meta Schedule] Measure Callbacks (apache#498) [Bug] Fix Infinite Loop Caused When Calling Methods Not Overrided In PyClass (apache#496) [MetaSchedule] Sample-Perfect-Tile (apache#501) [MetaSchedule] TE Workloads (apache#502) [TensorIR] GetProducer, GetConsumer (apache#506) [MetaScheduleRefactor] Annotate&Unannotate (apache#505) [MetaSchedule] Multi-Level-Tiling & Auto-Inline (apache#503) [Tests] Add unittests for auto-inline and multi-level-tiling (apache#508) [Meta Schedule] Minor Fixes (apache#507) [MetaSchedule] Rewrite Cooperative-Fetching / Unbound-Block / Reduction-Block (apache#509) [MetaSchedule] Rewrite Parallel-Vectorize-Unroll / Verify-GPU / Disallow-Dynamic-Loops (apache#499) [Meta Schedule] Add Helper Function & Minor Modification (apache#512) [MetaSchedule] Test for Rewrite Parallel-Vectorize-Unroll (apache#513) [Meta Schedule] Feature Extractor & Cost Model (apache#510) Blockize & Tensorize (apache#514) Layout Rewriting: Suggest-Index-Map (apache#520) [MetaSchedule] Parallel-Vectorize-Unroll & Random-Compute-Location (apache#516) [Meta Schedule] Per-Store-Feature (apache#521) Add traced schedule for blockize & tensorize (apache#526) [Meta Schedule] Add XGBoost Model & Random Model (apache#519) User-Interface: Tune-TIR (apache#525) User-Interface: Tune-TE (apache#527) [Minor] More logging on python (apache#528) Get CUDA tuning working (apache#529) [MetaSchedule] TensorRT BYOC (apache#518) [BugFix] LocalBuilder API (apache#531) [Meta Schedule] Add Cost Model Update Measure Callback (apache#530) [Bugfix] BuilderInput with default params (apache#532) [MetaSchedule] Mutator-Tile-Size, Mutate-Parallel, Mutate-Unroll (apache#534) [Meta Schedule] Evolutionary Search (apache#522) [BugFix] Remove duplicated definition of MakeMultinomialSampler (apache#535) [Meta Schedule] Fix some bugs (apache#537) Initiate Experiments for CPU Performance Alignment with Ansor (apache#538) [Meta Schedule] Tweak experiment scripts (apache#539) [Meta Schedule] Initiate experiments on CUDA (apache#540) [TIR][Schedule] Buffer transform (apache#523) Auto Tensor Core (apache#524) Working on Evo Search (apache#542) [Meta Schedule] Add Replay Tuning Interface (apache#543) Evolutionary Search on CPU (apache#544) Misc improvement over the error message (apache#545) [TIR][Schedule] Software pipelining (apache#533) [Meta Schedule Refactor] fixing unit tests (apache#547) [MetaSchedule] Mutator-Compute-Location (apache#548) Misc Improvement of Evolutionary Search (apache#549) Hotfix for software pipeline (apache#552) Misc Improvement (apache#550) [Cherry-Pick][TensorIR] Primitive "SetScope" (apache#9738) (apache#555) Rule RFactor (apache#551) [MemHammer] Rewrite Rules (apache#554) [MetaSchedule] Schedule Rule: Cross-Thread Reduction (apache#556) [MetaSchedule] Performance Alignment - NRM and SFM (CUDA) (apache#559) [MetaSchedule] Perf Alignment - NRM on CUDA (apache#560) [TIR] Reorder the block iters of the blocks generated by RFactor (apache#561) Removing 2 unit tests for software pipelining (apache#562) [MemHammer] Lower Pass + Unittests (apache#557) Perf Align: Remove Auto-inline before Multi-level-tiling (apache#564) Fix Sketch Generation Unittests (apache#565) speed up VerifyGpuCode (apache#568) [Performance Align] fixing codegen problems (apache#569) [Meta schedule] improve search space (apache#1) Hot fix for bound predicate (apache#3) [Meta Schedule] Update Tune Relay (apache#4) [Performance Align] fixing codegen problems (apache#5) [PerfAlign] NRM & SFM on Raspi Aligned (apache#6) [BugFix] Apply bound predicate directly to loops when possible (apache#12) [BugFix] Fix CrossThreadReduction on CUDA (apache#13) [MetaSchedule] Enable BertTuning with MetaScheduler (apache#11) [Minor][MemHammer] Minor tweaks in code review (apache#14) [Meta Schedule] Add customizable search space to PostOrderApply. (apache#16) Fix cooperative fetching (apache#17) Fixes for codegen (apache#18) [Hotfix] A unittest (apache#19) Fix for GRP sketch gen (apache#21) Add threadIdx filtering in Multi-Level-Tiling and Verify-GPU-Code (apache#20) [BugFix][TIR] Fix cross-thread reduction when single reduction loop with predicate (apache#10016) (apache#22) [MemHammer][Refactor] Code Review (apache#15) [Meta Schedule] Add Winograd Test for Customizable Search Space (apache#24) Co-authored-by: Siyuan Feng <[email protected]> Co-authored-by: Bohan Hou <[email protected]> Co-authored-by: Hongyi Jin <[email protected]> Co-authored-by: Ruihang Lai <[email protected]> Co-authored-by: Junru Shao <[email protected]> Co-authored-by: Wuwei Lin <[email protected]> Co-authored-by: Sunghyun Park <[email protected]> Co-authored-by: Xiyou Zhou <[email protected]>

[Meta Schedule][M3c] Schedule Rules, Mutator & Postprocs (apache#485) [Meta Schedule][M3c] PostOrderApply (apache#486) Fix Post Order Apply (apache#490) [MetaSchedule] Relay Integration (apache#489) [M3c][Meta Schedule] Add Trace Correctness Test for PostOrderApply (apache#492) Fix replay trace. (apache#493) [M3c][Meta Schedule] Implement the Replay Func class. (apache#495) [PR] Test script for meta-schedule task extraction. Interface to load… (apache#494) [Meta Schedule Refactor] Get child blocks (apache#500) Read-at && Write-at (apache#497) [M3c][Meta Schedule] Measure Callbacks (apache#498) [Bug] Fix Infinite Loop Caused When Calling Methods Not Overrided In PyClass (apache#496) [MetaSchedule] Sample-Perfect-Tile (apache#501) [MetaSchedule] TE Workloads (apache#502) [TensorIR] GetProducer, GetConsumer (apache#506) [MetaScheduleRefactor] Annotate&Unannotate (apache#505) [MetaSchedule] Multi-Level-Tiling & Auto-Inline (apache#503) [Tests] Add unittests for auto-inline and multi-level-tiling (apache#508) [Meta Schedule] Minor Fixes (apache#507) [MetaSchedule] Rewrite Cooperative-Fetching / Unbound-Block / Reduction-Block (apache#509) [MetaSchedule] Rewrite Parallel-Vectorize-Unroll / Verify-GPU / Disallow-Dynamic-Loops (apache#499) [Meta Schedule] Add Helper Function & Minor Modification (apache#512) [MetaSchedule] Test for Rewrite Parallel-Vectorize-Unroll (apache#513) [Meta Schedule] Feature Extractor & Cost Model (apache#510) Blockize & Tensorize (apache#514) Layout Rewriting: Suggest-Index-Map (apache#520) [MetaSchedule] Parallel-Vectorize-Unroll & Random-Compute-Location (apache#516) [Meta Schedule] Per-Store-Feature (apache#521) Add traced schedule for blockize & tensorize (apache#526) [Meta Schedule] Add XGBoost Model & Random Model (apache#519) User-Interface: Tune-TIR (apache#525) User-Interface: Tune-TE (apache#527) [Minor] More logging on python (apache#528) Get CUDA tuning working (apache#529) [MetaSchedule] TensorRT BYOC (apache#518) [BugFix] LocalBuilder API (apache#531) [Meta Schedule] Add Cost Model Update Measure Callback (apache#530) [Bugfix] BuilderInput with default params (apache#532) [MetaSchedule] Mutator-Tile-Size, Mutate-Parallel, Mutate-Unroll (apache#534) [Meta Schedule] Evolutionary Search (apache#522) [BugFix] Remove duplicated definition of MakeMultinomialSampler (apache#535) [Meta Schedule] Fix some bugs (apache#537) Initiate Experiments for CPU Performance Alignment with Ansor (apache#538) [Meta Schedule] Tweak experiment scripts (apache#539) [Meta Schedule] Initiate experiments on CUDA (apache#540) [TIR][Schedule] Buffer transform (apache#523) Auto Tensor Core (apache#524) Working on Evo Search (apache#542) [Meta Schedule] Add Replay Tuning Interface (apache#543) Evolutionary Search on CPU (apache#544) Misc improvement over the error message (apache#545) [TIR][Schedule] Software pipelining (apache#533) [Meta Schedule Refactor] fixing unit tests (apache#547) [MetaSchedule] Mutator-Compute-Location (apache#548) Misc Improvement of Evolutionary Search (apache#549) Hotfix for software pipeline (apache#552) Misc Improvement (apache#550) [Cherry-Pick][TensorIR] Primitive "SetScope" (apache#9738) (apache#555) Rule RFactor (apache#551) [MemHammer] Rewrite Rules (apache#554) [MetaSchedule] Schedule Rule: Cross-Thread Reduction (apache#556) [MetaSchedule] Performance Alignment - NRM and SFM (CUDA) (apache#559) [MetaSchedule] Perf Alignment - NRM on CUDA (apache#560) [TIR] Reorder the block iters of the blocks generated by RFactor (apache#561) Removing 2 unit tests for software pipelining (apache#562) [MemHammer] Lower Pass + Unittests (apache#557) Perf Align: Remove Auto-inline before Multi-level-tiling (apache#564) Fix Sketch Generation Unittests (apache#565) speed up VerifyGpuCode (apache#568) [Performance Align] fixing codegen problems (apache#569) [Meta schedule] improve search space (apache#1) Hot fix for bound predicate (apache#3) [Meta Schedule] Update Tune Relay (apache#4) [Performance Align] fixing codegen problems (apache#5) [PerfAlign] NRM & SFM on Raspi Aligned (apache#6) [BugFix] Apply bound predicate directly to loops when possible (apache#12) [BugFix] Fix CrossThreadReduction on CUDA (apache#13) [MetaSchedule] Enable BertTuning with MetaScheduler (apache#11) [Minor][MemHammer] Minor tweaks in code review (apache#14) [Meta Schedule] Add customizable search space to PostOrderApply. (apache#16) Fix cooperative fetching (apache#17) Fixes for codegen (apache#18) [Hotfix] A unittest (apache#19) Fix for GRP sketch gen (apache#21) Add threadIdx filtering in Multi-Level-Tiling and Verify-GPU-Code (apache#20) [BugFix][TIR] Fix cross-thread reduction when single reduction loop with predicate (apache#10016) (apache#22) [MemHammer][Refactor] Code Review (apache#15) [Meta Schedule] Add Winograd Test for Customizable Search Space (apache#24) Import & Cache Mechanism (apache#26) [BugFix] Fix Winograd Test Script (apache#25) Add task extraction & caching (apache#27) A few fixes for task extraction (apache#28) Co-authored-by: Siyuan Feng <[email protected]> Co-authored-by: Bohan Hou <[email protected]> Co-authored-by: Hongyi Jin <[email protected]> Co-authored-by: Ruihang Lai <[email protected]> Co-authored-by: Junru Shao <[email protected]> Co-authored-by: Wuwei Lin <[email protected]> Co-authored-by: Sunghyun Park <[email protected]> Co-authored-by: Xiyou Zhou <[email protected]>

[Meta Schedule][M3c] Schedule Rules, Mutator & Postprocs (apache#485) [Meta Schedule][M3c] PostOrderApply (apache#486) Fix Post Order Apply (apache#490) [MetaSchedule] Relay Integration (apache#489) [M3c][Meta Schedule] Add Trace Correctness Test for PostOrderApply (apache#492) Fix replay trace. (apache#493) [M3c][Meta Schedule] Implement the Replay Func class. (apache#495) [PR] Test script for meta-schedule task extraction. Interface to load… (apache#494) [Meta Schedule Refactor] Get child blocks (apache#500) Read-at && Write-at (apache#497) [M3c][Meta Schedule] Measure Callbacks (apache#498) [Bug] Fix Infinite Loop Caused When Calling Methods Not Overrided In PyClass (apache#496) [MetaSchedule] Sample-Perfect-Tile (apache#501) [MetaSchedule] TE Workloads (apache#502) [TensorIR] GetProducer, GetConsumer (apache#506) [MetaScheduleRefactor] Annotate&Unannotate (apache#505) [MetaSchedule] Multi-Level-Tiling & Auto-Inline (apache#503) [Tests] Add unittests for auto-inline and multi-level-tiling (apache#508) [Meta Schedule] Minor Fixes (apache#507) [MetaSchedule] Rewrite Cooperative-Fetching / Unbound-Block / Reduction-Block (apache#509) [MetaSchedule] Rewrite Parallel-Vectorize-Unroll / Verify-GPU / Disallow-Dynamic-Loops (apache#499) [Meta Schedule] Add Helper Function & Minor Modification (apache#512) [MetaSchedule] Test for Rewrite Parallel-Vectorize-Unroll (apache#513) [Meta Schedule] Feature Extractor & Cost Model (apache#510) Blockize & Tensorize (apache#514) Layout Rewriting: Suggest-Index-Map (apache#520) [MetaSchedule] Parallel-Vectorize-Unroll & Random-Compute-Location (apache#516) [Meta Schedule] Per-Store-Feature (apache#521) Add traced schedule for blockize & tensorize (apache#526) [Meta Schedule] Add XGBoost Model & Random Model (apache#519) User-Interface: Tune-TIR (apache#525) User-Interface: Tune-TE (apache#527) [Minor] More logging on python (apache#528) Get CUDA tuning working (apache#529) [MetaSchedule] TensorRT BYOC (apache#518) [BugFix] LocalBuilder API (apache#531) [Meta Schedule] Add Cost Model Update Measure Callback (apache#530) [Bugfix] BuilderInput with default params (apache#532) [MetaSchedule] Mutator-Tile-Size, Mutate-Parallel, Mutate-Unroll (apache#534) [Meta Schedule] Evolutionary Search (apache#522) [BugFix] Remove duplicated definition of MakeMultinomialSampler (apache#535) [Meta Schedule] Fix some bugs (apache#537) Initiate Experiments for CPU Performance Alignment with Ansor (apache#538) [Meta Schedule] Tweak experiment scripts (apache#539) [Meta Schedule] Initiate experiments on CUDA (apache#540) [TIR][Schedule] Buffer transform (apache#523) Auto Tensor Core (apache#524) Working on Evo Search (apache#542) [Meta Schedule] Add Replay Tuning Interface (apache#543) Evolutionary Search on CPU (apache#544) Misc improvement over the error message (apache#545) [TIR][Schedule] Software pipelining (apache#533) [Meta Schedule Refactor] fixing unit tests (apache#547) [MetaSchedule] Mutator-Compute-Location (apache#548) Misc Improvement of Evolutionary Search (apache#549) Hotfix for software pipeline (apache#552) Misc Improvement (apache#550) [Cherry-Pick][TensorIR] Primitive "SetScope" (apache#9738) (apache#555) Rule RFactor (apache#551) [MemHammer] Rewrite Rules (apache#554) [MetaSchedule] Schedule Rule: Cross-Thread Reduction (apache#556) [MetaSchedule] Performance Alignment - NRM and SFM (CUDA) (apache#559) [MetaSchedule] Perf Alignment - NRM on CUDA (apache#560) [TIR] Reorder the block iters of the blocks generated by RFactor (apache#561) Removing 2 unit tests for software pipelining (apache#562) [MemHammer] Lower Pass + Unittests (apache#557) Perf Align: Remove Auto-inline before Multi-level-tiling (apache#564) Fix Sketch Generation Unittests (apache#565) speed up VerifyGpuCode (apache#568) [Performance Align] fixing codegen problems (apache#569) [Meta schedule] improve search space (apache#1) Hot fix for bound predicate (apache#3) [Meta Schedule] Update Tune Relay (apache#4) [Performance Align] fixing codegen problems (apache#5) [PerfAlign] NRM & SFM on Raspi Aligned (apache#6) [BugFix] Apply bound predicate directly to loops when possible (apache#12) [BugFix] Fix CrossThreadReduction on CUDA (apache#13) [MetaSchedule] Enable BertTuning with MetaScheduler (apache#11) [Minor][MemHammer] Minor tweaks in code review (apache#14) [Meta Schedule] Add customizable search space to PostOrderApply. (apache#16) Fix cooperative fetching (apache#17) Fixes for codegen (apache#18) [Hotfix] A unittest (apache#19) Fix for GRP sketch gen (apache#21) Add threadIdx filtering in Multi-Level-Tiling and Verify-GPU-Code (apache#20) [BugFix][TIR] Fix cross-thread reduction when single reduction loop with predicate (apache#10016) (apache#22) [MemHammer][Refactor] Code Review (apache#15) [Meta Schedule] Add Winograd Test for Customizable Search Space (apache#24) Co-authored-by: Siyuan Feng <[email protected]> Co-authored-by: Bohan Hou <[email protected]> Co-authored-by: Hongyi Jin <[email protected]> Co-authored-by: Ruihang Lai <[email protected]> Co-authored-by: Junru Shao <[email protected]> Co-authored-by: Wuwei Lin <[email protected]> Co-authored-by: Sunghyun Park <[email protected]> Co-authored-by: Xiyou Zhou <[email protected]>

masahi changed the title ~~In topi schedules, num_thread == 512 is too big for AMD OpenCL~~ In topi schedules, for OpenCL target, num_thread == 512 is too big for AMD GPU Oct 10, 2017

masahi mentioned this issue Oct 11, 2017

tvm.bind not found #532

Closed

masahi mentioned this issue Oct 14, 2017

[Refactor] Introduce target generic dispatch system #556

Merged

masahi closed this as completed Oct 14, 2017

kaishijeng mentioned this issue Oct 27, 2017

[Tutorial] Use OpenCL on ARM Board #525

Closed

JammyZhou mentioned this issue Jan 8, 2018

CL_OUT_OF_RESOURCES when deploy resnet18 to ARM device #761

Closed

In topi schedules, for OpenCL target, num_thread == 512 is too big for AMD GPU #530

In topi schedules, for OpenCL target, num_thread == 512 is too big for AMD GPU #530

Comments

masahi commented Oct 10, 2017 • edited Loading

tqchen commented Oct 10, 2017

kaishijeng commented Oct 11, 2017

masahi commented Oct 11, 2017 • edited Loading

masahi commented Oct 11, 2017

kaishijeng commented Oct 11, 2017 via email

tqchen commented Oct 11, 2017

kaishijeng commented Oct 11, 2017 via email

masahi commented Oct 11, 2017 • edited Loading

kaishijeng commented Oct 11, 2017 via email

masahi commented Oct 11, 2017

kaishijeng commented Oct 11, 2017 via email

masahi commented Oct 11, 2017

kaishijeng commented Oct 11, 2017 via email

masahi commented Oct 11, 2017 via email

kaishijeng commented Oct 11, 2017 via email

masahi commented Oct 11, 2017 • edited Loading

tqchen commented Oct 11, 2017

masahi commented Oct 11, 2017

kaishijeng commented Oct 12, 2017 via email

tqchen commented Oct 12, 2017

kaishijeng commented Oct 12, 2017 via email

tqchen commented Oct 12, 2017

kaishijeng commented Oct 13, 2017 via email

masahi commented Oct 14, 2017

kaishijeng commented Oct 15, 2017 via email

tqchen commented Oct 15, 2017

masahi commented Oct 16, 2017 • edited Loading

kaishijeng commented Oct 17, 2017 via email

tqchen commented Oct 17, 2017

kaishijeng commented Oct 23, 2017

masahi commented Oct 23, 2017 • edited Loading

kaishijeng commented Oct 23, 2017 via email

masahi commented Oct 23, 2017

kaishijeng commented Oct 23, 2017 via email

masahi commented Oct 23, 2017 • edited Loading

kaishijeng commented Oct 23, 2017 via email

masahi commented Oct 23, 2017

tqchen commented Oct 23, 2017

masahi commented Oct 10, 2017 •

edited

Loading

masahi commented Oct 11, 2017 •

edited

Loading

masahi commented Oct 11, 2017 •

edited

Loading

masahi commented Oct 11, 2017 •

edited

Loading

masahi commented Oct 16, 2017 •

edited

Loading

masahi commented Oct 23, 2017 •

edited

Loading

masahi commented Oct 23, 2017 •

edited

Loading