Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CONV] Asymmetric padding #4511

Merged
merged 7 commits into from
Jan 6, 2020
Merged

Conversation

optima2005
Copy link
Contributor

The PR is to implement the 1st item in issue #2682.

I tried to find as more as possible the places where padding is handled as symmetric. But there may be some missing. Please comment is your find more.

The following module/file are kept using symmetic padding due to interface limitaion or I am not sure if changing to asymmetric would cause any trouble.

topi/python/topi/x86/conv2d_alter_op.py
topi/python/topi/rocm/conv2d.py
topi/include/topi/nn.h
convolution with 'cudnn' lib.

@optima2005
Copy link
Contributor Author

@FrozenGene @Rasterer

@optima2005 optima2005 changed the title [CONV] Asymmetic padding [CONV] Asymmetric padding Dec 12, 2019
@FrozenGene
Copy link
Member

For topi/python/topi/x86/conv2d_alter_op.py, do you mean legalize? We should modify it. Because the get_int_tuple["padding"] maybe return 4D padding. Then next we should use pad_top, pad_bottom to represent (pad_h, pad_h) , pad_left, pad_right to represent (pad_w, pad_w). i.e. padding[0] and padding[1] there.

For topi/python/topi/rocm/conv2d.py, I think we could get pad_h / pad_w use pad_top + pad_bottom / pad_left + pad_right, then pass to miopen.conv2d_forward. This is the same as cudnn. We don't change the interface of cudnn, but change the value passed to them.

For topi/include/topi/nn.h , conv2d_* / depthwise_conv2d_* shouldn't be used in any other places, because we use Python's topi version of conv2d / depthwise_conv2d, so we could keep them.

@FrozenGene
Copy link
Member

@Huyuwei It would be nice you could help to review too.

@optima2005
Copy link
Contributor Author

@FrozenGene I have addressed your comments, please go on review. Thanks!

@optima2005 optima2005 changed the title [CONV] Asymmetric padding [WIP] [CONV] Asymmetric padding Dec 15, 2019
@optima2005 optima2005 changed the title [WIP] [CONV] Asymmetric padding [CONV] Asymmetric padding Dec 16, 2019
@FrozenGene
Copy link
Member

Thanks. Could you help to resolve the conflict?

cc @optima2005 @zhiics @yzhliu @Huyuwei anyone could help to review and handle it?

@optima2005
Copy link
Contributor Author

rebase to master to fix conflict.

@yzhliu
Copy link
Member

yzhliu commented Jan 1, 2020

@optima2005 could you rebase again? sorry for the delayed review

@optima2005
Copy link
Contributor Author

@yzhliu I have rebased to master. Thanks!

@optima2005
Copy link
Contributor Author

Cannot reproduce the ci failure in my local box. How to recheck?

@yzhliu
Copy link
Member

yzhliu commented Jan 2, 2020

@optima2005 do you have gpu instance? if it's the environment difference, docker is probably the easiest way to reproduce.

@optima2005
Copy link
Contributor Author

@yzhliu, I have GPU instance and using the docker environment. I am wondering if the failure is occasional and might be due to the CI environment.
Anyway, I rebased again to re-trigger the ci.

@yzhliu
Copy link
Member

yzhliu commented Jan 6, 2020

Thanks @optima2005 @FrozenGene

@yzhliu yzhliu merged commit 34b98eb into apache:master Jan 6, 2020
@tqchen
Copy link
Member

tqchen commented Jan 6, 2020

NOTE: there has been two CI failures on the master since this PR get merged.

Both are on the CI instance aws.g4.n0.cuda0 , @yzhliu @zhiics it would be great if you can confirm if such error can be reproduced.

It could either due to a real problem, which we need to address, or due to some caching issues(as i see this PR changes the memoize format), it can be confirmed by running docker from a clean folder.

@zhiics
Copy link
Member

zhiics commented Jan 6, 2020

@tqchen Sure. Let me log into the machine and run the unit test from the master to see if it fails or not.

@zhiics
Copy link
Member

zhiics commented Jan 6, 2020

@tqchen I just tested the lasted master branch locally and also tried to execute some of the failed tests in the the docker image. They both worked well. Let me clean the folder and enable the CI again to see if it solves the problem.

@tqchen
Copy link
Member

tqchen commented Jan 7, 2020

Seems we still get some flaky error, see https://ci.tvm.ai/blue/rest/organizations/jenkins/pipelines/tvm/branches/PR-4635/runs/2/nodes/244/log/?start=0 @zhiics would be great if you can try to confirm again

@zhiics
Copy link
Member

zhiics commented Jan 7, 2020

@tqchen Sure. I will double check tonight once I get home.

@tqchen
Copy link
Member

tqchen commented Jan 7, 2020

Note, we might need to run the exact script as the error could due to interference between tests

@zhiics
Copy link
Member

zhiics commented Jan 7, 2020

Yeah, I can confirm that it fails with the following error:

topi/tests/python/test_topi_depthwise_conv2d_back_weight.py::test_topi_depthwise_conv2d_backward_weight_nhwc Fatal Python error: Segmentation fault

when I ran docker/bash.sh tvmai/ci-gpu:v0.56 ./tests/scripts/task_python_topi.sh on the G4 instance.

I can get many this type of errors when I just run pytest -s -v topi/tests/python:

topi/tests/python/test_topi_transform.py:168: in verify_concatenate
    check_device(device)
topi/tests/python/test_topi_transform.py:159: in check_device
    foo = tvm.build(s, tensor_l + [out_tensor], device, name="concatenate")
python/tvm/build_module.py:638: in build
    fhost, mdev = _build_for_device(flist, tar, target_host)
python/tvm/build_module.py:504: in _build_for_device
    mdev = codegen.build_module(fdevice, str(target)) if fdevice else None
python/tvm/codegen.py:36: in build_module
    return _Build(lowered_func, target)
tvm/_ffi/_cython/./function.pxi:304: in tvm._ffi._cy3.core.FunctionBase.__call__
    ???
tvm/_ffi/_cython/./function.pxi:239: in tvm._ffi._cy3.core.FuncCall
    ???
tvm/_ffi/_cython/./function.pxi:228: in tvm._ffi._cy3.core.FuncCall3
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   ???
E   tvm._ffi.base.TVMError: Traceback (most recent call last):
E     [bt] (5) /home/ubuntu/workspace/tvm/build/libtvm.so(TVMFuncCall+0x61) [0x7f22b56b8c01]
E     [bt] (4) /home/ubuntu/workspace/tvm/build/libtvm.so(+0xb592f5) [0x7f22b4ef42f5]
E     [bt] (3) /home/ubuntu/workspace/tvm/build/libtvm.so(tvm::codegen::Build(tvm::Array<tvm::LoweredFunc, void> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xb85) [0x7f22b501a835]
E     [bt] (2) /home/ubuntu/workspace/tvm/build/libtvm.so(std::_Function_handler<void (tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*), tvm::runtime::TypedPackedFunc<tvm::runtime::Module (tvm::Array<tvm::LoweredFunc, void>)>::AssignTypedLambda<tvm::runtime::Module (*)(tvm::Array<tvm::LoweredFunc, void>)>(tvm::runtime::Module (*)(tvm::Array<tvm::LoweredFunc, void>))::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}>::_M_invoke(std::_Any_data const&, tvm::runtime::TVMArgs&&, tvm::runtime::TVMRetValue*&&)+0x4e) [0x7f22b5045f4e]
E     [bt] (1) /home/ubuntu/workspace/tvm/build/libtvm.so(tvm::codegen::BuildCUDA(tvm::Array<tvm::LoweredFunc, void>)+0x3c2) [0x7f22b564ec52]
E     [bt] (0) /home/ubuntu/workspace/tvm/build/libtvm.so(+0x131912b) [0x7f22b56b412b]
E     File "tvm/_ffi/_cython/./function.pxi", line 55, in tvm._ffi._cy3.core.tvm_callback
E     File "/home/ubuntu/workspace/tvm/python/tvm/autotvm/measure/measure_methods.py", line 591, in tvm_callback_cuda_compile
E       ptx = nvcc.compile_cuda(code, target=target, arch=AutotvmGlobalScope.current.cuda_target_arch)
E     File "/home/ubuntu/workspace/tvm/python/tvm/contrib/nvcc.py", line 94, in compile_cuda
E       cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
E     File "/home/ubuntu/anaconda3/lib/python3.6/subprocess.py", line 709, in __init__
E       restore_signals, start_new_session)
E     File "/home/ubuntu/anaconda3/lib/python3.6/subprocess.py", line 1275, in _execute_child
E       restore_signals, start_new_session, preexec_fn)
E   OSError: [Errno 12] Cannot allocate memory

@zhiics
Copy link
Member

zhiics commented Jan 7, 2020

Not sure if it is because we don't have enough swap memory on it. I can create a swap space and see if it works.

@tqchen
Copy link
Member

tqchen commented Jan 7, 2020

@zhiics The nvcc error might be irrelevant, we should focus on the error within the docker container

given that error was pretty recent(could cause by this merge or a few one before it) can you first run a git history bisect to see if a particular commit caused the error?

Also if the segfault on topi/tests/python/test_topi_depthwise_conv2d_back_weight.py is reproducible, it would be great to get a stack trace(you can install gdb on the docker and run python using gdb)

@yzhliu
Copy link
Member

yzhliu commented Jan 7, 2020

also we need to update the tophub log files. workload does not match after the change as the padding dimension now becomes 4D

@tqchen
Copy link
Member

tqchen commented Jan 7, 2020

opened a new thread #4646 to track this.

alexwong pushed a commit to alexwong/tvm that referenced this pull request Feb 26, 2020
* [CONV] Asymmetic padding

* fix lint error

* update for legalize, rocm and cudnn

* add more test cases

* change more symmetric padding

* change conv2d winograd tests according orginal cases

* remove 'alter_op_layout.h' header in bitserial.cc
alexwong pushed a commit to alexwong/tvm that referenced this pull request Feb 28, 2020
* [CONV] Asymmetic padding

* fix lint error

* update for legalize, rocm and cudnn

* add more test cases

* change more symmetric padding

* change conv2d winograd tests according orginal cases

* remove 'alter_op_layout.h' header in bitserial.cc
zhiics pushed a commit to neo-ai/tvm that referenced this pull request Mar 2, 2020
* [CONV] Asymmetic padding

* fix lint error

* update for legalize, rocm and cudnn

* add more test cases

* change more symmetric padding

* change conv2d winograd tests according orginal cases

* remove 'alter_op_layout.h' header in bitserial.cc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants