Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TOPI][x86] Introduce schedule_injective_from_existing and unify external schedules for all targets #3983

Merged
merged 18 commits into from
Sep 26, 2019

Conversation

soiferj
Copy link
Contributor

@soiferj soiferj commented Sep 20, 2019

Currently x86 schedule_extern does not work properly, and will treat extern ops as injective ops. This PR introduces a new generic function, schedule_injective_from_existing that has the core logic of schedule_injective for each target. schedule_extern then calls this method. This ends up fixing schedule_extern for many targets besides just x86.

Related to the discussion here.

@masahi @vinx13 would you be able to take a look?

@soiferj soiferj changed the title Fix extern schedule for x86 [TOPI] Fix extern schedule for x86 Sep 20, 2019
@soiferj soiferj changed the title [TOPI] Fix extern schedule for x86 [TOPI][x86] Fix extern schedule for x86 Sep 20, 2019
for out in outs:
if isinstance(out.op, tvm.tensor.ExternOp):
continue
_schedule_injective(out.op, s)
Copy link
Contributor Author

@soiferj soiferj Sep 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vinx13, I have moved this logic from cuda/extern.py to generic/extern.py. Will schedule_injective still call the correct overridden function per target? Or will this just call the default one?

If it just calls the default one, it seems like I have to add a new file, x86/extern.py

Update: it seems like it does call the right one

@masahi
Copy link
Member

masahi commented Sep 20, 2019

@soiferj if you change the default behavior of schedule_extern in topi/python/topi/generic/extern.py, you should update topi/include/topi/generic/extern.h too.

I would rather move your implementation of schedule_extern in x86/extern.h to generic/extern.h, and use it from both python and x86 cpp.

@soiferj
Copy link
Contributor Author

soiferj commented Sep 20, 2019

Sure, I can work on that. Can you give me a pointer for how to call the cpp schedule from Python?

@masahi
Copy link
Member

masahi commented Sep 21, 2019

@soiferj
Copy link
Contributor Author

soiferj commented Sep 21, 2019

x86/injective.h and generic/injective.h are different. If I move the logic into generic/extern.h, how do I make sure I'm calling the right injective function? In Python, it seems like the right overridden function is automatically called. This doesn't seem like the case in C++. If we can solve this, then we can actually have all of the logic be in generic/extern.h and have all it properly call each target's schedule_injective.

Sorry for all of the questions, I'm new to this code

@masahi
Copy link
Member

masahi commented Sep 22, 2019

I see, you are right about the lack of target dispatch mechanism in our c++ topi (introduced for python in #556). It seems such dynamism is not possible in c++ topi at the moment.

I can think of two options:

  • Keep generic/extern.py and generic/extern.h as is, and duplicate existing logic in cuda/extern.py to x86/extern.py or x86/extern.h

  • We make our desired change to generic/extern.py and remove generic/extern.h . It requires many code change, but it makes our codebase more consistent. For example, the cpp schedule_extern is used here, but this is not correct. It should call topi::cuda::schedule_extern(...) directly. This bug is a result of line-by-line porting of python topi to c++.

What do you think? @soiferj @vinx13

@soiferj
Copy link
Contributor Author

soiferj commented Sep 22, 2019

Personally, I feel that the more logic we put in C++, the better. That way, we only have to implement things once, and users can use the Python or C++ API with the exact same features.

How about this: I'll implement this change like your first suggestion (duplicate logic in x86/extern.h) because this bug should be fixed urgently.

I'll also post an RFC on the forum about adding a target dispatch API for C++. In fact, the schedules are already registered as generic functions in C++ here. If we have an API like topi::generic::dispatch("schedule_injective"), we can call this in C++ and always have the right implementation per target.

What do you think?

@vinx13
Copy link
Member

vinx13 commented Sep 23, 2019

@soiferj I agree that putting more things to C++ side is better. Lack of target dispatch in C++ can be confusing to users as they only need to call topi::generic::.... It would be helpful to support that in C++.

@masahi
Copy link
Member

masahi commented Sep 23, 2019

@soiferj Great! I thought implementing the target dispatch in C++ would not be straightforward, but if you are willing to do it I'm happy to help. You can continue with this PR as you see fit and I'll merge this ASAP.

continue;
}
Array<Tensor> new_outs = { out };
tvm::GenericFunc::Get("schedule_injective")(new_outs);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@masahi or @vinx13 , this call seems to work as expected. It calls the correct schedule_injective function for the current target. However, the function that it calls creates its own schedule (see cuda/injective.py). This causes failures in the unit tests. This is probably why the previous implementation used a helper function, _schedule_injective.py). Can you think of any way to fix this when calling from C++?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the problem with cuda/injective.py? Is overriding native generic from Python side a problem in your case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The specific error is Direct host side access to device memory is detected in fused_nn_conv2d_multiply_add_nn_relu_1. Did you forget to bind?. This is being hit in tutorials/frontend/using_external_lib.py

Copy link
Contributor Author

@soiferj soiferj Sep 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it has to do with the fact that schedule_injective creates its own schedule, but I'm not totally sure.

Update: I just tried changing override_native_generic_func to generic_func and it has the same issue.

Thanks a lot for all of the help, btw. I'm just trying to avoid duplicating code :( I think if we solve this, we can open the door to cleaner refactoring.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, we can't create a new schedule using schedule_injective. So the problem is that we don't want to duplicate the helper function in C++ right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly. Maybe we should create a new generic function? Or have a generic function that is overridden with more arguments? Is that possible?

Copy link
Member

@vinx13 vinx13 Sep 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think generic func can be overridden by number of args. We can register the function as a new generic func

@soiferj
Copy link
Contributor Author

soiferj commented Sep 23, 2019

I just added a new generic function schedule_injective_from_existing. I tried to update all of the core scheduling logic for all targets into this function, and changed all of the places that used _schedule_injective to call this function instead. Let me know what you think.

I see a lot of areas to refactor - in another change I think I'll move all of the schedule_injective logic into generic/injective.h, and move all of the schedule_injective_from_existing logic out of Python and into C++. Right now it has to be duplicated because the unit tests rely on it being in C++, but Python has the updated logic.

@soiferj
Copy link
Contributor Author

soiferj commented Sep 25, 2019

Yeah, I am little confused by that too. I tried to mirror what schedule_injective does. The C++ code takes in a target, and the Python code doesn’t.

In my testing, yes, the C++ code calls back into Python. I can try to remove target from the C++ interface and fix that unit test. Is that alright?

In the future, we should also remove target from the schedule function signature.

@masahi
Copy link
Member

masahi commented Sep 25, 2019

Yes, I prefer fixing the unit test and cleaning up the interface, if you can.

@soiferj
Copy link
Contributor Author

soiferj commented Sep 25, 2019

Thanks a lot. Sorry for the long back-and-forth. I’m still pretty new to this codebase and am trying to make sure I do this the right way.

@soiferj
Copy link
Contributor Author

soiferj commented Sep 25, 2019

This is a CPP unit test - how can I set the current target?

@soiferj
Copy link
Contributor Author

soiferj commented Sep 25, 2019

@masahi or @vinx13 I am testing this new change, and while my code is being called as expected, I'm still confused as to whether this actually works. For example, I am testing the last few ops of BERT base, matmul -> bias_add -> tanh. In TVM, this will be fused to fused_nn.dense_add_tanh. When I print the outs that are passed into schedule_extern, the list has only one element, and is tanh. Because of this, schedule_extern always calls schedule_injective_from_existing. It seems like nn.dense isn't being properly scheduled.

Would one of you be able to look at this script and verify whether the behavior is expected?

https://pastebin.com/tJi2rnG2

Edit: That being said, the performance issue does seem to be resolved, I'm just a little confused as to why it works :)

@vinx13
Copy link
Member

vinx13 commented Sep 25, 2019

@soiferj Only the final output of the fused group will be passed to the schedule function, and that's why we need to use traverse during scheduling (for example https://github.com/dmlc/tvm/blob/master/topi/python/topi/cuda/conv2d.py#L153)

@soiferj
Copy link
Contributor Author

soiferj commented Sep 25, 2019

I see. But if you look at line 140, extern schedule returns immediately. In my example, the fused op is never fully traversed since it calls schedule_dense, which immediately returns schedule_injective for tanh. There is no traversal. Should a traversal be happening within schedule_extern?

It almost seems like the callback should be returning schedule_extern and the schedule function should only be calling traverse. I believe this only works in my case because dense is never actually scheduled.

@vinx13
Copy link
Member

vinx13 commented Sep 25, 2019

This is a CPP unit test - how can I set the current target?

Target::Current

@vinx13
Copy link
Member

vinx13 commented Sep 25, 2019

For your case, I'm expecting that schedule_injective_for_existing called for tanh, and add inlined (via AutoInlineInjective, which also traverse the stages)

@soiferj
Copy link
Contributor Author

soiferj commented Sep 25, 2019

Thanks, it seems you're right. AutoInlineInjective seems to do the right thing here. Also, regarding the CPP unit test, the current target can be gotten by using Target::Current, but it seems it cannot be set. This unit test calls topi::cuda::schedule_injective directly, so a target needs to be set.

@vinx13
Copy link
Member

vinx13 commented Sep 25, 2019

Let's fix the test. As schedules are target-specific, we expect target to be set before calling it. We can the target (EnterTargetScope) at the beginning of the test

@soiferj
Copy link
Contributor Author

soiferj commented Sep 26, 2019

Ok got the tests working :)

topi/src/topi.cc Outdated
*/
inline PackedFunc WrapScheduleFromExisting(FTVMScheduleFromExistingBuilder builder) {
return PackedFunc([builder](TVMArgs args, TVMRetValue* ret) {
*ret = builder(args[1], args[2]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't it be args0 and args1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, fixed

@masahi
Copy link
Member

masahi commented Sep 26, 2019

Great @soiferj I will merge after CI.

@soiferj
Copy link
Contributor Author

soiferj commented Sep 26, 2019

Awesome, thanks again for all of your help!

@masahi masahi merged commit b330d30 into apache:master Sep 26, 2019
@masahi
Copy link
Member

masahi commented Sep 26, 2019

thanks @soiferj @vinx13 this is merged.

@soiferj soiferj deleted the densefixes branch September 26, 2019 16:04
wweic pushed a commit to wweic/tvm that referenced this pull request Sep 30, 2019
…rnal schedules for all targets (apache#3983)

* Fix extern schedule for x86

* Register x86::schedule_extern

* Fix

* Fix

* Replace extern.py with extern.h

* Introduce new generic function schedule_injective_from_existing

* Fix

* Fix

* Add back to C++

* Fix style

* Injective schedule calls local schedule_injective_from_existing

* Fix

* Remove target arg from schedule_injective_from_existing

* Fix docs

* Try to fix unit test

* Fix test

* Fix other tests

* Fix bug
wweic pushed a commit to wweic/tvm that referenced this pull request Sep 30, 2019
…rnal schedules for all targets (apache#3983)

* Fix extern schedule for x86

* Register x86::schedule_extern

* Fix

* Fix

* Replace extern.py with extern.h

* Introduce new generic function schedule_injective_from_existing

* Fix

* Fix

* Add back to C++

* Fix style

* Injective schedule calls local schedule_injective_from_existing

* Fix

* Remove target arg from schedule_injective_from_existing

* Fix docs

* Try to fix unit test

* Fix test

* Fix other tests

* Fix bug
wweic pushed a commit to neo-ai/tvm that referenced this pull request Oct 1, 2019
…rnal schedules for all targets (apache#3983)

* Fix extern schedule for x86

* Register x86::schedule_extern

* Fix

* Fix

* Replace extern.py with extern.h

* Introduce new generic function schedule_injective_from_existing

* Fix

* Fix

* Add back to C++

* Fix style

* Injective schedule calls local schedule_injective_from_existing

* Fix

* Remove target arg from schedule_injective_from_existing

* Fix docs

* Try to fix unit test

* Fix test

* Fix other tests

* Fix bug
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants