-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RELAY][FUSION] Enhance fusion rule that starts from elemwise and broadcast #2932
Conversation
We observe 15% performance degradation with the bug for a model. Would like to understand if this is the right fix. |
@@ -678,6 +678,7 @@ class GraphPartitioner { | |||
} else { | |||
return (kind <= kBroadcast || | |||
kind == kCommReduce || | |||
kind == kInjective || | |||
kind == kOutEWiseFusable); | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this fix is correct, but in this case we can just simplify this condition to kind <= kOutEwiseFusable
I wonder what is the original reason for leaving out kInjective @tqchen
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@masahi yes, we can. I wasn’t quite sure if it was intentionally left out. If not, I’ll change to <=
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is something I overlooked. However, it is good to leave things as they are because this way it is more clear. We also need to add a case to fuse into multiple injective ops, i.e. need to enhance condition in the if branch as well. Please add a test-case on that as well
ewise->parallel{injective, injective}->injective
Good catch! will leave to @masahi to manage the PR. see my comments in the review block |
@zhiics can you retrigger the CI? |
@masahi Sorry. I didn't have much time to look into it so far. It looks that the failure in import tvm
from tvm import relay
data = relay.var("data", relay.ty.TensorType((1, 32, 32, 3), "float32"))
data1 = relay.var("data1", relay.ty.TensorType((1, 32, 32, 3), "float32"))
log = relay.log(data)
c1 = relay.copy(data1)
c2 = relay.copy(data1)
func = relay.Function([data, data1], relay.Tuple(tvm.convert([log, c1, c2])))
func = relay.ir_pass.infer_type(func)
with relay.build_config(opt_level=3):
graph, lib, params = relay.build(func, target="llvm") This gives me: Without the changed, the fused code is like the following: %12 = fn (%data: Tensor[(1, 32, 32, 3), float32], %data1: Tensor[(1, 32, 32, 3), float32]) -> (Tensor[(1, 32, 32, 3), float32], Tensor[(1, 32, 32, 3), float32], Tensor[(1, 32, 32, 3), float32]) {
%1 = fn (%p0: Tensor[(1, 32, 32, 3), float32], __dict__=meta[StrMap][0]) -> Tensor[(1, 32, 32, 3), float32] {
%0 = log(%p0)
%0
}
%2 = %1(%data)
%4 = fn (%p01: Tensor[(1, 32, 32, 3), float32], __dict__=meta[StrMap][1]) -> Tensor[(1, 32, 32, 3), float32] {
%3 = copy(%p01)
%3
}
%5 = %4(%data1)
%10 = fn (%p02: Tensor[(1, 32, 32, 3), float32], %p1: Tensor[(1, 32, 32, 3), float32], __dict__=meta[StrMap][2]) -> (Tensor[(1, 32, 32, 3), float32], Tensor[(1, 32, 32, 3), float32], Tensor[(1, 32, 32, 3), float32]) {
%6 = copy(%p02)
%7 = copy(%p1)
%8 = copy(%p1)
%9 = (%6, %7, %8)
%9
}
%11 = %10(%2, %5)
%11
}
%12 With the change, the fused code is like the following: %5 = fn (%data: Tensor[(1, 32, 32, 3), float32], %data1: Tensor[(1, 32, 32, 3), float32]) -> (Tensor[(1, 32, 32, 3), float32], Tensor[(1, 32, 32, 3), float32], Tensor[(1, 32, 32, 3), float32]) {
%3 = fn (%p0: Tensor[(1, 32, 32, 3), float32], %p1: Tensor[(1, 32, 32, 3), float32], __dict__=meta[StrMap][0]) -> (Tensor[(1, 32, 32, 3), float32], Tensor[(1, 32, 32, 3), float32], Tensor[(1, 32, 32, 3), float32]) {
%0 = log(%p0)
%1 = copy(%p1)
%2 = (%0, %1, %1) # CSE is used here
%2
}
%4 = %3(%data, %data1)
%4
}
%5 |
this seems like a similar error as this one |
@masahi They might not be the same. The last commit is able to fix the mentioned problem, but I am not very sure if it is correct. With the fix %2 = fn (%data: Tensor[(1, 32, 32, 3), float32]) -> (Tensor[(1, 32, 32, 3), float32], Tensor[(1, 32, 32, 3), float32]) {
%0 = log(%data) // ty=Tensor[(1, 32, 32, 3), float32]
%1 = (%0, %0)
%1
} will produce: %5 = fn (%data: Tensor[(1, 32, 32, 3), float32]) -> (Tensor[(1, 32, 32, 3), float32], Tensor[(1, 32, 32, 3), float32]) {
%3 = fn (%p0: Tensor[(1, 32, 32, 3), float32], __dict__=meta[StrMap][0]) -> (Tensor[(1, 32, 32, 3), float32], Tensor[(1, 32, 32, 3), float32]) {
%0 = log(%p0)
%1 = copy(%0)
%2 = (%0, %1)
%2
}
%4 = %3(%data)
%4
}
%5 Instead of: %4 = fn (%data: Tensor[(1, 32, 32, 3), float32]) -> (Tensor[(1, 32, 32, 3), float32], Tensor[(1, 32, 32, 3), float32]) {
%3 = fn (%p0: Tensor[(1, 32, 32, 3), float32], __dict__=meta[StrMap][0]) -> (Tensor[(1, 32, 32, 3), float32], Tensor[(1, 32, 32, 3), float32]) {
%0 = log(%p0)
%1 = (%0, %0)
%1
}
%3 = %2(%data)
%3
}
%4 Another problem caused this type of fusion is in It would be great if you can also take a look. |
ok, I'll take a look. |
@masahi Thank you. FYI, I think it is the scalar here http://ci.tvm.ai:8080/blue/organizations/jenkins/tvm/detail/PR-2932/3/pipeline/237#step-294-log-1426 |
The error seems to be happening at the fused function below. The sequence of adds at the beginning is not fused into this function if I disable this PR. I think scalars associated with add functions are causing the issue.
|
src/relay/pass/fuse_ops.cc
Outdated
j++; | ||
} | ||
if (j != i) { | ||
auto copy = Copy(new_fields[i]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure if this is the best way to fix. Does it work if you remove this assertion?
The llvm error could be fixed in codegen. But we might still have memory planing issue, that's why I added the copy above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vinx13 it works, but should we remove that assertion? For my case, memory planning can pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is https://github.com/dmlc/tvm/blob/3441b95e8f1e0435060f2010c10496a4b84973ae/src/codegen/llvm/codegen_llvm.cc#L849
The assertion is not handled properly. We should keep the assertion and fix the codegen.
@masahi yes, because |
@masahi Does it make sense to add an function to check if a |
@zhiics yeah, that will be a quick fix, but I want to understand why the scalar is not included in the schedule. The scalar seems to be 0f associated with multiplys by 0, such as I also wonder why we need compute_inline here in the first place. That should be handled in TOPi schedules. If I remove compute_inline, it seems to work. |
@masahi Yes. I tried that. I have the same question. |
@tqchen can you review this PR? We have two problems and walkarounds for each. I'm not sure if we want to merge them.
|
@vinx13 isn't it the case that |
@masahi Yes it will be visited, but we need to explicitly call see the discussion #2116 (comment) |
@vinx13 Good to know. So I guess skipping compute_inline on scalars that are not in the schedule (for some reason) will cause problems if we are running on GPU. |
For the llvm error, we need to handle |
@vinx13 Thanks. But I am not sure if we want to do that because handle should be a pointer, right? llvm cmps can support |
Yes handle should be pointer, in the above case it goes the else branch which is float point cmp and caused error. |
@zhiics that is just my rough guess, based on @vinx13's comment in #2116. I don't know what happens if unscheduled scalars are accessed from GPU. I'll try running the SSD GPU PR at #2784 with this change to see if it works. |
The problem with unscheduled scalars on GPU is, the host code initialize the scalar tensor directly, instead of finding the tensor in |
@zhiics How to reproduce this error? |
@vinx13 This one should be able to reproduce: import tvm
from tvm import relay
data = relay.var("data", relay.ty.TensorType((1, 32, 32, 3), "float32"))
log = relay.log(data)
func = relay.Function([data], relay.Tuple(tvm.convert([log, log])))
func = relay.ir_pass.infer_type(func)
with relay.build_config(opt_level=3):
graph, lib, params = relay.build(func, target="cuda") |
If the output is like |
@vinx13 yeah, I am aware of that the same compute is scheduled twice here as well. That’s probably why ‘copy’ works here. |
This PR has been a bit stale. Please summarize the problems and list possible solutions |
@tqchen Sorry. I've been on vacation these days. There are two problems/bugs triggered by fusing from elemwise and broadcast ops.
But I am not sure what's the best way to fix. |
@zhiics Regarding to the problem of |
At the current moment, we do not pass the tuple around as intermediate values. That means unless the tuple is the final return value of the global function, the fusor should not stop at that point. One way to solve the problem is to treat the tuple which is the return value of the function and intermediate values differently, and not fuse through such tuples if they are marked as extern_ref ( by marking its children as Opaque). This way we simplified the assumption of the primitive kernel. I also agree that it is useful to add additional checks to check such invariants(so tuple is not used as return value) |
…adcast (apache#2932) * [relay][bugfix] fuse injective to elemwise and broadcast * enhance fusion for prarllel injectiveOD * check if tensor in schedule * fix codegen * fix lint * update * lint
…adcast (apache#2932) * [relay][bugfix] fuse injective to elemwise and broadcast * enhance fusion for prarllel injectiveOD * check if tensor in schedule * fix codegen * fix lint * update * lint
In NNVM we can fuse injective ops to elemwise ops, but in Relay cannot. Is this intended or a bug? This PR enables such a behavior.
The following script shows the fusion effects of NNVM and Relay:
cc @tqchen @jroesch @wweic @masahi