implementation of broadcast div backward by reduce #38044

Zjq9409 · 2021-12-10T08:31:54Z

PR types

Performance optimization

PR changes

OPs

Describe

原始PR链接：implementation of broadcast div backward by reduce #37776
使用reduce实现broadcast div 反向，相比于原始性能数据如下：

case	pytorch	优化前	优化前相比pytorch	优化后	优化后相比pytorch	加速比
[50, 128, 1000], [128, 1000]	0.46865	0.24259	优于 (48.24%)	0.23764	优于 (49.29%)	1.02
[50, 128, 1000], [1, 128, 1000]	0.46940	0.24346	优于 (48.13%)	0.23795	优于 (49.30%)	1.02
[16, 2048, 7, 7], [16, 2048]	0.14044	0.07819	优于 (44.32%)	0.07565	优于 (45.84%)	1.03
[16, 2048, 16, 16], [16, 2048, 16, 16]	0.71575	0.34497	优于 (1.07x)	0.34354	优于 (1.07x)	1.00
[16,1,513,513], [1]	0.31762	4.67214	差于 (13.71x)	0.15971	优于 (49.51%)	29.25
[512, 896, 4, 12], [512, 896, 4, 1]	1.68353	2.82219	差于 (67.64%)	0.86215	优于 (48.78%)	3.27
[512, 896, 4, 12], [512, 896, 4, 1] fp16	1.17390	2.74304	差于 (1.34x)	0.60514	优于 (48.67%)	4.53
[32, 12, 128, 128], [32, 1, 1, 128] fp16	0.34941	0.57034	差于 (63.23%)	0.15004	优于 (1.32x)	3.80
[32, 1, 1, 128], [1, 12, 128, 1] fp16	0.38124	0.4983	差于 (30.71%)	0.19352	优于 (49.29%)	2.57

paddle-bot-old · 2021-12-10T08:31:58Z

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

ZzSean · 2021-12-10T09:05:46Z

paddle/fluid/operators/elementwise/elementwise_div_op.cu

+  inline HOSTDEVICE T operator()(const T& a, const T& b) const { return a * b; }
+};
+template <typename T>
+struct MulDxDyFunctor<paddle::platform::complex<T>> {


直接叫MulFunctor和DivFunctor不行吗

名字修改为MulGradFunctor和DivGradFunctor

AshburnLee · 2021-12-13T07:17:29Z

paddle/fluid/operators/elementwise/elementwise_functor.h

+      const paddle::platform::complex<T>& y) const {
+    paddle::platform::complex<T> y_conj(y.real, -y.imag);
+    return x / y_conj;
+  }


统一DivGradFunctor形参，x和y或a和b。MulGradFunctor一样

ZzSean · 2021-12-13T08:37:54Z

paddle/fluid/operators/elementwise/elementwise_div_op.cu

+      ins.emplace_back(y);
+      outs.emplace_back(&res_dy);
+
+      const auto& cuda_ctx =


cuda_ctx统一改成dev_ctx

ZzSean · 2021-12-13T08:40:17Z

paddle/fluid/operators/elementwise/elementwise_functor.h

+
+// x * y / z
+template <typename T>
+struct MulDivGradFunctor {


改成DivGradYFunctor吧

AshburnLee · 2021-12-17T05:43:48Z

paddle/fluid/operators/elementwise/elementwise_div_op.cu

+
+      std::vector<int> reduce_dims = GetReduceDim(x->dims(), out->dims(), axis);
+      gpuStream_t stream = ctx.cuda_device_context().stream();
+      TensorReduceFunctorImpl<T, T, CustomSum>(res_dx, dx, reduce_dims, stream);


Reduce接口有变化，需要响应修改，参考：#38135

AshburnLee · 2021-12-17T05:44:17Z

paddle/fluid/operators/elementwise/elementwise_div_op.cu

+
+      std::vector<int> reduce_dims = GetReduceDim(y->dims(), out->dims(), axis);
+      gpuStream_t stream = ctx.cuda_device_context().stream();
+      TensorReduceFunctorImpl<T, T, CustomSub>(res_dy, dy, reduce_dims, stream);


Reduce接口需要更新

AshburnLee · 2021-12-17T05:46:41Z

paddle/fluid/operators/elementwise/elementwise_div_op.cu

+      ins.emplace_back(dout);
+      ins.emplace_back(out);
+      ins.emplace_back(y);
+      outs.emplace_back(&res_dy);


创建vector时，就可以初始化了，没必要emplace_back()

AshburnLee · 2021-12-17T05:49:23Z

paddle/fluid/operators/elementwise/elementwise_div_op.cu

+
+      const auto& dev_ctx =
+          ctx.template device_context<platform::CUDADeviceContext>();
+      LaunchElementwiseCudaKernel<ElementwiseType::kBinary, T, T>(


DivGradYFunctor既是三元的，这里的kBinary->kTernary更合适吧

JamesLim-sy · 2021-12-24T06:36:49Z

paddle/fluid/operators/elementwise/elementwise_div_op.cu

+      framework::Tensor tmp_dx;
+      tmp_dx.Resize(dout->dims());
+
+      ElementwiseComputeEx<DivGradFunctor<T>, DeviceContext, T>(


纯GPU代码就不要调用这个接口了，这个接口是用于同时需要支持CPU 和 GPU计算的时候才用的，纯粹GPU的代码还是走LaunchElementwiseCudaKernel 更直观

JamesLim-sy · 2021-12-24T06:37:00Z

paddle/fluid/operators/elementwise/elementwise_div_op.cu

+    }
+    if (dx->dims() == dout->dims()) {
+      // dx = dout/y
+      ElementwiseComputeEx<DivGradFunctor<T>, DeviceContext, T>(


JamesLim-sy · 2021-12-24T06:40:34Z

paddle/fluid/operators/elementwise/elementwise_div_op.cu

-    dx[col] = o / y_conj;
+    if (dx != nullptr) {
+      dx[col] = o / y_conj;
+    }
    dy[col] = -o * out_div_y_conj;
    col += blockDim.x * gridDim.x;


这种写法可以修改成为 grid_stride的写法，见链接：https://developer.nvidia.com/blog/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/

JamesLim-sy · 2021-12-24T06:40:47Z

paddle/fluid/operators/elementwise/elementwise_div_op.cu

-    dx[col] = o / y_conj;
+    if (dx != nullptr) {
+      dx[col] = o / y_conj;
+    }
    dy[col] = -o * out_div_y_conj;
    col += blockDim.x * gridDim.x;


ZzSean · 2021-12-29T08:25:32Z

paddle/fluid/operators/elementwise/elementwise_div_op.cu

+}
+
+template <typename T>
+void reduce_functor(const framework::ExecutionContext& ctx,


函数名都改成大驼峰吧

ZzSean · 2021-12-29T08:25:44Z

paddle/fluid/operators/elementwise/elementwise_div_op.cu

+template <typename DeviceContext, typename T>
+typename std::enable_if<
+    std::is_same<DeviceContext, platform::CUDADeviceContext>::value>::type
+default_elementwise_div_grad(const framework::ExecutionContext& ctx,


ZzSean · 2021-12-29T08:27:18Z

paddle/fluid/operators/elementwise/elementwise_div_op.cu

+
+template <typename T>
+void reduce_functor(const framework::ExecutionContext& ctx,
+                    const framework::Tensor* in, const framework::Tensor* out,


in和src，out和dst，这些变量名有啥区别，各自作用都是啥呢？能不能区分或者说明一下？

in,out用于计算reduce_dims，src表示需要reduce的值，dst表示reduce计算后的值。可以添加注释说明

ZzSean · 2021-12-29T08:31:05Z

paddle/fluid/operators/elementwise/elementwise_div_op.cu

+}
+
+template <typename T>
+void reduce_functor(const framework::ExecutionContext& ctx,


这里可以直接传CUDA device

ZzSean · 2021-12-29T08:35:04Z

paddle/fluid/operators/elementwise/elementwise_div_op.cu

-    dx[col] = o / y_conj;
-    dy[col] = -o * out_div_y_conj;
-    col += blockDim.x * gridDim.x;
+    if (dx->dims() == dout->dims() && dy->dims() == dout->dims()) {


直接调两个reduce_functor就可以，不需要这个if else了

ZzSean · 2021-12-29T08:35:41Z

paddle/fluid/operators/elementwise/elementwise_div_op.cu

+    LaunchElementwiseCudaKernel<ElementwiseType::kBinary, T, T>(
+        dev_ctx, ins, &outs, axis, DivGradFunctor<T>());
+    if (dx->dims() != dout->dims()) {
+      reduce_functor<T>(ctx, x, out, &tmp_dx, dx);


ZzSean · 2021-12-29T08:35:54Z

paddle/fluid/operators/elementwise/elementwise_div_op.cu

+    std::vector<framework::Tensor*> outs = {&tmp_dy};
+    LaunchElementwiseCudaKernel<ElementwiseType::kTernary, T, T>(
+        dev_ctx, ins, &outs, axis, DivGradYFunctor<T>());
+    if (dy->dims() != dout->dims()) {


ZzSean · 2021-12-29T08:38:45Z

paddle/fluid/operators/elementwise/elementwise_div_op.h

-      ElemwiseGradCompute<DeviceContext, T, DivGradDX<T>, DivGradDY<T>>(
-          ctx, *x, *y, *out, *dout, axis, dx, dy, DivGradDX<T>(),
-          DivGradDY<T>());
+      default_elementwise_div_grad<DeviceContext, T>(ctx, x, y, out, dout, dx,


default也改个名字吧，比如改成Common，或者其他更好的

后续会统一修改

JamesLim-sy · 2021-12-29T09:04:49Z

paddle/fluid/operators/elementwise/elementwise_div_op.cu

-    dy[col] = -o * out_div_y_conj;
-    col += blockDim.x * gridDim.x;
+    if (dx->dims() == dout->dims() && dy->dims() == dout->dims()) {
+      dx->ShareDataWith(tmp_dx);


ShareDataWith 这种写法把tensor tmp_dx 赋给了 dx ，对模型运行时可能会造成问题，尽量避免掉这种写法

paddle/fluid/operators/elementwise/elementwise_div_op.cu

ZzSean · 2021-12-29T13:22:49Z

paddle/fluid/operators/elementwise/elementwise_functor.h

+
+// Complex div grad
+template <typename T>
+struct DivGradFunctor<Complex<T>> {


这里是不是跟GradY对应起来写成GradX

ZzSean · 2021-12-29T13:30:33Z

paddle/fluid/operators/elementwise/elementwise_div_op.cu

+
+    std::vector<const framework::Tensor*> ins = {dout, out, y};
+    std::vector<framework::Tensor*> outs;
+    if (dx->dims() == dout->dims() && dy->dims() == dout->dims()) {


这个分支可以删掉，因为这种情况下根本不会进到这个接口

删掉了原来相同dims 的CUDA函数，相同dims走该分支，所以保留

AshburnLee · 2021-12-29T13:56:19Z

paddle/fluid/operators/elementwise/elementwise_div_op.cu

+                          framework::Tensor* dy) {
+  int axis = ctx.Attr<int>("axis");
+  auto* dout_data = dout->data<T>();
+  dim3 block_size = dim3(ELEMENTWISE_BLOCK_SIZE, 1);


block_size 定义了但没有被使用

已经删掉

AshburnLee · 2021-12-29T14:10:02Z

paddle/fluid/operators/elementwise/elementwise_div_op.cu

+}
+
+template <typename T>
+void ReduceForDiv(const platform::CUDADeviceContext& dev_ctx, int axis,


这个封装不止适用于div吧，是不是能改成ReduceWrapper()

已经提出公共接口函数

AshburnLee · 2021-12-29T14:35:54Z

paddle/fluid/operators/elementwise/elementwise_div_op.h

@@ -146,14 +167,11 @@ class ElementwiseDivGradKernel : public ElemwiseGradKernel<T> {
    auto* dout = ctx.Input<Tensor>(framework::GradVarName("Out"));
    auto* dx = ctx.Output<Tensor>(framework::GradVarName("X"));
    auto* dy = ctx.Output<Tensor>(framework::GradVarName("Y"));
-    int axis = ctx.Attr<int>("axis");

    if (dx != nullptr && dy != nullptr && (dx->dims() == dy->dims())) {


DefaultElementwiseDivGrad已经包括这个分支了，可以删除

AshburnLee · 2021-12-29T14:36:56Z

paddle/fluid/operators/elementwise/elementwise_div_op.cu

+
+    std::vector<const framework::Tensor*> ins = {dout, out, y};
+    std::vector<framework::Tensor*> outs;
+    if (dx->dims() == dout->dims() && dy->dims() == dout->dims()) {


这个分支重复，可以去掉，当(dx->dims() == dy->dims())时，外层调用不会执行到该函数。
或者，保留，将外层调用的 dx != nullptr && dy != nullptr && (dx->dims() == dy->dims()) 分支删掉。见下

AshburnLee · 2021-12-29T14:38:14Z

paddle/fluid/operators/elementwise/elementwise_functor.h

+
+template <typename InT, typename OutT>
+struct DivGradXYFunctor {
+  inline HOSTDEVICE paddle::framework::Array<OutT, 2> operator()(InT a, InT b,


只读参数传 const reference，下同

AshburnLee · 2021-12-30T01:30:01Z

paddle/fluid/operators/elementwise/elementwise_div_op.cu

+  framework::Tensor tmp_dy;
+  tmp_dy.mutable_data<T>(dout->dims(), ctx.GetPlace());
+  if (dx != nullptr && dy != nullptr) {
+    auto* dx_data = dx->mutable_data<T>(ctx.GetPlace());


mutable_data的结果不必传给指针（下文没用到指针），下同

Xreki · 2021-12-30T03:27:59Z

paddle/fluid/operators/elementwise/elementwise_div_op.cu

-    dx[col] = o / y[col];
-    dy[col] = -o * out[col] / y[col];
-    col += blockDim.x * gridDim.x;
+  for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < size;


这个函数还有必要吗？

该函数目前已经删掉，走多输出分支

Xreki · 2021-12-30T03:44:52Z

paddle/fluid/operators/elementwise/elementwise_div_op.cu

+  framework::Tensor tmp_dx;
+  tmp_dx.mutable_data<T>(dout->dims(), ctx.GetPlace());
+  framework::Tensor tmp_dy;
+  tmp_dy.mutable_data<T>(dout->dims(), ctx.GetPlace());


并不是所有情况都需要使用临时Tensor吧？

申请临时Tensor空间放在了if else分支中

JamesLim-sy · 2021-12-30T12:14:04Z

paddle/fluid/operators/elementwise/elementwise_op_function.h

+    ReduceWrapper<T>(dev_ctx, axis, &tmp_dy, dy);
+  }
+}
+#endif


GetGradXOut 和 GetGradYOut 两个函数太像了，可以压缩成一个函数

JamesLim-sy · 2021-12-30T12:15:18Z

paddle/fluid/operators/elementwise/elementwise_op_function.h

+  if (dx->dims() == dout->dims() && dy->dims() == dout->dims()) {
+    outs = {dx, dy};
+  }
+  if (dx->dims() != dout->dims() && dy->dims() == dout->dims()) {


这里应该是 else if 吧？

ZzSean · 2021-12-30T13:14:22Z

paddle/fluid/operators/elementwise/elementwise_div_op.cu

+  const auto& dev_ctx =
+      ctx.template device_context<platform::CUDADeviceContext>();
+  if (dx != nullptr && dy != nullptr) {
+    GetGradXYOut<T>(dev_ctx, axis, x, y, out, dout, dx, dy,


感觉可以封装成一个，用不到的参数传空指针就好，剩下的只有functor不同了

传入空指针的话，在GetGradXYOut函数中需要多次判断指针是否为空，代码可读性不太高

Zjq9409 · 2021-12-30T11:08:41Z

paddle/fluid/operators/elementwise/elementwise_div_op.cu

@@ -13,8 +13,6 @@ See the License for the specific language governing permissions and
 limitations under the License. */

 #include "paddle/fluid/operators/elementwise/elementwise_div_op.h"
-#include "paddle/fluid/operators/elementwise/elementwise_op_broadcast.cu.h"
-#include "paddle/fluid/platform/complex.h"
 #include "paddle/fluid/platform/float16.h"


头文件已经删除

Zjq9409 · 2021-12-31T02:37:05Z

paddle/fluid/operators/elementwise/elementwise_div_op.cu

+  const auto& dev_ctx =
+      ctx.template device_context<platform::CUDADeviceContext>();
+  if (dx != nullptr && dy != nullptr) {
+    GetGradXYOut<T>(dev_ctx, axis, x, y, out, dout, dx, dy,


传入空指针的话，在GetGradXYOut函数中需要多次判断指针是否为空，代码可读性不太高

Zjq9409 · 2021-12-31T02:37:15Z

paddle/fluid/operators/elementwise/elementwise_op_function.h

+  if (dx->dims() == dout->dims() && dy->dims() == dout->dims()) {
+    outs = {dx, dy};
+  }
+  if (dx->dims() != dout->dims() && dy->dims() == dout->dims()) {


Zjq9409 · 2021-12-31T02:37:22Z

paddle/fluid/operators/elementwise/elementwise_op_function.h

+    ReduceWrapper<T>(dev_ctx, axis, &tmp_dy, dy);
+  }
+}
+#endif


JamesLim-sy

I agree with this pr yet, but there still are some modification advices needed to be discussed with other pr reviewers. May you discuss with other reviewers as soon as possible. And please not use force-push once reviewing process inits.

ZzSean

LGTM

add elementwise div

d3173f8

Zjq9409 force-pushed the broadcast_div branch from 052614c to d3173f8 Compare December 10, 2021 08:34

ZzSean reviewed Dec 10, 2021

View reviewed changes

Zjq9409 changed the title ~~add elementwise div~~ implementation of broadcast div backward by reduce Dec 13, 2021

move mul and div grad functor

c6cef2e

Zjq9409 force-pushed the broadcast_div branch from 485f716 to c6cef2e Compare December 13, 2021 05:51

AshburnLee reviewed Dec 13, 2021

View reviewed changes

ZzSean reviewed Dec 13, 2021

View reviewed changes

Zjq9409 force-pushed the broadcast_div branch from 20aa561 to 22f4434 Compare December 13, 2021 08:54

Combine multiple CUDA kernels

9265a8d

Zjq9409 force-pushed the broadcast_div branch from 22f4434 to 9265a8d Compare December 13, 2021 08:56

AshburnLee reviewed Dec 17, 2021

View reviewed changes

Merge branch 'develop' into broadcast_div

080bf95

Zjq9409 force-pushed the broadcast_div branch from f87757d to be21b4b Compare December 20, 2021 03:25

Zjq9409 force-pushed the broadcast_div branch from be21b4b to c3780ac Compare December 20, 2021 05:23

Update the reduce interface call

f0f1cf3

Zjq9409 force-pushed the broadcast_div branch from c3780ac to f0f1cf3 Compare December 20, 2021 07:54

JamesLim-sy reviewed Dec 24, 2021

View reviewed changes

Zjq9409 added 3 commits December 26, 2021 07:41

Merge branch 'develop' into broadcast_div

8c43581

Merge branch 'develop' into broadcast_div

b1f58dc

add multi-output

e07e54e

Zjq9409 force-pushed the broadcast_div branch from 1c91d6b to e07e54e Compare December 27, 2021 13:01

JamesLim-sy mentioned this pull request Dec 28, 2021

Support multi-output feature for elementwise #38410

Merged

Zjq9409 added 2 commits December 29, 2021 02:17

Merge branch 'develop' into broadcast_div

3594f6b

add multi-output div

7adf371

ZzSean reviewed Dec 29, 2021

View reviewed changes

add branch judge

560ed45

JamesLim-sy reviewed Dec 29, 2021

View reviewed changes

ZzSean reviewed Dec 29, 2021

View reviewed changes

AshburnLee reviewed Dec 29, 2021

View reviewed changes

AshburnLee reviewed Dec 30, 2021

View reviewed changes

Xreki reviewed Dec 30, 2021

View reviewed changes

Package branch

2920824

JamesLim-sy reviewed Dec 30, 2021

View reviewed changes

ZzSean reviewed Dec 30, 2021

View reviewed changes

Zjq9409 commented Dec 31, 2021

View reviewed changes

Zjq9409 added 2 commits December 31, 2021 03:50

Combine the x and y functions into one

476c797

Merge branch 'develop' into broadcast_div

8259c34

Merge branch 'develop' into broadcast_div

d2f3776

Zjq9409 force-pushed the broadcast_div branch from b1e6181 to d2f3776 Compare January 4, 2022 05:09

JamesLim-sy approved these changes Jan 4, 2022

View reviewed changes

ZzSean approved these changes Jan 4, 2022

View reviewed changes

PaddlePaddle deleted a comment from Zjq9409 Jan 4, 2022

JamesLim-sy requested review from Xreki and AshburnLee January 4, 2022 15:17

JamesLim-sy merged commit 55cd9cb into PaddlePaddle:develop Jan 5, 2022

JamesLim-sy mentioned this pull request Jan 5, 2022

Remove useless headers for some grad ops #38732

Merged

implementation of broadcast div backward by reduce #38044

implementation of broadcast div backward by reduce #38044

Conversation

Zjq9409 commented Dec 10, 2021 • edited Loading

PR types

PR changes

Describe

paddle-bot-old bot commented Dec 10, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JamesLim-sy left a comment • edited Loading

Choose a reason for hiding this comment

ZzSean left a comment

Zjq9409 commented Dec 10, 2021 •

edited

Loading

JamesLim-sy left a comment •

edited

Loading