Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use elementwise to optimize gelu forward implementation on GPU #38188

Merged
merged 3 commits into from
Dec 21, 2021

Conversation

Zjq9409
Copy link
Contributor

@Zjq9409 Zjq9409 commented Dec 16, 2021

PR types

Performance optimization

PR changes

OPs

Describe

使用elementwise优化gelu算子GPU前向计算,前向算子性能数据如下:
image

@paddle-bot-old
Copy link

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@Zjq9409 Zjq9409 changed the title relu forward opt Use elementwise to optimize gelu implementation on GPU Dec 16, 2021
std::vector<const framework::Tensor*> ins;
std::vector<framework::Tensor*> outs;
ins = {in};
outs = {out};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

创建vector时就可以初始化

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@Zjq9409 Zjq9409 changed the title Use elementwise to optimize gelu implementation on GPU Use elementwise to optimize gelu forward implementation on GPU Dec 20, 2021
};

template <typename T>
struct GeluNoApproximateFunctor {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

名字改一下,跟上面的对应,改成without

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

template <typename DeviceContext, typename T>
typename std::enable_if<
std::is_same<DeviceContext, platform::CUDADeviceContext>::value>::type
default_gelu_fw(const framework::ExecutionContext& ctx,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不需要重新写一个新的函数,直接特化一个CUDA版本的GeluKernel就可以了

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

using MT = typename details::MPTypeTrait<T>::Type;
inline HOSTDEVICE T operator()(T x) {
// this function is tanh approximation of gelu
MT mx = static_cast<MT>(x);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的命名可以参考activation_op.cu中的命名方式

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@Zjq9409 Zjq9409 changed the title Use elementwise to optimize gelu forward implementation on GPU use elementwise to optimize gelu forward implementation on GPU Dec 20, 2021
@Zjq9409 Zjq9409 force-pushed the gelu_opt branch 2 times, most recently from 107f8ed to dac385f Compare December 20, 2021 11:55
};

template <typename DeviceContext, typename T>
class GeluCUDAKernel : public framework::OpKernel<T> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

特化的话不用改名字,还用GeluKernel就好,DeviceContext这个模版参数用CUDADeviceContext就可以

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

@ZzSean ZzSean left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ZzSean ZzSean merged commit aff4368 into PaddlePaddle:develop Dec 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants