-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ONNX Exported BART Model Performance is degraded than native pytorch on T4 #7796
Comments
Thanks @anshoomehra for providing the repro! I'll take a look |
Hi @anshoomehra, based on your code there are two things you can do to improve the performance:
|
For #1, you can refer to example of GPT-2 to keep inputs and outputs in GPU during text generation (all intermediate tensors shall be in GPU, and not copied to CPU) : onnxruntime/onnxruntime/python/tools/transformers/gpt2_helper.py Lines 486 to 502 in a41255c
Example change for #2 (optimizer): Let us know whether the issue can be resolved using these two changes. |
@wangyems & @tianleiwu truly appreciate the inputs & revised code. Working on changing the variable bindings and understanding the optimized code. I tried running code with optimization set as 'BERT' the model fails with below error, the model we are looking here is BART (Text Generation, producing summaries) not BERT. Is BERT the right choice? If not, is there support for BART? If not, can we instead use gpt-x being close match? |
I am curious about this... we're using T5 which is similar to BART |
@sam-writer, the merged PR #8698 shall improve BART performance. For T5, the DecoderAttention operator need slight change. |
This issue has been automatically marked as stale due to inactivity and will be closed in 7 days if no further activity occurs. If further support is needed, please provide an update and/or more details. |
Describe the bug
@hariharans29 creating new issue as your suggestion.
Stemming from Issue, post enabling CUDAExecutionProvider, the performance of inference seems to have degraded post ONNX conversion.
Pytorch Native Model Performance : 676 ms
ONNX Model Peformance: 2.78 sec
Urgency
Our project went live this weekend, but the performance is hammering us.
System information
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux lab-am-vm 4.19.0-16-cloud-amd64 Set up CI with Azure Pipelines #1 SMP Debian 4.19.181-1 (2021-03-19) x86_64 GNU/Linux
ONNX Runtime installed from (source or binary): PIP Install
ONNX Runtime version:
onnxruntime-gpu==1.7.0
Python version: 3.7.10
CUDA/cuDNN version: release 11.0, V11.0.194
GPU model and memory: NVIDIA T4, 16G
To Reproduce
Attached full script/jupyter notebook to reproduce and analyze. Please look at cell #7 onwards.
bart_onnx-am.ipynb.zip
Expected behavior
Performance be significantly better than native pytorch
The text was updated successfully, but these errors were encountered: