Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ONNX Exported BART Model Performance is degraded than native pytorch on T4 #7796

Open
anshoomehra opened this issue May 21, 2021 · 7 comments
Assignees
Labels
stale issues that have not been addressed in a while; categorized by a bot

Comments

@anshoomehra
Copy link

anshoomehra commented May 21, 2021

Describe the bug
@hariharans29 creating new issue as your suggestion.

Stemming from Issue, post enabling CUDAExecutionProvider, the performance of inference seems to have degraded post ONNX conversion.

Pytorch Native Model Performance : 676 ms
ONNX Model Peformance: 2.78 sec

Urgency
Our project went live this weekend, but the performance is hammering us.

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux lab-am-vm 4.19.0-16-cloud-amd64 Set up CI with Azure Pipelines #1 SMP Debian 4.19.181-1 (2021-03-19) x86_64 GNU/Linux

  • ONNX Runtime installed from (source or binary): PIP Install

  • ONNX Runtime version:
    onnxruntime-gpu==1.7.0

  • Python version: 3.7.10

  • CUDA/cuDNN version: release 11.0, V11.0.194

  • GPU model and memory: NVIDIA T4, 16G

To Reproduce
Attached full script/jupyter notebook to reproduce and analyze. Please look at cell #7 onwards.
bart_onnx-am.ipynb.zip

Expected behavior
Performance be significantly better than native pytorch

@wangyems
Copy link
Contributor

Thanks @anshoomehra for providing the repro! I'll take a look

@wangyems
Copy link
Contributor

Hi @anshoomehra, based on your code there are two things you can do to improve the performance:

  1. eliminate the multiple rounds of data copy from host to device when feeding the decoder_inputs. you can take a look at https://www.onnxruntime.ai/python/api_summary#iobinding
  2. apply optimizer to each of the onnx model: https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/python/tools/transformers#model-optimizer and using 'bert' as model_type

@tianleiwu
Copy link
Contributor

@anshoomehra

For #1, you can refer to example of GPT-2 to keep inputs and outputs in GPU during text generation (all intermediate tensors shall be in GPU, and not copied to CPU) :

def onnxruntime_inference_with_binded_io(ort_session,
inputs: Gpt2Inputs,
output_buffers: Dict[str, torch.Tensor],
output_shapes: Dict[str, List[int]],
total_runs: int = 0,
return_numpy: bool = True,
include_copy_output_latency: bool = False):
""" Inference with IO binding. Returns outputs, and optional latency when total_runs > 0.
"""
logger.debug(f"start onnxruntime_inference_with_binded_io")
# Bind inputs and outputs to onnxruntime session
io_binding = Gpt2Helper.prepare_io_binding(ort_session, inputs.input_ids, inputs.position_ids,
inputs.attention_mask, inputs.past, output_buffers, output_shapes)
# Run onnxruntime with io binding
ort_session.run_with_iobinding(io_binding)

Example change for #2 (optimizer):
test_bart.zip

Let us know whether the issue can be resolved using these two changes.

@anshoomehra
Copy link
Author

@wangyems & @tianleiwu truly appreciate the inputs & revised code. Working on changing the variable bindings and understanding the optimized code. I tried running code with optimization set as 'BERT' the model fails with below error, the model we are looking here is BART (Text Generation, producing summaries) not BERT. Is BERT the right choice? If not, is there support for BART? If not, can we instead use gpt-x being close match?

image

@sam-writer
Copy link

I am curious about this... we're using T5 which is similar to BART

@tianleiwu
Copy link
Contributor

@sam-writer, the merged PR #8698 shall improve BART performance. For T5, the DecoderAttention operator need slight change.

@stale
Copy link

stale bot commented Apr 16, 2022

This issue has been automatically marked as stale due to inactivity and will be closed in 7 days if no further activity occurs. If further support is needed, please provide an update and/or more details.

@stale stale bot added the stale issues that have not been addressed in a while; categorized by a bot label Apr 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale issues that have not been addressed in a while; categorized by a bot
Projects
None yet
Development

No branches or pull requests

6 participants