ONNX Exported BART Model Performance is degraded than native pytorch on T4 #7796

anshoomehra · 2021-05-21T18:35:26Z

Describe the bug
@hariharans29 creating new issue as your suggestion.

Stemming from Issue, post enabling CUDAExecutionProvider, the performance of inference seems to have degraded post ONNX conversion.

Pytorch Native Model Performance : 676 ms
ONNX Model Peformance: 2.78 sec

Urgency
Our project went live this weekend, but the performance is hammering us.

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux lab-am-vm 4.19.0-16-cloud-amd64 Set up CI with Azure Pipelines #1 SMP Debian 4.19.181-1 (2021-03-19) x86_64 GNU/Linux
ONNX Runtime installed from (source or binary): PIP Install
ONNX Runtime version:
onnxruntime-gpu==1.7.0
Python version: 3.7.10
CUDA/cuDNN version: release 11.0, V11.0.194
GPU model and memory: NVIDIA T4, 16G

To Reproduce
Attached full script/jupyter notebook to reproduce and analyze. Please look at cell #7 onwards.
bart_onnx-am.ipynb.zip

Expected behavior
Performance be significantly better than native pytorch

wangyems · 2021-05-24T21:38:39Z

Thanks @anshoomehra for providing the repro! I'll take a look

wangyems · 2021-05-25T00:05:20Z

Hi @anshoomehra, based on your code there are two things you can do to improve the performance:

eliminate the multiple rounds of data copy from host to device when feeding the decoder_inputs. you can take a look at https://www.onnxruntime.ai/python/api_summary#iobinding
apply optimizer to each of the onnx model: https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/python/tools/transformers#model-optimizer and using 'bert' as model_type

tianleiwu · 2021-05-25T01:23:14Z

@anshoomehra

For #1, you can refer to example of GPT-2 to keep inputs and outputs in GPU during text generation (all intermediate tensors shall be in GPU, and not copied to CPU) :

onnxruntime/onnxruntime/python/tools/transformers/gpt2_helper.py

Lines 486 to 502 in a41255c

    
               def onnxruntime_inference_with_binded_io(ort_session, 
        
                                                        inputs: Gpt2Inputs, 
        
                                                        output_buffers: Dict[str, torch.Tensor], 
        
                                                        output_shapes: Dict[str, List[int]], 
        
                                                        total_runs: int = 0, 
        
                                                        return_numpy: bool = True, 
        
                                                        include_copy_output_latency: bool = False): 
        
                   """ Inference with IO binding. Returns outputs, and optional latency when total_runs > 0. 
        
                   """ 
        
                   logger.debug(f"start onnxruntime_inference_with_binded_io") 
        
                   # Bind inputs and outputs to onnxruntime session 
        
                   io_binding = Gpt2Helper.prepare_io_binding(ort_session, inputs.input_ids, inputs.position_ids, 
        
                                                              inputs.attention_mask, inputs.past, output_buffers, output_shapes) 
        
                   # Run onnxruntime with io binding 
        
                   ort_session.run_with_iobinding(io_binding)

Example change for #2 (optimizer):
test_bart.zip

Let us know whether the issue can be resolved using these two changes.

anshoomehra · 2021-05-27T17:18:10Z

@wangyems & @tianleiwu truly appreciate the inputs & revised code. Working on changing the variable bindings and understanding the optimized code. I tried running code with optimization set as 'BERT' the model fails with below error, the model we are looking here is BART (Text Generation, producing summaries) not BERT. Is BERT the right choice? If not, is there support for BART? If not, can we instead use gpt-x being close match?

sam-writer · 2022-01-10T21:38:52Z

I am curious about this... we're using T5 which is similar to BART

tianleiwu · 2022-01-19T22:22:12Z

@sam-writer, the merged PR #8698 shall improve BART performance. For T5, the DecoderAttention operator need slight change.

stale · 2022-04-16T07:54:52Z

This issue has been automatically marked as stale due to inactivity and will be closed in 7 days if no further activity occurs. If further support is needed, please provide an update and/or more details.

anshoomehra mentioned this issue May 21, 2021

CUDAExecutionProvider Not Available / GPU Not Visible on NVIDIA T4 #7748

Closed

tianleiwu assigned wangyems May 24, 2021

codemzs added the type:support label May 26, 2021

wangyems mentioned this issue Aug 12, 2021

Add new transformers model type: Bart #8698

Merged

faxu removed the type:support label Aug 18, 2021

stale bot added the stale issues that have not been addressed in a while; categorized by a bot label Apr 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ONNX Exported BART Model Performance is degraded than native pytorch on T4 #7796

ONNX Exported BART Model Performance is degraded than native pytorch on T4 #7796

anshoomehra commented May 21, 2021 •

edited

Loading

wangyems commented May 24, 2021

wangyems commented May 25, 2021

tianleiwu commented May 25, 2021

anshoomehra commented May 27, 2021

sam-writer commented Jan 10, 2022

tianleiwu commented Jan 19, 2022

stale bot commented Apr 16, 2022

ONNX Exported BART Model Performance is degraded than native pytorch on T4 #7796

ONNX Exported BART Model Performance is degraded than native pytorch on T4 #7796

Comments

anshoomehra commented May 21, 2021 • edited Loading

wangyems commented May 24, 2021

wangyems commented May 25, 2021

tianleiwu commented May 25, 2021

anshoomehra commented May 27, 2021

sam-writer commented Jan 10, 2022

tianleiwu commented Jan 19, 2022

stale bot commented Apr 16, 2022

anshoomehra commented May 21, 2021 •

edited

Loading