Save ONNX model in file #4671

thiagocrepaldi · 2020-07-30T21:21:36Z

Must be merged after #4668

This PR:

Fixes ORTTrainer initialization when options=None
Implements save_as_onnx API
Extend unit test to save, load and compare ONNX models

rayankrish

LGTM

orttraining/orttraining/python/training/orttrainer.py

orttraining/orttraining/test/python/orttraining_test_orttrainer_frontend.py

Co-authored-by: Rayan Krishnan <t-rakr@OrtDevTest2v100.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>

* Add ORTTrainerOptions class for the new pytorch frontend (#4382) Add ORTTrainerOptions class and some placeholders * Add _ORTTrainerModelDesc to perform validation for model description (#4416) * Add Loss Scaler classes to the new frontend (#4306) * Add TrainStepInfo used on the new frontend API (#4256) * Add Optimizer classes to the new frontend (#4280) * Add LRScheduler implementation (#4357) * Add basic ORTTrainer API (#4435) This PR presents the public API for ORTTrainer for the short term development. It also validates and saves input parameters, which will be used in the next stages, such as building ONNX model, post processing the model and configuring the training session * Add opset_version into ORTTrainerOptions and change type of ORTTrainer.loss_fn (#4592) * Update ModelDescription and minor fix on ORTTrainer ctor (#4605) * Update ModelDescription and minor fix on ORTTrainer/ORTTrainerOptions This PR keeps the public API intact, but changes how model description is stored on the backend Currently, users creates a dict with two lists of tuples. One list called 'inputs' and each tuple has the following format tuple(name, shape). The second list is called 'outputs' and each tuple can be either tuple(name, shape) or tuple(name, shape, is_loss). With this PR, when this dict is passed in to ORTTrainer, it is fully validated as usual. However, tuples are internally replaced by namedtuples and all output tuples will have tuple(name, shape, is_loss) format instead of is_loss being optionally present. Additionally to that normalization in the internal representation (which eases coding), two internal methods were created to replace a namedtuple(name, shape) to namedtuple(name, shape, dtype) or namedtuple(name, shape, is_loss, dtype) dependeing whether the tuple is an input or output. This is necessary as ORTTRainer finds out data types of each input/output during model export to onnx. Finally, a minor fix was done on ORTTrainer. It could initialize ORTTrainerOptions incorrectly when options=None * Rename input name for test * Add ONNX Model Export to New Frontend (#4612) Co-authored-by: Rayan Krishnan <t-rakr@OrtDevTest2v100.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net> Co-authored-by: Thiago Crepaldi <[email protected]> * Create training session + minor improvements (#4668) Co-authored-by: Rayan Krishnan <t-rakr@OrtDevTest2v100.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net> * Save ONNX model in file (#4671) Co-authored-by: Rayan Krishnan <t-rakr@OrtDevTest2v100.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net> * Add eval step (#4674) Co-authored-by: Rayan Krishnan <t-rakr@OrtDevTest2v100.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net> * Add train_step (#4677) Co-authored-by: Rayan Krishnan <t-rakr@OrtDevTest2v100.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net> * Add LR Scheduler (#4694) Co-authored-by: Rayan Krishnan <t-rakr@OrtDevTest2v100.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net> Co-authored-by: Thiago Crepaldi <[email protected]> * Add deterministic compute tests (#4716) Co-authored-by: Rayan Krishnan <t-rakr@OrtDevTest2v100.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net> Co-authored-by: Thiago Crepaldi <[email protected]> * Add legacy vs experimental ORTTrainer accuracy comparison (#4727) Co-authored-by: Rayan Krishnan <t-rakr@OrtDevTest2v100.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net> Co-authored-by: Thiago Crepaldi <[email protected]> * Add Mixed precision/LossScaler + several fixes (#4739) Additionally to the mixed precision/loss scaler code, this PR includes: * Fix CUDA training * Add optimization_step into TrainStepInfo class * Refactor LRSCheduler to use optimization_step instead of step * Updated several default values at ORTTrainerOptions * Add initial Gradient Accumulation supported. Untested * Fix ONNX model post processing * Refactor unit tests * Add ONNX BERT example + minor fixes (#4757) * Fix training issue when passing ONNX file into ORTTrainer Co-authored-by: Thiago Crepaldi <[email protected]> Co-authored-by: Rayan Krishnan <t-rakr@OrtDevTest2v100.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net> * Add Dynamic Shape support (#4758) * Update DeepSpeed Zero Stage option to a separate option group (#4772) * Add support to fetches (#4777) * Add Gradient Accumulation Steps support (#4793) * Fix Dynamic Axes feature and add unit test (#4795) * Add frozen weights test (#4807) * Move new pytorch front-end to 'experimental' namespace (#4814) * Fix build Co-authored-by: Rayan-Krishnan <[email protected]> Co-authored-by: Rayan Krishnan <t-rakr@OrtDevTest2v100.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>

thiagocrepaldi added the component:training-frontend label Jul 30, 2020

thiagocrepaldi requested review from liqunfu, BowenBao and rayankrish July 30, 2020 21:21

thiagocrepaldi requested a review from a team as a code owner July 30, 2020 21:21

thiagocrepaldi self-assigned this Jul 30, 2020

thiagocrepaldi linked an issue Jul 30, 2020 that may be closed by this pull request

[WIP] New PyTorch frontend API #4176

Closed

rayankrish reviewed Jul 31, 2020

View reviewed changes

Rayan Krishnan and others added 11 commits July 30, 2020 17:46

initial onnx model export

2b6ed6c

fix onnx export, add init session

1548112

Add opset version to ORTTrainerOptions and update ORTTrainer.loss_fn

ffd7fa0

Rebase feature branch

7f73354

Add debug flag, set gpu device/mem_limit and minor refactoring

fec6f16

initial onnx model export

9b41b0f

fix onnx export, add init session

51ca5b7

Initial version - not tested

0a474f2

Remove dead code

407e257

Fix save as onnx

feb8cc6

Fix ORTTrainer init and minor refactor

83b0f48

thiagocrepaldi force-pushed the thiagofc/new_frontend/save_as_onnx branch from e1fc1aa to 83b0f48 Compare July 31, 2020 00:52

liqunfu reviewed Jul 31, 2020

View reviewed changes

orttraining/orttraining/python/training/orttrainer.py Show resolved Hide resolved

liqunfu approved these changes Jul 31, 2020

View reviewed changes

orttraining/orttraining/test/python/orttraining_test_orttrainer_frontend.py Show resolved Hide resolved

thiagocrepaldi merged commit 771d979 into feature/new_pytorch_frontend Jul 31, 2020

thiagocrepaldi deleted the thiagofc/new_frontend/save_as_onnx branch July 31, 2020 21:39

thiagocrepaldi mentioned this pull request Aug 10, 2020

[WIP] New PyTorch frontend API #4176

Closed

thiagocrepaldi removed a link to an issue Aug 10, 2020

[WIP] New PyTorch frontend API #4176

Closed

thiagocrepaldi pushed a commit that referenced this pull request Aug 12, 2020

Save ONNX model in file (#4671)

1f5955f

Co-authored-by: Rayan Krishnan <t-rakr@OrtDevTest2v100.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>

thiagocrepaldi pushed a commit that referenced this pull request Aug 14, 2020

Save ONNX model in file (#4671)

b7b4f5f

Co-authored-by: Rayan Krishnan <t-rakr@OrtDevTest2v100.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>

thiagocrepaldi pushed a commit that referenced this pull request Aug 15, 2020

Save ONNX model in file (#4671)

4e24aac

Co-authored-by: Rayan Krishnan <t-rakr@OrtDevTest2v100.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>

thiagocrepaldi pushed a commit that referenced this pull request Aug 15, 2020

Save ONNX model in file (#4671)

88f0dbf

Co-authored-by: Rayan Krishnan <t-rakr@OrtDevTest2v100.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Save ONNX model in file #4671

Save ONNX model in file #4671

thiagocrepaldi commented Jul 30, 2020

rayankrish left a comment

Save ONNX model in file #4671

Save ONNX model in file #4671

Conversation

thiagocrepaldi commented Jul 30, 2020

rayankrish left a comment

Choose a reason for hiding this comment