Releases: microsoft/Olive
Releases · microsoft/Olive
Olive-ai 0.7.1
Command Line Interface
New command line tools have been added and existing tools have been improved.
olive --help
works as expected.auto-opt
:- The command chooses a set of passes compatible with the provided model type, precision and accelerator information.
- New options to split a model, either using
--num-splits
or--cost-model
.
Improvements
ExtractAdapters
:- Support lora adapter nodes in Stable Diffusion unet or text-embedding models.
- Default initializers for quantized adapter to run the model without adapter inputs.
GPTQ
:- Avoid saving unused bias weights (all zeros).
- Set
use_exllama
toFalse
by default to allow exporting and fine-tuning external GPTQ checkpoints.
AWQ
: Patch autoawq to run quantization on newer transformers versions.- Atomic
SharedCache
operations - New
CaptureSplitInfo
andSplit
passes to split models into components. Number of splits can be user provided or inferred from a cost model. disable_search
is deprecated from pass configuration in an olive workflow config.OrtSessionParamsTuning
redone to use olive search features.OrtModelOptimizer
renamed toOrtPeepholeOptimizer
and some bug fixes.
Examples:
- Stable Diffusion: New MultiLora Example
- Phi3: New int quantization example using
nvidia-modelopt
Olive-ai 0.7.0
Command Line Interface (CLI)
Introducing new command line interface for Olive with support to execute well-defined concrete workflows without user having to ever create or edit a config manually. CLI workflow commands can be chained i.e. output of one execution can be fed as input to the next, to facilitate ease of operations for the entire pipeline. Below is a list of few CLI workflow commands -
- finetune - Fine-tune a model on a dataset using peft and optimize the model for ONNX Runtime
- capture-onnx-graph: Capture ONNX graph for a Huggingface model.
- auto-opt: Automatically optimize a model for performance.
- quantize: Quantize model using given algorithm for desired precision and target.
- tune-session-params: Automatically tune the session parameters for a ONNX model.
- generate-adapter: Generate ONNX model with adapters as inputs.
Improvements
- Added support for yaml based workflow config
- Streamlined DataConfig management
- Simplified workflow configuration
- Added shared cache support for intermediate models and supporting data files
- Added QuaRoT quantization pass for PyTorch models
- Added support to evaluate generative PyTorch models
- Streamlined support for user-defined evaluators
- Enabled use of llm-evaluation-harness for generative model evaluations
Examples
- Llama
- Updated multi-lora example to use ORT genreate() API
- Updated to demonstrate use of shared cache
- Phi3
- Updated to demonstrate evaluation using lm-eval harness
- Updated to showcase search across three different QLoRA ranks
- Added Vision tutorial
Olive-ai 0.6.2
Workflow config
- Support YAML files as workflow config file. #1191
- Workflow id feature is a prerequisite for running workflow on a remote vm feature. By adding this feature #1179 :
- Cache dir will become
<cache_dir>/<workflow_id>
- OLive config will be automatically saved to cache dir.
- User can specify
workflow_id
in config file. - The default workflow_id is
default_workflow
.
- Cache dir will become
Passes (optimization techniques)
- Accept SNPE DLC model for qnn context binnary generator #1188
Data
- Remove params_config, components/component_args. All components specific parameters are now grouped in four separate objects: #1187
- load_dataset_config
- pre_process_data_config
- post_process_data_config
- dataloader_config
Docs
- Add olive workflow schema to doc website. This schema file can be used in IDEs when writing workflow configs. #1190
Olive-ai 0.6.1
Olive-ai 0.6.0
Examples
The following examples are added:
- Add LLM sample for DirectML #1082 #1106
- This adds an LLM sample for DirectML that can convert and quantize a bunch of LLMs from HuggingFace. The Dolly, Phi and LLaMA 2 folders were removed and replaced with a more generic LLM example that supports a large number of LLMs, including but not limited to Phi-2, Mistral, LLaMA 2
- Add Gemma to DML LLM sample #1138
- Llama2 optimization with multi-ep managed env #1087
- Llama2: Multi-lora example notebook, Custom generator #1114
- Search Optimal optimization among multiple EPs #1092
Olive CLI updates
- Previous commands
python -m olive.workflows.run
andpython -m olive.platform_sdk.qualcomm.configure
are deprecated. Useolive run
orpython -m olive
instead. #1129
Passes (optimization techniques)
- Pytorch
- ONNXRuntime
ExtractAdapters
pass supports int4 quantized models and expose the external data config options to users. #1083ModelBuilder
: Converts a Huggingface/AML generative PyTorch model to ONNX model using the ONNX Runtime Generative AI >= 0.2.0. #1089 #1073 #1110 #1112 #1118 #1130 #1131 #1141 #1146 #1147 #1154OnnxFloatToFloat16
: Use ort float16 converter #1132NVModelOptQuantization
Quantize ONNX model with Nvidia-ModelOpt. #1135OnnxIOFloat16ToFloat32
: Converts float16 model inputs/outputs to float32. #1149- [Vitis AI] Make Vitis AI techniques compatible with ORT 1.18 #1140
Data Config
- Remove name ambiguity in dataset configuration #1111
- Remove HfConfig::dataset references in examples and tests #1113
Engine
- Add aml deployment packaging. #1090
System
- Make the accelerator EP optional in olive systems for non-onnx pass. #1072
Data
- Add AML resource support for data configs.
- Add audio classification data preprocess function.
Model
- Provide build-in kv_cache_config for generative model's io_config #1121
- MLFlow transfrormers models to huggingface format which can be consumed by the passes which need huggingface format. #1150
Metrics
Dependencies:
Support onnxruntime 1.17.3
Issues
Olive-ai 0.5.2
Examples
The following examples are added
Passes (optimization techniques)
- SliceGPT: SliceGPT is post-training sparsification scheme that makes transformer networks smaller by applying orthogonal transformations to each transformer layer that reduces the model size by slicing off the least-significant rows and columns of the weight matrices. This results in speedups and a reduced memory footprint.
- ExtractAdapters: Extracts the lora adapters (float or static quantized) weights and saves them in a separate file.
Engine
- Simplify the engine config
Fix
- GenAIModelExporter: In windows, the cache_dir of genai model exporter will exceed 260.
Olive-ai 0.5.1
Examples
The following examples are added
Passes (optimization techniques)
- QNNPreprocess: Add the configs which are added in onnxruntime nightly package.
- GptqQuantizer: PTQ quantization using Hugging Face Optimum and export model with onnxruntime optimized kernel.
- OnnxMatMul4Quantizer: Add matmul RTN/HQQ/GPTQ quant configs.
- Move all pass need create inference session to run on target:
- IncQuantization
- OptimumMerging
- OrtTransformersOptimization
- VitisAIQuantization
- OrtPerfTuning
Engine
- Support to pack AzureML output.
- Remove execution_providers from engine config, typical config looks like:
"systems": {
"local_system": {
"type": "LocalSystem",
"config": {
"accelerators": [
{
"device": "gpu",
"execution_providers": [
"CUDAExecutionProvider"
]
}
]
}
}
},
"engine": {
"host": "local_system",
"target": "local_system",
}
Workflows
- Delayed python pass module loading and provide the option
--package-config
to let advanced users to write their individual pass module and corresponding dependencies.
Fix
- Cannot load MLFlow model as
from_pretrained_args
is missed. - LoRA: Provide save_embedding_layers=False to saving the peft model. Otherwise, it defaults to "auto" which checks if the vocab size changed.
- Update the model_rank file for zipfile packaging type. The model path now is the path relative to the output zip file.
- Fix windows shutil.which return None when passing full python path.
Olive-ai 0.5.0
Examples
The following examples are added:
- Audio Spectrogram Transformer optimization #762
- Bert SNPE #925
- Llama2 GenAI #940
- Llama2 notebook turorial #798
- MobileNet optimization with QDQ Quantization on Qualcomm NPU #874
- Phi2 Generation #979
- Phi2 optimization with different precision #938
- Stable Diffusion OpenVINO example #853
Passes (optimization techniques)
New Passes
- PyTorch
- Introduce GenAIModelExporter pass to export a PyTorch model using GenAI exporter.
- Introduce LoftQ pass which performs model fine-tuning using the LoftQ initialization proposed in https://arxiv.org/abs/2310.08659.
- ONNXRuntime
- Introduce DynamicToFixedShape pass to convert dynamic shape to fixed shape for ONNX model.
- Introduce OnnxOpVersionConversion pass to convert an existing ONNX model with another target opset.
- [QNN-EP] Add the option of
prepare_qnn_config:bool
for quantization under QNN-EP where the int16/uint16 are supported both for weights and activation. - [QNN-EP] Introduce QNNPreprocess pass to preprocess the model before quantization.
- QNN
- Introduce QNNConversion pass to convert models to QNN C++ model.
- Introduce QNNContextBinaryGenerator pass to generate the context binary from a compiled model library using a specific backend.
- Introduce QNNModelLibGenerator pass to compile the C++ model into a model library for the desired target.
Updates
- OnnxConversion
- Support both
past_key_values.index.key/value
andpast_key_value.index
.
- Support both
- OptimumConversion
- Provide parameter
components
if the user wants to export only some models such asdecoder_model
anddecoder_with_past_model
. - Uses the default exporter args and behavior of the underlying optimum version. For versions 1.14.0+, this means
legacy=False
andno_post_process=False
. User must provide them usingextra_args
if legacy behavior is desired.
- Provide parameter
- OpenVINO
- Upgrade OpenVINO API to 2023.2.0.
- OrtPerTuning
- Add
tunable_op_enable
andtunable_op_tuning_enable
for ROCM ep to speed up the performance.
- Add
- LoRA/QLoRA
- Support bfloat16 with ort-training.
- Support resuming training from checkpoint by
resume_from_checkpoint
option.overwrite_output_dir
option.
- MoEExpertsDistributor
- Add option to configure number of parallel jobs.
Engine
- As for Zipfile packaging, add models rank json file. This file ranks all output models from different EPs. This json file includes model_config and metrics.
- Add Auto Optimizer which is a tool that can be used to automatically search Olive passes combination.
System
- Add
hf_token
support for Olive systems. - AzureMLSystem
- Olive config file will be uploaded to AML jobs under codes folder.
- Support adding tags to the AML jobs.
- Support using existing AML workspace Environment for AzureMLSystem.
- DockerSystem
- Support running Olive Pass.
PythonEnvironmentSystem
requires Olive to be installed in the environment. It can run passes and evaluate models.- New
IsolatedORTSystem
introduced that only supports evaluation of ONNX models. It requires onnxruntime to be installed in the environment. Can be used to for packages like onnxruntime-qnn which can only be run on Windows ARM64 python environment.
Data
- Add AML resource support for data configs.
- Add audio classification data preprocess function.
Model
- Rename
model_loading_args
tofrom_pretrained_args
inhf_config
.
Metrics
- Add
throughput
metric support.
Dependencies:
Support onnxruntime 1.17.1.
Olive-ai 0.4.0
Examples
The following examples are added
- Llama2 optimization with ONNX Runtime Tools #641
- Llama2 finetuning with QLoRA and optimization with ONNX Runtime Tools #703
- Llama2 shard to multiple GPUs #694
- DirectML Llama2 #701
- DirectML phi #693
- phi-1.5 finetuning with QLoRA #689
Passes (optimization techniques)
- OrtPerTuning
- Raises known failure exceptions to immediately stop tuning.
- Default values for
device
andproviders_list
is based on the accelerator spec.
- OrtTransformersOptimization
- Checks that
model_type
is provided in the pass configs or available in the model attributes.None
is invalid. fp16
related arguments are better documented.
- Checks that
- Introduce LoRA pass for finetuning pytorch models with Low-Rank Adaptation
- Introduce OnnxMatMul4Quantizer pass to quantize onnx models to 4-bit integers.
- Introduce OnnxBnb4Quantization pass to quantize onnx models to 4-bit data types from bitsandbytes (FP4, NF4).
- Onnx external data configuration supports
size_threshold
andconvert_attribute
parameters. - LlamaPyTorchTensorParallel pass to split Llama model into a tensor parallel distributed pytorch model.
- OnnxConversion
- Support DistributedPyTorchModel.
use_device
andtorch_dtype
options to specify device ("cpu", "cuda") and data type ("float16", "float32") for the model before conversion.
- DeviceSpecificOnnxConversion removed in favor or OnnxConversion pass with
use_device
option. - LoRA/QLoRA
- Support training using ONNX Runtime Training.
- Mixed-precision training when
torch_dtype=float16
for numerical stability.
Engine
- Make
engine/evaluator
config optional in olive run config. With this default way, user can just run optimization without search and evaluation in simplest pass config. evaluate_input_model
is optional in engine config in no-search model. It is forced toFalse
when no evaluator is provided.ort_py_log_severity_level
option to control logging level for onnxruntime python logs.- CLI option
--tempdir
to use a custom directory as the root directory for tempfile. - IO-Binding:
- New method to efficiently bind inputs and outputs to the session using either the CPU or GPU depending on the device.
shared_kv_buffer
option to enable key value buffer sharing between input (past key values) and output (present key values)
Model
- DistributedOnnxModel file structure updated to use resource paths. Can be saved from cache to destination directory.
- Introduce DistributedPyTorchModel that is analogous to DistributedOnnxModel for pytorch model.
trust_remote_code
added to HFConfig model_loading_args.
Metrics
- Option to provide kwargs to user_script functions through
func_kwargs
Dependencies:
- Support onnxruntime 1.16.2
Olive-ai 0.3.3
Quick fix for v0.3.2
- Vitis AI quantization support ORT 1.16.1
- Add optional attention mask for text-generation task