Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accept device and dtype in OnnxConversion, Add OnnxBnb4Quantization, llama2 e2e qlora example #703

Merged
merged 8 commits into from
Nov 8, 2023

Conversation

jambayk
Copy link
Contributor

@jambayk jambayk commented Nov 7, 2023

Describe your changes

OnnxConversion

  • The code has been refactored a bit so that the methods _convert_model_on_device and _convert_distributed_model_on_device have similar signature and behavior
  • Doc strings for the methods
  • The above also makes the logic for composite models simpler
  • New config parameters added to make the pass more flexible:
    • use_device: users can specify what device to run conversion on. For example, there is enough gpu memory to speed up conversion or model is float16.
    • torch_dtype: torch data type to cast the model to before conversion. For example, model is originally float16 but want to convert on cpu and run another pass later to convert to float16.  
  • It is more aware about loading models using hf_config.
    • Check and update torch_dtype
    • Check if model is quantized using bitsandbytes.

OnnxBnb4Quantization

  • New quantization pass to quantize a model using nf4/fp4. Uses the MatMulBnb4 quantizer and contrib op
  • It can handle both pass config and model attributes such as quantized_modules and quantization_config
  • Using this pass, we usually don't want to quantize all MaMul nodes, only the ones that were originally quantized in the source model to retain accuracy

llama2

  • Add E2E qlora + ort optimization workflow
  • Update readme

qlora-e2e

Checklist before requesting a review

  • Add unit tests for this change.
  • Make sure all tests can pass.
  • Update documents if necessary.
  • Format your code by running pre-commit run --all-files
  • Is this a user-facing change? If yes, give a description of this change to be included in the release notes.
    OnnxConversion pass has new parameters use_device and torch_dtype for more flexible conversion.
    New OnnxBnb4Quantization to quantize an ONNX model using FP4/NF4 data types.

(Optional) Issue link

olive/model/__init__.py Outdated Show resolved Hide resolved
examples/llama2/llama2_qlora.json Outdated Show resolved Hide resolved
examples/llama2/llama2_qlora.json Show resolved Hide resolved
olive/passes/onnx/bnb_quantization.py Show resolved Hide resolved
olive/passes/onnx/quantization.py Show resolved Hide resolved
@jambayk jambayk merged commit e752dc0 into main Nov 8, 2023
31 checks passed
@jambayk jambayk deleted the jambayk/llama-qlora-quant branch November 8, 2023 05:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants