Introduction to Qualcomm Transformers Library!

Cloud AI 100

Introduction to Qualcomm Transformers Library!

Train anywhere, Infer on Qualcomm Cloud AI with a Developer-centric Toolchain

This library provides reimplemented blocks of LLMs which are used to make the models functional and highly performant on Qualcomm Cloud AI 100. There are several models which can be directly transformed from a pre-trained original form to a deployment ready optimized form. For other models, there is comprehensive documentation to inspire upon the changes needed and How-To(s).

Typically for LLMs, the library provides:

Reimplemented blocks from Transformers which enable efficient on-device retention of intermediate states.
Graph transformations to enable execution of key operations in lower precision
Graph transformations to replace some operations to other mathematically equivalent operations
Handling for underflows and overflows in lower precision
Patcher modules to map weights of original model's operations to updated model's operations
Exporter module to export the model source into a ONNX Graph.
Sample example applications and demo notebooks
Unit test templates.

It is mandatory for each Pull Request to include tests such as:

If the PR is for adding support for a model, the tests should include successful execution of the model post changes (the changes included as part of PR) on Pytorch and ONNXRT. Successful exit criteria is MSE between output of original model and updated model.
If the PR modifies any common utilities, tests need to be included to execute tests of all models included in the library.

Validated Models

Models Coming Soon..

Requirements

System Requirements:

Linux ubuntu (x86)
Preferred Python (3.8.12)

💡 Use bash terminal

📝 If using ZSH terminal then "device_group" should be in single quotes e.g. "--device_group '[0]'"

Installation

# Create Python virtual env and activate it. (Required Python 3.8)
 
# Login to the Cloud AI 100 Server.
ssh -X username@hostname

python3.8 -m venv qeff_env
source qeff_env/bin/activate
 
# Clone the QEfficient Repo.
 
# Install the qefficient-library in Host machine (Used for CLI APIs) (Until we have docker integrated in Apps SDK)
pip install -e .

Quick Start Guide

QEfficient Library was designed with one goal: to make onboarding of models inference straightforward for any Transformer architecture, while leveraging the complete power of Cloud AI platform

To achieve this, we have 2 levels of APIs, with different levels of abstraction.

High-level APIs abstract away complex details, offering a simpler interface. They're ideal for quick development and prototyping. If you're new to a technology or want to minimize coding effort, high-level APIs are more user-friendly.
Low-level APIs offer more granular control, ideal for when customization is necessary. These are particularly useful for users who are trying their own models, not hosted on HF but are implemented based on Transformers.

In summary:

Choose high-level APIs for quick development, simplicity, and ease of use.
Opt for low-level APIs when you need fine-tuned control, optimization, or advanced customization.

Using High Level APIs

High Level APIs	Sample use	Arguments
QEfficient.cloud.infer	click here	model_name : $\color{green} {Mandatory}$ num_cores : $\color{green} {Mandatory}$ device_group : $\color{green} {Mandatory}$ batch_size : Optional [Default-1] prompt_len : Optional [Default-32] ctx_len : Optional [Default-128] mxfp6 : Optional hf_token : Optional cache_dir : Optional ["cache_dir" in current working directory] prompt : Optinoal [Default-"My name is"]
QEfficient.cloud.execute	click here	model_name : $\color{green} {Mandatory}$ device_group : $\color{green} {Mandatory}$ qpc_path : $\color{green} {Mandatory}$ prompt : Optional [Default-"My name is"] cache_dir : Optional ["cache_dir" in current working directory] hf_token : Optional

1. Use QEfficient.cloud.infer

This is the single e2e python api in the library, which takes model_card name as input along with other compile args if necessary and does everything in one go.

Torch Download → Optimize for Cloud AI 100 → Export to ONNX → Verify (CPU) → Compile on Cloud AI 100 → Execute
Its skips the ONNX export/compile stage if ONNX file or qpc found on path

# Check out the options using the help menu
python -m QEfficient.cloud.infer --help
python -m QEfficient.cloud.infer --model_name gpt2 --batch_size 1 --prompt_len 32 --ctx_len 128 --mxfp6 --num_cores 16 --device_group [0] --prompt "My name is" --mos 1 --aic_enable_depth_first  
 
# If executing for batch size>1, pass input prompts in single string but seperate with pipe (|) symbol". Example below

python -m QEfficient.cloud.infer --model_name gpt2 --batch_size 3 --prompt_len 32 --ctx_len 128 --num_cores 16 --device_group [0] --prompt "My name is|The flat earth theory is the belief that|The sun rises from" --mxfp6 --mos 1 --aic_enable_depth_first

2. Use of QEfficient.cloud.excute

Once we have compiled the QPC, we can now use the precompiled QPC in execute API to run for different prompts, like below:

python -m QEfficient.cloud.execute --model_name gpt2 --qpc_path qeff_models/gpt2/qpc_16cores_1BS_32PL_128CL_1devices_mxfp6/qpcs/ --prompt "Once upon a time in" --device_group [0]

We can also enable MQ, just based on the number of devices. Based on the "--device_group" as input it will create TS config on the fly. If "--device-group [0,1]" it will create TS config for 2 devices and use it for compilation, if "--device-group 0" then TS compilation is skipped and single soc execution is enabled.

python -m QEfficient.cloud.infer --model_name Salesforce/codegen-2B-mono --batch_size 1 --prompt_len 32 --ctx_len 128 --mxfp6 --num_cores 16 --device-group [0,1] --prompt "def fibonacci(n):" --mos 2 --aic_enable_depth_first  
 
# Once qpc is saved, you can use the execute API to run for different prompts
python -m QEfficient.cloud.execute --model_name Salesforce/codegen-2B-mono --qpc-path qeff_models/Salesforce/codegen-2B-mono/qpc_16cores_1BS_32PL_128CL_2devices_mxfp6/qpcs --prompt "def binary_search(array: np.array, k: int):" --device-group [0,1] 
 
# To disable MQ, just pass single soc like below:
python -m QEfficient.cloud.infer --model_name gpt2 --batch_size 1 --prompt_len 32 --ctx_len 128 --mxfp6 --num_cores 16 --device-group [0] --prompt "My name is" --mos 1 --aic_enable_depth_first

High Level APIs	Single SoC	Tensor Slicing
QEfficient.cloud.infer	python -m QEfficient.cloud.infer --model_name $\color{green} {model}$ --batch_size 8 --prompt_len 128 --ctx_len 1024 --num_cores 16 --device-group [0] --prompt "My name is" --mxfp6 --hf_token $\color{green}{xyz}$ --mos 1 --aic_enable_depth_first	python -m QEfficient.cloud.infer --model_name $\color{green}{model}$ --batch_size 8 --prompt_len 128 --ctx_len 1024--num_cores 16 --device-group [0,1,2,3] --prompt "My name is" --mxfp6 --hf_token $\color{green}{xyz}$ --mos 4 --aic_enable_depth_first
QEfficient.cloud.excute	python -m QEfficient.cloud.execute --model_name $\color{green}{model}$ --device_group [0] --qpc_path $\color{green}{path}$ --prompt "My name is" --hf_token $\color{green}{xyz}$	python -m QEfficient.cloud.execute --model_name $\color{green}{model}$ --device_group [0,1,2,3] --qpc_path $\color{green}{path}$ --prompt "My name is" --hf_token $\color{green}{xyz}$

📝 Replace $\color{green}{model}$ , $\color{green}{path}$ and $\color{green}{xyz}$ with preffered model card name, qpc path and hf token respectively.

Using Low Level APIs

Low Level APIs	Sample use	Arguments
QEfficient.transform	click here	model : $\color{green} {Mandatory}$ Type : Optional [Default- "Transformers"] form_factor : Optional [Default-"cloud"]
qualcomm_efficient_converter	click here	mode_name : $\color{green} {Mandatory}$ model_kv : $\color{green} {Mandatory}$ [Optional when model_class passed] model_class : $\color{green} {Mandatory}$ [Optional when model_kv passed] tokenizer : Optional onnx_path : Optional hf_token : Optional seq_length : Optional [Default-128] input_str : Optional [Default-"My name is"] kv : Optional [Default-$\color{green} {True}$] return_path : Optional [Default-False] form_factor : Optional [Default-"cloud"] save_fp32_onnx : Optional [Default-False] save_fp16_onnx : Optional [Default-True] Both save_fp32_onnx and save_fp16_onnx can't be false
compile	click here	onnx_path : $\color{green} {Mandatory}$ qpc_path : $\color{green} {Mandatory}$ num_cores : $\color{green} {Mandatory}$ device_group : $\color{green} {Mandatory}$ batch_size : Optional [Default-1] prompt_len : Optional [Default-32] ctx_len : Optional [Default-128] mxfp6 : Optional [Default-True]
latency_stats_kv	click here	tokenizer : $\color{green} {Mandatory}$ qpc : $\color{green} {Mandatory}$ prompt : $\color{green} {Mandatory}$ input_len : Optional [Default-None] generation_len : Optional [Default-None] device_id : Optional [Default-[0]] enable_debug_logs : Optional [Default-False] stream : Optional [Default-True] write_io_dir : Optional automation : Optional [Default-False]

1. Model download and transform

Initialize QEfficient and transform the models, Check the list of supported architectures in the repo.

# Initiate the Orignal Transformer model
import os
from transformers.models.gpt2.modeling_gpt2 import GPT2LMHeadModel
import QEfficient
from transformers import AutoTokenizer
from QEfficient.utils import hf_download
from QEfficient.utils.constants import Constants
# Please uncomment and use appropriate Cache Directory for transformers, in case you don't want to use default ~/.cache dir.
# os.environ["TRANSFORMERS_CACHE"] = "/local/mnt/workspace/hf_cache"

ROOT_DIR = os.path.dirname(os.path.abspath(""))

# Model-Card name to be onboarded (This is HF Model Card name) : https://huggingface.co/gpt2-xl

model_name = "gpt2" 

# Similar, we can change model name and generate corresponding models, if we have added the support in the lib.

model_hf_path = hf_download(repo_id=model_name, cache_dir=Constants.CACHE_DIR, ignore_pattrens=["*.txt", "*.onnx", "*.ot", "*.md", "*.tflite", "*.pdf"])
model_hf = GPT2LMHeadModel.from_pretrained(model_hf_path, use_cache=True)
model_hf.eval()
print(f"{model_name} from hugging-face \n", model_hf)

# Easy and minimal api to update the model
model_transformed = QEfficient.transform(model_hf, type="Transformers", form_factor="cloud")

model_transformed.eval()
print("Model after Optimized transformations \n", model_transformed)

2. ONNX export of transformed model

use the qualcomm_efficient_converter API to export the KV transformed Model to ONNX and Verify on Torch.

from QEfficient.exporter.export_hf_to_cloud_ai_100 import qualcomm_efficient_converter

# We can now export the modified models to  ONNX framework
# This will generate single ONNX Model for both Prefill and Decode Variations which are optimized for
# Cloud AI 100 Platform.

# This will generate  ONNX model, clip the overflow constants to fp16
# Verify the model on  ONNXRuntime vs Pytorch
# Then generate inputs and custom_io.yaml file required for compilation.

# We can generate the KV Style models with the flag "kv"
# Bertstyle models do not have any optimization w.r.t KV cache changes and are unoptimized version.
# It is recommended to use kv=True for better performance.

tokenizer = AutoTokenizer.from_pretrained(model_hf_path, use_cache=True)
base_path, onnx_path = qualcomm_efficient_converter(
    model_kv=model_transformed,
    model_name=model_name,
    kv=True,
    form_factor="cloud",
    return_path=True,
    tokenizer=tokenizer,
)

3. Compile on Cloud AI 100

Once, the model is exported, Compile the model on Cloud AI 100 and generate QPC.

# Please use platform SDk to Check num_cores for your card.
from QEfficient.cloud.compile import main as compile

generated_qpc_path = compile(
    onnx_path=onnx_path,
    num_cores=14,
    qpc_path=base_path,
    device_group=[0],
    mxfp6=True,
)

4. Run Benchmark

Benchmark the model on Cloud AI 100, run the infer API to print tokens and tok/sec

from QEfficient.generation.text_generation_inference import latency_stats_kv

# post compilation, we can print the latency stats for the kv models, We provide API to print token and Latency stats on AI 100
# We need the compiled prefill and decode qpc to compute the token generated, This is based on Greedy Sampling Approach
latency_stats_kv(tokenizer=tokenizer, qpc=generated_qpc_path, device_id=[0], prompt="My name is")

End to End demo examples for various models are available in notebooks directory. Please check them out.

Adding support for a new model

Watch this space for references to detailed steps, template examples and much more.

Details on KV Cache Optimization for Cloud AI 100

Note: More details are here: https://quic.github.io/cloud-ai-sdk-pages/latest/Getting-Started/Model-Architecture-Support/Large-Language-Models/llm/

Acknowledgements

Thanks to:

Huggingface transformers for work in LLM GenAI modeling implementation
ONNX, Pytorch, ONNXruntime community.

Support

If you run into any problems with the code, please file Github issues directly to this repo.

Contributing

This project welcomes contributions and suggestions. Please check the License. Integration with a CLA Bot is underway.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github/workflows		.github/workflows
QEfficient		QEfficient
docs		docs
examples		examples
notebooks		notebooks
scripts		scripts
tests		tests
.pre-commit-config.yaml		.pre-commit-config.yaml
CODE-OF-CONDUCT.md		CODE-OF-CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction to Qualcomm Transformers Library!

Train anywhere, Infer on Qualcomm Cloud AI with a Developer-centric Toolchain

Typically for LLMs, the library provides:

Validated Models

Requirements

Installation

Quick Start Guide

Using High Level APIs

1. Use QEfficient.cloud.infer

2. Use of QEfficient.cloud.excute

Using Low Level APIs

1. Model download and transform

2. ONNX export of transformed model

3. Compile on Cloud AI 100

4. Run Benchmark

Adding support for a new model

Details on KV Cache Optimization for Cloud AI 100

Acknowledgements

Support

Contributing

About

Releases

Packages

Languages

License

anujgupt-github/efficient-transformers

Folders and files

Latest commit

History

Repository files navigation

Introduction to Qualcomm Transformers Library!

Train anywhere, Infer on Qualcomm Cloud AI with a Developer-centric Toolchain

Typically for LLMs, the library provides:

Validated Models

Requirements

Installation

Quick Start Guide

Using High Level APIs

1. Use QEfficient.cloud.infer

2. Use of QEfficient.cloud.excute

Using Low Level APIs

1. Model download and transform

2. ONNX export of transformed model

3. Compile on Cloud AI 100

4. Run Benchmark

Adding support for a new model

Details on KV Cache Optimization for Cloud AI 100

Acknowledgements

Support

Contributing

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages