The Ryzen AI Software includes support for deploying quantized LLMs on the NPU using an eager execution mode, simplifying the model ingestion process. Instead of compiling and executing as a complete graph, the model is processed on an operator-by-operator basis. Compute-intensive operations, such as GEMM/MATMUL, are dynamically offloaded to the NPU, while the remaining operators are executed on the CPU. Eager mode for LLMs is supported in both PyTorch and the ONNX Runtime.
A general-purpose flow can be found here: LLMs on RyzenAI with Pytorch
- Applicability: prototyping and early development with a broad set of LLMs
- Performance: functional support only, not to be used for benchmarking
- Supported platforms: PHX, HPT, STX (and onwards)
- Supported frameworks: Pytorch
- Supported models: Many
A set of performance-optimized models is available upon request on the AMD secure download site: Optimized LLMs on RyzenAI
- Applicability: benchmarking and deployment of specific LLMs
- Performance: highly optimized
- Supported platforms: STX (and onwards)
- Supported frameworks: Pytorch and ONNX Runtime
- Supported models: Llama2, Llama3, Qwen1.5
This is an early access flow, and expected to be upgraded in upcoming release.