🤗🦙Welcome! This repository contains minimal recipes to get started with Llama 3.1 quickly.
- To get an overview of Llama 3.1, please visit Hugging Face announcement blog post.
- For more advanced end-to-end use cases with open ML, please visit the Open Source AI Cookbook.
This repository is WIP so that you might see considerable changes in the coming days.
Note: To use Llama 3.1, you need to accept the license and request permission to access the models. Please, visit any of the Hugging Face repos and submit your request. You only need to do this once, you'll get access to all the repos if your request is approved.
Would you like to run inference of the Llama 3.1 models locally? So do we! The memory requirements depend on the model size and the precision of the weights. Here's a table showing the approximate memory needed for different configurations:
Model Size | FP16 | FP8 | INT4 (AWQ/GPTQ/bnb) |
8B | 16 GB | 8 GB | 4 GB |
70B | 140 GB | 70 GB | 35 GB |
405B | 810 GB | 405 GB | 203 GB |
Note: These are estimated values and may vary based on specific implementation details and optimizations.
Here are some notebooks to help you started:
- Run Llama 8B in free Google Colab in half precision
- Run Llama 8B in 8-bits with bitsandbytes
- Run Llama 8B in 4-bits with bitsandbytes
- Run Llama 8B with AWQ & fused ops
- Run Llama 3.1 405B FP8
- Run Llama 3.1 405B quantized to INT4 with AWQ
- Run Llama 3.1 405B quantized to INT4 with GPTQ
- Run assisted decoding with Llama 405B and Llama 8B
- Accelerate your inference using torch.compile
- Accelerate your inference using torch.compile and 4-bit quantization with torchao
- Execute some Llama-generated Python code
- Use tools with Llama!
Are these models too large for you to run at home? Would you like to experiment with Llama 405B? Try out the following examples!
- Use the Inference API for PRO users
- Use a dedicated Inference Endpoint
In addition to the generative models, Meta released two new models: Llama Guard 3 and Prompt Guard. Prompt Guard is a small classifier that detects jailbreaks and prompt injections. Llama Guard 3 is a safeguard model that can classify LLM inputs and generations. Learn how to use them as done in the following notebooks:
- Detecting jailbreaks and prompt injection with Prompt Guard
- Using Llama Guard for Guardrailing
- How to fine-tune Llama 3.1 8B on consumer GPU with PEFT and QLoRA with bitsandbytes
- Generate synthetic data with
distilabel
- Do assisted decoding with a large and a small model
- Build a ML demo using Gradio