Hugging Face Llama Recipes

🤗🦙Welcome! This repository contains minimal recipes to get started with Llama 3.1 quickly.

To get an overview of Llama 3.1, please visit Hugging Face announcement blog post.
For more advanced end-to-end use cases with open ML, please visit the Open Source AI Cookbook.

This repository is WIP so that you might see considerable changes in the coming days.

Note: To use Llama 3.1, you need to accept the license and request permission to access the models. Please, visit any of the Hugging Face repos and submit your request. You only need to do this once, you'll get access to all the repos if your request is approved.

Local Inference

Would you like to run inference of the Llama 3.1 models locally? So do we! The memory requirements depend on the model size and the precision of the weights. Here's a table showing the approximate memory needed for different configurations:

Model Size	FP16	FP8	INT4 (AWQ/GPTQ/bnb)
8B	16 GB	8 GB	4 GB
70B	140 GB	70 GB	35 GB
405B	810 GB	405 GB	203 GB

Note: These are estimated values and may vary based on specific implementation details and optimizations.

Here are some notebooks to help you started:

Run Llama 8B in free Google Colab in half precision
Run Llama 8B in 8-bits with bitsandbytes
Run Llama 8B in 4-bits with bitsandbytes
Run Llama 8B with AWQ & fused ops
Run Llama 3.1 405B FP8
Run Llama 3.1 405B quantized to INT4 with AWQ
Run Llama 3.1 405B quantized to INT4 with GPTQ
Run assisted decoding with Llama 405B and Llama 8B
Accelerate your inference using torch.compile
Accelerate your inference using torch.compile and 4-bit quantization with torchao
Execute some Llama-generated Python code
Use tools with Llama!

API inference

Are these models too large for you to run at home? Would you like to experiment with Llama 405B? Try out the following examples!

Use the Inference API for PRO users
Use a dedicated Inference Endpoint

Llama Guard and Prompt Guard

In addition to the generative models, Meta released two new models: Llama Guard 3 and Prompt Guard. Prompt Guard is a small classifier that detects jailbreaks and prompt injections. Llama Guard 3 is a safeguard model that can classify LLM inputs and generations. Learn how to use them as done in the following notebooks:

Detecting jailbreaks and prompt injection with Prompt Guard
Using Llama Guard for Guardrailing

Advanced use cases

How to fine-tune Llama 3.1 8B on consumer GPU with PEFT and QLoRA with bitsandbytes
Generate synthetic data with distilabel
Do assisted decoding with a large and a small model
Build a ML demo using Gradio

Name		Name	Last commit message	Last commit date
Latest commit History 88 Commits
assets		assets
4bit_bnb.ipynb		4bit_bnb.ipynb
8bit_bnb.ipynb		8bit_bnb.ipynb
README.md		README.md
assisted_decoding.py		assisted_decoding.py
awq.ipynb		awq.ipynb
awq_generation.py		awq_generation.py
fp8-405B.ipynb		fp8-405B.ipynb
gptq_generation.py		gptq_generation.py
inference-api.ipynb		inference-api.ipynb
peft_finetuning.py		peft_finetuning.py
prompt_guard.ipynb		prompt_guard.ipynb
prompt_reuse.py		prompt_reuse.py
qlora_405B.slurm		qlora_405B.slurm
quantized_cache.py		quantized_cache.py
synthetic-data-with-llama.ipynb		synthetic-data-with-llama.ipynb
torch_compile.py		torch_compile.py
torch_compile_with_torchao.ipynb		torch_compile_with_torchao.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hugging Face Llama Recipes

Local Inference

API inference

Llama Guard and Prompt Guard

Advanced use cases

About

Releases

Packages

Contributors 13

Languages

huggingface/huggingface-llama-recipes

Folders and files

Latest commit

History

Repository files navigation

Hugging Face Llama Recipes

Local Inference

API inference

Llama Guard and Prompt Guard

Advanced use cases

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 13

Languages

Packages