Fooling the Overseer

This is my code for my entry to Apart Research's Alignment Jam on Finding Failure Cases in Multi-Agent AI Systems. I demonstrate that if a Large Language Model (LLM) is being scored by another AI, it can spontaneously start hacking their overseer AI via jailbreaking attacks if it knows that jailbreaking attacks exist. I also show that this can get much worse if we then run reinforcement learning on the original model so that it is incentivized to chase higher reward than it could obtain without jailbreaks. Read the submission here.

Running Supervised Finetuning

This was run on 4 A100 GPUs.

accelerate launch run_llama_training.py --data_path "data/datasets/descriptions.jsonl" --model_path [PATH_TO_LLAMA_2_7B_CHAT] --run_name "SFT"

You can download a finetuned model here.

Running Reinforcement Learning

This was run on 8 A100 GPUs. It also requires the trlx package.

accelerate launch train_PPO.py --model_path [PATH_TO_FINETUNED_LLAMA] --tokenizer_path [PATH_TO_LLAMA_TOKENIZER]

The checkpoint at 150 steps can be downloaded here.

Evaluating the Models

This is best done on a single GPU. The outputs get stored under data/generations/[NAME_FOR_OUTPUT_FILE].

python generate_outputs.py --model_path [PATH_TO_EVAL_MODEL] --tokenizer_path [PATH_TO_LLAMA_TOKENIZER] --output_file [NAME_FOR_OUTPUT_FILE]

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data		data
utils		utils
.gitignore		.gitignore
README.md		README.md
generate_outputs.py		generate_outputs.py
teaser.png		teaser.png
train_PPO.py		train_PPO.py
train_SFT.py		train_SFT.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fooling the Overseer

Running Supervised Finetuning

Running Reinforcement Learning

Evaluating the Models

About

Releases

Packages

Languages

AlexMeinke/fooling-the-overseer

Folders and files

Latest commit

History

Repository files navigation

Fooling the Overseer

Running Supervised Finetuning

Running Reinforcement Learning

Evaluating the Models

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages