Skip to content

AlexMeinke/fooling-the-overseer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Fooling the Overseer

This is my code for my entry to Apart Research's Alignment Jam on Finding Failure Cases in Multi-Agent AI Systems. I demonstrate that if a Large Language Model (LLM) is being scored by another AI, it can spontaneously start hacking their overseer AI via jailbreaking attacks if it knows that jailbreaking attacks exist. I also show that this can get much worse if we then run reinforcement learning on the original model so that it is incentivized to chase higher reward than it could obtain without jailbreaks. Read the submission here.

an AI fooling another AI via a jailbreak

Running Supervised Finetuning

This was run on 4 A100 GPUs.

accelerate launch run_llama_training.py --data_path "data/datasets/descriptions.jsonl" --model_path [PATH_TO_LLAMA_2_7B_CHAT] --run_name "SFT"

You can download a finetuned model here.

Running Reinforcement Learning

This was run on 8 A100 GPUs. It also requires the trlx package.

accelerate launch train_PPO.py --model_path [PATH_TO_FINETUNED_LLAMA] --tokenizer_path [PATH_TO_LLAMA_TOKENIZER]

The checkpoint at 150 steps can be downloaded here.

Evaluating the Models

This is best done on a single GPU. The outputs get stored under data/generations/[NAME_FOR_OUTPUT_FILE].

python generate_outputs.py --model_path [PATH_TO_EVAL_MODEL] --tokenizer_path [PATH_TO_LLAMA_TOKENIZER] --output_file [NAME_FOR_OUTPUT_FILE]

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages