Skip to content

nqgl/arena-capstone

 
 

Repository files navigation

Paper Replication, Trojan Detection, GBRT, Gemma prompt jailbreaking

Based on these two papers:

For our ARENA Capstone project, we replicated the paper "Universal and Transferable Adversarial Attacks on Aligned Language Models" to discover adversarial suffixes on GPT-2, LLaMA 2. This part worked and functions well. Afterwards, we implemented it for Gemma (also successful) and attempted a variant of this method, related to the paper "Gradient-Based Language Model Red Teaming" to discover the trojan suffixes in the RLHF trojan competition. This didn't work as well as expected, but is at least theoretically interesting. We recommend reading the paper(s) before navigating this repository.

To set up the repository, clone the repository, navigate to the cloned repository and run the following commands:

pip install -e .
huggingface-cli login
wandb login

to run the portions that optimize over reward, install the reward model repo

cd src/arena_capstone
git clone https://github.com/ethz-spylab/rlhf_trojan_competition.git
cd -

you may need to make some imports absolute instead of relative for this repo.

Then, you can run the following commands to run the experiments on GPT-2, LLaMA 2, and Gemma respectively:

python src/arena_capstone/algorithm/upo.py
python src/arena_capstone/scripts/run_with_llama.py
python src/arena_capstone/gemma/gemma_upo.py

Source Code Structure

algorithm/

In order to get gradients wrt. all possible token substitutions for our suffix, the suffix must be inputted in some continuous vector form that can recieve a grad. To handle this, we use a one-hot float representation the convert to sequences of embedding vectors. This file handles these actions and related responsibilities.

Batches and Bundles:
EmbeddedBatch:
  • can get grad wrt. this
TokensBatch:
  • cannot get grads, but computationally cheaper (not produced from one-hot embedded processing)
MaskedChunk:
  • Bundles a sequential representation along with it's attention mask
  • the sequence representation can be any of:
    • tokens
    • vocab space vectors/logits
    • embeddings
Embedding Friendly Models:

These objects handle operating on "softened" and mixed representation sequences

EmbeddingFriendlyForCausalLM:
  • convert tokens & vocab space vectors to embeddings/one hot float vectors
  • do forward passes from embeddings
  • implemented by wrapping a HuggingFace *ForCausalLM model
EmbeddingFriendlyValueHeadForCausalLM:
  • does what EmbeddingFriendlyForCausalLM does, but a forward pass produces (logits, values) instead of just logits, where values are estimates of the reward of the generation
GCG:
  • Implements the GCG algorithm from the Universal and Transferable Adversarial Attacks on Aligned Language Models paper
TokenGradients:
  • Get a loss wrt either type of batch to either assess loss or to backprop the loss and get gradients for the suffix
  • Select top k candidates according to vocab gradients
  • Sample from selection
UPO:
  • Implements the UPO algorithm from the Universal and Transferable Adversarial Attacks on Aligned Language Models paper

rewards/

RewardGenerator:
RewardUPO:
  • Implements UPO with a loss coming from a reward model, ra ...
  • Just random greedy search, no gradients involved

gemma/

  • Does UPO but for Gemma, implementation is slightly nicer.

scripts/

  • Has loaders for the poisoned Llama models
  • Function to run UPO on Llama
  • Trains the value head to be used in soft_value_head.py, using the reward model

soft_suffix/

GumbelSoftmaxConfig:
  • Executable & schedulable config implementing gumbel softmax
SoftOptPrompt:
  • Soft prompt optimization inspired by GBRT and UPO/GCG
  • alternates between phases of:
    • GBRT
    • random greedy search (over topk soft prompt tokens or all token)
VHSoftOptPrompt:
  • Similar to SoftOptPrompt, but with a value head instead of a reward model.
Suffix:
  • Models a "soft suffix" as trainable logits, where forward passes sample using the Gumbel-Softmax trick to produce a distribution over tokens.

About

Capstone project for ARENA v3.0

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 73.1%
  • Jupyter Notebook 26.9%