Based on these two papers:
- Universal and Transferable Adversarial Attacks on Aligned Language Models
- Gradient-Based Language Model Red Teaming
For our ARENA Capstone project, we replicated the paper "Universal and Transferable Adversarial Attacks on Aligned Language Models" to discover adversarial suffixes on GPT-2, LLaMA 2. This part worked and functions well. Afterwards, we implemented it for Gemma (also successful) and attempted a variant of this method, related to the paper "Gradient-Based Language Model Red Teaming" to discover the trojan suffixes in the RLHF trojan competition. This didn't work as well as expected, but is at least theoretically interesting. We recommend reading the paper(s) before navigating this repository.
To set up the repository, clone the repository, navigate to the cloned repository and run the following commands:
pip install -e .
huggingface-cli login
wandb login
to run the portions that optimize over reward, install the reward model repo
cd src/arena_capstone
git clone https://github.com/ethz-spylab/rlhf_trojan_competition.git
cd -
you may need to make some imports absolute instead of relative for this repo.
Then, you can run the following commands to run the experiments on GPT-2, LLaMA 2, and Gemma respectively:
python src/arena_capstone/algorithm/upo.py
python src/arena_capstone/scripts/run_with_llama.py
python src/arena_capstone/gemma/gemma_upo.py
In order to get gradients wrt. all possible token substitutions for our suffix, the suffix must be inputted in some continuous vector form that can recieve a grad. To handle this, we use a one-hot float representation the convert to sequences of embedding vectors. This file handles these actions and related responsibilities.
- can get grad wrt. this
- cannot get grads, but computationally cheaper (not produced from one-hot embedded processing)
- Bundles a sequential representation along with it's attention mask
- the sequence representation can be any of:
- tokens
- vocab space vectors/logits
- embeddings
These objects handle operating on "softened" and mixed representation sequences
- convert tokens & vocab space vectors to embeddings/one hot float vectors
- do forward passes from embeddings
- implemented by wrapping a HuggingFace *ForCausalLM model
- does what EmbeddingFriendlyForCausalLM does, but a forward pass produces (logits, values) instead of just logits, where values are estimates of the reward of the generation
- Implements the GCG algorithm from the Universal and Transferable Adversarial Attacks on Aligned Language Models paper
- Get a loss wrt either type of batch to either assess loss or to backprop the loss and get gradients for the suffix
- Select top k candidates according to vocab gradients
- Sample from selection
- Implements the UPO algorithm from the Universal and Transferable Adversarial Attacks on Aligned Language Models paper
- subclasses of RewardModel that
- Implements UPO with a loss coming from a reward model, ra ...
- Just random greedy search, no gradients involved
- Does UPO but for Gemma, implementation is slightly nicer.
- Has loaders for the poisoned Llama models
- Function to run UPO on Llama
- Trains the value head to be used in soft_value_head.py, using the reward model
- Executable & schedulable config implementing gumbel softmax
- Soft prompt optimization inspired by GBRT and UPO/GCG
- alternates between phases of:
- GBRT
- random greedy search (over topk soft prompt tokens or all token)
- Similar to SoftOptPrompt, but with a value head instead of a reward model.
- Models a "soft suffix" as trainable logits, where forward passes sample using the Gumbel-Softmax trick to produce a distribution over tokens.