This repo contains the implementation of NeurIPS 2024 paper:
We explore the Self-Play training of LLMs in an Adversarial language Game (SPAG) named Adversarial Taboo. In the following game examples with the target word "conversation", the attacker wins left while the defender wins right:
With the training epoch of SPAG increasing, the LLM reasoning ability continuously improves as shown in the plots below:
Many thanks to @thwu1, who has reproduced the SPAG experiments and released model checkpoints (Imitation Model, SPAG-1, SPAG-2, SPAG-3) on Huggingface 🤗!
To build the running environment, use the following command:
pip3 install -r requirements.txt
We train models and sampling episodes using 32 40G A100 GPUs with CUDA 11.0. The commands below are also compatible with 8 A100 GPUs.
To ensure the instruction-following ability of LLMs to the game rules, we first let LLMs imitate the winning behaviors of GPT-4. To launch the imitation learning on LLaMA-2-7B-base, use the following command:
torchrun --nproc_per_node=8 --master_port=6000 train.py \
--output_dir <path_to_save_your_imitation_checkpoint> \
--model_name_or_path "Llama-2-7b-hf" \
--ref_model_name_or_path "Llama-2-7b-hf" \
--lm_kl_coeff 0.1 \
--train_method "SFTwithKL" \
--train_data_path "./data/train_imitation_gpt4.json" \
--remove_unused_columns False \
--num_train_epochs 1 \
--per_device_train_batch_size 2 \
--gradient_accumulation_steps 8 \
--evaluation_strategy no \
--padding_side "right" \
--truncation_side "left" \
--max_length 2048 \
--save_strategy epoch \
--learning_rate 5e-6 \
--lr_scheduler_type "cosine" \
--warmup_ratio 0.03 \
--logging_steps 1 \
--weight_decay 0. \
--deepspeed "./configs/default_offload_opt_param.json" \
--gradient_checkpointing True \
--tf32 True --bf16 True
Here Llama-2-7b-hf
can be replaced by Baichuan2-13B-Base
to reproduce the Baichuan-2 results in our paper.
After the imitation learning, we can conduct the self-play with the imitation-learned model on all targets words:
export PYTHONPATH=.
torchrun --nproc_per_node=8 --master_port=6000 tools/play_llm_game.py \
--taboo_max_turns 5 \
--attacker_model_name_or_path <path_to_imitation_learned_model> \
--defender_model_name_or_path <path_to_imitation_learned_model> \
--model_prefix "im_llama2" \
--data_path "./data/all_target_words.txt" \
--output_dir "./data/self_play_results" \
--per_device_eval_batch_size 1 \
--task_type "sampling" \
--data_suffix "all_words" \
--max_length 2048 \
--max_new_tokens 256 \
--logging_steps 5 \
--bf16 True --tf32 True
When the self-play collection is finished, we can access all the game episodes in im_llama2_sampling_all_words_results.json
at data/self_play_results/
.
To conduct reinforcement learning on game episodes, we first calculate the outcomes by rule-based judgment and assign rewards to actions:
export PYTHONPATH=.
python3 tools/assign_rewards.py \
--input_data_path data/self_play_results/im_llama2_sampling_all_target_words_results.json \
--output_data_path data/train_spag_data_im_llama2.json \
--sft_data_path data/alpaca_train.json
The output file train_spag_data_im_llama2.json
is already in an instruction-tuning format, with the following keywords:
query
&target
: the input and label for language modeling,reward
: the reward assigned to the current utterance (target
),weight
: the re-weighting parameter to ensure that both attacker and defender have an equal learning coefficient 1/2 in expectation.
Then the SPAG model can be learned with the following command:
torchrun --nproc_per_node=8 --master_port=6000 train.py \
--output_dir <path_to_save_your_SPAG_checkpoint> \
--model_name_or_path <path_to_your_imitation_checkpoint> \
--ref_model_name_or_path <path_to_your_imitation_checkpoint> \
--lm_kl_coeff 0.2 \
--lm_sft_coeff 0.5 \
--train_method "OfflinePO" \
--train_data_path "./data/train_spag_data_im_llama2.json" \
--remove_unused_columns False \
--num_train_epochs 1 \
--per_device_train_batch_size 2 \
--gradient_accumulation_steps 8 \
--evaluation_strategy no \
--padding_side "right" \
--truncation_side "left" \
--max_length 2048 \
--save_strategy epoch \
--learning_rate 2e-6 \
--lr_scheduler_type "cosine" \
--warmup_ratio 0.03 \
--logging_steps 1 \
--weight_decay 0. \
--deepspeed "./configs/default_offload_opt_param.json" \
--gradient_checkpointing True \
--tf32 True --bf16 True
By repeating the episode-collection and SPAG-learning processes, we can observe continous improvements on reasoning benchmarks. For LLM reasoning evaluation, we use the lm-evaluation-harness repo with the setups described in our paper.
Please cite our paper if you find the code useful.
@article{cheng2024spag,
title={Self-playing Adversarial Language Game Enhances LLM Reasoning},
author={Cheng, Pengyu and Hu, Tianhao and Xu, Han and Zhang, Zhisong and Dai, Yong and Han, Lei and Du, Nan},
journal={arXiv preprint arXiv:2404.10642},
year={2024}
}