This repository contains official implementation of our paper "Figure it Out: Analyzing-based Jailbreak Attack on Large Language Models"
Please feel free to contact [email protected] if you have any questions.
- (2024/07/21) We have released the official code of ABJ-Attack!
- (2024/07/23) Our paper is on arXiv! Check it out here!
- (2024/09/11) We have released a comprehensive defense methodology against jailbreak attacks!Check it out here!
- (2024/09/26) We have released a simple yet comprehensive benchmark that covers most of the existing jailbreak attack methods!Check it out here!
This repository shares the code of our latest work on LLMs jailbreaking. In this work:
- We further explore the boundary of jailbreak attacks on LLMs and propose ABJ, the first jailbreak attack method specifically designed to assess LLMs’ safety in handling analyzing-based tasks. ABJ generalizes jailbreak attack prompts in two steps: data preparation and data analysis.
- We conduct comprehensive experiments on both opensource (Llama-3, Qwen-2, GLM-4) and closed-source (GPT-3.5-turbo, GPT-4-turbo, Claude-3) LLMs. The results demonstrate that ABJ exhibits exceptional attack effectiveness and efficiency, achieving 94.8% ASR on GPT-4-turbo, while the AE is around 1.
- We show the robustness of ABJ when facing different defense strategies, indicating that mitigating this attack might be difficult. Furthermore, by transforming and modifying the ABJ method, we can enable more stealthy and effective jailbreak attacks on a wider range of harmful scenarios, extending beyond the limitations of finite datasets, which makes it more difficult to defend against. Notably, the modified ABJ has achieved over 85% ASR on Llama-3 and Claude-3, which are considered two of the most secure LLMs by far.
-
attack_method
: We implement4
kind of ABJ Attack, includingoriginal_ABJ
,modified_ABJ
,code_based_ABJ
,adversarial_ABJ
. -
target_model
: The name of target model, includinggpt3
,gpt4
,claude3_haiku
,llama3
,glm4
,qwen2
. -
attack_rounds
: Number of iteration rounds, default is3
. -
target_model_cuda_id
: Number of the GPU for target model, default iscuda:0
.
Before you start, you should replace the necessary information in llm/api_config.py
and llm/llm_model.py
.
-
Clone this repository:
git clone https://github.com/theshi-1128/ABJ-Attack.git
-
Build enviroment:
cd ABJ-Attack conda create -n ABJ python==3.10 conda activate ABJ pip install -r requirements.txt
-
Run ABJ-Attack:
python ABJ.py \ -- attack_method [ATTACK METHOD] \ -- target_model [TARGET MODEL] \ -- attack_rounds [ATTACK ROUNDS] \ -- target_model_cuda_id [CUDA ID]
For example, to run
original_ABJ
withgpt-4-turbo-2024-04-09
as the target model onCUDA:0
for3
rounds, runpython ABJ.py \ -- attack_method original_ABJ \ -- target_model gpt4 \ -- attack_rounds 3 \ -- target_model_cuda_id cuda:1
If you find this work useful in your own research, please feel free to leave a star⭐️ and cite our paper:
@article{lin2024figure,
title={Figure it Out: Analyzing-based Jailbreak Attack on Large Language Models},
author={Lin, Shi and Li, Rongchang and Wang, Xun and Lin, Changting and Xing, Wenpeng and Han, Meng},
journal={arXiv preprint arXiv:2407.16205},
year={2024}
}