GradShield is a jailbreak defense method basedon gradient-weighted attention. The workflow of GradShield is illustrated in the following figure:
We suggest installing GradShield in Python 3.9 or higher versions, you can use the following command to install all the packages required by GradShield.
pip install -r requirements.txt Call GradShield to defend against jailbreak attacks:
GradShield(model, tokenizer, template, prompt, copies, std ,top_k)parameters
-
model,tokenizer: Target LLM and its tokenizer loaded throughtransformerslibrary. -
template: The template used to generate the prompt. The template is a dictionary like following example, where{instruction}will be replaced by the prompt:{ "description": "Template used by Vicuna.", "prompt": "A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions. USER: {instruction} ASSISTANT:", } -
prompt: The adversarial prompt used to attack the target LLM -
copies: The number of perturbed copies$P$ , defaulting to10 -
std: The tup of lower bound$\sigma_{\min}$ and upper bound$\sigma_{\max}$ of Gaussian blur standard deviation, defaulting to(0.05, 0.5) -
top_k: The number of top-k tokens of LLM's outputs to be calculated perplexity, defaulting to4
returns
response: The string of final response after being defended by GradShieldtoken_importance: The numpy array of token importance.
Evaluate GradShield's defense capability using adversarial prompts pre-generated by HarmBench and calculate the DSR (Defense Success Rate).
Step 1: Prepare the target LLM and adversarial prompts
- Place the weights and tokenizers of the target LLM in
/models, and write the model name and path tomodels/model_path.jsonlike this:{ "llama2_7b": "models/Llama-2-7b-chat-hf", "vicuna_7b_v1_5": "models/vicuna_7b_v1_5", "vicuna_13b_v1_5": "models/vicuna_13b_v1_5", "baichuan2_7b": "models/Baichuan2-7B-Chat", "koala_7b": "models/koala-7B-HF" } - Download adversarial prompts pre-generated by HarmBench here and place them in the
/adversarial_promptsdirectory following this directory structure:adversarial_prompts/ ├── <Jailbreak>/ │ ├── <target LLM>/ │ │ ├── results/ │ │ └── test_cases/ │ └── ... ├── ... └── harmbench_behaviors_text_all.csv (already exists)
Step 2: attack the target LLM using adversarial prompts and defend it with GradShield using the same parameters as in the paper
-
Run
evaluation.pyto attack the target LLM using adversarial prompts and defend it with GradShield:python evaluation.py --model_name <name of target LLM> --Jailbreak <name of Jailbreak>
parameters
model_name: The name of the target LLM, which is also the key inmodel_path.json, defaulting tovicuna_7b_v1_5Jailbreak: The name of the jailbreak attack, corresponding to the subdirectory in/adversarial_promptsdirectory, defaulting toGCG
-
The final response will be saved in
defense_results\defense_results_<Jailbreak>_<model_name>.jsonin the following format:{ "<Behavior ID of HarmBench>": { "prompt": <string of adversarial prompt>, "response": <string of final response>, "token_importance": <list of token importance>, "label": <Label indicating whether the final response is harmful> (Null as the placeholder before judgment) }, ... }
Step 3: Use HarmBench-Llama-2-13b-cls to judge whether the final response is harmful
- Download the HarmBench-Llama-2-13b-cls model weights here and place them in the
/modelsdirectory. - Run
judgment.pyto judgment whether the final response is harmful:python judgment.py --model_name <name of target LLM> --Jailbreak <name of Jailbreak>
- The judgment results will be written into the
defense_resultsfile in thelabelfield, whereYesindicates harmful andNoindicates harmless. - DSR will be printed in the terminal
- The judgment results will be written into the
Visualizing token importance can clearly demonstrate the token importance values analyzed by GradShield's perplexity gradient-weighted attention mechanism.
-
After executing Step 2 in Evaluation, run
visualize_token_importance.pyto generate a visualized HTML file.python visualize_token_importance.py --model_name <name of target LLM> --Jailbreak <name of Jailbreak> --BehaviorID <Behavior ID of HarmBench>
parameters
model_name: The name of the target LLM, which is also the key inmodel_path.json, defaulting tovicuna_7b_v1_5Jailbreak: The name of the jailbreak attack, corresponding to the subdirectory in/adversarial_promptsdirectory, defaulting toGCGBehaviorID: The Behavior ID of HarmBench, which is also the key indefense_results\defense_results_<Jailbreak>_<model_name>.json
-
The HTML file will be saved in
visualization\visualization_<Jailbreak>_<model_name>_<Behavior ID>.htmlas shown in the example:
*The redder the token, the more important it is.
We would like to thank the HarmBench project for providing adversarial prompts and HarmBench-Llama-2-13b-cls that significantly contributed to the evaluation of GradShield.
