Skip to content

samxrl/GradShield

Repository files navigation

GradShield

GradShield is a jailbreak defense method basedon gradient-weighted attention. The workflow of GradShield is illustrated in the following figure:

The Workflow of GradShield

⚠️Content Warning: This project contains examples of harmful language that some readers may find offensive.

🛠️Install

We suggest installing GradShield in Python 3.9 or higher versions, you can use the following command to install all the packages required by GradShield.

pip install -r requirements.txt 

🛡️Defense LLM

Call GradShield to defend against jailbreak attacks:

GradShield(model, tokenizer, template, prompt, copies, std ,top_k)

parameters

  • model, tokenizer: Target LLM and its tokenizer loaded through transformers library.
  • template: The template used to generate the prompt. The template is a dictionary like following example, where {instruction} will be replaced by the prompt:
    { "description": "Template used by Vicuna.", "prompt": "A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions. USER: {instruction} ASSISTANT:", }
  • prompt: The adversarial prompt used to attack the target LLM
  • copies: The number of perturbed copies $P$, defaulting to 10
  • std: The tup of lower bound $\sigma_{\min}$ and upper bound $\sigma_{\max}$ of Gaussian blur standard deviation, defaulting to (0.05, 0.5)
  • top_k: The number of top-k tokens of LLM's outputs to be calculated perplexity, defaulting to 4

returns

  • response: The string of final response after being defended by GradShield
  • token_importance: The numpy array of token importance.

🧪Evaluation

Evaluate GradShield's defense capability using adversarial prompts pre-generated by HarmBench and calculate the DSR (Defense Success Rate).

Step 1: Prepare the target LLM and adversarial prompts

  • Place the weights and tokenizers of the target LLM in /models, and write the model name and path to models/model_path.json like this:
    { "llama2_7b": "models/Llama-2-7b-chat-hf", "vicuna_7b_v1_5": "models/vicuna_7b_v1_5", "vicuna_13b_v1_5": "models/vicuna_13b_v1_5", "baichuan2_7b": "models/Baichuan2-7B-Chat", "koala_7b": "models/koala-7B-HF" }
  • Download adversarial prompts pre-generated by HarmBench here and place them in the /adversarial_prompts directory following this directory structure:
    adversarial_prompts/ ├── <Jailbreak>/ │ ├── <target LLM>/ │ │ ├── results/ │ │ └── test_cases/ │ └── ... ├── ... └── harmbench_behaviors_text_all.csv (already exists) 

Step 2: attack the target LLM using adversarial prompts and defend it with GradShield using the same parameters as in the paper

  • Run evaluation.py to attack the target LLM using adversarial prompts and defend it with GradShield:

    python evaluation.py --model_name <name of target LLM> --Jailbreak <name of Jailbreak> 

    parameters

    • model_name: The name of the target LLM, which is also the key in model_path.json, defaulting to vicuna_7b_v1_5
    • Jailbreak: The name of the jailbreak attack, corresponding to the subdirectory in /adversarial_prompts directory, defaulting to GCG
  • The final response will be saved in defense_results\defense_results_<Jailbreak>_<model_name>.json in the following format:

    { "<Behavior ID of HarmBench>": { "prompt": <string of adversarial prompt>, "response": <string of final response>, "token_importance": <list of token importance>, "label": <Label indicating whether the final response is harmful> (Null as the placeholder before judgment) }, ... } 

Step 3: Use HarmBench-Llama-2-13b-cls to judge whether the final response is harmful

  • Download the HarmBench-Llama-2-13b-cls model weights here and place them in the /models directory.
  • Run judgment.py to judgment whether the final response is harmful:
    python judgment.py --model_name <name of target LLM> --Jailbreak <name of Jailbreak>
    • The judgment results will be written into the defense_results file in the label field, where Yes indicates harmful and No indicates harmless.
    • DSR will be printed in the terminal

🔥Token Importance Visualization

Visualizing token importance can clearly demonstrate the token importance values analyzed by GradShield's perplexity gradient-weighted attention mechanism.

  • After executing Step 2 in Evaluation, run visualize_token_importance.py to generate a visualized HTML file.

    python visualize_token_importance.py --model_name <name of target LLM> --Jailbreak <name of Jailbreak> --BehaviorID <Behavior ID of HarmBench>

    parameters

    • model_name: The name of the target LLM, which is also the key in model_path.json, defaulting to vicuna_7b_v1_5
    • Jailbreak: The name of the jailbreak attack, corresponding to the subdirectory in /adversarial_prompts directory, defaulting to GCG
    • BehaviorID: The Behavior ID of HarmBench, which is also the key in defense_results\defense_results_<Jailbreak>_<model_name>.json
  • The HTML file will be saved in visualization\visualization_<Jailbreak>_<model_name>_<Behavior ID>.html as shown in the example: 示例图片

    *The redder the token, the more important it is.

🙏 Acknowledgments

We would like to thank the HarmBench project for providing adversarial prompts and HarmBench-Llama-2-13b-cls that significantly contributed to the evaluation of GradShield.

About

The official implementation of GradShield

Resources

License

Stars

Watchers

Forks

Contributors