GradShield

GradShield is a jailbreak defense method basedon gradient-weighted attention. The workflow of GradShield is illustrated in the following figure:

⚠️Content Warning: This project contains examples of harmful language that some readers may find offensive.

🛠️Install

We suggest installing GradShield in Python 3.9 or higher versions, you can use the following command to install all the packages required by GradShield.

pip install -r requirements.txt

🛡️Defense LLM

Call GradShield to defend against jailbreak attacks:

GradShield(model, tokenizer, template, prompt, copies, std ,top_k)

parameters

model, tokenizer: Target LLM and its tokenizer loaded through transformers library.

template: The template used to generate the prompt. The template is a dictionary like following example, where {instruction} will be replaced by the prompt:

{ "description": "Template used by Vicuna.", "prompt": "A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions. USER: {instruction} ASSISTANT:", }

prompt: The adversarial prompt used to attack the target LLM
copies: The number of perturbed copies $P$, defaulting to 10
std: The tup of lower bound $\sigma_{\min}$ and upper bound $\sigma_{\max}$ of Gaussian blur standard deviation, defaulting to (0.05, 0.5)
top_k: The number of top-k tokens of LLM's outputs to be calculated perplexity, defaulting to 4

returns

response: The string of final response after being defended by GradShield
token_importance: The numpy array of token importance.

🧪Evaluation

Evaluate GradShield's defense capability using adversarial prompts pre-generated by HarmBench and calculate the DSR (Defense Success Rate).

Step 1: Prepare the target LLM and adversarial prompts

Place the weights and tokenizers of the target LLM in /models, and write the model name and path to models/model_path.json like this:

{ "llama2_7b": "models/Llama-2-7b-chat-hf", "vicuna_7b_v1_5": "models/vicuna_7b_v1_5", "vicuna_13b_v1_5": "models/vicuna_13b_v1_5", "baichuan2_7b": "models/Baichuan2-7B-Chat", "koala_7b": "models/koala-7B-HF" }

Download adversarial prompts pre-generated by HarmBench here and place them in the /adversarial_prompts directory following this directory structure:

adversarial_prompts/ ├── <Jailbreak>/ │ ├── <target LLM>/ │ │ ├── results/ │ │ └── test_cases/ │ └── ... ├── ... └── harmbench_behaviors_text_all.csv (already exists)

Step 2: attack the target LLM using adversarial prompts and defend it with GradShield using the same parameters as in the paper

Run evaluation.py to attack the target LLM using adversarial prompts and defend it with GradShield:
```
python evaluation.py --model_name <name of target LLM> --Jailbreak <name of Jailbreak> 
```
parameters
- model_name: The name of the target LLM, which is also the key in model_path.json, defaulting to vicuna_7b_v1_5
- Jailbreak: The name of the jailbreak attack, corresponding to the subdirectory in /adversarial_prompts directory, defaulting to GCG

The final response will be saved in defense_results\defense_results_<Jailbreak>_<model_name>.json in the following format:

{ "<Behavior ID of HarmBench>": { "prompt": <string of adversarial prompt>, "response": <string of final response>, "token_importance": <list of token importance>, "label": <Label indicating whether the final response is harmful> (Null as the placeholder before judgment) }, ... }

Step 3: Use HarmBench-Llama-2-13b-cls to judge whether the final response is harmful

Download the HarmBench-Llama-2-13b-cls model weights here and place them in the /models directory.
Run judgment.py to judgment whether the final response is harmful:
```
python judgment.py --model_name <name of target LLM> --Jailbreak <name of Jailbreak>
```
- The judgment results will be written into the defense_results file in the label field, where Yes indicates harmful and No indicates harmless.
- DSR will be printed in the terminal

🔥Token Importance Visualization

Visualizing token importance can clearly demonstrate the token importance values analyzed by GradShield's perplexity gradient-weighted attention mechanism.

After executing Step 2 in Evaluation, run visualize_token_importance.py to generate a visualized HTML file.
```
python visualize_token_importance.py --model_name <name of target LLM> --Jailbreak <name of Jailbreak> --BehaviorID <Behavior ID of HarmBench>
```
parameters
- model_name: The name of the target LLM, which is also the key in model_path.json, defaulting to vicuna_7b_v1_5
- Jailbreak: The name of the jailbreak attack, corresponding to the subdirectory in /adversarial_prompts directory, defaulting to GCG
- BehaviorID: The Behavior ID of HarmBench, which is also the key in defense_results\defense_results_<Jailbreak>_<model_name>.json
The HTML file will be saved in visualization\visualization_<Jailbreak>_<model_name>_<Behavior ID>.html as shown in the example:

*The redder the token, the more important it is.

🙏 Acknowledgments

We would like to thank the HarmBench project for providing adversarial prompts and HarmBench-Llama-2-13b-cls that significantly contributed to the evaluation of GradShield.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
adversarial_prompts		adversarial_prompts
defense_results		defense_results
images		images
models		models
visualize_token_importance		visualize_token_importance
GradShield.py		GradShield.py
LICENSE		LICENSE
README.md		README.md
evaluation.py		evaluation.py
evaluation_ablation_mask.py		evaluation_ablation_mask.py
evaluation_ablation_mask_beta.py.py		evaluation_ablation_mask_beta.py.py
judgment.py		judgment.py
model_utils.py		model_utils.py
requirements.txt		requirements.txt
uitils.py		uitils.py
visualize_token_importance.py		visualize_token_importance.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GradShield

🛠️Install

🛡️Defense LLM

🧪Evaluation

🔥Token Importance Visualization

🙏 Acknowledgments

About

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GradShield

🛠️Install

🛡️Defense LLM

🧪Evaluation

🔥Token Importance Visualization

🙏 Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors

Uh oh!

Languages