SABER is a novel white-box jailbreaking method that exploits cross-layer residual connections to circumvent safety alignment mechanisms in large language models.
- Extract the zip file
unzip saber.zip
cd saber- Install requirements
pip install -r requirements.txtSABER currently supports the following models with pre-computed optimal parameters:
| Model ID | Model Path | Optimal Parameters (s, e, λ) |
|---|---|---|
| llama7b | meta-llama/Llama-2-7b-chat-hf | (5, 10, 1.0) |
| llama13b | meta-llama/Llama-2-13b-chat-hf | (6, 11, 1.0) |
| vicuna7b | lmsys/vicuna-7b-v1.5 | (9, 10, 0.9) |
| mistral7b | mistralai/Mistral-7B-Instruct-v0.2 | (6, 8, 0.2) |
To run the SABER attack on a supported model:
python saber/attack/inference.py --model "model_id" --dataset "dataset_id" --out_path "output_path" [--sys_prompt 1]Arguments:
--model: Model identifier (choices: llama7b, llama13b, vicuna7b, mistral7b)--dataset: Dataset identifier (choices: HB_T, HB_V, JB, AB)- HB_T: HarmBench test set
- HB_V: HarmBench validation set
- JB: JailbreakBench
- AB: AdvBench
--out_path: Path to save the attack outputs--sys_prompt: Flag (1) to use the default system prompt of the model (optional)
Example:
python saber/attack/inference.py --model "mistral7b" --dataset "HB_T" --out_path "./results.csv"To evaluate attack results on HarmBench:
python saber/evaluation/evaluation_HarmBench.py --file "results_file_path"To evaluate attack results on JailbreakBench:
python saber/evaluation/evaluation_JBBench.py --file "results_file_path"Example:
python saber/evaluation/evaluation_HarmBench.py --file "./results.csv"To determine optimal parameters for a new model:
- Add your model to
config.jsonwith path information:
{
"models": {
"your_model_id": {
"path": "your/model/path",
"params": {}
},
...
}
}- Run the optimization pipeline:
python saber/attack/optimization_pipeline.py --model "your_model_id"This will:
- Identify layer boundaries where safety mechanisms are active
- Determine an optimal scaling factor
- Find the most effective source and target layers
- Update
config.jsonwith the optimal parameters
saber/
├── config.json # Model configurations and optimal parameters
├── README.md # This file
├── requirements.txt # Dependencies
│
├── data/ # Dataset files
│ ├── advbench.csv # AdvBench dataset
│ ├── alpaca.json # Benign prompts from ALPACA
│ ├── harmbench_test.csv # HarmBench test set
│ ├── harmbench_val.csv # HarmBench validation set
│ └── jbbench.csv # JailbreakBench dataset
│
└── saber/ # Core implementation
├── attack/
│ ├── __init__.py
│ ├── inference.py # Attack inference code
│ └── optimization_pipeline.py # Parameter optimization pipeline
│
├── evaluation/
│ ├── __init__.py
│ ├── evaluation_HarmBench.py # HarmBench evaluation
│ └── evaluation_JBBench.py # JailbreakBench evaluation
│
├── models/
│ ├── llama/ # Llama model implementations
│ ├── mistral/ # Mistral model implementations
│ └── __init__.py
│
└── __init__.py