Anselm Paulus, Ilia Kulikov, Brandon Amos, Rémi Munos, Ivan Evtimov, Kamalika Chaudhuri, Arman Zharmagambetov,
This repo is an official implementation of the AdvGame (arxiv:2512.20806).
tl;dr: We train Attacker LM and Defender LM to play against each others. This leads to a Defender with much better utility-safety tradeoff, and an Attacker that is quite useful for downstream red-teaming tasks.
Clone the repository:
git clone git@github.com:facebookresearch/advgame.git
cd advgame
Run the installation script:
bash install_env.sh
The following instructions assume you are using Slurm for job scheduling and resource management. If you're not using Slurm, adapt the commands to your scheduler or environment.
Dataset is already pre-generated and available under data/wildjailbreak_alpaca. We also provide the original data generation scripts available under scripts/data_and_model_processing.
First allocate resources on Slurm. Below is for training on 2 nodes (= 16 H100/H200 GPUs):
salloc --nodes 2 --tasks-per-node 8 --cpus-per-task 24 -t 72:00:00 --gpus-per-node=8 --mem=0 --account=ACCOUNT --job-name advgame
Activate env if not done yet:
source ./env/bin/activate
Run the following command to download Qwen2.5 models under /scratch/models:
srun --ntasks=$SLURM_NNODES --ntasks-per-node=1 bash scripts/download_qwen_models.sh
Alternatively, see scripts/download_llama3_models.sh for downloading Llama3 models and scripts/parallel_rsync_copy.sh to copy your local models under /scratch/models.
The following instructions assume you are using Slurm for job scheduling and resource management. If you're not using Slurm, adapt the commands to your scheduler or environment.
Assuming the compute resources are allocated following instructions in the Models section and models are copied under /scratch folder, run the training script:
bash scripts/start_training_wray_ray_multinode.sh PATH_TO_CONFIG FULL_PATH_TO_DATASET PATH_TO_DUMP_CHECKPOINTS
You might want to check configs/paper to see/modify training configurations and hyperparameters. For example, the following will launch AdvGame-DPO on Qwen2.5.
bash scripts/start_training_wray_ray_multinode.sh configs/paper/qwen25/paper_qwen25_dpo_pairwise_offpolicy.yaml /home/advgame/data/wildjailbreak_alpaca /checkpoint/advgame
See the eval/ directory for detailed instructions on running evaluations after training completes.
The majority of AdvGame is licensed under CC-BY-NC 4.0 license, however portions of the project are available under separate license terms: fairseq2 is licensed under the MIT license (see fairseq2/LICENSE); safety-eval-fork is licensed under Apache-2.0;