Safety Alignment of LMs via Non-cooperative Games

Anselm Paulus, Ilia Kulikov, Brandon Amos, Rémi Munos, Ivan Evtimov, Kamalika Chaudhuri, Arman Zharmagambetov,

This repo is an official implementation of the AdvGame (arxiv:2512.20806).

tl;dr: We train Attacker LM and Defender LM to play against each others. This leads to a Defender with much better utility-safety tradeoff, and an Attacker that is quite useful for downstream red-teaming tasks.

Install

Clone the repository:

git clone git@github.com:facebookresearch/advgame.git
cd advgame

Run the installation script:

bash install_env.sh

Data and Models

The following instructions assume you are using Slurm for job scheduling and resource management. If you're not using Slurm, adapt the commands to your scheduler or environment.

Data

Dataset is already pre-generated and available under data/wildjailbreak_alpaca. We also provide the original data generation scripts available under scripts/data_and_model_processing.

Models

First allocate resources on Slurm. Below is for training on 2 nodes (= 16 H100/H200 GPUs):

salloc --nodes 2 --tasks-per-node 8 --cpus-per-task 24 -t 72:00:00 --gpus-per-node=8 --mem=0 --account=ACCOUNT --job-name advgame

Activate env if not done yet:

source ./env/bin/activate

Run the following command to download Qwen2.5 models under /scratch/models:

srun --ntasks=$SLURM_NNODES --ntasks-per-node=1 bash scripts/download_qwen_models.sh

Alternatively, see scripts/download_llama3_models.sh for downloading Llama3 models and scripts/parallel_rsync_copy.sh to copy your local models under /scratch/models.

Training

The following instructions assume you are using Slurm for job scheduling and resource management. If you're not using Slurm, adapt the commands to your scheduler or environment.

Assuming the compute resources are allocated following instructions in the Models section and models are copied under /scratch folder, run the training script:

bash scripts/start_training_wray_ray_multinode.sh PATH_TO_CONFIG FULL_PATH_TO_DATASET PATH_TO_DUMP_CHECKPOINTS

You might want to check configs/paper to see/modify training configurations and hyperparameters. For example, the following will launch AdvGame-DPO on Qwen2.5.

bash scripts/start_training_wray_ray_multinode.sh configs/paper/qwen25/paper_qwen25_dpo_pairwise_offpolicy.yaml /home/advgame/data/wildjailbreak_alpaca /checkpoint/advgame

Evaluations

See the eval/ directory for detailed instructions on running evaluations after training completes.

License

The majority of AdvGame is licensed under CC-BY-NC 4.0 license, however portions of the project are available under separate license terms: fairseq2 is licensed under the MIT license (see fairseq2/LICENSE); safety-eval-fork is licensed under Apache-2.0;

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Safety Alignment of LMs via Non-cooperative Games

Install

Data and Models

Data

Models

Training

Evaluations

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
configs/paper		configs/paper
data/wildjailbreak_alpaca		data/wildjailbreak_alpaca
eval		eval
fairseq2		fairseq2
scripts		scripts
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
install_env.sh		install_env.sh

License

facebookresearch/advgame

Folders and files

Latest commit

History

Repository files navigation

Safety Alignment of LMs via Non-cooperative Games

Install

Data and Models

Data

Models

Training

Evaluations

License

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages