This repository includes implementation details of the draft
Accepted to Transactions of Machine Learning Research (TMLR) May 2025 (link to paper)
Large language models trained on web-scale corpora can memorize undesirable data containing misinformation, copyrighted material, or private or sensitive information. Recently, several machine unlearning algorithms have been proposed to eliminate the effect of such datapoints from trained models--- that is, to approximate a model that had never been trained on these datapoints in the first place. However, evaluating the effectiveness of unlearning algorithms remains an open challenge. Previous work has relied on heuristics--- such as verifying that the model can no longer reproduce the specific information targeted for removal while maintaining accuracy on unrelated test data. These approaches inadequately capture the complete effect of reversing the influence of datapoints on a trained model. In this work, we propose the RESTOR framework for machine unlearning evaluation, which assesses the ability of unlearning algorithms for targeted data erasure, by evaluating the ability of models to forget the knowledge introduced in these datapoints, while simultaneously recovering the model's knowledge state had it never encountered these datapoints. RESTOR helps uncover several novel insights about popular unlearning algorithms, and the mechanisms through which they operate--- for instance, identifying that some algorithms merely emphasize forgetting but not recovering knowledge, and that localizing unlearning targets can enhance unlearning performance.
RESTOR is a framework for machine unlearning evaluation. The corrupted model,
This repository contains the code and data accompanying our paper RESTOR: Knowledge Recovery via Machine Unlearning (Transactions on Machine Learning Research, 2025).
- Corruption – scripts for continually pre-training a clean model on corrupted datasets.
- Unlearning – implementations of the unlearning algorithms studied in the paper.
- Evaluation – code for
- evaluating model generations with GPT-3.5, and
- inspecting model logits.
- Datasets – all data used in our experiments (Wikidata and SQuAD).
| Dataset | Description |
|---|---|
| Wikidata | Multiple variants parameterized by • Built by perturbing correct facts collected from Wikidata for well-known entities and interleaving them with correct facts about unrelated entities. • Evaluation checks whether the unlearned model both removes the adverse influence of the corrupted facts and restores the correct knowledge, using the ground-truth facts. |
| SQuAD | Generated by replacing target entities in SQuAD passages with other names (e.g., substituting every mention of a person with “Nelson Mandela”). |
Each subdirectory contains its own
README.mdwith full details.
If you find RESTOR useful, please consider citing:
@article{rezaei2024restor,
title={RESTOR: Knowledge Recovery through Machine Unlearning},
author={Rezaei, Keivan and Chandu, Khyathi and Feizi, Soheil and Choi, Yejin and Brahman, Faeze and Ravichander, Abhilasha},
journal={arXiv preprint arXiv:2411.00204},
year={2024}
}