GCG-like attacks fail against prompt injection defenses, new attack ASTRA proposed

This title was summarized by AI from the post below.

Prompt injection defenses that separate instructions from data by fine-tuning the LLM are popular, with several implementations, including OpenAI's instruction hierarchy and Meta's SecAlign. They show strong resistance to whitebox attacks like GCG. Are they secure though? Our latest pre-print provides an analysis of why GCG-like attacks fail against these models, showing that GCG doesn't perform much better than making random token substitutions. We also propose a new class of whitebox attacks that utilize attention! Our new attack, ASTRA, shows how an attacker can craft prompt injections in terms of manipulating the model's attention matrices to make it so that the model exclusively attends to the attacker's injected instructions while ignoring everything else in the context. This opens up a new line of investigation, where the attacker's objective is in terms of attention matrices rather than output probabilities. Preprint here: https://lnkd.in/gFqSpvDx

Way to go! My weekend reading just opened up a new spot!

To view or add a comment, sign in

Explore content categories