Prompt injection defenses that separate instructions from data by fine-tuning the LLM are popular, with several implementations, including OpenAI's instruction hierarchy and Meta's SecAlign. They show strong resistance to whitebox attacks like GCG. Are they secure though? Our latest pre-print provides an analysis of why GCG-like attacks fail against these models, showing that GCG doesn't perform much better than making random token substitutions. We also propose a new class of whitebox attacks that utilize attention! Our new attack, ASTRA, shows how an attacker can craft prompt injections in terms of manipulating the model's attention matrices to make it so that the model exclusively attends to the attacker's injected instructions while ignoring everything else in the context. This opens up a new line of investigation, where the attacker's objective is in terms of attention matrices rather than output probabilities. Preprint here: https://lnkd.in/gFqSpvDx
GCG-like attacks fail against prompt injection defenses, new attack ASTRA proposed
More Relevant Posts
-
Paper of the day (always read Carlini papers): "The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against Llm Jailbreaks and Prompt Injections" -- https://lnkd.in/gDN_CmGU Main takeaway: evaluating the "security" of a model against a static set of prompt injections gives you a false sense of security. A simple, static evaluation gives you a lower bound on attack success rate. Real attackers adapt to defenses, and adaptive attacks had a vastly higher success rate than a naive static evaluation might lead you to believe. But most importantly, not a single defense studied in this paper stood up to sustained human attack. When considered as a class, humans had a 100% attack success rate. Assume prompt injection: design your system to be robust to LLM outputs.
To view or add a comment, sign in
-
Excited to share our paper "Bypassing Prompt Guards in Production with Controlled-Release Prompting" on arXiv! This is joint work with Sanjam Garg, Keewoo Lee, and Mingyuan Wang. A common method to defend against jailbreaks is to place a "prompt guard" model in front of the foundation model. The prompt guard checks all incoming prompts for malicious intent: benign prompts are passed through, and malicious prompts are blocked. We found that four major GenAI platforms (Google Gemini, DeepSeek Chat, xAI Grok, and Mistral Le Chat) relied heavily on prompt guards to defend against roleplay jailbreaks, e.g., DAN (Do Anything Now). Our techniques highlight that it's always possible to bypass prompt guards. We re-enabled previously-patched roleplay jailbreaks on these platforms and obtained direct responses to malicious prompts including, in perhaps the most severe case, detailed instructions for self harm. The main takeaway is that alignment efforts should focus on making outputs safe over preventing malicious inputs. OpenAI's safe-completion safety training, for example, is a step in the right direction. 🔍 Explore our findings and data: https://lnkd.in/gMnRiYHe 📖 Read the preprint: https://lnkd.in/gZr8raNi Note: We disclosed our methods to affected platforms, so lifting attack prompts directly from our paper may no longer work.
To view or add a comment, sign in
-
-
🚨 Prompt injection vulnerability in Brave’s Leo AI While testing how Leo summarizes web pages, I found that a hidden HTML element can quietly instruct it to say anything, even display fake messages or phishing links with almost no user interaction I broke down the details, with code snippets and a demo, in my new blog post 👇 https://lnkd.in/gqjw9Gup #AIsecurity #PromptInjection #Brave #LLM #SecurityResearch #Infosec
To view or add a comment, sign in
-
-
3 interesting AI-related security bits in one blog post this week! https://lnkd.in/gv3YjNiw 1) AI VRP launched officially! Check out the new reward tables and better disambiguation between AI security and abuse issues, with the goal to pay researchers more. This comes on the heels of our ESCAL8 event, and we want to grow our engagement with our researcher community further! More detail here: https://lnkd.in/gscKft4s 2) CodeMender! Security and GDM have been collaborating on two efforts here. One is to speed up bug fixes and patches for security vulnerabilities using Gemini; the other one is refactoring code to make it safer (see example with adding C++ bounds checks to libwebp already). Read the GDM blog post: https://lnkd.in/gzNTGrqe 3) Agents guidelines are expanded with a risk map, linked from SAIF 2.0 that our teams have been working on for ~1.5 years. https://lnkd.in/gdVzxRhG
To view or add a comment, sign in
-
If you are a security researcher who is frustrated that AI companies don't put model security issues in-scope for their bug bounties, this is the best explanation I've read for why things like prompt-injection and model behavior aren't eligible for rewards (first link in Peter's post): https://lnkd.in/g4NEvNcz Bug bounties exist to incentivize research into novel issues that pose a business risk to the company. LLM behavior is a known quantity for all these companies, including the parts people don't like, which makes those reports product feedback, valuable in aggregate but not individually.
3 interesting AI-related security bits in one blog post this week! https://lnkd.in/gv3YjNiw 1) AI VRP launched officially! Check out the new reward tables and better disambiguation between AI security and abuse issues, with the goal to pay researchers more. This comes on the heels of our ESCAL8 event, and we want to grow our engagement with our researcher community further! More detail here: https://lnkd.in/gscKft4s 2) CodeMender! Security and GDM have been collaborating on two efforts here. One is to speed up bug fixes and patches for security vulnerabilities using Gemini; the other one is refactoring code to make it safer (see example with adding C++ bounds checks to libwebp already). Read the GDM blog post: https://lnkd.in/gzNTGrqe 3) Agents guidelines are expanded with a risk map, linked from SAIF 2.0 that our teams have been working on for ~1.5 years. https://lnkd.in/gdVzxRhG
To view or add a comment, sign in
-
𝗪𝗵𝗮𝘁 𝗮𝗰𝘁𝘂𝗮𝗹𝗹𝘆 𝗰𝗼𝘂𝗻𝘁𝘀 𝗮𝘀 𝗮 𝗽𝗿𝗼𝗺𝗽𝘁 𝗮𝘁𝘁𝗮𝗰𝗸 𝗮𝗻𝗱 𝘄𝗵𝗮𝘁 𝗱𝗼𝗲𝘀𝗻’𝘁? 💭 Sometimes the line between a harmless query and a malicious instruction isn’t as clear as it seems. It’s one of the first things you learn when digging into AI security: 𝘶𝘯𝘥𝘦𝘳𝘴𝘵𝘢𝘯𝘥𝘪𝘯𝘨 𝘸𝘩𝘢𝘵 𝘺𝘰𝘶’𝘳𝘦 𝘳𝘦𝘢𝘭𝘭𝘺 𝘶𝘱 𝘢𝘨𝘢𝘪𝘯𝘴𝘵. We pulled a couple of pages from one of our guides that make this distinction clear with real examples of what qualifies as a prompt attack and what doesn’t. A small knowledge pill for the weekend. Take a look below and grab the full guide here 👉 https://lnkd.in/dFBDx77m
To view or add a comment, sign in
-
🤔 Is your data used to train machine learning models? 🔍 Membership Inference Attacks (MIA) are widely used to assess privacy risks of ML models. But existing MIAs all infer membership one instance at a time, ignoring their crucial dependence. 🚀 At #NDSS2026, we introduce the first attack (Cascading MIA) that leverages such dependence. Our study shows that joint membership inference can significantly boost attack performance compared to existing approaches. ✨ This opens a new direction for MIA research, and many more possibilities remain to be explored! 📄 Paper: https://lnkd.in/ga3xeXQP 📝 Blog: https://lnkd.in/g9y5Jc2j 💻 Code: https://lnkd.in/gE8vjsNE 🙏 Many thanks to my collaborators Jiacheng Li, Yuetian Chen, Kaiyuan Zhang, Zhizhen Yuan, Hanshen Xiao, Bruno Ribeiro, and Ninghui Li.
To view or add a comment, sign in
-
Robust LLM defense evaluation must use adaptive, well-resourced attackers, and most popular defenses collapse under such pressure. A general attacker organizes iterative attacks into propose, score, select, and update, instantiated via gradient methods, reinforcement learning, search over prompts, and human red-teaming. Applying these attackers to 12 diverse defenses yields attack success rates typically above 90 percent, far exceeding the near-zero figures reported by original authors. Prompting strategies like Spotlighting, Prompt Sandwiching, and RPO fail once attacks adapt to their structure. Adversarial training on fixed attack sets, such as Circuit Breakers, StruQ, and MetaSecAlign, does not generalize. Stand-alone detectors and stacked filters remain brittle, and secret-knowledge canary schemes like Data Sentinel and MELON are bypassable, especially when adversaries adapt conditionally. Key lessons: small static evaluations mislead, automated evaluations help but cannot certify robustness, human red-teaming remains the strongest adversary, and model-based auto raters invite reward hacking and adversarial failure. https://lnkd.in/gkaWxSdP
To view or add a comment, sign in
-
In real security, the defense is published first, and the attacker adapts after seeing it. The authors show that when attackers do this—using optimization, reinforcement learning, search, or human creativity—every tested defense collapses, often reaching > 90 % attack success rate (ASR), even when the original papers reported near-zero success. https://lnkd.in/eTSpHXsR
To view or add a comment, sign in
-
Are you building with the Model Context Protocol (MCP)? Read 👉 https://lnkd.in/gGKGHpvv 👈 to follow the journey of a fictional user, Alex, in a play-like narrative, showing how a simple request can trigger a chain of vulnerabilities—from tool poisoning to agent impersonation. This is an essential read for any engineer planning to implement MCP based application or working to secure the next generation of AI agents and their connection to real-world tools. 👍 #AIAgents #MCP #ApplicationSecurity #AISecurity
To view or add a comment, sign in
More from this author
Explore related topics
- Prompt Injection Techniques for AI Security
- How to Understand Prompt Injection Attacks
- How to Secure Large Language Models
- Strategies to Prevent Code Attacks in AI
- Recent Developments in LLM Models
- Understanding Attention Mechanisms in LLMs
- Security Risks of OpenAI Integration
- Best Practices for Securing LLMs in High-Stakes Workflows
- AI Security Guidance for LLMs
Way to go! My weekend reading just opened up a new spot!