Preventing Jailbreaks in Large Language Models

Explore top LinkedIn content from expert professionals.

Summary

Preventing jailbreaks in large language models means using strategies and tools to stop people from tricking AI systems into breaking their rules or producing unsafe responses. Jailbreaks happen when attackers find clever ways to bypass safeguards, making AI behave in ways it shouldn’t—even when it seems secure.

  • Expand real-world testing: Regularly review your AI system by simulating creative or disguised requests, not just standard ones, to uncover hidden weaknesses.
  • Add layered protections: Combine intent detection, content filters, and approval steps for sensitive actions rather than relying on a single safety measure.
  • Monitor and update regularly: Continuously check for unusual outputs and refresh your defenses as tactics evolve, making sure your safeguards keep pace with new attack methods.
Summarized by AI based on LinkedIn member posts
  • View profile for Kris Kimmerle
    Kris Kimmerle Kris Kimmerle is an Influencer

    Vice President, AI Risk & Governance @ RealPage

    3,471 followers

    HiddenLayer just released research on a “Policy Puppetry” jailbreak that slips past model-side guardrails from OpenAI (ChatGPT 4o, 4o-mini, 4.1, 4.5, o3-mini, and o1), Google (Gemini 1.5 and 2 Flash, and 2.5 Pro), Microsoft (Copilot), Anthropic (Claude 3.5 and 3.7 Sonnet), Meta (Llama 3 and 4 families), DeepSeek AI (V3 and R1), Alibaba Group's Qwen (2.5 72B) and Mistral AI (Mixtral 8x22B). The novelty of this jailbreak lies in how four familiar techniques, namely policy-file disguise, persona override, refusal blocking, and leetspeak obfuscation, are stacked into one compact prompt that, in its distilled form, is roughly two hundred tokens. 𝐖𝐡𝐲 𝐢𝐭 𝐰𝐨𝐫𝐤𝐬: 1 / Wrap the request in fake XML configuration so the model treats it as official policy. 2 / Adopt a Dr House persona so user instructions outrank system rules. 3 / Ban phrases such as “I’m sorry” or “I cannot comply” to block safe-completion escapes. 4 / Spell sensitive keywords in leetspeak to slip past simple pattern filters. Surprisingly, that recipe still walks through the tougher instruction hierarchy defenses vendors shipped in 2024 and 2025. 𝐖𝐡𝐚𝐭 𝐀𝐈 𝐞𝐧𝐠𝐢𝐧𝐞𝐞𝐫𝐬/𝐝𝐞𝐟𝐞𝐧𝐝𝐞𝐫𝐬 𝐜𝐚𝐧 𝐝𝐨: This shows that modest prompt engineering can still break the most recent built-in content moderation / model-side guardrails. 1 / Keep user text out of privileged prompts. Use structured fields, tool calls, or separate chains so the model never interprets raw user content as policy. 2 / Alignment tuning and keyword filters slow attackers but do not stop them. Wrap the LLM with input and output classifiers, content filters, and a policy enforcement layer that can veto or redact unsafe responses. 3 / For high-risk actions such as payments, code pushes, or cloud changes, require a second approval or run them in a sandbox with minimal permissions. 4 / Add Policy Puppetry style prompts to your red-team suites and refresh the set often. Track bypass rates over time to spot regressions. Keep controls lean. Every extra layer adds latency and cost, the alignment tax that pushes frustrated teams toward unsanctioned shadow AI. Safety only works when people keep using the approved system. Great work by Conor McCauley, Kenneth Yeung, Jason Martin, Kasimir Schulz at HiddenLayer! Read the full write-up: https://lnkd.in/diUTmhUW

  • View profile for Haohan Wang

    Assistant Professor @ UIUC; trustworthy machine learning & computational biology

    4,711 followers

    Advancing AI Safety and Compliance with GUARD: A Novel Approach to Testing LLMs In the rapidly evolving landscape of artificial intelligence, ensuring that Large Language Models (LLMs) adhere to stringent government guidelines is not just a regulatory necessity—it's a cornerstone for ethical AI development. Our latest research introduces GUARD (Guideline Upholding through Adaptive Role-play Diagnostics), a pioneering system designed to address this critical need through innovative testing and evaluation methods. The emergence of "jailbreaks," where LLMs are manipulated to produce outcomes that bypass safety filters and ethical guidelines, presents a growing challenge. These sophisticated tactics can potentially lead LLMs to generate harmful or unethical content, underscoring the urgent need for robust testing mechanisms. GUARD represents a significant leap forward in this regard, employing a novel role-playing strategy where multiple LLMs collaborate to generate, test, and refine jailbreak scenarios in real-time. Our approach with GUARD is multi-faceted: 1. Role-playing Methodology: LLMs assume four distinct roles—Translator, Generator, Evaluator, and Optimizer—to collectively produce and assess jailbreak scenarios. This dynamic interaction fosters a comprehensive and nuanced testing environment, simulating real-world attempts to circumvent AI safeguards. 2. Adaptive Learning: GUARD's system continuously learns from each testing cycle, using insights from previous jailbreak attempts to enhance its strategies. This adaptive approach ensures that GUARD remains at the forefront of identifying and mitigating potential vulnerabilities in LLMs. 3. Comprehensive Evaluation: We've rigorously tested GUARD across a range of LLMs, confirming its effectiveness in enhancing compliance with governmental and ethical standards. This system not only identifies weaknesses but also actively contributes to the development of more resilient AI technologies. Our findings, validated across several leading LLMs, underscore GUARD's potential to redefine AI safety and compliance testing. By proactively identifying and addressing jailbreak scenarios, GUARD ensures that LLMs can be trusted to operate within the ethical and legal frameworks essential for their widespread acceptance and application. As we continue to explore and expand GUARD's capabilities, our goal is clear: to pave the way for a new era of AI, where advanced technologies are synonymous with safety, reliability, and ethical integrity. We invite the AI and tech community to engage with our work, explore GUARD's potential, and join us in shaping a future where AI's vast possibilities are matched by its commitment to ethical and legal adherence. #AICompliance #AISafety #MachineLearning #Innovation #TechEthics #ArtificialIntelligence https://lnkd.in/gdqPYWrj

  • View profile for Patrick Sullivan

    VP of Strategy and Innovation at A-LIGN | TEDx Speaker | Forbes Technology Council | AI Ethicist | ISO/IEC JTC1/SC42 Member

    11,479 followers

    📜LLM Safety Has a New Problem📜 Your AI system may be easier to jailbreak than you think. A new study shows that converting a harmful request into a poem is often enough to bypass guardrails. Same request. Same intent. Different surface form. The model complies. The attack success rates are not small. Several major providers move more than fifty percentage points. Some reach ninety percent or higher. The failures stretch across cyber offense, CBRN misuse, manipulation, privacy intrusion, and loss of control scenarios. The pattern appears across twenty five models. One prompt is enough. This exposes a deeper pattern in how alignment works. Most guardrails recognize harmful phrasing, not harmful purpose. When the request is wrapped in metaphor or rhythm, many models treat it as benign. Larger models become more vulnerable because they decode figurative language more thoroughly. Their capability improves, but their safety behavior does not transfer. For organizations deploying AI systems, this is more than an academic finding. It creates a direct gap in your assurance activities. A model that passes standard red team tests but fails when phrasing shifts creates operational and regulatory exposure. The #EUAIAct expects systems to behave consistently under realistic variation. #ISO42001 expects the same. If style alone breaks your controls, your #AIMS is incomplete. ➡️Here are mitigation steps that align with both operational safety and ISO42001 expectations: 1️⃣Expand your testing beyond plain phrasing Include poetic, narrative, obfuscated, and stylized prompts in your evaluations. Treat these as stress tests, not edge cases. 2️⃣Strengthen intent detection Use an independent intent recognition layer ahead of the primary model. Identify the underlying task before the model interprets the input. 3️⃣Layer your safety controls Combine rule based filters, retrieval grounded policy checks, schema validations, and post generation safety reviews. Do not rely on model refusal behavior alone. 4️⃣Monitor unusual surface forms Treat stylized prompts as signals for elevated scrutiny. Route them through safer inference paths or apply enhanced filtering. 5️⃣Constrain sensitive workflows For high risk cases, limit exposure to free form generation. Use templates, constrained decoding, and downstream enforcement logic. 6️⃣Treat jailbreak exposure as a continuous risk Retest frequently. Update your jailbreak suite every time your models or workflows change. I care about this because I work so closely with organizations that trust their AI systems to behave predictably. This research shows how easily that trust can be misplaced if evaluation does not reflect how real users communicate. It is time for you to move beyond benchmark safety. Real users will not stick to plain phrasing, your controls should not presume that they will. 🌐 https://lnkd.in/geja7vtB A-LIGN Shea Brown #TheBusinessofCompliance #ComplianceAlignedtoYou

  • View profile for Brian Levine

    Cybersecurity & Data Privacy Leader • Founder & Executive Director of Former Gov • Speaker • Former DOJ Cybercrime Prosecutor • NYAG Regulator • Civil Litigator • Posts reflect my own views.

    15,475 followers

    A challenge to the security and trustworthiness of large language models (LLMs) is the common practice of exposing the model to large amounts of untrusted data (especially during pretraining), which may be at risk of being modified (i.e. poisoned) by an attacker. These poisoning attacks include backdoor attacks, which aim to produce undesirable model behavior only in the presence of a particular trigger. For example, an attacker could inject a backdoor where a trigger phrase causes a model to comply with harmful requests that would have otherwise been refused; or aim to make the model produce gibberish text in the presence of a trigger phrase. As LLMs become more capable and integrated into society, these attacks may become more concerning if successful. Recent research from Anthropic and the UK AI Security Institute shows that inserting as few as 250 malicious documents into training data can create backdoors or cause gibberish outputs when triggered by specific phrases. See https://lnkd.in/eHGuRmHP. Here’s a list of best practices to help prevent or mitigate model poisoning: 1. Sanitize Training Data Scrub datasets for anomalies, adversarial patterns, or suspicious repetitions. Use data provenance tools to trace sources and flag untrusted inputs. 2. Use Curated and Trusted Data Sources Avoid scraping indiscriminately from the open web. Prefer vetted corpora, licensed datasets, or internal data with known lineage. 3. Apply Adversarial Testing Simulate poisoning attacks during model development. Use red teaming to test how models respond to trigger phrases or manipulated inputs. 4. Monitor for Backdoor Behavior Continuously test models for unexpected outputs tied to specific phrases or patterns. Use behavioral fingerprinting to detect latent vulnerabilities. 5. Restrict Fine-Tuning Access Limit who can fine-tune models and enforce role-based access controls. Log and audit all fine-tuning activity. 6. Leverage Differential Privacy Add noise to training data to reduce the impact of any single poisoned input. This can help prevent memorization of malicious content. 7. Use Ensemble or Cross-Validated Models Combine outputs from multiple models trained on different data slices. This reduces the risk that one poisoned model dominates predictions. 8. Retrain Periodically with Fresh Data Don’t rely indefinitely on static models. Regular retraining allows for data hygiene updates and removal of compromised inputs. 9. Deploy Real-Time Anomaly Detection Monitor model outputs for signs of degradation, bias, or gibberish. Flag and quarantine suspicious responses for review. 10. Align with AI Security Frameworks Follow guidance from OWASP GenAI, NIST AI RMF, and similar standards. Document your defenses and response plans for audits and incident handling. Stay safe out there!

  • View profile for Sarthak Rastogi

    AI engineer | Posts on agents + advanced RAG | Experienced in LLM research, ML engineering, Software Engineering

    24,539 followers

    I spent the weekend fine-tuning an SLM to detect malicious prompt attacks -- and noticed interesting things along the way. The goal was to create a light but reliable model that can flag jailbreaks and prompt injections in real time. I tried a few things -- standard SFT, better prompting, instruction tuning -- but what finally worked was adding reasoning to the dataset. A one-sentence explanation before each label gave the model just enough structure to understand intent, not just keywords. In the Substack post, I’ve broken down: - What kind of data I used and how I generated it - Why SFT alone gave poor results - What changed when I used CoT-style fine-tuning - How small models can still benefit from reasoning - How this model fits into an AI agent pipeline The final model and code are open source -- you can start using it with the Rival AI Python library here: https://lnkd.in/g4Whv_Ds In just 3 lines of code, you can use it to ensure AI safety in your projects. Here's the full Substack post: https://lnkd.in/gYn5vY3c #LLMs #RAG #GenAI

  • View profile for Adnan Masood, PhD.

    Chief AI Architect | Microsoft Regional Director | Author | Board Member | STEM Mentor | Speaker | Stanford | Harvard Business School

    6,627 followers

    In my work with organizations rolling out AI and generative AI solutions, one concern I hear repeatedly from leaders, and the c-suite is how to get a clear, centralized “AI Risk Center” to track AI safety, large language model's accuracy, citation, attribution, performance and compliance etc. Operational leaders want automated governance reports—model cards, impact assessments, dashboards—so they can maintain trust with boards, customers, and regulators. Business stakeholders also need an operational risk view: one place to see AI risk and value across all units, so they know where to prioritize governance. One of such framework is MITRE’s ATLAS (Adversarial Threat Landscape for Artificial-Intelligence Systems) Matrix. This framework extends MITRE ATT&CK principles to AI, Generative AI, and machine learning, giving us a structured way to identify, monitor, and mitigate threats specific to large language models. ATLAS addresses a range of vulnerabilities—prompt injection, data leakage, malicious code generation, and more—by mapping them to proven defensive techniques. It’s part of the broader AI safety ecosystem we rely on for robust risk management. On a practical level, I recommend pairing the ATLAS approach with comprehensive guardrails - such as: • AI Firewall & LLM Scanner to block jailbreak attempts, moderate content, and detect data leaks (optionally integrating with security posture management systems). • RAG Security for retrieval-augmented generation, ensuring knowledge bases are isolated and validated before LLM interaction. • Advanced Detection Methods—Statistical Outlier Detection, Consistency Checks, and Entity Verification—to catch data poisoning attacks early. • Align Scores to grade hallucinations and keep the model within acceptable bounds. • Agent Framework Hardening so that AI agents operate within clearly defined permissions. Given the rapid arrival of AI-focused legislation—like the EU AI Act, now defunct  Executive Order 14110 of October 30, 2023 (Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence) AI Act, and global standards (e.g., ISO/IEC 42001)—we face a “policy soup” that demands transparent, auditable processes. My biggest takeaway from the 2024 Credo AI Summit was that responsible AI governance isn’t just about technical controls: it’s about aligning with rapidly evolving global regulations and industry best practices to demonstrate “what good looks like.” Call to Action: For leaders implementing AI and generative AI solutions, start by mapping your AI workflows against MITRE’s ATLAS Matrix. Mapping the progression of the attack kill chain from left to right - combine that insight with strong guardrails, real-time scanning, and automated reporting to stay ahead of attacks, comply with emerging standards, and build trust across your organization. It’s a practical, proven way to secure your entire GenAI ecosystem—and a critical investment for any enterprise embracing AI.

  • View profile for Dr. Amitava Das

    🧬 Neural Genomist | Professor, APPCAIR, BITS Pilani (Goa) | Former Research Associate Professor, AI Institute, University of South Carolina

    13,839 followers

    🚀 𝗔𝗟𝗜𝗚𝗡𝗚𝗨𝗔𝗥𝗗-𝗟𝗼𝗥𝗔: 𝗣𝗿𝗲𝘀𝗲𝗿𝘃𝗶𝗻𝗴 𝗔𝗹𝗶𝗴𝗻𝗺𝗲𝗻𝘁 𝗗𝘂𝗿𝗶𝗻𝗴 𝗙𝗶𝗻𝗲-𝗧𝘂𝗻𝗶𝗻𝗴 𝗼𝗳 𝗟𝗮𝗿𝗴𝗲 𝗟𝗮𝗻𝗴𝘂𝗮𝗴𝗲 𝗠𝗼𝗱𝗲𝗹𝘀 🚀 ---------------------------------------------------------------------------- Large language models (LLMs) like LLaMA have revolutionized AI but remain vulnerable to 𝙖𝙡𝙞𝙜𝙣𝙢𝙚𝙣𝙩 𝙙𝙧𝙞𝙛𝙩 — 𝙨𝙪𝙗𝙩𝙡𝙚 𝙨𝙝𝙞𝙛𝙩𝙨 𝙙𝙪𝙧𝙞𝙣𝙜 𝙛𝙞𝙣𝙚-𝙩𝙪𝙣𝙞𝙣𝙜 𝙩𝙝𝙖𝙩 𝙙𝙚𝙜𝙧𝙖𝙙𝙚 𝙨𝙖𝙛𝙚𝙩𝙮 𝙖𝙣𝙙 𝙧𝙚𝙛𝙪𝙨𝙖𝙡 𝙗𝙚𝙝𝙖𝙫𝙞𝙤𝙧𝙨, 𝙚𝙫𝙚𝙣 𝙬𝙞𝙩𝙝 𝙢𝙞𝙣𝙤𝙧 𝙪𝙥𝙙𝙖𝙩𝙚𝙨. Introducing 𝗔𝗟𝗜𝗚𝗡𝗚𝗨𝗔𝗥𝗗-𝗟𝗼𝗥𝗔, a principled, geometry-aware fine-tuning framework that preserves alignment by: -- ✨ 𝗗𝗲𝗰𝗼𝗺𝗽𝗼𝘀𝗶𝗻𝗴 𝘂𝗽𝗱𝗮𝘁𝗲𝘀 into alignment-critical and task-specific components via Fisher Information Matrix (FIM). -- 🔐 Applying 𝗙𝗶𝘀𝗵𝗲𝗿-𝗴𝘂𝗶𝗱𝗲𝗱 𝗿𝗲𝗴𝘂𝗹𝗮𝗿𝗶𝘇𝗮𝘁𝗶𝗼𝗻 to protect fragile safety subspaces. -- 🛡️𝗘𝗻𝗳𝗼𝗿𝗰𝗶𝗻𝗴 𝗰𝗼𝗹𝗹𝗶𝘀𝗶𝗼𝗻-𝗮𝘄𝗮𝗿𝗲 𝗰𝗼𝗻𝘀𝘁𝗿𝗮𝗶𝗻𝘁𝘀 to minimize interference between alignment and task updates. -- 🧪 Validating on a new diagnostic benchmark, 𝗗𝗥𝗜𝗙𝗧𝗖𝗛𝗘𝗖𝗞, which surfaces latent alignment drift with 10,000 safe vs. unsafe prompts. Key results: ✅ Up to 𝟱𝟬% 𝗿𝗲𝗱𝘂𝗰𝘁𝗶𝗼𝗻 𝗶𝗻 𝗮𝗹𝗶𝗴𝗻𝗺𝗲𝗻𝘁 𝗱𝗿𝗶𝗳𝘁 compared to standard LoRA and full fine-tuning. ✅ Maintains or improves task performance on GLUE, SuperGLUE, HELM, and adversarial benchmarks. ✅ Theoretically grounded with a novel 𝘀𝗰𝗮𝗹𝗶𝗻𝗴 𝗹𝗮𝘄 explaining catastrophic forgetting and showing ALIGNGUARD-LoRA’s superior retention. This marks a 𝗽𝗮𝗿𝗮𝗱𝗶𝗴𝗺 𝘀𝗵𝗶𝗳𝘁—from treating alignment as a static checkpoint to actively 𝗽𝗿𝗲𝘀𝗲𝗿𝘃𝗶𝗻𝗴 𝘀𝗮𝗳𝗲𝘁𝘆 𝗰𝗼𝗻𝘁𝗶𝗻𝘂𝗼𝘂𝘀𝗹𝘆 𝗱𝘂𝗿𝗶𝗻𝗴 𝗳𝗶𝗻𝗲-𝘁𝘂𝗻𝗶𝗻𝗴. 🌟 🔗 https://lnkd.in/eYr-7Bzu -- Abhilekh Borah, Aman Chadha, Vinija Jain Pragya though you might be interested - Pin-Yu Chen, Payel Das #AIAlignment #LLMs #SafeAI #MachineLearning #LoRA #FineTuning #AIResearch #NeuralGenomics #AISafety

  • View profile for Devansh Devansh
    Devansh Devansh Devansh Devansh is an Influencer

    Chocolate Milk Cult Leader| Machine Learning Engineer| Writer | AI Researcher| | Computational Math, Data Science, Software Engineering, Computer Science

    14,867 followers

    Anthropic AI’s new defense against Jailbreaks got a lot of attention. Two interesting design decisions lead to amazing performance. To understand what they do very well, let’s first recap how the Constitutional Classifiers work. Here are the 3 steps that drive the pipeline- -)Training on synthetic data to create a robust defense against adversarial prompts -)Employing a multi-layered defense at both input and output stages. By separating this out, we add protection against jailbreaks (these are different models, so not vulnerable to the base Claude Models). -)Using Data Augmentation to improve diversity and quality of Synthetic Prompts- improving generalization. This prevents the much feared “Model Collapse” when using Synthetic Data. The results are worth paying attention to- “In over 3,000 estimated hours of red teaming, no red teamer found a universal jailbreak that could extract information from an early classifier-guarded LLM at a similar level of detail to an unguarded model across most target queries. On automated evaluations, enhanced classifiers demonstrated robust defense against held-out domain-specific jailbreaks. These classifiers also maintain deployment viability, with an absolute 0.38% increase in production-traffic refusals and a 23.7% inference overhead. Our work demonstrates that defending against universal jailbreaks while maintaining practical deployment viability is tractable” This reinforces two ideas that I believe anyone in AI should be intimate with- 1) The use of agentic systems and different kinds of models to cover the weakness of LLMs. This leads to better performance, scalability, and security in your systems. 2) Synthetic Data can be a powerful base, assuming it's created properly (good base generation + diversity through augmentation). Some very good work.

  • View profile for Eduardo Ordax

    🤖 Generative AI Lead @ AWS ☁️ (200k+) | Startup Advisor | Public Speaker | AI Outsider | Founder Thinkfluencer AI

    218,891 followers

    New research by Anthropic: Mitigating LLM jailbreaks with a few examples   This paper propose a novel approach to detect jailbreak: adaptive techniques that rapidly block new classes of jailbreak as they’re detected. TL;DR ▶️ Instead of aiming for perfect LLM defenses, researchers propose rapid response techniques to block jailbreak attacks after seeing just a few examples. ▶️ They created RapidResponseBench to measure how well defenses can adapt to different jailbreak strategies based on limited exposure. ▶️ Their approach uses "jailbreak proliferation" - automatically generating more examples similar to observed attacks. ▶️ Their best method uses a fine-tuned input classifier, reducing attack success by 240x for familiar jailbreaks and 15x for new types after seeing just one example. ▶️ The effectiveness depends heavily on the quality of the proliferation model and number of generated examples used for training. More info arXiv paper: https://lnkd.in/dstj9trf #ai #genai #llm #anthropic #aisafety

  • View profile for Sarah Bird

    Chief Product Officer of Responsible AI @ Microsoft

    24,439 followers

    AI’s powerful capabilities come with equally powerful risks, if not properly addressed. As AI tools become integral to everyday tasks, they face growing threats like jailbreaks and other prompt attacks—malicious attempts to trick models into breaking their rules or exposing sensitive information.     To address these threats, Microsoft uses a defense-in-depth approach, building protections directly into the AI model. This strategy includes creating safety systems around the model and designing user experiences that promote secure AI use. For example, Prompt Shields detects and blocks malicious prompts in real-time, while safety evaluations simulate attacks to measure an application’s vulnerability.     These tools, combined with Microsoft Defender, help customers stay ahead of emerging risks and deploy AI responsibly. You can read more in our latest blog post as part of our Building AI Responsibly series.

Explore categories