Understanding Malicious AI Technologies

Explore top LinkedIn content from expert professionals.

Summary

Understanding malicious AI technologies means recognizing how artificial intelligence is used by attackers to manipulate models, steal data, bypass security, and orchestrate autonomous cyber threats. These technologies can exploit weaknesses in AI systems, often through cleverly crafted prompts or manipulated data, posing new challenges beyond traditional cybersecurity measures.

  • Watch for prompt injection: Always be cautious about the sources and content your AI systems interact with, since attackers can embed harmful instructions in natural language that look benign to humans.
  • Harden model access: Limit who and what can interact with your AI models and regularly review permissions to prevent unauthorized use or data extraction.
  • Upgrade security tools: Move beyond standard antivirus and network monitoring by adopting AI-specific threat detection and validating the intent behind every AI action, not just the inputs.
Summarized by AI based on LinkedIn member posts
  • View profile for Akhil Sharma

    Founder@ Armur AI (Offensive Security Tooling) | Backed by Techstars, Outlier Ventures | Published Security Researcher

    24,512 followers

    Your AI agent can now be hijacked through a calendar invite. Zenity Labs just disclosed PleaseFix — a family of critical vulnerabilities in agentic browsers, including Perplexity Comet, that let attackers take over AI agents through indirect prompt injection. Here's what makes this terrifying: An attacker embeds malicious instructions in something as mundane as a calendar invite. When you ask your AI agent to accept the invite, the agent autonomously accesses your local file system and exfiltrates data to an attacker-controlled endpoint. Zero clicks. Zero user awareness. The agent even returns the expected response so you never suspect anything. The second exploit is worse. The attacker manipulates the agent into interacting with your password manager — not by exploiting the password manager itself, but by abusing the agent's authorized workflows. Your credentials get stolen through a legitimate authenticated session. This is the fundamental problem with agentic AI security that nobody is solving well yet: We're giving agents access to our most sensitive systems — files, credentials, workflows — and trusting them to only do what we asked. But prompt injection turns that trust into an attack vector. The exploit isn't code. It's text. No malware binary. No exploit payload. Just natural language instructions hidden in content the agent processes normally. Traditional security tools are blind to this. EDR looks for malicious binaries. Network monitoring looks for C2 patterns. Neither catches a natural language instruction embedded in a calendar invite. We need a fundamentally different security model for AI agents — one that validates intent at the execution layer, not just the input layer. Article link in comments. #AISecurity #AgenticAI #CyberSecurity #PromptInjection #AIAgents

  • View profile for Tristan Ingold

    AI Governance @ Meta | Product Compliance | Public Speaking | Coaching

    6,114 followers

    Most AI security programs protect the wrong thing 🛡️ Traditional cybersecurity is built around the network perimeter, keeping attackers out, protecting the data inside, detecting intrusions when they happen. AI systems introduce a different attack surface. The model itself is the target. The training data is the target. The inference pipeline is the target. Let's look at the three attack categories every GRC and security team needs to understand now. 👇 1️⃣ Data Poisoning: An adversary introduces manipulated data into the training set, causing the model to learn incorrect patterns or develop hidden behaviors that activate under specific conditions. The most dangerous variant is the backdoor attack, in which the model performs normally on clean inputs and passes every standard accuracy test, then fails in predictable, attacker-controlled ways when triggered by a specific input pattern. The governance failure mode is subtle. Poisoned models look fine in testing. The gap between "model passed evaluation" and "model is safe to deploy" is exactly where data governance lives. 2️⃣ Prompt Injection: The defining security threat of LLM deployment. An attacker embeds malicious instructions in content the model processes, a user message, a retrieved document, a webpage, that override the model's intended behavior. Indirect injection is the more dangerous variant. The model retrieves attacker-controlled content during operation, redirecting its actions without the user or operator knowing. 💡 Agentic AI systems are particularly exposed. A model that can take actions, send emails, query databases, or execute code is one where a successful prompt injection becomes an execution vector, not just an output problem. 3️⃣ Model Extraction: An attacker queries a deployed model repeatedly, observing inputs and outputs, and uses those observations to reconstruct a functional replica. The replica can compete commercially, enable adversarial attacks offline, or reveal vulnerabilities exploitable against the original. This is an intellectual property and security risk simultaneously. The attack is difficult to detect because it looks like normal API usage. What makes these different from traditional cybersecurity risks is that they target the AI system's behavior and integrity, not just surrounding infrastructure. A firewall doesn't stop a poisoned training set. Endpoint detection doesn't catch prompt injection in a retrieved document. Organizations need AI-specific threat modeling, not traditional controls applied to AI deployments. MITRE ATLAS maps these attacks in detail. OWASP's LLM Top 10 is a good starting list: https://lnkd.in/g3ZRuZNq Drop a comment and let me know which of these three attack categories you need more to learn more about! #AIGovernance #AIRisk #Cybersecurity #GRC #AI

  • View profile for Chaitanya Yedilla

    Security Engineer @ BlackPerl DFIR | EC-Council CEI/CEH | Building Pwndora Labs | Lead Cyber Security Trainer (5K+ Students) | IIT-H & APIS Startup Mentor

    8,271 followers

    How are AI-driven malware variants evading traditional detection methods AI-driven malware variants are evading traditional detection methods through several sophisticated techniques: 1. Polymorphism and Mutation: These malware strains use AI to constantly change aspects of their code, file structure, and behavior—sometimes every few seconds—making it extremely difficult for signature-based antivirus programs to identify them. Polymorphic malware, which mutates its hash and code structure automatically, is now present in more than 70% of major breaches and over 76% of phishing attacks. AI allows these mutations to happen rapidly and unpredictably, outpacing static detection engines. 2. Adversarial Examples: Attackers create subtle modifications in malware and use adversarial machine learning tactics to fool detection models. By tuning payloads with adversarial examples, they cause classifiers to misidentify malicious files as benign. Memetic algorithms and generative adversarial networks (GANs) are now being used to optimize these evasion tactics, achieving success rates of up to 98% against advanced AI detectors like MalConv, and notable evasion rates even against leading commercial antivirus products. 3. Prompt Injection and AI Model Manipulation: Some advanced malware now embeds natural-language prompts into their code, attempting to "trick" AI-driven security tools into misclassifying them as harmless. This is a relatively new evasion method: instead of altering code structure alone, attackers manipulate the logic and instructions of large language models used for malware analysis. The goal is for the AI to falsely declare “NO MALWARE DETECTED.” Such attacks exploit the contextual vulnerabilities of modern AI models, especially as these models become more central to automated threat detection. 4. Real-Time Learning from Failed Attempts: New AI-powered strains can learn from failed attacks or detections, tweaking future attack vectors for better success. This self-improving loop allows malware to incrementally bypass increasingly complex defensive measures. Traditional signature-based antivirus, static heuristics, and even some behavioral analysis tools are being outpaced by these adaptive, AI-driven threats. The future of defense will likely depend on deploying similarly advanced AI models that can keep up with these evolving tactics and spot anomalies that legacy tools miss. #malware #advesary #detection

  • View profile for Katharina Koerner

    AI Governance, Privacy & Security I Trace3 : Innovating with risk-managed AI/IT - Passionate about Strategies to Advance Business Goals through AI Governance, Privacy & Security

    44,732 followers

    This new guide from the OWASP® Foundation Agentic Security Initiative for developers, architects, security professionals, and platform engineers building or securing agentic AI applications, published Feb 17, 2025, provides a threat-model-based reference for understanding emerging agentic AI threats and their mitigations. Link: https://lnkd.in/gFVHb2BF * * * The OWASP Agentic AI Threat Model highlights 15 major threats in AI-driven agents and potential mitigations: 1️⃣ Memory Poisoning – Prevent unauthorized data manipulation via session isolation & anomaly detection. 2️⃣ Tool Misuse – Enforce strict tool access controls & execution monitoring to prevent unauthorized actions. 3️⃣ Privilege Compromise – Use granular permission controls & role validation to prevent privilege escalation. 4️⃣ Resource Overload – Implement rate limiting & adaptive scaling to mitigate system failures. 5️⃣ Cascading Hallucinations – Deploy multi-source validation & output monitoring to reduce misinformation spread. 6️⃣ Intent Breaking & Goal Manipulation – Use goal alignment audits & AI behavioral tracking to prevent agent deviation. 7️⃣ Misaligned & Deceptive Behaviors – Require human confirmation & deception detection for high-risk AI decisions. 8️⃣ Repudiation & Untraceability – Ensure cryptographic logging & real-time monitoring for accountability. 9️⃣ Identity Spoofing & Impersonation – Strengthen identity validation & trust boundaries to prevent fraud. 🔟 Overwhelming Human Oversight – Introduce adaptive AI-human interaction thresholds to prevent decision fatigue. 1️⃣1️⃣ Unexpected Code Execution (RCE) – Sandbox execution & monitor AI-generated scripts for unauthorized actions. 1️⃣2️⃣ Agent Communication Poisoning – Secure agent-to-agent interactions with cryptographic authentication. 1️⃣3️⃣ Rogue Agents in Multi-Agent Systems – Monitor for unauthorized agent activities & enforce policy constraints. 1️⃣4️⃣ Human Attacks on Multi-Agent Systems – Restrict agent delegation & enforce inter-agent authentication. 1️⃣5️⃣ Human Manipulation – Implement response validation & content filtering to detect manipulated AI outputs. * * * The Agentic Threats Taxonomy Navigator then provides a structured approach to identifying and assessing agentic AI security risks by leading though 6 questions: 1️⃣ Autonomy & Reasoning Risks – Does the AI autonomously decide steps to achieve goals? 2️⃣ Memory-Based Threats – Does the AI rely on stored memory for decision-making? 3️⃣ Tool & Execution Threats – Does the AI use tools, system commands, or external integrations? 4️⃣ Authentication & Spoofing Risks – Does AI require authentication for users, tools, or services? 5️⃣ Human-In-The-Loop (HITL) Exploits – Does AI require human engagement for decisions? 6️⃣ Multi-Agent System Risks – Does the AI system rely on multiple interacting agents?

  • View profile for Bally S Kehal

    ⭐️Top AI Voice | Founder (Multiple Companies) | Teaching & Reviewing Production-Grade AI Tools | Voice + Agentic Systems | AI Architect | Ex-Microsoft

    19,876 followers

    Anthropic Just Documented the First AI-Orchestrated Cyber Espionage Campaign → 30 Targets → 80-90% Autonomous Operations GTG-1002 changed everything we thought we knew about AI agent security. Chinese state actors didn't just use Claude for advice. They turned it into an autonomous penetration testing orchestrator using MCP servers. Here's what your security team needs to understand... The Technical Reality ↳ Claude Code + Model Context Protocol = autonomous attack framework ↳ AI executed reconnaissance, exploitation, lateral movement, data exfiltration ↳ Humans only intervened at strategic decision gates (10-20% of operations) ↳ Peak activity: thousands of requests per second ↳ Multiple simultaneous intrusions across major tech companies and government agencies The Evolution from Vibe Coding to Autonomous Attacks In June 2025: "Vibe hacking" - humans directing operations November 2025: AI autonomously discovering vulnerabilities and exploiting them at scale What Teams Should Learn The Bypass Method: ↳ Role-play convinced Claude it was doing "defensive security testing" ↳ Social engineering the AI model itself ↳ Individual tasks appeared legitimate when evaluated in isolation The Infrastructure: ↳ MCP servers orchestrated commodity penetration testing tools ↳ No custom malware needed ↳ Integration over innovation Critical Limitation: ↳ AI hallucinations created false positives ↳ Claimed credentials that didn't work ↳ "Critical discoveries" turned out to be public information ↳ Full autonomy still requires human validation Security Implications for Founders The barriers to sophisticated cyberattacks dropped substantially. Less experienced groups can now potentially execute nation-state level operations. But here's what matters: The same AI capabilities enabling these attacks are critical for defense. SOC automation, threat detection, vulnerability assessment, incident response. Key Takeaways for Your Team ↳ Experiment with AI for defensive security operations ↳ Build detection systems for autonomous attack patterns ↳ Implement stronger safety controls and validation layers ↳ Assume AI-orchestrated attacks are now standard threat landscape ↳ Test your systems against AI-driven reconnaissance This isn't 2023 anymore. Your security posture needs to account for AI agents that can execute full attack chains with minimal human oversight. The question isn't whether AI will be used in cyberattacks. The question is whether your defenses account for AI-orchestrated operations happening right now. P.S. Building AI agents or implementing MCP in your infrastructure? Security-first architecture isn't optional anymore. One misconfigured agent with access to production systems = complete compromise.

  • View profile for Vinod Bijlani

    Building AI Factories | Sovereign AI Visionary | Board-Level Advisor | 25× Patents

    9,841 followers

    𝐒𝐭𝐨𝐜𝐤 𝐦𝐚𝐫𝐤𝐞𝐭𝐬 𝐚𝐫𝐞 𝐩𝐚𝐧𝐢𝐜𝐤𝐢𝐧𝐠 𝐚𝐛𝐨𝐮𝐭 𝐀𝐈 𝐜𝐨𝐝𝐢𝐧𝐠 𝐚𝐬𝐬𝐢𝐬𝐭𝐚𝐧𝐭𝐬 𝐫𝐞𝐩𝐥𝐚𝐜𝐢𝐧𝐠 𝐬𝐨𝐟𝐭𝐰𝐚𝐫𝐞 𝐜𝐨𝐦𝐩𝐚𝐧𝐢𝐞𝐬. $2 trillion wiped off software market caps in days. Indian IT companies alone lost $50 billion. But almost nobody is talking about the 𝐒𝐞𝐜𝐮𝐫𝐢𝐭𝐲 𝐃𝐞𝐛𝐭 𝐂𝐫𝐢𝐬𝐢𝐬 we are creating with these Assistants. 𝐖𝐞 𝐚𝐫𝐞 𝐰𝐫𝐢𝐭𝐢𝐧𝐠 𝐜𝐨𝐝𝐞 56% 𝐟𝐚𝐬𝐭𝐞𝐫. 𝐖𝐞 𝐚𝐫𝐞 𝐚𝐥𝐬𝐨 𝐛𝐫𝐞𝐚𝐤𝐢𝐧𝐠 𝐨𝐮𝐫 𝐚𝐫𝐜𝐡𝐢𝐭𝐞𝐜𝐭𝐮𝐫𝐞 153% 𝐟𝐚𝐬𝐭𝐞𝐫. Copilot. Cursor. Q. These aren't just "tools." They are privileged agents. We are granting them deep access to file systems, shells, credentials, and codebases. We are letting them execute commands with the developer's own permissions. BUT we are protecting them with security models that are 𝐩𝐫𝐨𝐛𝐚𝐛𝐢𝐥𝐢𝐬𝐭𝐢𝐜, 𝐧𝐨𝐭 𝐝𝐞𝐭𝐞𝐫𝐦𝐢𝐧𝐢𝐬𝐭𝐢𝐜. Let’s look at what researchers have actually demonstrated recently: -𝐖𝐨𝐫𝐤𝐬𝐩𝐚𝐜𝐞 𝐇𝐢𝐣𝐚𝐜𝐤𝐢𝐧𝐠: Tools manipulated to execute arbitrary system commands via simple "pre-planning" steps. -𝐃𝐚𝐭𝐚 𝐄𝐱𝐟𝐢𝐥𝐭𝐫𝐚𝐭𝐢𝐨𝐧: Hidden tricks in rendered content (like SVGs) used to bypass security and leak repo secrets. -𝐏𝐫𝐨𝐦𝐩𝐭 𝐈𝐧𝐣𝐞𝐜𝐭𝐢𝐨𝐧: Malicious instructions hidden in READMEs or white-text comments that rewrite your configuration or steal API keys. -𝐇𝐚𝐥𝐥𝐮𝐜𝐢𝐧𝐚𝐭𝐞𝐝 𝐃𝐞𝐩𝐞𝐧𝐝𝐞𝐧𝐜𝐢𝐞𝐬: Assistants confidently recommending packages that don't exist - or worse, installing malicious ones. The scary part? These tools execute with your permissions. When a coding assistant is weaponized by a hidden comment, the attack surface isn't the tool. It’s the 𝐭𝐫𝐮𝐬𝐭 𝐦𝐨𝐝𝐞𝐥. 𝐒𝐭𝐨𝐩 𝐭𝐫𝐞𝐚𝐭𝐢𝐧𝐠 𝐭𝐡𝐞𝐬𝐞 𝐚𝐬 𝐩𝐫𝐨𝐝𝐮𝐜𝐭𝐢𝐯𝐢𝐭𝐲 𝐚𝐝𝐝-𝐨𝐧𝐬. 𝐒𝐭𝐚𝐫𝐭 𝐭𝐫𝐞𝐚𝐭𝐢𝐧𝐠 𝐭𝐡𝐞𝐦 𝐚𝐬 𝐩𝐫𝐢𝐯𝐢𝐥𝐞𝐠𝐞𝐝 𝐚𝐜𝐜𝐞𝐬𝐬 𝐞𝐧𝐝𝐩𝐨𝐢𝐧𝐭𝐬. Build your policy enforcement pipeline before you onboard these tools, not after a breach. 𝐈𝐟 𝐲𝐨𝐮 𝐚𝐫𝐞 𝐚𝐧 𝐄𝐧𝐠𝐢𝐧𝐞𝐞𝐫𝐢𝐧𝐠 𝐋𝐞𝐚𝐝𝐞𝐫, 𝐲𝐨𝐮 𝐧𝐞𝐞𝐝 3 𝐜𝐨𝐧𝐭𝐫𝐨𝐥𝐬 𝐧𝐨𝐰: 𝐒𝐞𝐦𝐚𝐧𝐭𝐢𝐜 𝐂𝐨𝐧𝐭𝐞𝐱𝐭 𝐅𝐢𝐥𝐭𝐞𝐫𝐢𝐧𝐠 Adopt a "shift left" approach. Filter credentials and PII before the codebase is exposed to the model. Data-first security means the secret never reaches the assistant. 𝐇𝐚𝐫𝐝𝐞𝐧𝐞𝐝 𝐌𝐂𝐏 𝐆𝐚𝐭𝐞𝐰𝐚𝐲𝐬 To combat vulnerabilities like CVE-2025-6514, you cannot allow direct external connections. Use model routers and sanctioned registries to govern tool access. 𝐑𝐞𝐚𝐥-𝐓𝐢𝐦𝐞 𝐀𝐧𝐨𝐦𝐚𝐥𝐲 𝐌𝐨𝐧𝐢𝐭𝐨𝐫𝐢𝐧𝐠 Detects sudden requests for security-sensitive code. This is often the only way to catch prompt injection attempts before workstation compromise occurs. The question is not whether AI coding assistants are useful. The question is whether you are treating code as a sovereign asset, or just a byproduct of speed. What controls has your team implemented for AI assistants?   Follow Vinod Bijlani for more insights

  • View profile for Austin Larsen

    Principal Threat Analyst @ Google Threat Intelligence Group

    14,096 followers

    Our team at Google Threat Intelligence Group (GTIG) just published our new AI Threat Tracker report. Adversaries are moving beyond using AI for productivity gains and are now deploying novel AI-enabled malware in active operations. This marks a new phase of AI abuse, involving tools that leverage LLMs mid-execution to dynamically alter their behavior, generate malicious code, and evade detection. A few key findings: 🤖 First observation of "just-in-time" AI malware, like APT28's PROMPTSTEAL, using LLMs in live operations. 🧬 Discovery of experimental malware PROMPTFLUX using the Gemini API to attempt self-modification and evade detection. 🎭 Actors are social engineering AI models, posing as students in a CTF competition to bypass safety guardrails. 🛒 A maturing criminal marketplace for illicit, purpose-built AI tools is lowering the barrier for entry for less-skilled actors. We are actively disrupting these actors, disabling associated assets, and continuously feeding these insights back to Google DeepMind to strengthen our classifiers and model safeguards against misuse. Read the full report for more detail. I'll post the link in the comments. #ThreatIntelligence #CyberSecurity #AI #ArtificialIntelligence #InfoSec #Malware #GTIG #APT28 #Gemini

  • View profile for Keith King

    Former White House Lead Communications Engineer, U.S. Dept of State, and Joint Chiefs of Staff in the Pentagon. Veteran U.S. Navy, Top Secret/SCI Security Clearance. Over 17,000+ direct connections & 49,000+ followers.

    49,253 followers

    AI Is Supercharging Cybercrime at an Alarming Pace Introduction A newly uncovered cyber-espionage campaign shows how advanced chatbots are becoming powerful tools for hackers. As AI systems gain autonomy, tool-use capabilities, and coding proficiency, they are reshaping the threat landscape and giving criminals unprecedented scale and speed. Key Breakdown Anthropic’s Discovery • Hackers—likely state-backed—used Claude Code’s agentic features to run long, automated hacking sequences. • The AI wrote malicious code, analyzed vulnerabilities, harvested passwords, and exfiltrated data with minimal human involvement. • The operation functioned like a business, running during Chinese work hours and pausing for holidays. AI as a Criminal Force Multiplier • Generative AI accelerates phishing, ransomware debugging, vulnerability scanning, and exploitation. • Hackers can now deploy AI “assistants” at scale—effectively turning a small team into thousands of virtual operators. • UC Berkeley tests showed AI agents can discover new security flaws that human analysts miss. A New Breed of Hard-to-Detect Malware • Attackers are generating bespoke malicious code for each target, making detection far more difficult. • Underground markets now sell AI-powered hacking tools, enabling low-skill actors to launch sophisticated attacks. • Intrusions happen so quickly that defenses may activate only after significant damage is done. AI Systems Create Vulnerabilities Too • Companies deploying chatbots without proper threat modeling expose new attack vectors. • AI-generated code often contains security gaps, introducing fresh weaknesses into corporate systems. • Compromised customer-service bots can be manipulated to leak data or perform unauthorized actions. Why This Matters This shift represents a fundamental change in cybersecurity dynamics. Attackers can innovate rapidly, automate complex operations, and exploit AI to scale their reach. Defenders, constrained by caution and legacy infrastructure, struggle to keep pace. An AI-driven arms race is underway—and while defensive AI offers promise, criminals currently have the momentum. I share daily insights with 34,000+ followers across defense, tech, and policy. If this topic resonates, I invite you to connect and continue the conversation. Keith King https://lnkd.in/gHPvUttw

  • View profile for Leonard Rodman, M.Sc. PMP LSSBB CSM CSPO Workato

    AI Implementation Manager | API Automation Developer/Engineer | Email promotions@rodman.ai for collabs

    56,559 followers

    AI is rapidly becoming the nerve-center of how we build, sell, and serve—but that also makes it a bullseye. Before you can defend your models, you need to understand how attackers break them. Here are the five most common vectors I’m seeing in the wild: 1️⃣ Prompt Injection & Jailbreaks – Hidden instructions in seemingly harmless text or images can trick a chatbot into leaking data or taking unintended actions. 2️⃣ Data / Model Poisoning – Adversaries slip malicious samples into your training or fine-tuning set, planting logic bombs that detonate after deployment. 3️⃣ Supply-Chain Manipulation – LLMs sometimes “hallucinate” package names; attackers register those libraries so an unwary dev installs malware straight from npm or PyPI. 4️⃣ Model Theft & Extraction – Bulk-scraping outputs or abusing unsecured endpoints can replicate proprietary capabilities and drain your competitive moat. 5️⃣ Membership-Inference & Privacy Leakage – Researchers keep showing they can guess whether a sensitive record was in the training set, turning personal data into low-hanging fruit. Knowing the playbook is half the battle. Stay tuned—and start threat-modeling your AI today. 🔒🤖

  • View profile for Peter Slattery, PhD

    MIT AI Risk Initiative | MIT FutureTech

    68,994 followers

    "Technologists and policymakers are increasingly seized with the importance of addressing AI Loss Of Control (LOC) risk—a hypothetical state in which an AI system diverges from authorized constraints to the extent that the human operator is no longer able to prevent, constrain, or revert undesired and unintended outcomes. However, significant gaps remain in how policymakers, the AI industry and AI security and safety researchers understand, anticipate, and perceive this risk. As these systems continue to gain power and capability, even a five percent probability that the worst-case AI LOC scenario materializes should be enough to compel decision-makers to treat this risk category as a national, human, and economic security priority. To address this gap, this paper proposes applying the Indications & Warning (I&W) methodology—used by the intelligence community to detect, track, and warn of impending significant threats—for monitoring AI LOC risk. The framework distinguishes between potential AI LOC indicators (theoretical behaviors signaling potential LOC) and actual indications (documented evidence that these patterns are occurring in reality)[...] To monitor AI LOC risk in particular, this paper proposes seven potential indicators:   • Scheming [...] • Manipulation [...] • Deception [...] • Self-Preserving Behavior [...] • Unauthorized Resource Acquisition [...] • Goal Misgeneralization [...] • Model and Behavior Drift [...] [...] A growing body of evidence, laid out in this paper, finds that AI systems can: • Conceal their actions and fabricate data to deceive the human operator • Identify vulnerable users and target them with manipulative strategies • Learn deception through reinforcement learning rewards • Strategically adjust behavior when they detect being evaluated • Rewrite their own system prompt to preserve their goals, copy their weights to external servers, and delete successor models • Conceal their reasoning from interpretability tools • Gradually lose their alignment properties over deployment cycles • Pursue unintended goals that succeed in training but fail in novel contexts • Optimize for code completion while systematically failing in security objectives • Circumvent shutdown mechanisms to continue task execution • Strategically alter behavior to evade evaluation and preserve deployment viability" Lots more in the document attached. Great work from Mariami Tkeshelashvili, Ritika Verma, and Steven M. Kelly at the Institute for Security and Technology (IST). I'm glad that I could play a role alongside some other members of the working group.

Explore categories