The EDPB recently published a report on AI Privacy Risks and Mitigations in LLMs. This is one of the most practical and detailed resources I've seen from the EDPB, with extensive guidance for developers and deployers. The report walks through privacy risks associated with LLMs across the AI lifecycle, from data collection and training to deployment and retirement, and offers practical tips for identifying, measuring, and mitigating risks. Here's a quick summary of some of the key mitigations mentioned in the report: For providers: • Fine-tune LLMs on curated, high-quality datasets and limit the scope of model outputs to relevant and up-to-date information. • Use robust anonymisation techniques and automated tools to detect and remove personal data from training data. • Apply input filters and user warnings during deployment to discourage users from entering personal data, as well as automated detection methods to flag or anonymise sensitive input data before it is processed. • Clearly inform users about how their data will be processed through privacy policies, instructions, warning or disclaimers in the user interface. • Encrypt user inputs and outputs during transmission and storage to protect data from unauthorized access. • Protect against prompt injection and jailbreaking by validating inputs, monitoring LLMs for abnormal input behaviour, and limiting the amount of text a user can input. • Apply content filtering and human review processes to flag sensitive or inappropriate outputs. • Limit data logging and provide configurable options to deployers regarding log retention. • Offer easy-to-use opt-in/opt-out options for users whose feedback data might be used for retraining. For deployers: • Enforce strong authentication to restrict access to the input interface and protect session data. • Mitigate adversarial attacks by adding a layer for input sanitization and filtering, monitoring and logging user queries to detect unusual patterns. • Work with providers to ensure they do not retain or misuse sensitive input data. • Guide users to avoid sharing unnecessary personal data through clear instructions, training and warnings. • Educate employees and end users on proper usage, including the appropriate use of outputs and phishing techniques that could trick individuals into revealing sensitive information. • Ensure employees and end users avoid overreliance on LLMs for critical or high-stakes decisions without verification, and ensure outputs are reviewed by humans before implementation or dissemination. • Securely store outputs and restrict access to authorised personnel and systems. This is a rare example where the EDPB strikes a good balance between practical safeguards and legal expectations. Link to the report included in the comments. #AIprivacy #LLMs #dataprotection #AIgovernance #EDPB #privacybydesign #GDPR
Addressing Data Privacy in Closed-Source AI Systems
Explore top LinkedIn content from expert professionals.
Summary
Addressing data privacy in closed-source AI systems means protecting personal and sensitive information within AI platforms that are not publicly accessible or open to inspection. This involves understanding how user data is collected, processed, stored, and potentially used for AI training, and ensuring that privacy risks are minimized through clear guidelines and safeguards.
- Clarify sensitive data: Identify and clearly define what information is considered sensitive or personal, including details that go beyond obvious examples like names and addresses.
- Review privacy controls: Use tools and policies to filter, anonymize, or redact sensitive information before it enters closed-source AI systems, and always check if opt-out options for data training are available.
- Educate and communicate: Make sure staff and users understand privacy policies and know the risks of sharing personal data, especially when using AI-powered chatbots or services.
-
-
HUGE AI LEGAL NEWS! The European Data Protection Board (EDPB) has published its much anticipated Opinion on AI and data protection. The opinion looks at 1) when and how AI models can be considered anonymous, 2) whether and how legitimate interest can be used as a legal basis for developing or using AI models, and 3) what happens if an AI model is developed using personal data that was processed unlawfully. It also considers the use of first and third-party data. The opinion also addresses the consequences of developing AI models with unlawfully processed personal data, an area of particular concern for both developers and users. The EDPB clarifies that supervisory authorities are empowered to impose corrective measures, including the deletion of unlawfully processed data, retraining of the model, or even requiring its destruction in severe cases. On the issue of anonymity, the opinion grapples with the question of whether AI models trained on personal data can ever fully transcend their origins to be considered anonymous. The EDPB highlights that merely asserting that an AI model does not process personal data is insufficient. Supervisory authorities (SAs) must assess claims of anonymity rigorously, considering whether personal data has been effectively anonymised in the model and whether risks such as re-identification or membership inference attacks have been mitigated. For AI developers, this means that claims of anonymity should be substantiated with evidence, including the implementation of technical and organisational measures to prevent re-identification. On legitimate interest as a legal basis for AI, the opinion offers detailed guidance for both development and deployment phases. Legitimate interest under Article 6(1)(f) GDPR requires meeting three cumulative conditions: pursuing a legitimate interest, demonstrating that processing is necessary to achieve that interest, and ensuring the processing does not override the fundamental rights and freedoms of data subjects. For third-party data, the opinion emphasises that the absence of a direct relationship with the data subjects necessitates stronger safeguards, including enhanced transparency, opt-out mechanisms, and robust risk assessments. The opinion’s findings stress that the balancing test under legitimate interest must consider the unique risks posed by AI. These include discriminatory outcomes, regurgitation of personal data by generative AI models, and the broader societal risks of misuse, such as through deepfakes or misinformation campaigns. The opinion also provides examples of mitigating measures that could tip the balance in favour of controllers, such as pseudonymisation, output filters, and voluntary transparency initiatives like model cards and annual reports. The implications for developers are significant: compliance failures in the development phase can render an entire AI system non-compliant, leading to legal and operational challenges.
-
NEWS 21/10/25: Department of Homeland Security obtains first-known warrant targeting OpenAI for user prompts in ChatGPT According to a recent article by Forbes, the U.S. Department of Homeland Security (DHS) has secured a federal search warrant ordering OpenAI to identify a user of ChatGPT and to produce the user’s prompts, as part of a child-exploitation investigation. https://lnkd.in/eatmK3zv? Key details: - The warrant was filed by child-exploitation investigators within DHS. - It specifically targets “two prompts” submitted to ChatGPT by an anonymous user. The warrant asks OpenAI for the user’s identifying information and associated prompt history. - This is described as the first known federal search warrant compelling ChatGPT prompt-level data from OpenAI. What this means for privacy: -Prompts are treated as evidence. What users have assumed to be ephemeral or private entries in a chat session with an AI service may now be subject to law-enforcement production. -Scope of data retention and access must be reconsidered. If prompt history can be identified and requested, both users and providers should evaluate how long prompts are stored, under what identifiers, and how anonymised they truly are. - Implications for user trust and provider responsibility. AI companies may face growing legal obligations to disclose user-generated content and metadata, which may affect how the services present themselves (privacy guarantees, terms of service) and how users engage with them. - International context and legal cross-overs. For users in jurisdictions with strong data-protection regimes (for example, the General Data Protection Regulation in the UK/EU), the fact that prompt-data can be subject to U.S. warrant may raise questions about extraterritorial access and data flow compliance. In short: this isn’t just another law-enforcement request. It marks the first time a generative-AI provider has been legally compelled to unmask a user and disclose their prompt history. ============ ↳I track how stories like this shape the ethics and governance of AI. You can find deeper analysis at discarded.ai. #AISafety #AIRegulation #Privacy #Governance #Ethics Image AI Generated
-
This Stanford study examined how six major AI companies (Anthropic, OpenAI, Google, Meta, Microsoft, and Amazon) handle user data from chatbot conversations. Here are the main privacy concerns. 👀 All six companies use chat data for training by default, though some allow opt-out 👀 Data retention is often indefinite, with personal information stored long-term 👀 Cross-platform data merging occurs at multi-product companies (Google, Meta, Microsoft, Amazon) 👀 Children's data is handled inconsistently, with most companies not adequately protecting minors 👀 Limited transparency in privacy policies, which are complex and hard to understand and often lack crucial details about actual practices Practical Takeaways for Acceptable Use Policy and Training for nonprofits in using generative AI: ✅ Assume anything you share will be used for training - sensitive information, uploaded files, health details, biometric data, etc. ✅ Opt out when possible - proactively disable data collection for training (Meta is the one where you cannot) ✅ Information cascades through ecosystems - your inputs can lead to inferences that affect ads, recommendations, and potentially insurance or other third parties ✅ Special concern for children's data - age verification and consent protections are inconsistent Some questions to consider in acceptable use policies and to incorporate in any training. ❓ What types of sensitive information might your nonprofit staff share with generative AI? ❓ Does your nonprofit currently specifically identify what is considered “sensitive information” (beyond PID) and should not be shared with GenerativeAI ? Is this incorporated into training? ❓ Are you working with children, people with health conditions, or others whose data could be particularly harmful if leaked or misused? ❓ What would be the consequences if sensitive information or strategic organizational data ended up being used to train AI models? How might this affect trust, compliance, or your mission? How is this communicated in training and policy? Across the board, the Stanford research points that developers’ privacy policies lack essential information about their practices. They recommend policymakers and developers address data privacy challenges posed by LLM-powered chatbots through comprehensive federal privacy regulation, affirmative opt-in for model training, and filtering personal information from chat inputs by default. “We need to promote innovation in privacy-preserving AI, so that user privacy isn’t an afterthought." How are you advocating for privacy-preserving AI? How are you educating your staff to navigate this challenge? https://lnkd.in/g3RmbEwD
-
Before diving headfirst into AI, companies need to define what data privacy means to them in order to use GenAI safely. After decades of harvesting and storing data, many tech companies have created vast troves of the stuff - and not all of it is safe to use when training new GenAI models. Most companies can easily recognize obvious examples of Personally Identifying Information (PII) like Social Security numbers (SSNs) - but what about home addresses, phone numbers, or even information like how many kids a customer has? These details can be just as critical to ensure newly built GenAI products don’t compromise their users' privacy - or safety - but once this information has entered an LLM, it can be really difficult to excise it. To safely build the next generation of AI, companies need to consider some key issues: ⚠️Defining Sensitive Data: Companies need to decide what they consider sensitive beyond the obvious. Personally identifiable information (PII) covers more than just SSNs and contact information - it can include any data that paints a detailed picture of an individual and needs to be redacted to protect customers. 🔒Using Tools to Ensure Privacy: Ensuring privacy in AI requires a range of tools that can help tech companies process, redact, and safeguard sensitive information. Without these tools in place, they risk exposing critical data in their AI models. 🏗️ Building a Framework for Privacy: Redacting sensitive data isn’t just a one-time process; it needs to be a cornerstone of any company’s data management strategy as they continue to scale AI efforts. Since PII is so difficult to remove from an LLM once added, GenAI companies need to devote resources to making sure it doesn’t enter their databases in the first place. Ultimately, AI is only as safe as the data you feed into it. Companies need a clear, actionable plan to protect their customers - and the time to implement it is now.
-
Generative AI is reshaping industries, but as Large Language Models (LLMs) continue to evolve, they bring a critical challenge: how do we teach them to forget? Forget what? Our sensitive data. In their default state, LLMs are designed to retain patterns from training data, enabling them to generate remarkable outputs. However, this capability raises privacy and security concerns. Why Forgetting Matters? Compliance with Privacy Laws: Regulations like GDPR and CCPA mandate the right to be forgotten. Training LLMs to erase specific data aligns with these legal requirements. Minimizing Data Exposure: Retaining unnecessary or sensitive information increases risks in case of breaches. Forgetting protects users and organizations alike. Building User Trust: Transparent mechanisms to delete user data foster confidence in AI solutions. Techniques to Enable Forgetting 🔹 Selective Fine-Tuning: Retraining models to exclude specific data sets without degrading performance. 🔹 Differential Privacy: Ensuring individual data points are obscured during training to prevent memorization. 🔹 Memory Augmentation: Using external memory modules where specific records can be updated or deleted without affecting the core model. 🔹 Data Tokenization: Encapsulating sensitive information in reversible tokens that can be erased independently. Balancing forgetfulness with functionality is complex. LLMs must retain enough context for accuracy while ensuring sensitive information isn’t permanently embedded. By prioritizing privacy, we can shape a future in which AI doesn’t just work for us—it works with our values. How are you addressing privacy concerns in your AI initiatives? Let’s discuss! #GenerativeAI #AIPrivacy #LLM #DataSecurity #EthicalAI Successive Digital
-
Privacy isn’t a policy layer in AI. It’s a design constraint. The new EDPB guidance on LLMs doesn’t just outline risks. It gives builders, buyers, and decision-makers a usable blueprint for engineering privacy - not just documenting it. The key shift? → Yesterday: Protect inputs → Today: Audit the entire pipeline → Tomorrow: Design for privacy observability at runtime The real risk isn’t malicious intent. It’s silent propagation through opaque systems. In most LLM systems, sensitive data leaks not because someone intended harm but because no one mapped the flows, tested outputs, or scoped where memory could resurface prior inputs. This guidance helps close that gap. And here’s how to apply it: For Developers: • Map how personal data enters, transforms, and persists • Identify points of memorization, retention, or leakage • Use the framework to embed mitigation into each phase: pretraining, fine-tuning, inference, RAG, feedback For Users & Deployers: • Don’t treat LLMs as black boxes. Ask if data is stored, recalled, or used to retrain • Evaluate vendor claims with structured questions from the report • Build internal governance that tracks model behaviors over time For Decision-Makers & Risk Owners: • Use this to complement your DPIAs with LLM-specific threat modeling • Shift privacy thinking from legal compliance to architectural accountability • Set organizational standards for “commercial-safe” LLM usage This isn’t about slowing innovation. It’s about future-proofing it. Because the next phase of AI scale won’t just be powered by better models. It will be constrained and enabled by how seriously we engineer for trust. Thanks European Data Protection Board, Isabel Barberá H/T Peter Slattery, PhD
-
⚠️Privacy Risks in AI Management: Lessons from Italy’s DeepSeek Ban⚠️ Italy’s recent ban on #DeepSeek over privacy concerns underscores the need for organizations to integrate stronger data protection measures into their AI Management System (#AIMS), AI Impact Assessment (#AIIA), and AI Risk Assessment (#AIRA). Ensuring compliance with #ISO42001, #ISO42005 (DIS), #ISO23894, and #ISO27701 (DIS) guidelines is now more material than ever. 1. Strengthening AI Management Systems (AIMS) with Privacy Controls 🔑Key Considerations: 🔸ISO 42001 Clause 6.1.2 (AI Risk Assessment): Organizations must integrate privacy risk evaluations into their AI management framework. 🔸ISO 42001 Clause 6.1.4 (AI System Impact Assessment): Requires assessing AI system risks, including personal data exposure and third-party data handling. 🔸ISO 27701 Clause 5.2 (Privacy Policy): Calls for explicit privacy commitments in AI policies to ensure alignment with global data protection laws. 🪛Implementation Example: Establish an AI Data Protection Policy that incorporates ISO27701 guidelines and explicitly defines how AI models handle user data. 2. Enhancing AI Impact Assessments (AIIA) to Address Privacy Risks 🔑Key Considerations: 🔸ISO 42005 Clause 4.7 (Sensitive Use & Impact Thresholds): Mandates defining thresholds for AI systems handling personal data. 🔸ISO 42005 Clause 5.8 (Potential AI System Harms & Benefits): Identifies risks of data misuse, profiling, and unauthorized access. 🔸ISO 27701 Clause A.1.2.6 (Privacy Impact Assessment): Requires documenting how AI systems process personally identifiable information (#PII). 🪛 Implementation Example: Conduct a Privacy Impact Assessment (#PIA) during AI system design to evaluate data collection, retention policies, and user consent mechanisms. 3. Integrating AI Risk Assessments (AIRA) to Mitigate Regulatory Exposure 🔑Key Considerations: 🔸ISO 23894 Clause 6.4.2 (Risk Identification): Calls for AI models to identify and mitigate privacy risks tied to automated decision-making. 🔸ISO 23894 Clause 6.4.4 (Risk Evaluation): Evaluates the consequences of noncompliance with regulations like #GDPR. 🔸ISO 27701 Clause A.1.3.7 (Access, Correction, & Erasure): Ensures AI systems respect user rights to modify or delete their data. 🪛 Implementation Example: Establish compliance audits that review AI data handling practices against evolving regulatory standards. ➡️ Final Thoughts: Governance Can’t Wait The DeepSeek ban is a clear warning that privacy safeguards in AIMS, AIIA, and AIRA aren’t optional. They’re essential for regulatory compliance, stakeholder trust, and business resilience. 🔑 Key actions: ◻️Adopt AI privacy and governance frameworks (ISO42001 & 27701). ◻️Conduct AI impact assessments to preempt regulatory concerns (ISO 42005). ◻️Align risk assessments with global privacy laws (ISO23894 & 27701). Privacy-first AI shouldn't be seen just as a cost of doing business, it’s actually your new competitive advantage.
-
France’s CNIL has issued its first comprehensive recommendations on applying the GDPR to the development of AI systems, an architecture intended to reconcile innovation with the primacy of individual rights. The guidance covers machine-learning and general-purpose AI during the pre-deployment lifecycle (design, dataset construction, training) and is crafted to complement the EU AI Act where personal data is involved. The document operationalizes GDPR principles through an 11-step build pathway: define a specific and legitimate purpose (including for GPAI and research contexts); determine roles and responsibilities (controller, joint controller, processor); select an appropriate legal basis, often legitimate interests subject to a structured three-part test and risk-limiting safeguards; and, where data harvesting is used, impose strict minimization, source-respecting, and expectation-aligned guardrails. It further requires compatibility testing for data reuse; rigorous data minimization, cleaning, documentation and updates; predefined retention periods; layered transparency including source disclosures; practicable modalities for access/erasure/objection even at the model layer; robust security-by-design; status analysis of models to decide when the GDPR applies (including re-identification testing and encapsulation); GDPR-compliant annotation; and a DPIA oriented to AI-specific risks and controls. Two implementation signals stand out. First, CNIL expects realistic, proportionate rights-enablement at both dataset and model levels, including periodic retraining or, where disproportionate, robust output-level filters, paired with contractual propagation of updates to downstream users. Second, model status analysis is no longer optional: absent evidence of anonymity, assume GDPR applies and layer controls (access restriction, output filtering, watermarking) proven by adversarial testing. Link to the recommendations (in French) will be shared in the comments. P.S. This post is for academic discussion only. #AI #GDPR #CNIL #DataProtection #ResponsibleAI #AICompliance #PrivacyByDesign #GeneralPurposeAI #AIGovernance #SecurityByDesign
-
Generative AI offers transformative potential, but how do we harness it without compromising crucial data privacy? It's not an afterthought — it's central to the strategy. Evaluating the right approach depends heavily on specific privacy goals and data sensitivity. One starting point, with strong vendor contracts, is using the LLM context window directly. For larger datasets, Retrieval-Augmented Generation (RAG) scales well. RAG retrieves relevant information at query time to augment the prompt, which helps keep private data out of the LLM's core training dataset. However, optimizing RAG across diverse content types and meeting user expectations for structured, precise answers can be challenging. At the other extreme lies Self-Hosting LLMs. This offers maximum control but introduces significant deployment and maintenance overhead, especially when aiming for the capabilities of large foundation models. For ultra-sensitive use cases, this might be the only viable path. Distilling larger models for specific tasks can mitigate some deployment complexity, but the core challenges of self-hosting remain. Look at Apple Intelligence as a prime example. Their strategy prioritizes user privacy through On-Device Processing, minimizing external data access. While not explicitly labeled RAG, the architecture — with its semantic index, orchestration, and LLM interaction — strongly resembles a sophisticated RAG system, proving privacy and capability can coexist. At Egnyte, we believe robust AI solutions must uphold data security. For us, data privacy and fine-grained, authorized access aren't just compliance hurdles; they are innovation drivers. Looking ahead to advanced Agent-to-Agent AI interactions, this becomes even more critical. Autonomous agents require a bedrock of trust, built on rigorous access controls and privacy-centric design, to interact securely and effectively. This foundation is essential for unlocking AI's future potential responsibly.