The EDPB recently published a report on AI Privacy Risks and Mitigations in LLMs. This is one of the most practical and detailed resources I've seen from the EDPB, with extensive guidance for developers and deployers. The report walks through privacy risks associated with LLMs across the AI lifecycle, from data collection and training to deployment and retirement, and offers practical tips for identifying, measuring, and mitigating risks. Here's a quick summary of some of the key mitigations mentioned in the report: For providers: • Fine-tune LLMs on curated, high-quality datasets and limit the scope of model outputs to relevant and up-to-date information. • Use robust anonymisation techniques and automated tools to detect and remove personal data from training data. • Apply input filters and user warnings during deployment to discourage users from entering personal data, as well as automated detection methods to flag or anonymise sensitive input data before it is processed. • Clearly inform users about how their data will be processed through privacy policies, instructions, warning or disclaimers in the user interface. • Encrypt user inputs and outputs during transmission and storage to protect data from unauthorized access. • Protect against prompt injection and jailbreaking by validating inputs, monitoring LLMs for abnormal input behaviour, and limiting the amount of text a user can input. • Apply content filtering and human review processes to flag sensitive or inappropriate outputs. • Limit data logging and provide configurable options to deployers regarding log retention. • Offer easy-to-use opt-in/opt-out options for users whose feedback data might be used for retraining. For deployers: • Enforce strong authentication to restrict access to the input interface and protect session data. • Mitigate adversarial attacks by adding a layer for input sanitization and filtering, monitoring and logging user queries to detect unusual patterns. • Work with providers to ensure they do not retain or misuse sensitive input data. • Guide users to avoid sharing unnecessary personal data through clear instructions, training and warnings. • Educate employees and end users on proper usage, including the appropriate use of outputs and phishing techniques that could trick individuals into revealing sensitive information. • Ensure employees and end users avoid overreliance on LLMs for critical or high-stakes decisions without verification, and ensure outputs are reviewed by humans before implementation or dissemination. • Securely store outputs and restrict access to authorised personnel and systems. This is a rare example where the EDPB strikes a good balance between practical safeguards and legal expectations. Link to the report included in the comments. #AIprivacy #LLMs #dataprotection #AIgovernance #EDPB #privacybydesign #GDPR
Data Privacy Standards in Machine Learning
Explore top LinkedIn content from expert professionals.
Summary
Data privacy standards in machine learning refer to the rules and practices designed to protect personal information when AI systems are built, trained, and used. These standards help ensure that sensitive data is handled responsibly, minimizing risks of misuse or accidental exposure.
- Prioritize informed consent: Always make sure users understand how their data will be used and give them clear choices for opting in or out of data collection.
- Secure the data pipeline: Use anonymization, encryption, and access controls to keep personal information safe throughout every stage of the machine learning process.
- Audit and monitor regularly: Set up ongoing checks and reviews to catch any privacy risks, and update your safeguards as new threats or regulations emerge.
-
-
This new white paper by Stanford Institute for Human-Centered Artificial Intelligence (HAI) titled "Rethinking Privacy in the AI Era" addresses the intersection of data privacy and AI development, highlighting the challenges and proposing solutions for mitigating privacy risks. It outlines the current data protection landscape, including the Fair Information Practice Principles, GDPR, and U.S. state privacy laws, and discusses the distinction and regulatory implications between predictive and generative AI. The paper argues that AI's reliance on extensive data collection presents unique privacy risks at both individual and societal levels, noting that existing laws are inadequate for the emerging challenges posed by AI systems, because they don't fully tackle the shortcomings of the Fair Information Practice Principles (FIPs) framework or concentrate adequately on the comprehensive data governance measures necessary for regulating data used in AI development. According to the paper, FIPs are outdated and not well-suited for modern data and AI complexities, because: - They do not address the power imbalance between data collectors and individuals. - FIPs fail to enforce data minimization and purpose limitation effectively. - The framework places too much responsibility on individuals for privacy management. - Allows for data collection by default, putting the onus on individuals to opt out. - Focuses on procedural rather than substantive protections. - Struggles with the concepts of consent and legitimate interest, complicating privacy management. It emphasizes the need for new regulatory approaches that go beyond current privacy legislation to effectively manage the risks associated with AI-driven data acquisition and processing. The paper suggests three key strategies to mitigate the privacy harms of AI: 1.) Denormalize Data Collection by Default: Shift from opt-out to opt-in data collection models to facilitate true data minimization. This approach emphasizes "privacy by default" and the need for technical standards and infrastructure that enable meaningful consent mechanisms. 2.) Focus on the AI Data Supply Chain: Enhance privacy and data protection by ensuring dataset transparency and accountability throughout the entire lifecycle of data. This includes a call for regulatory frameworks that address data privacy comprehensively across the data supply chain. 3.) Flip the Script on Personal Data Management: Encourage the development of new governance mechanisms and technical infrastructures, such as data intermediaries and data permissioning systems, to automate and support the exercise of individual data rights and preferences. This strategy aims to empower individuals by facilitating easier management and control of their personal data in the context of AI. by Dr. Jennifer King Caroline Meinhardt Link: https://lnkd.in/dniktn3V
-
On Protecting the Data Privacy of Large Language Models (LLMs): A Survey From the research paper: In this paper, we extensively investigate data privacy concerns within Large LLMs, specifically examining potential privacy threats from two folds: Privacy leakage and privacy attacks, and the pivotal technologies for privacy protection during various stages of LLM privacy inference, including federated learning, differential privacy, knowledge unlearning, and hardware-assisted privacy protection. Some key aspects from the paper: 1)Challenges: Given the intricate complexity involved in training LLMs, privacy protection research tends to dissect various phases of LLM development and deployment, including pre-training, prompt tuning, and inference 2) Future Directions: Protecting the privacy of LLMs throughout their creation process is paramount and requires a multifaceted approach. (i) Firstly, during data collection, minimizing the collection of sensitive information and obtaining informed consent from users are critical steps. Data should be anonymized or pseudonymized to mitigate re-identification risks. (ii) Secondly, in data preprocessing and model training, techniques such as federated learning, secure multiparty computation, and differential privacy can be employed to train LLMs on decentralized data sources while preserving individual privacy. (iii) Additionally, conducting privacy impact assessments and adversarial testing during model evaluation ensures potential privacy risks are identified and addressed before deployment. (iv)In the deployment phase, privacy-preserving APIs and access controls can limit access to LLMs, while transparency and accountability measures foster trust with users by providing insight into data handling practices. (v)Ongoing monitoring and maintenance, including continuous monitoring for privacy breaches and regular privacy audits, are essential to ensure compliance with privacy regulations and the effectiveness of privacy safeguards. By implementing these measures comprehensively throughout the LLM creation process, developers can mitigate privacy risks and build trust with users, thereby leveraging the capabilities of LLMs while safeguarding individual privacy. #privacy #llm #llmprivacy #mitigationstrategies #riskmanagement #artificialintelligence #ai #languagelearningmodels #security #risks
-
Exploring the Intersection of AI and GDPR In this article, we navigate through the critical aspects of data protection during the deployment of AI and Large Language Models (LLMs), highlighting real-world challenges and their solutions. Diversity of Personal Data in AI: AI appears in various forms, with chatbots being a common example. Training these chatbots demands comprehensive datasets, typically composed of audio, video, image, or text data. Special categories of data, such as sexual orientation or political beliefs, need extra attention as per Article 9(1) of the GDPR. Legal Bases for Data Processing: Key legal grounds for processing data include consent (Article 6(1)(a) GDPR), legitimate interest (Article 6(1)(f) GDPR), and contractual relationships (Article 6(1)(b) GDPR). National data protection laws can also be relevant, particularly in areas such as public administration, justice, employee data, and research. Obtaining Training Data: The process of acquiring training data necessitates thoughtful handling of consent, especially when dealing with sensitive information. Repurposing existing data for AI training must respect the purpose limitation principle of Article 5(1)(b) GDPR. Navigating Challenges with Scraped Data: Processing data obtained through scraping requires careful scrutiny due to potential source legality issues and the risk of re-identification. Legal bases like legitimate interest (Article 6(1)(f) GDPR) could apply, but careful evaluation is required. Anonymized or Aggregated Data as a Solution: Anonymized data, where identification is practically impossible, is outside the scope of GDPR. However, both anonymization and the creation of aggregated datasets are considered personal data processing, requiring a legal basis, often relying on legitimate interest. Upholding Rights and Data Security: Individuals have rights such as access, correction, deletion, and restriction under GDPR. Challenges can arise with individualizing Big Data and the inherent "Black Box" issue in AI decision-making. Emphasizing Security Measures: Given the risks associated with processing personal data via AI, prioritizing data security is crucial. Conclusion: Implementing generative AI or LLMs under GDPR brings significant data protection considerations. Compliance measures include updating processing records, revising privacy notices, obtaining consent, conducting Data Protection Impact Assessments, implementing Technical Organizational Measures, and potentially signing Data Processing Agreements. In addition, for AI, an Algorithmic Impact Assessment can be performed (see the link below). #AI #GDPR #DataProtection"
-
Privacy isn’t a policy layer in AI. It’s a design constraint. The new EDPB guidance on LLMs doesn’t just outline risks. It gives builders, buyers, and decision-makers a usable blueprint for engineering privacy - not just documenting it. The key shift? → Yesterday: Protect inputs → Today: Audit the entire pipeline → Tomorrow: Design for privacy observability at runtime The real risk isn’t malicious intent. It’s silent propagation through opaque systems. In most LLM systems, sensitive data leaks not because someone intended harm but because no one mapped the flows, tested outputs, or scoped where memory could resurface prior inputs. This guidance helps close that gap. And here’s how to apply it: For Developers: • Map how personal data enters, transforms, and persists • Identify points of memorization, retention, or leakage • Use the framework to embed mitigation into each phase: pretraining, fine-tuning, inference, RAG, feedback For Users & Deployers: • Don’t treat LLMs as black boxes. Ask if data is stored, recalled, or used to retrain • Evaluate vendor claims with structured questions from the report • Build internal governance that tracks model behaviors over time For Decision-Makers & Risk Owners: • Use this to complement your DPIAs with LLM-specific threat modeling • Shift privacy thinking from legal compliance to architectural accountability • Set organizational standards for “commercial-safe” LLM usage This isn’t about slowing innovation. It’s about future-proofing it. Because the next phase of AI scale won’t just be powered by better models. It will be constrained and enabled by how seriously we engineer for trust. Thanks European Data Protection Board, Isabel Barberá H/T Peter Slattery, PhD
-
Your AI models are learning from your most sensitive data. Here's why that should worry you. Most companies don’t stop to ask: what happens to that data once it’s inside the model? 🤯 That’s where Privacy-Preserving Machine Learning (PPML) comes in. It lets you train powerful AI models without ever exposing your raw data. Here's how it works: ⭐ Differential Privacy - Adds mathematical noise to your data so individual records can't be identified, but the AI still learns useful patterns. E.g. Apple uses this to collect iOS usage stats without exposing individuals. ⭐ Federated Learning - Trains models across multiple devices or organizations without centralizing the data anywhere. E.g Google trains Gboard’s next-word predictions across millions of devices without centralizing keystrokes. ⭐ Homomorphic Encryption - Lets AI process encrypted data without ever decrypting it. E.g. Imagine a bank detecting fraud on encrypted transactions without decrypting them. ⭐ Secure Multi-party Computation - Multiple parties can jointly train a model without sharing their raw data with each other. E.g. Healthcare orgs collaborate on drug discovery without ever exchanging patient records. In a world where everyone is trying to build AI apps and AI native workflows, the companies that figure out PPML first will have a massive competitive advantage and will be able to: ✅ Tap into more data sources ✅ Collaborate across industries ✅ Earn customer trust 👉 What’s your biggest privacy concern with how AI is being used today?
-
🌟 The European Data Protection Board (EDPB) has released its guidance on AI models and personal data processing! If you’re navigating the complexities of AI development, this is a must-read. 🚀 Key highlights: 🔒 Anonymity Redefined: AI models aren’t automatically anonymous. The bar is high—data must be irreversibly anonymized, ensuring extraction is highly unlikely. ⚖️ Legitimate Interest: A viable legal basis, but it requires rigorous necessity tests and rights balancing. No shortcuts here! 🌐 Web Scraping under Spotlight: Safeguards and opt-out mechanisms are essential when scraping public data for training AI models. 🚫 Unlawful Data Use: Mishandling personal data during training can undermine the lawfulness of the model’s deployment unless anonymized appropriately. 🪞 Transparency Matters: Clear, accessible communication with data subjects is non-negotiable. 💡 Why this matters: The EDPB underscores the need for privacy by design and accountability. AI innovation must align with GDPR principles to ensure trust and compliance. Whether you’re an in-house counsel, compliance professional, or innovator in AI, these guidelines are a wake-up call for responsible AI development. Stay ahead by embedding these principles into your practices. 💼✨ #AI #GDPR #Privacy #DataProtection #ArtificialIntelligence #EDPB #LegalTech #Compliance #Innovation
-
😥 AI Privacy Has Moved Beyond Data Protection 📍 what are the data privacy risks when we use LLMs in real life? 🚨 The paper “Beyond Data Privacy: New Privacy Risks for Large Language Models” highlights a major shift: AI privacy breaches now happen after deployment, not just during training. New risks include: 👉 Side-channel leaks: timing, caching, or packet patterns reveal what users type or discuss. 👉 Memory & agent leakage: assistants reuse or expose sensitive context across tasks or tools. 👉 Automated profiling: LLMs infer identities and sensitive traits from aggregated traces. 👉Mass-scale social engineering: LLMs automate phishing, impersonation, and reconnaissance. 👉 Fine-tuning and in-context exposure: small, repeated datasets and shared examples increase memorization and extractability. ⚠️ These risks arise from how models are deployed, not just how they’re trained The causes lie in system-level design , optimisation, tool integration, and shared memory. 🤔 Safeguarding privacy becomes a more complex process with a need of auditing the routes of application and data usage all the way. 🎯 Bottom Line: For modern AI systems, privacy is operational, not static. 🔗 to the paper in the comments #artificialintelligence #data #privacy #risk #governance
-
The European Data Protection Board published the report “AI Privacy Risks & Mitigations Large Language Models” to help developers and users systematically identify, assess, and mitigate privacy and data protection risks, supporting the responsible development and deployment of LLM systems. The report starts by explaining how LLMs work, their common applications, and performance evaluation measures. Then, it discusses how #privacy risks emerge across different #LLM service models, emphasizing the importance of understanding data flows throughout the #artificialintelligence lifecycle. Once the risks are identified and classified, the document shows how to prioritize mitigation efforts effectively. It also discusses residual risk acceptance and the iterative nature of risk. This guidance also supports the requirements of Art. 25 of the #GDPR Article (data protection by design and by default) and Art. 32 (security of processing) by offering technical and organizational measures to help ensure an appropriate level of #security and data protection.
-
This Stanford University white paper outlines a comprehensive examination of privacy and data protection challenges in the age of artificial intelligence. It emphasizes the insufficiency of existing privacy laws to address the unique risks posed by AI systems and proposes novel approaches for enhancing data privacy. Three key suggestions are offered: denormalizing data collection by default, focusing on the AI data supply chain, and innovating the management of personal data to improve privacy protections. 1️⃣ Denormalize data collection by default by shifting away from opt-out to opt-in data collection. Data collectors must facilitate true data minimization through “privacy by default” strategies and adopt technical standards and infrastructure for meaningful consent mechanisms. 2️⃣ Focus on the AI data supply chain to improve privacy and data protection. Ensuring dataset transparency and accountability across the entire life cycle must be a focus of any regulatory system that addresses data privacy. 3️⃣ Flip the script on the creation and management of personal data. Policymakers should support the development of new governance mechanisms and technical infrastructure (e.g., data intermediaries and data permissioning infrastructure) to support and automate the exercise of individual data rights and preferences. The paper argues for a shift in regulatory approaches to better protect individual and societal privacy in the AI-dominated landscape, underscoring the urgent need for updated privacy frameworks in the era of exponential data growth and AI advancements. ✍🏻 Dr. Jennifer King, Caroline Meinhardt. Stanford Institute for Human-Centered Artificial Intelligence (HAI). February 2024. ✅ Subscribe to my newsletter and stay at the forefront of groundbreaking studies. Get started here: https://lnkd.in/eR7qichj.