Presenting our work on Differential Privacy and Group Fairness optimization across sensitive and protected attributes in Natural Language Processing models at the North American Association for Computational Linguistics (NAACL) 2024. Quick summary: ⁉️Research Question: Does differential privacy inhibit attempts to model performance and group fairness across protected attributes? 📝 Elevator Pitch: Differential privacy, often seen solely as a barrier to performance, when harnessed with Gaussian noise injection and robust training techniques, not only safeguards data but also enhances fairness across various computational tasks by acting as a dynamic regulator. 📊 Technical Core: We introduce Gaussian noise injection as a method for applying differential privacy over stochastic gradient descent in model training. This technique not only serves as a privacy safeguard but also interestingly acts as a form of regularization, which can influence model fairness. 🧠 Findings: - Baseline Scenario: Differential privacy tends to widen the performance gap between groups. - Robust Training: When coupled with group distributionally robust training objectives, differential privacy can actually reduce performance disparities, enhancing fairness. 📈 Impact: - Demonstrates the dual role of differential privacy as both a protector of privacy and a regulator in model training. - Provides a mathematical framework for balancing privacy and fairness, especially crucial for minority group representations.
The Impact Of Data Privacy On Predictive Modeling
Explore top LinkedIn content from expert professionals.
Summary
Data privacy plays a crucial role in predictive modeling, which uses past data to forecast future outcomes. Protecting sensitive information is not only a legal and ethical requirement—it also prevents models from exposing private details and ensures fair, trustworthy results.
- Identify sensitive data: Take the time to define what counts as private or personally identifiable information in your datasets, paying attention to details that could reveal individual identities.
- Use privacy tools: Employ privacy-preserving techniques like anonymization, differential privacy, or synthetic data generation to minimize the risk of leaking sensitive information from your predictive models.
- Monitor model access: Limit who can interact with and fine-tune models, especially those trained on personal data, to prevent adversaries from extracting or inferring private information.
-
-
AI systems can unintentionally leak sensitive information not just through obvious outputs but through the subtler patterns and fingerprints that emerge as models are updated or trained. Recent research has shown that attackers can analyse these parameter changes to extract private data from models including open-source large language models. This kind of leakage is especially concerning when the underlying training data includes personally identifiable information or biometric templates such as fingerprints, facial scans or other identity signals. Biometric data is inherently sensitive because it is immutable and uniquely tied to an individual, which makes such leaks exceptionally high-risk from a privacy and security standpoint. The implications are clear for organisations using AI in contexts involving identity, authentication or personal data: • model lifecycle governance must include security and privacy risk assessments, not just performance metrics • access controls and monitoring need to be designed specifically to prevent side-channel inference • anonymisation and differential privacy techniques should be standard practice where biometric or PII data is involved In 2026, data protection and AI governance are converging. It’s no longer enough to build accurate or powerful models. We have to ensure they cannot be weaponised to reveal the very things they were trained to protect.
-
What can #syntheticdata do to transform healthcare? Sharing real patient data – though vital for biomedical research – is often fraught. Synthetic data, generated by #privacypreserving models, offers promising solutions. But how effective is synthetic data in practice, especially in the critical area of clinical model development? In our Scientific Reports paper, we built synthetic versions of one of the world’s richest and most complex biomedical datasets – the UK Biobank – to explore the vast potential of synthetic data for privacy-preserving clinical risk prediction. #Privacy: We compared multiple privacy-preserving synthetic data generators - synthetic data can replicate complex real data patterns without exposing sensitive patient information. #ModelDevelopment: Synthetic data proved to be a viable substitute throughout the medical prognostic modelling pipeline. We were able to develop accurate lung cancer prognostic models without ever accessing the real patient data. #SyntheticDataDeployment: Highlighting different data release approaches, we showed how synthetic biobank data could be integrated into the healthcare system and how it could accelerate research. Our paper is available here: https://lnkd.in/ecuv8Svu Zhaozhi Q. / Tom Callender / Bogdan Cebere / Sam Janes / Neal Navani
-
Before diving headfirst into AI, companies need to define what data privacy means to them in order to use GenAI safely. After decades of harvesting and storing data, many tech companies have created vast troves of the stuff - and not all of it is safe to use when training new GenAI models. Most companies can easily recognize obvious examples of Personally Identifying Information (PII) like Social Security numbers (SSNs) - but what about home addresses, phone numbers, or even information like how many kids a customer has? These details can be just as critical to ensure newly built GenAI products don��t compromise their users' privacy - or safety - but once this information has entered an LLM, it can be really difficult to excise it. To safely build the next generation of AI, companies need to consider some key issues: ⚠️Defining Sensitive Data: Companies need to decide what they consider sensitive beyond the obvious. Personally identifiable information (PII) covers more than just SSNs and contact information - it can include any data that paints a detailed picture of an individual and needs to be redacted to protect customers. 🔒Using Tools to Ensure Privacy: Ensuring privacy in AI requires a range of tools that can help tech companies process, redact, and safeguard sensitive information. Without these tools in place, they risk exposing critical data in their AI models. 🏗️ Building a Framework for Privacy: Redacting sensitive data isn’t just a one-time process; it needs to be a cornerstone of any company’s data management strategy as they continue to scale AI efforts. Since PII is so difficult to remove from an LLM once added, GenAI companies need to devote resources to making sure it doesn’t enter their databases in the first place. Ultimately, AI is only as safe as the data you feed into it. Companies need a clear, actionable plan to protect their customers - and the time to implement it is now.
-
A new paper from Feb 2024, last revised 24 Jun 2024, by a team at Secure and Fair AI (SAFR AI) Lab at Harvard demonstrates that even with minimal data and partial model access, powerful Membership inference attacks (MIAs) on Large Language Models (LLMs) can reveal if specific data points were used to train large language models, highlighting significant privacy risks. Problem: MIAs on LLMs allow adversaries with access to the model to determine if specific data points were part of the training set, indicating potential privacy leakage. This has risk and opportunities: - Copyright Detection: MIAs can help to verify if copyrighted data was used in training. - Machine Unlearning: MIAs can help to determine is specific personal information was used for training relevant for the right-to-be-forgotten. - Train/Test Contamination: Detecting if evaluation examples were part of the training set ensures the integrity and reliability of model assessments. - Training Dataset Extraction: Extracting training data from generative models highlights privacy vulnerabilities and informs the development of more secure AI systems. Background and Technical Overview: In a MIA, an adversary with access only to the model tries to ascertain whether a data point belongs to the model’s training data. Since the adversary only has access to the model, detecting training data implies information leakage through the model. Techniques based on Differential Privacy can prevent MIAs but at a significant cost to model accuracy, particularly for large models. Research Question: While strong MIAs exist for classifiers, given the unique training processes and complex data distributions of LLMs, it was speculated whether strong MIAs are even possible against them. The study introduces two novel MIAs for pretraining data: a neural network classifier based on model gradients and a variant using only logit access, leveraging model-stealing techniques. Results: The new methods outperform existing techniques. Even with access to less than 0.001% of the training data, along with the ability to compute model gradients, it's possible to create powerful MIAs. In particular, the findings indicate that fine-tuned models are far more susceptible to privacy attacks compared to pretrained models. Using robust MIAs, the research team extracted over 50% of the training set from fine-tuned LLMs, showcasing the potential extent of data leakage. Practical takeaway: We must limit adversaries' access to models fine-tuned on sensitive data. * * * Paper: “Pandora’s White-Box: Precise Training Data Detection and Extraction in Large Language Models” By Jeffrey G. Wang, Jason Wang, Marvin Li, Seth Neel Paper: https://lnkd.in/gTGGjRwX Blog post: https://lnkd.in/gRCJdM_q Red teaming library: https://lnkd.in/gQxEnWBv Code: https://lnkd.in/g8qpDiSE. Graphic: see paper
-
🤖🎅🚨Santa came early this year for data #privacy pros, bringing the EDPB’s Opinion 28/2024, which offers guidance on processing personal data in #AI models and clarifies how to ensure compliance with the #GDPR. 📍The European Data Protection Board focuses on whether AI models trained with personal data can be anonymous, how controllers can justify legitimate interest as a lawful basis for development and deployment, and the consequences of unlawful processing in the AI model’s development phase. It advises that claims of anonymity be rigorously assessed. An AI model should only be considered anonymous if there is an insignificant likelihood of extracting personal data. DPAs should examine documentation, testing, and methods employed by controllers to ensure that personal data cannot be identified. 📍The Opinion also highlights the complexity of relying on legitimate interest. Controllers must undertake a three-step test: first, determine a lawful, clearly defined, and genuine interest; second, ascertain that processing is necessary to achieve that interest; and third, verify that individuals’ rights and freedoms do not override the interest. SAs are encouraged to consider all relevant circumstances and possible mitigating measures, particularly given the complexity of AI models. Suppose original processing in the development phase was unlawful. In that case, the Opinion states that the subsequent use of the model may be affected depending on whether it still contains personal data and who processes it. 📍The Opinion acknowledges that fully anonymised models may fall outside the GDPR’s scope. If the model retains personal data and is operated by the same controller, the initial lack of a legal basis may influence subsequent phases. If processed by a different controller, that controller must assess the lawfulness of the initial development. In cases where anonymisation has been achieved, the prior unlawfulness does not carry over to the subsequent phases, although any new personal data processing remains subject to GDPR rules. The Opinion does not extensively cover special categories of data, automated decision-making, purpose compatibility, DPIAs, or data protection by design but notes that these issues are critical in real-world scenarios and must be appropriately addressed. 📍Throughout, the Opinion reminds controllers of their accountability obligations. This entails documenting processing activities, assessing risks, implementing privacy-preserving techniques, evaluating data minimisation strategies, and considering the expectations and rights of data subjects..
-
Privacy isn’t a policy layer in AI. It’s a design constraint. The new EDPB guidance on LLMs doesn’t just outline risks. It gives builders, buyers, and decision-makers a usable blueprint for engineering privacy - not just documenting it. The key shift? → Yesterday: Protect inputs → Today: Audit the entire pipeline → Tomorrow: Design for privacy observability at runtime The real risk isn’t malicious intent. It’s silent propagation through opaque systems. In most LLM systems, sensitive data leaks not because someone intended harm but because no one mapped the flows, tested outputs, or scoped where memory could resurface prior inputs. This guidance helps close that gap. And here’s how to apply it: For Developers: • Map how personal data enters, transforms, and persists • Identify points of memorization, retention, or leakage • Use the framework to embed mitigation into each phase: pretraining, fine-tuning, inference, RAG, feedback For Users & Deployers: • Don’t treat LLMs as black boxes. Ask if data is stored, recalled, or used to retrain • Evaluate vendor claims with structured questions from the report • Build internal governance that tracks model behaviors over time For Decision-Makers & Risk Owners: • Use this to complement your DPIAs with LLM-specific threat modeling • Shift privacy thinking from legal compliance to architectural accountability • Set organizational standards for “commercial-safe” LLM usage This isn’t about slowing innovation. It’s about future-proofing it. Because the next phase of AI scale won’t just be powered by better models. It will be constrained and enabled by how seriously we engineer for trust. Thanks European Data Protection Board, Isabel Barberá H/T Peter Slattery, PhD
-
Impressive technical result: the team at Google Research trained an LLM with differential privacy guarantees that apply to the *corpus data*, not just a fine-tuning dataset 😮 So what about what I wrote in my "privacy in AI" blog post¹, that you couldn't get strong privacy guarantees on a Web-scale corpus of unstructured data? It comes down to the details: the guarantee applies to any *sequence of up to 1024 tokens*. This means that: - If the data about someone only appears in a single sequence (e.g. one reasonably short document), it's provably protected ✅ - If it's repeated across many documents, you can't rely on the DP guarantee alone to get provable guarantees ❌ It's not exactly bulletproof, but it's certainly a big practical improvement that pushes the state of the art. It also seem to bring practical benefits: no memorization was detected using the standard ways of measuring this phenomenon ✨ Of course, it comes at a cost, with the model's performance being roughly similar to models from 5 years ago. But it's a great first step, and one which improves our understanding of scaling laws for DP model training. We should expect this gap to narrow in the future 🗜️ Blog post ➡️ https://lnkd.in/eujF_twT 💡 Technical report ➡️ https://lnkd.in/eSFbWRYg 📜 Scientific paper focused on scaling laws ➡️ https://lnkd.in/efbYin_P 📈 ¹ https://lnkd.in/eByWAtKE Featuring a bunch of folks: Amer Sinha, Thomas Mesnard, Ryan McKenna, Daogao Liu, Christopher A. Choquette-Choo, Yangsibo Huang, Da Yu, George Kaissis, Zachary Charles, Ruibo Liu, Lynn Chua, Pritish Kamath, Pasin Manurangsi, Steve He, Chiyuan Zhang, Badih Ghazi, Borja Balle, Prem E., Tris Warkentin, Armand Joulin, and Ravi Kumar 👏
-
Let's make it clear: We need more frameworks for evaluating data protection risks in AI systems. As I delve into this topic, more and more new papers and risk assessment approaches appear. One of them is described in the paper titled "Rethinking Data Protection in the (Generative) Artificial Intelligence Era." 👉 My key takeaways: 1️⃣ Begin by identifying the data that should be protected in AI systems. Authors recommend focusing on the following: • Training Datasets • Trained Models • Deployment-integrated Data (e.g., protect your internal system prompts and external knowledge bases like RAG). ❗ I loved this differentiation and risk assessment, as if, for example, an adversary discovers your system prompts, they might try to exploit them. Also, protecting sensitive RAG data is essential. • User prompts (e.g., besides prompts protection, add transparency and let users know if prompts will be logged or used for training). • AI-generated Content (e.g., ensure traceability to understand its provenance if used for training, etc.). 2️⃣ Authors also introduce an interesting taxonomy of data protection areas to focus on when dealing with generative AI: • Level 1: Data Non-usability. Ensures that specified data cannot contribute to model learning or predicting in any way by using strategies that block any unauthorized party from using or even accessing protected data (e.g., encryption, access controls, unlearnable examples, non-transferable learning, etc.) • Level 2: Data Privacy-preservation. Here, the focus is on how the training can be performed with enhanced privacy techniques (PETs): K-anonymity and L-diversity schemes, differential privacy, homomorphic encryption, federated learning, and split learning. • Level 3: Data Traceability. This is about the ability to track the origin, history, and influence of data as it is used in AI applications during training and inference. This capability allows stakeholders to audit and verify data usage. This can be categorised into intrusive (e.g., digital watermarking with signatures to datasets, model parameters, or prompts) and non-intrusive methods (e.g., membership inference, model fingerprinting, cryptographic hashing, etc.). • Level 4: Data Deletability. This is about the capacity to completely remove a specific piece of data and its influence from a trained model (authors recommend exploring unlearning techniques that specifically focus on erasing the influence of the data in the model, rather than the content or model itself). ------------------------------------------------------------------------ 👋 I'm Vadym, an expert in integrating privacy requirements into AI-driven data processing operations. 🔔 Follow me to stay ahead of the latest trends and to receive actionable guidance on the intersection of AI and privacy. ✍ Expect content that is solely authored by me, reflecting my reading and experiences. #AI #privacy #GDPR
-
It’s nearly impossible to make AI forget your data. AI models are increasingly trained on data that we might not actually want to include. Our private messages. Our kids' pictures. That angry email complaining about our previous boss. And the problematic issue is: Those models may reproduce the private data verbatim, exposing that data to third parties interacting with the model. Yes, like that very boss you complained about! Luckily, Harvard University is researching how to make AI "unlearn" unwanted data. This means efficiently removing the influence of that data from the model without having to retrain it from scratch. But there's a problem: Removing data might expose your private information even more. How? Seth Neel, assistant professor at Harvard Business School, explains it simply: "Suppose that I'm computing the average income of a group of people stored in a database and publishing that information every day. I just compute the average, and then let's say the next day, I compute the average again. But on that day, let's say you have left that group of people, and been removed from the database. If I just look at those two averages and subtract them, I can exactly get what your income was—this is a simple example called a differencing attack." The solution? Add a small amount of random noise to those averages, so they're not exactly correct. In this way, we delete the data but at the same time we protect it. My takeaway: Data deletion is the next frontier of AI safety, and the hardest to solve. Until machines can unlearn by themselves, the only real privacy protection is human judgment. Think twice before feeding your life into an algorithm. Read the full article here: https://lnkd.in/d5zdjYjJ ---- 🔔 Follow me Chiara Gallese, Ph.D. for everyday insights on AI Safety