Diverse Datasets for Equitable AI Models

Explore top LinkedIn content from expert professionals.

Summary

Diverse datasets for equitable AI models means gathering and using a wide range of data—from different languages, cultures, and regions—so artificial intelligence systems can learn fairly and avoid bias. This approach ensures AI can accurately serve people from all backgrounds, rather than favoring only those who are already well-represented in existing data.

Seek broad representation: Make sure your data collection includes voices, languages, and formats from varied communities to build AI that recognizes and respects everyone's needs.
Promote transparency: Regularly test for bias and clearly document how your datasets are built and used to encourage trust and accountability in AI systems.
Collaborate locally: Work with community members and organizations to gather unique cultural and linguistic data, helping AI models become more inclusive and accurate.

Summarized by AI based on LinkedIn member posts

Maximilian Nickel

Research Director at Meta, FAIR | AI ∩ Society ∩ Complex Systems

3,849 followers 10mo
Report this post
🦄 Today we're releasing Community Alignment - the largest open-source dataset to align LLMs with people's preferences in a variety of cultural contexts, containing ~200k comparisons from >3000 annotators in 5 countries and languages! There was a lot of research that went into this... 🧵 🔍 We started by conducting a joint human study and model evaluation with 15,000 nationally-representative participants from 5 countries & 21 LLMs. We found that the LLMs exhibited an *algorithmic monoculture* were all models aligned with the same minority of human preferences. 🚫 Standard alignment methods fail to learn common human preferences (as identified from our joint human-model study) from existing preference datasets because the candidate responses that people choose from are too homogeneous, even when they are sampled from multiple models. 🥭 Intuitively, if all the candidate responses only cover one set of values, then you'll never be able to learn preferences outside of those values. It is like if someone asks me to pick between four types of apples, but if what I really want is a mango, you won't be measuring that 🌈 To produce more diverse candidate sets, rather than independently sampling them, you want some kind of "negatively-correlated (NC) sampling", where sampling one candidate makes other similar ones less likely. Turns out, prompting can implement this decently well, with win rates jumping from random chance to ~0.8 🤡 💽 Finally, based on these insights we collect and open-source (CC-BY 4.0) the Community Alignment (CA) dataset. Features include: - NC-sampled candidate responses - Multilingual (64% non-English) - >2500 prompts are annotated by >= 10 people - Natural language explanations for > 1/4 of choices and more! This was a big project and collective effort spanning FAIR, AI at Meta, Meta Governance, Meta Policy as well es NYU and Ecole Polytechnique -- major thanks to all the collaborators (see paper) and especially the amazing Smitha Milli and Kris R., who led this project masterfully from start to finish. Also, thanks to Joelle Pineau, Rob Fergus, Stephane Kasriel, and Rob Sherman for their support🙏 And this is not the end! 😉 If you want to support us in doing more of these releases, email communityalignment@meta.com (or me) with feedback on what you liked about CA and what you want to see more of Paper: https://lnkd.in/ejJqGQfS Dataset: https://lnkd.in/e5Vp6z2E

7 Comments
Like Comment
Dr. Dinesh Chandrasekar DC

CEO & Founder @ Dinwins Intelligence 1st Consulting | Strategist | Investor | Board Advisor | Nasscom DeepTech,Telangana AI Mission & HYSEA-Mentor | Alumni of Hitachi, GE, Citigroup & Centific AI| Billion $ before Sunset

37,765 followers 1y
Report this post
#AiDays2025 Round Table : #Community Sourcing for low resource languages In an era where AI is fast shaping the contours of our digital future, VISWAM.AI initiative stands as a timely and transformational one. Their mission to build community-sourced Large Language Models (LLMs), grounded in India’s rich linguistic and cultural diversity, is not just pioneering—it’s redefining how inclusive and ethical AI should be built. By anchoring their work in community participation, linguistic preservation, and ethical co-creation, Viswam.ai offers a people-first approach to AI—moving beyond data extraction to cultural stewardship. Their ambition to mobilize 1 lakh community interns to collect data from underrepresented geographies across India is both bold and brilliant. This isn’t just about building better AI—it’s about building equity, agency, and cultural resilience through AI. 1. Linguistic Equity by Design In India, where linguistic hegemony often privileges English and Hindi, AI systems risk reinforcing this imbalance. The solution? Intentional design. Allocate equal engineering and validation efforts to low-resource languages. Ethical AI must be built on informed consent, community ownership, and fair compensation—because data is not just input, it’s identity and heritage. 2. Decentralized Internship Model By decentralizing AI development, we bridge the urban-rural digital divide. This model should focus on: Capacity building through training in ethics and digital literacy Inclusivity by involving women, Dalit and Adivasi youth Localized platforms using mobile-first tools in native languages Partnerships with Swecha, local NGOs, and institutions serve as trust bridges to ensure mentorship and sustainability. 3. Tools for Low-Resource Languages Many Indian languages are oral-first, with complex dialects and sparse corpora. Community-driven solutions—like collecting voice datasets from folklore, and crowdsourcing annotation—are key. Elders, poets, and storytellers become linguistic technologists, preserving not just language but legacy. 4. Trust & Transparency Bias in AI is structural. To mitigate it: Include diverse dialects and accents in training Conduct bias testing and community validation Promote explainable AI with local language dashboards and storytelling What’s Next? A living white paper on ethics, governance, and technical guidelines A roadmap for the internship program, with toolkits and impact metrics Collaboration with literary and linguistic organizations to enrich model depth VISWAM.AI is planting seeds for an AI movement rooted in language justice, data sovereignty, and community wisdom. Let’s co-create systems that don’t just understand our languages—but respect our voices. DC* Chaitanya Chokkareddy Kiran Chandra Ramesh Loganathan Centific
No more previous content

No more next content
1 Comment
Like Comment
Sharat Chandra

Blockchain & Emerging Tech Evangelist | Driving Impact at the Intersection of Technology, Policy & Regulation | Startup Enabler

49,250 followers 1y
Report this post
🚀 Accelerating Responsible AI in India: IIT Bombay Releases 16 Open Datasets on AIKOSH 🇮🇳 . Indian Institute of Technology, Bombay Bombay has taken a significant step toward advancing responsible and inclusive AI by contributing 16 diverse datasets to AIKOSH, the Government of India’s official AI repository. These open datasets reflect India’s cultural and linguistic richness and are aimed at fueling research and innovation in Artificial Intelligence (AI) and Machine Learning (ML) — tailored for Indian use cases. 🔍 What’s Inside: Handwritten & printed Indian scripts Tables extracted from scanned documents Multi-language audio samples Drone surveillance imagery Visual question-answer datasets 🎯 Why It Matters: These datasets help AI models: Recognize and interpret Indian languages and handwriting Process documents and media with regional nuances Analyze audio-visual data in Indian contexts This initiative marks a crucial step toward building AI that works for India, by India — empowering researchers, developers, and institutions nationwide. 🔗 Explore the datasets on AIKOSH: https://lnkd.in/gJv4XS4j #AI #MachineLearning #ResponsibleAI #OpenData #IITBombay #AIKOSH #IndiaAI #Innovation #Research #DigitalIndia
No more previous content

No more next content
5 Comments
Like Comment
Jaime Teevan

Chief Scientist & Technical Fellow at Microsoft - for speaking requests please contact teevan-externalopps@microsoft.com

22,117 followers 1y
Report this post
Can AI truly serve everyone, regardless of language or culture? Meet Sunayana Sitaram, a Principal Researcher at Microsoft Research India, who is working hard to ensure that it will. Her research is focused on making AI more inclusive by bridging the gap between diverse languages and cultures, and recently she’s been working closely with the Office team to help shape our multilingual strategy. Most language models are predominantly trained on data from the web, which does not equally represent all languages and cultures. This creates an inherent inequity from the start of the model-building process. In a recent paper (https://lnkd.in/gJVR6xRE) Sunayana proposed a method to integrate cultural differences into LLMs using the World Value Survey as seed data. This approach aims to create more culturally aware models by fine-tuning them with semantically equivalent data from diverse cultures. The lack of sufficient linguistic and cultural diversity in existing benchmarks similarly makes evaluating how LLMs perform for different languages and cultures hard. Her paper “MEGA: Multilingual Evaluation of Generative AI” (https://lnkd.in/gzgt9cVT) was the first large-scale multilingual benchmarking effort and explored how well LLMs perform across various tasks and languages. Recognizing the need for fair and transparent evaluation, she also implemented the PARIKSHA platform (https://lnkd.in/g3ZSuhTV), which not only involves a diverse set of evaluators but also regularly updates its assessments to reflect ongoing improvements in Indic LLMs. To learn more about what Sunayana is doing to ensure that generative AI benefits everyone around the world, I recommend listening to what she had to say during a recent MSR panel discussion on AI’s global impact (https://lnkd.in/gCvxrvxz). And if you're not yet following Sunayana’s research, I highly recommend checking it out! #AIInnovators #AppliedResearch #NewFutureOfWork #LeadingLikeAScientist
No more previous content

No more next content
18 Comments
Like Comment
Heather Couture, PhD

Fractional Principal CV/ML Scientist | Making Vision AI Work in the Real World | Solving Distribution Shift, Bias & Batch Effects in Pathology & Earth Observation

17,172 followers 7mo
Report this post
𝐂𝐚𝐧 𝐅𝐨𝐮𝐧𝐝𝐚𝐭𝐢𝐨𝐧 𝐌𝐨𝐝𝐞𝐥𝐬 𝐓𝐫𝐚𝐢𝐧𝐞𝐝 𝐢𝐧 𝐭𝐡𝐞 𝐔𝐒 𝐌𝐚𝐩 𝐂𝐫𝐨𝐩𝐬 𝐢𝐧 𝐀𝐟𝐫𝐢𝐜𝐚? The US and EU maintain detailed crop type maps with 80%+ accuracy, updated regularly. But most of the world—especially data-scarce regions in Africa, South America, and Asia—lacks this critical agricultural intelligence. Can foundation models trained on data-rich regions generalize to regions where labeled data is scarce? Food security depends on accurate crop type mapping for yield prediction, conservation, and disaster assessment. Yet the geographic disparity in data availability creates a potential geospatial bias—models trained on developed nations may fail in developing ones, precisely where they're needed most. Yi-Chia Chang et al. addressed this challenge by creating the first harmonized global crop type mapping dataset, combining five regional datasets across five continents, all focused on the four major cereal grains: maize, soybean, rice, and wheat. 𝘒𝘦𝘺 𝘧𝘪𝘯𝘥𝘪𝘯𝘨𝘴: - SSL4EO-S12 (pre-trained on all 13 Sentinel-2 spectral bands) outperformed both SatlasPretrain and ImageNet weights by 3-27% across all regions - Only 100 labeled images are sufficient for achieving high overall accuracy—but 900 images are needed to overcome severe class imbalance and improve average accuracy - Out-of-distribution data helps significantly when in-domain samples are scarce (zero-shot learning), but can actually hurt performance when sufficient local data becomes available due to distribution shift The research reveals both promise and pitfalls: while foundation models can bridge data gaps between regions, careful attention to data composition and distribution shift is essential. The takeaway? Prioritize in-domain data when available, but leverage out-of-domain pretraining strategically for data-scarce regions. All datasets and code available via TorchGeo, Hugging Face, and GitHub. https://lnkd.in/dNf7xG9f #RemoteSensing #MachineLearning #FoundationModels #PrecisionAgriculture #AI4Good #GeospatialAI #TransferLearning #DeepLearning — Subscribe to 𝘊𝘰𝘮𝘱𝘶𝘵𝘦𝘳 𝘝𝘪𝘴𝘪𝘰𝘯 𝘐𝘯𝘴𝘪𝘨𝘩𝘵𝘴 — weekly briefings on making vision AI work in the real world → https://lnkd.in/guekaSPf
No more previous content

No more next content
3 Comments
Like Comment
Shruti Mishra

CEO @Truebrand | Building Brands That Feel Real | 160k+ on Twitter/X (@heyshrutimishra)

79,003 followers 10mo Edited
Report this post
Google just launched 3 new open-source AI initiatives that strengthen agriculture, preserve language diversity, and improve inclusion in large language models. Here’s a breakdown of what matters (and why): 👇 1. 𝐀𝐠𝐫𝐢𝐜𝐮𝐥𝐭𝐮𝐫𝐚𝐥 𝐌𝐨𝐧𝐢𝐭𝐨𝐫𝐢𝐧𝐠 & 𝐄𝐯𝐞𝐧𝐭 𝐃𝐞𝐭𝐞𝐜𝐭𝐢𝐨𝐧 (𝐀𝐌𝐄𝐃) 𝐀𝐏𝐈 AI meets satellite imagery to deliver field-level insights across India. Farmers, startups, and governments can now access crop types, acreage, sowing/harvesting timelines, and 3 years of historical data, updated every 2 weeks. ✅ Supports drought planning ✅ Helps rural lending & climate risk modeling ✅ Powers precision agriculture 2. 𝐀𝐦𝐩𝐥𝐢𝐟𝐲 𝐈𝐧𝐢𝐭𝐢𝐚𝐭𝐢𝐯𝐞 (𝐈𝐧𝐝𝐢𝐚 𝐥𝐚𝐮𝐧𝐜𝐡) Large language models often miss the real world lived experiences of users in countries like India. Through the Amplify Initiative, Google + IIT Kharagpur are building hyperlocal datasets in Indian languages, infused with cultural nuance and real-world context. The goal? To help all LLMs better understand and serve Indian users, not just translate. Link: https://lnkd.in/gHk7MwVi 3. 𝐏𝐫𝐨𝐣𝐞𝐜𝐭 𝐕𝐚𝐚𝐧𝐢 𝐱 𝐁𝐡𝐚𝐬𝐡𝐢𝐧𝐢 𝐱 𝐇𝐮𝐠𝐠𝐢𝐧𝐠 𝐅𝐚𝐜𝐞 India's largest open Indic speech dataset just got bigger. Project Vaani has now contributed: — 21,500 hours of speech audio — 835 hours of transcribed speech — 86 languages, 22 states, 120 districts All open source via Bhashini + Hugging Face...fueling better voice tech and more inclusive AI products. These 3 aren’t research demos. They’re open tools, already powering use cases in agri-tech, healthcare, rural intelligence, and language research. If India wants to build sovereign AI, it needs this kind of deep, public infrastructure. And we’re starting to get there. Google Google DeepMind #AI #opensource ♻️ Share this post if you care about inclusive AI infrastructure. 💡 Follow Shruti Mishra for more drops on how AI is changing the real world.
No more previous content

No more next content
33 Comments
Like Comment

Diverse Datasets for Equitable AI Models

Summary

More in Data Quality for AI

Explore categories