Value of Human-Generated Training Data

Explore top LinkedIn content from expert professionals.

Summary

Human-generated training data refers to information created, labeled, or curated by real people, which is used to teach AI models. Recent findings highlight that relying too heavily on AI-generated data causes models to lose touch with reality and miss out on rare or nuanced cases, making human input more valuable than ever.

  • Prioritize original data: Use human-created examples and labels to keep AI models grounded and maintain their ability to understand a wide variety of real-world situations.
  • Validate with experts: Involve professionals and domain specialists in reviewing and correcting training data so rare cases and important details don’t disappear over time.
  • Monitor for drift: Regularly check AI outputs for signs of bias or lost information, especially when using automated or AI-generated datasets, to catch problems before they impact results.
Summarized by AI based on LinkedIn member posts
  • View profile for Bhargav Patel, MD, MBA

    Physician-Leader at the Intersection of AI, Medicine & Psychiatry | Medical + AI Researcher | Adult & Child Psychiatrist | Neuroscientist | Founder | Upcoming Books: Trauma Transformed & The Future of AI in Healthcare

    11,297 followers

    When AI models are trained on AI-generated data, they forget. A Nature paper by Shumailov et al. calls this model collapse. The tails of the distribution disappear first. Rare events vanish. Then variance collapses. What remains? Maximum confidence. Minimum reality. The mechanism is simple: Finite sampling loses rare signals. Each generation compounds that loss. The math shows convergence toward a delta function. A model that becomes increasingly certain… about a shrinking version of truth. Here’s why this matters now: LLM-generated text is flooding the internet. Future models will be trained on that text. GPT-{n+1} will inevitably contain output from GPT-{n}. This recursive contamination isn't hypothetical. It's a degenerative process with predictable dynamics. The implication: Human-generated data becomes more valuable over time, not less. Access to the original distribution isn’t optional. It’s foundational. The parallel to clinical research: We've spent decades training our intuition on summary statistics. Group means. Terminal endpoints. Compressed representations of complex biology. Each generation of researchers inherits the compressed version, not the raw signal. The tails disappear. The rare phenotypes. The unexpected responders. The patterns that weren't in the scoring rubric. Model collapse isn't AI-specific. It's information-theoretic. It applies to any system that learns recursively from its own simplified outputs. The antidote in AI: preserve access to original, human-generated data. The antidote in clinical research: preserve access to continuous, high-resolution, unfiltered biological signals. Both require the same thing. Infrastructure that captures reality before someone decides what to discard. That's not a technology problem. It's a design choice. As healthcare AI scales, this becomes critical. If we train on AI-summarized notes, filtered datasets, and compressed clinical narratives… We won’t notice the loss at first. The rare cases disappear quietly. Then clinical variance collapses. And eventually: We’ll have extremely confident systems… That no longer understand real patients. — Source: Shumailov et al., Nature 631, 755-759 (2024) 📌 Save this. Share with teams building healthcare AI training datasets. ♻️ Repost, someone designing clinical AI infrastructure needs to understand model collapse. 🔔 Follow Bhargav Patel, MD, MBA for healthcare AI insights on data quality and long-term model integrity.

  • View profile for Matt Pavelle

    Democratizing healthcare. Co-founder/co-CEO of Doctronic: your AI doctor.

    9,272 followers

    This is fascinating. New research analyzed 800,000+ synthetic medical data points and found that AI models training on AI-generated content rapidly converge toward generic outputs. This means rare but critical findings (pneumothorax, effusions, etc.) simply vanish. Demographics skew to middle-aged males. The clinical tail gets amputated. One of our engineers summed it up perfectly - this is "model collapse by regression to the mean." This is one of the many reasons why @Doctronic keeps top of the line physicians in the loop. Not just for patient safety and trust, but because without human clinical signal, medical AI progressively forgets edge cases. Those interesting presentations and atypical patients are much more common than one might think. Our doctors are our safety net. They're also our training data's immune system. https://lnkd.in/ecJAf9Fm

  • View profile for Armine Papikyan

    I talk about AI

    7,234 followers

    The AI industry might be poisoning itself, and nobody wants to talk about it. Since #ChatGPT blew up in 2022, companies have rushed to train new AI models on fresh internet data. But here’s the problem: a lot of that “new” internet content is already written by AI. So when AI models train on AI-generated content, they’re learning from machines, not real people. Think of it like copying someone’s homework… when that person already copied someone else’s bad homework. This creates what some experts call model collapse – when AIs start to get worse because they’re learning from junk instead of real, high-quality human-created information. To fix it, companies are turning to #RAG, which lets models look things up online instead of relying only on what they were trained on. Sounds smart, but not really. The internet is now packed with low-effort, AI-written junk. So when the model “retrieves” information, it often finds bad answers – and then gives you those same bad answers in a confident tone. So the fix might actually be making the problem worse. Honestly, the only thing that keeps this whole system from spiraling is a bit of good old-fashioned human judgment. There's more than one proof: 🔹 Meta's $15B investment in human data, 🔹 Andrej Karpathy on 'keeping AI on a tight leash' 🔹 Ali Ghodsi on how hard full automation is and the need for human supervision At SuperAnnotate, we’ve seen how much of a difference it makes when #humans are part of the loop – reviewing data, checking outputs, guiding quality. Because if AI’s only learning from itself, someone has to break the loop — or we just keep training tomorrow’s models on yesterday’s mistakes. #AI #data #HumanInTheLoop #SyntheticData #ModelCollapse

  • View profile for Karyna Naminas

    CEO of Label Your Data. Helping AI teams deploy their ML models faster.

    6,861 followers

    🚨 I see a lot of companies relying on AI-generated labels, assuming they improve with better prompts. But new ML research proves that’s not the case— without gold-standard data, most users actually made their labels worse over time. Zeyu He and Ting-Hao 'Kenneth' Huang (Penn State University), and Saniya Naphade (GumGum) studied “prompting in the dark”—refining AI-generated labels without benchmarks. ⏳ Key findings to save you time: - Goal: Assess how effective humans are at refining LLM-generated labels without gold-standard data. - Method: 20 participants iteratively prompted LLMs for sentiment labeling tasks using PromptingSheet, comparing results to manually labeled benchmarks. - Results: Only 9 out of 20 participants improved accuracy over time, while 10 performed worse. Even automated prompt optimization tools struggled when gold labels were missing. - Why It Matters: AI-generated labels can drift without human validation, proving that manual annotation is still essential for reliable AI training data. If AI alone can’t self-correct, how should we rethink automated data labeling? #MachineLearning #DataAnnotation #AIResearch #PromptEngineering #LLMs #DataLabeling

  • View profile for Jason Cohen
    Jason Cohen Jason Cohen is an Influencer

    Head of Global Partner Solution Architecture @ Amazon | Previously; Head of Global Technical Solutions at Google, Senior Director at Sony

    20,918 followers

    A groundbreaking study in Nature reveals a critical challenge for AI development: AI models trained on AI-generated content begin to "collapse," similar to how making copies of cassette tapes leads to quality degradation. Think back to the days of cassette tapes: When you made a copy of a copy of a copy, each generation lost some of the original audio quality. By the 4th or 5th copy, the music would become noticeably distorted and muffled. The researchers found that AI models face a similar problem. When new AI models are trained on content generated by previous AI models (instead of human-created content), they lose important information and nuances - particularly rare or unusual examples. The AI's outputs become increasingly distorted from reality with each generation, just like those tape copies. Why does this matter? As AI-generated content floods the internet, future AI models trained on this data may become less capable of understanding and representing the full spectrum of human knowledge and expression. The study suggests that maintaining access to original, human-generated content will be crucial for developing better AI systems. The researchers' conclusion is clear: just as audiophiles kept original recordings to maintain quality, we must preserve and prioritize human-generated content to ensure AI systems continue learning and accurately representing our world. What do you think? Link to study in the comments. #ArtificialIntelligence #MachineLearning #Technology #DataScience #Research

  • View profile for Shreekant Mandvikar

    I (actually) build GenAI & Agentic AI solutions | Executive Director @ Wells Fargo | Architect · Researcher · Speaker · Author

    7,845 followers

    𝐖𝐡𝐲 𝐀𝐈 𝐓𝐫𝐚𝐢𝐧𝐞𝐝 𝐨𝐧 𝐀𝐈 𝐒𝐭𝐚𝐫𝐭𝐬 𝐭𝐨 𝐅𝐨𝐫𝐠𝐞𝐭 𝐑𝐞𝐚𝐥𝐢𝐭𝐲 𝟏. 𝐓𝐡𝐞 𝐒𝐡𝐢𝐟𝐭 𝐢𝐧 𝐓𝐫𝐚𝐢𝐧𝐢𝐧𝐠 𝐃𝐚𝐭𝐚 Today’s AI models learn mostly from human-created content. But as AI-generated content floods the internet, future models risk learning more from themselves than from us. 𝟐. 𝐓𝐡𝐞 𝐋𝐨𝐨𝐩 𝐨𝐟 𝐌𝐨𝐝𝐞𝐥 𝐂𝐨𝐥𝐥𝐚𝐩𝐬𝐞 When models are trained on outputs of earlier models, rare but important details slowly vanish. Over time, responses start looking less like real-world knowledge and more like blurred copies of the past. 𝟑. 𝐄𝐚𝐫𝐥𝐲 𝐯𝐬 𝐋𝐚𝐭𝐞 𝐂𝐨𝐥𝐥𝐚𝐩𝐬𝐞 - Early collapse: AI forgets the rare edge cases first, like unique events or minority perspectives. - Late collapse: Everything gets blended into generic, shallow outputs. 𝟒. 𝐖𝐡𝐲 𝐓𝐡𝐢𝐬 𝐇𝐚𝐩𝐩𝐞𝐧𝐬 - Sampling errors: Each time AI copies itself, small details get lost. - Model errors: Neural nets smooth over data, reinforcing mistakes generation after generation. It is like making a photocopy of a photocopy clarity keeps fading. 𝟓. 𝐏𝐫𝐨𝐨𝐟 𝐟𝐫𝐨𝐦 𝐄𝐱𝐩𝐞𝐫𝐢𝐦𝐞𝐧𝐭𝐬 - Images: Distinct patterns blurred into fuzzy shapes. - Text: AIs started repeating phrases or generating nonsense. - Data distributions: Models narrowed to fewer, less diverse outcomes. 𝟔. 𝐖𝐡𝐲 𝐓𝐡𝐢𝐬 𝐌𝐚𝐭𝐭𝐞𝐫𝐬 𝐢𝐧 𝐑𝐞𝐚𝐥 𝐋𝐢𝐟𝐞 - Critical medical or rare cases may be forgotten. - Biases get amplified. - Knowledge becomes shallow instead of rich. 𝟕. 𝐓𝐡𝐞 𝐂𝐫𝐢𝐭𝐢𝐜𝐚𝐥 𝐋𝐞𝐬𝐬𝐨𝐧 Human-created data is gold. Even preserving just 10 percent of original human data drastically reduces collapse. Without it, AI risks losing touch with reality. 𝟖. 𝐓𝐡𝐞 𝐖𝐚𝐲 𝐅𝐨𝐫𝐰𝐚𝐫𝐝 - Track whether content is AI or human-made. - Preserve real-world human data. - Understand that progress is not just about bigger models but about grounding them in authentic knowledge. We are entering a future where AI could either amplify human intelligence or echo its own distortions. 𝐓𝐡𝐞 𝐛𝐢𝐠 𝐪𝐮𝐞𝐬𝐭𝐢𝐨𝐧 𝐢𝐬: 𝐇𝐨𝐰 𝐝𝐨 𝐰𝐞 𝐞𝐧𝐬𝐮𝐫𝐞 𝐀𝐈 𝐬𝐭𝐚𝐲𝐬 𝐠𝐫𝐨𝐮𝐧𝐝𝐞𝐝 𝐢𝐧 𝐡𝐮𝐦𝐚𝐧 𝐭𝐫𝐮𝐭𝐡 𝐢𝐧𝐬𝐭𝐞𝐚𝐝 𝐨𝐟 𝐝𝐫𝐢𝐟𝐭𝐢𝐧𝐠 𝐢𝐧𝐭𝐨 𝐦𝐚𝐜𝐡𝐢𝐧𝐞-𝐦𝐚𝐝𝐞 𝐞𝐜𝐡𝐨𝐞𝐬? ♻️ Repost this to help your network get started ➕ Follow Shreekant for more

Explore categories