Microsoft just released a 35-page report on medical AI - and it’s a reality check for healthcare. The paper, “The Illusion of Readiness”, tested six of the most popular models (OpenAI, Gemini, etc)… across six multimodal medical benchmarks. And the verdict? The models scored high on medical exams. But they’re not even close to being real-world ready. Here’s what the stress tests revealed: ▶ 1. Shortcut learning Models often answered correctly even when key information, like medical images, was removed. They weren’t reasoning - they were exploiting statistical shortcuts. That means benchmark wins may hide shallow understanding. ▶ 2. Fragile under small changes Making small tweaks caused big swings in predictions. This fragility shows how unreliable model reasoning becomes under stress. In visual substitution tests, accuracy dropped from 83% to 52% when images were swapped - exposing shallow visual–answer pairings. ▶ 3. Fabricated reasoning Models produced confident, step-by-step medical explanations - but many were medically unsound… or entirely fabricated. Convincing to the eye, dangerous in practice. And more importantly, healthcare isn’t a multiple-choice exam. It’s uncertainty, incomplete data, and high stakes. So Microsoft’s team calls for new standards: - Stress tests that expose fragility - Clinician-guided guidelines that profile benchmarks - Evaluation of robustness and trustworthiness - not just leaderboard scores The takeaway is simple: Medical AI may ace tests today. But until it proves reliable under stress, it’s not ready for the clinic. When do you think popular LLMs will be clinic-ready? #entrepreneurship #healthtech #AI
AI Limitations Overview
Explore top LinkedIn content from expert professionals.
-
-
BREAKING! The FDA just released this draft guidance, titled Artificial Intelligence-Enabled Device Software Functions: Lifecycle Management and Marketing Submission Recommendations, that aims to provide industry and FDA staff with a Total Product Life Cycle (TPLC) approach for developing, validating, and maintaining AI-enabled medical devices. The guidance is important even in its draft stage in providing more detailed, AI-specific instructions on what regulators expect in marketing submissions; and how developers can control AI bias. What’s new in it? 1) It requests clear explanations of how and why AI is used within the device. 2) It requires sponsors to provide adequate instructions, warnings, and limitations so that users understand the model’s outputs and scope (e.g., whether further tests or clinical judgment are needed). 3) Encourages sponsors to follow standard risk-management procedures; and stresses that misunderstanding or incorrect interpretation of the AI’s output is a major risk factor. 4) Recommends analyzing performance across subgroups to detect potential AI bias (e.g., different performance in underrepresented demographics). 5) Recommends robust testing (e.g., sensitivity, specificity, AUC, PPV/NPV) on datasets that match the intended clinical conditions. 6) Recognizes that AI performance may drift (e.g., as clinical practice changes), therefore sponsors are advised to maintain ongoing monitoring, identify performance deterioration, and enact timely mitigations. 7) Discusses AI-specific security threats (e.g., data poisoning, model inversion/stealing, adversarial inputs) and encourages sponsors to adopt threat modeling and testing (fuzz testing, penetration testing). 8) And proposed for public-facing FDA summaries (e.g., 510(k) Summaries, De Novo decision summaries) to foster user trust and better understanding of the model’s capabilities and limits.
-
AI is already transforming hospital workflows, but wide-scale clinical integration in the EU remains slow - this report explains why, and what might shift the tide. 1️⃣ AI is already easing pressure in hospitals - improving triage, reducing missed appointments, and speeding up radiology and pathology reviews. 2️⃣ The biggest early wins come from "low-risk" tools: automating documentation, scheduling, or predicting patient flow - not from diagnostic systems. 3️⃣ Generative AI (like LLMs) is seen as a game-changer in admin-heavy tasks, but hallucinations and trust issues still limit use in clinical decisions. 4️⃣ Key blockers include poor interoperability, lack of real-world validation, and unclear liability - especially with tools that adapt post-deployment. 5️⃣ Hospitals face regulatory overload: navigating MDR, IVDR, GDPR, AIA, EHDS, and more is slowing AI integration across the EU. 6️⃣ Despite over 500 CE-marked or FDA-cleared AI tools, most hospitals only deploy a few - and often don't monitor post-launch performance. 7️⃣ Health professionals want AI - but stress it's only helpful when it aligns with existing workflows and adds real local value. 8️⃣ Many successful use cases come from the US, Israel, and Japan - showing what works when funding, digital maturity, and regulation align. 9️⃣ The EU aims to create "AI centres of excellence," shared catalogues of validated tools, and stronger frameworks for testing, deployment, and monitoring. 🔟 The report urges urgent action on data standards, clinician training, real-world pilots, and aligned financing - or risk falling behind globally. ✍🏻 European Commission: Directorate-General for Health and Food Safety, EEIG, OpenEvidence and PwC. Study on the deployment of AI in healthcare - Final report. Publications Office of the European Union, 2025. DOI: 10.2875/2169577
-
Google DeepMind just exposed AI's limits - with Math. The same vector embeddings that power most modern AI search systems have mathematical limits we can't engineer around. No amount of training data or model scaling will fix this, according to them. Here's what's happening: When we ask AI to find relevant documents, we're essentially asking it to map meaning into geometric space - turning words into coordinates. But the researchers proved that for any given embedding dimension, there are combinations of documents that simply cannot be retrieved correctly. What sounds like a bug might be a fundamental limitation of these systems. To demonstrate this, they created LIMIT - a dataset so simple a child could solve it (matching "who likes apples?" with "Jon likes apples"). Yet even the best models, including those powering enterprise search systems, achieve less than 20% accuracy. GPT-class models with 4,096-dimensional embeddings still fail spectacularly. As we push AI to handle more complex retrieval tasks - think multi-criteria search, reasoning-based queries, or the instruction-following systems many companies are betting on - we're guaranteed to hit these walls. The paper shows that web-scale search would need embedding dimensions in the millions to handle all possible document combinations. So, what does this mean? Every company building RAG systems, every startup promising "ChatGPT for your documents," every enterprise search deployment - they're all constrained by this fundamental limit. The researchers found that alternative architectures like sparse models (think old-school keyword search) actually outperform modern neural approaches on these tasks. We've been treating retrieval as a solved problem, a building block we can rely on. But their research suggests we need to fundamentally rethink how we architect AI systems that need to find and reason over information. The good news? Once we understand the limits, we can design around them. Hybrid approaches, multi-stage retrieval, and careful system design can mitigate these issues. But it requires acknowledging that bigger models and more compute won't solve everything. For those of us working with AI, this is a reminder that understanding the fundamentals matters. The next breakthrough might not come from scaling up, but from stepping back and questioning our basic assumptions. What retrieval challenges has your organization faced that might be explained by these fundamental limits? ↓ 𝐖𝐚𝐧𝐭 𝐭𝐨 𝐤𝐞𝐞𝐩 𝐮𝐩? Join my newsletter with 50k+ readers and be the first to learn about the latest AI research: llmwatch.com 💡
-
You know all those arguments that LLMs think like humans? Turns out it's not true 😱 In our new paper we put this to the test by checking if LLMs form concepts the same way humans do. Do LLMs truly grasp concepts and meaning analogously to humans, or is their success primarily rooted in sophisticated statistical pattern matching over vast datasets? We used classic cognitive experiments as benchmarks. What we found is surprising... 🧐 We used seminal datasets from cognitive psychology that mapped how humans actually categorize things like "birds" or "furniture" ('robin' as a typical bird). The nice thing about these datasets is that they are not crowdsourced, they're rigorous scientific benchmarks. We tested 30+ LLMs (BERT, Llama, Gemma, Qwen, etc.) using an information-theoretic framework that measures the trade-off between: - Compression (how efficiently you organize info) - Meaning preservation (how much semantic detail you keep) Finding #1: The Good News LLMs DO form broad conceptual categories that align with humans significantly above chance. Surprisingly (or not?), smaller encoder models like BERT outperformed much larger models. Scale isn't everything! Finding #2: But LLMs struggle with fine-grained semantic distinctions. They can't capture "typicality" - like knowing a robin is a more typical bird than a penguin. Their internal concept structure doesn't match human intuitions about category membership. Finding #3: The Big Difference Here's the kicker: LLMs and humans optimize for completely different things. - LLMs: Aggressive statistical compression (minimize redundancy) - Humans: Adaptive richness (preserve flexibility and context) This explains why LLMs can be simultaneously impressive AND miss obvious human-like reasoning. They're not broken - they're just optimized for pattern matching rather than the rich, contextual understanding humans use. What this means: - Current scaling might not lead to human-like understanding - We need architectures that balance compression with semantic richness - The path to AGI ( 😅 ) might require rethinking optimization objectives Our paper gives tools to measure this compression-meaning trade-off. This could guide future AI development toward more human-aligned conceptual representations. Cool to see cognitive psychology and AI research coming together! Thanks to Chen Shani, Ph.D., who did all the work and Yann LeCun and Dan Jurafsky for their guidance
-
The Death of Originality The reason all corporate Generative AI strategies look the same… is because they are the same. Ever wonder why a zebra has stripes? It was a mystery for years. It's obviously not to camouflage themselves given they stick out like a sore thumb. It's so they can hide as a herd. So when they feel threatened they can crowd together and the individuals can blend together and hide in plain sight. This is the current predominant corporate Gen AI strategy. Virtually every large company follows the same exact playbook: Same consultants. Same vendors. Same emphasis on automation. Same short-term priorities. It's the same script to placate the board, appease the shareholders, and follow the competition. It's essentially a lay-off with a paint job. The Fortune 500: Use the same LLMs (OpenAI, Claude, Gemini) Work with the same consultants Deploy AI in the same domains first Track the same KPIs The Research Supports this trend: OpenAI, Anthropic, and DeepMind all report hallucination/error rates between 20%–70%, depending on task type (reasoning, factual accuracy, summarization).¹ Companies deploying AI in customer service and legal settings are already facing legal liability.² AI Reduces Trust A 2024 Pew study found that customer trust in AI-generated content declines in time.³ Research from Gartner shows brands using standard LLMs for content generation suffer from decreased perceived uniqueness.⁴ Organizations Are De-Skilling Their Talent MIT Sloan Management Review reports that heavy AI reliance in workflows has led to a decline in critical thinking.⁵ We must start with different questions: How do we focus on our revenue growth, differentiating ourselves and creating new sources of value? How do we use AI to strengthen, not dilute, our originality? How do we avoid vendor lock-in and preserve architectural control? How do we train our people to think, not just prompt? At some point someone will need to find the courage to stand apart from the crowd. Come out come out wherever you are. ******************************************************************************** The trick with technology is to avoid spreading darkness at the speed of light. Stephen Klein is Founder and CEO of Curiouser.AI, the only Generative AI platform to augment human intelligence, not automate it. He also teaches at UC Berkeley. To learn more visit curiouser.ai or connect on https://lnkd.in/gphSPv_e Footnotes & Sources: ¹ OpenAI, Anthropic, DeepMind: Technical docs, March 2023–March 2024 ² Mata v. Avianca Airlines (2023), SEC investigations of Gen AI usage in financial disclosures ³ Pew Research Center (2024). “AI Perception and Public Trust” ⁴ Gartner / Writer.com (2023). “Brand Differentiation and Language Models” ⁵ MIT Sloan Management Review (2024). “The Quiet Cost of Automating Strategic Thinking”
-
Last week, a customer said something that stopped me in my tracks: “Our data is what makes us unique. If we share it with an AI model, it may play against us.” This customer recognizes the transformative power of AI. They understand that their data holds the key to unlocking that potential. But they also see risks alongside the opportunities—and those risks can’t be ignored. The truth is, technology is advancing faster than many businesses feel ready to adopt it. Bridging that gap between innovation and trust will be critical for unlocking AI’s full potential. So, how do we do that? It comes down understanding, acknowledging and addressing the barriers to AI adoption facing SMBs today: 1. Inflated expectations Companies are promised that AI will revolutionize their business. But when they adopt new AI tools, the reality falls short. Many use cases feel novel, not necessary. And that leads to low repeat usage and high skepticism. For scaling companies with limited resources and big ambitions, AI needs to deliver real value – not just hype. 2. Complex setups Many AI solutions are too complex, requiring armies of consultants to build and train custom tools. That might be ok if you’re a large enterprise. But for everyone else it’s a barrier to getting started, let alone driving adoption. SMBs need AI that works out of the box and integrates seamlessly into the flow of work – from the start. 3. Data privacy concerns Remember the quote I shared earlier? SMBs worry their proprietary data could be exposed and even used against them by competitors. Sharing data with AI tools feels too risky (especially tools that rely on third-party platforms). And that’s a barrier to usage. AI adoption starts with trust, and SMBs need absolute confidence that their data is secure – no exceptions. If 2024 was the year when SMBs saw AI’s potential from afar, 2025 will be the year when they unlock that potential for themselves. That starts by tackling barriers to AI adoption with products that provide immediate value, not inflated hype. Products that offer simplicity, not complexity (or consultants!). Products with security that’s rigorous, not risky. That’s what we’re building at HubSpot, and I’m excited to see what scaling companies do with the full potential of AI at their fingertips this year!
-
𝗪𝗵𝘆 𝗚𝗲𝗻𝗲𝗿𝗮𝘁𝗶𝘃𝗲 𝗔𝗜 𝗥𝗼𝗹𝗹𝗼𝘂𝘁𝘀 𝗨𝗻𝗱𝗲𝗿𝗽𝗲𝗿𝗳𝗼𝗿𝗺 𝗪𝗶𝘁𝗵𝗼𝘂𝘁 𝗣𝗿𝗼𝗺𝗽𝘁 𝗘𝗻𝗴𝗶𝗻𝗲𝗲𝗿𝗶𝗻𝗴 𝗧𝗿𝗮𝗶𝗻𝗶𝗻𝗴 As organisations rapidly deploy Generative AI tools across the enterprise, one assumption shows up again and again: → 𝗜𝗳 𝘄𝗲 𝗽𝗿𝗼𝘃𝗶𝗱𝗲 𝘁𝗵𝗲 𝘁𝗼𝗼𝗹𝘀, 𝗽𝗲𝗼𝗽𝗹𝗲 𝘄𝗶𝗹𝗹 𝗳𝗶𝗴𝘂𝗿𝗲 𝗼𝘂𝘁 𝗵𝗼𝘄 𝘁𝗼 𝘂𝘀𝗲 𝘁𝗵𝗲𝗺 →↳ 𝗧𝗵𝗮𝘁 𝗮𝘀𝘀𝘂𝗺𝗽𝘁𝗶𝗼𝗻 𝗶𝘀 𝗰𝗼𝘀𝘁𝗹𝘆 Gen AI rarely underdelivers because of the technology. It underdelivers because users are never taught how to communicate with it effectively. Most organisations don’t struggle with access to AI. They struggle with 𝗶𝗻𝗽𝘂𝘁 𝗾𝘂𝗮𝗹𝗶𝘁𝘆. 𝗧𝗵𝗲 𝗨𝗻𝗱𝗲𝗿𝗮𝗽𝗽𝗿𝗲𝗰𝗶𝗮𝘁𝗲𝗱 𝗣𝗿𝗼𝗯𝗹𝗲𝗺 Gen AI can: → Automate routine tasks → Support analysis and decision-making → Accelerate content creation and ideation But many rollouts skip a foundational capability: → B𝗮𝘀𝗶𝗰 𝗽𝗿𝗼𝗺𝗽𝘁 𝗲𝗻𝗴𝗶𝗻𝗲𝗲𝗿𝗶𝗻𝗴 𝘀𝗸𝗶𝗹𝗹𝘀 Employees are expected to: → Know how to frame questions → Provide the proper context → Guide outputs toward business-ready results Without training, they don’t. 𝗧𝗵𝗲 𝗞𝗲𝘆 𝗜𝗻𝘀𝗶𝗴𝗵𝘁 Prompt engineering is not an advanced or technical niche. It is a 𝗰𝗼𝗿𝗲 𝘄𝗼𝗿𝗸𝗽𝗹𝗮𝗰𝗲 𝘀𝗸𝗶𝗹𝗹. When prompts are vague, incomplete, or poorly structured: → Outputs are shallow → Results are inconsistent → Trust in the tool erodes In other words: → 𝗴𝗮𝗿𝗯𝗮𝗴𝗲 𝗶𝗻 →↳ 𝗴𝗮𝗿𝗯𝗮𝗴𝗲 𝗼𝘂𝘁 𝗧𝗵𝗲 𝗜𝗺𝗽𝗮𝗰𝘁 𝗼𝗳 𝗦𝗸𝗶𝗽𝗽𝗶𝗻𝗴 𝗧𝗿𝗮𝗶𝗻𝗶𝗻𝗴 When Gen AI is rolled out without prompt literacy: → Employees spend time fixing poor outputs → Teams abandon tools after early frustration → Productivity gains never materialise The result is predictable: → Licensed tools →↳ Limited adoption →↳ Minimal ROI What should be a force multiplier becomes shelfware. 𝗪𝗵𝗮𝘁 𝗔𝗰𝘁𝘂𝗮𝗹𝗹𝘆 𝗪𝗼𝗿𝗸𝘀 Organisations seeing real value take a different approach: 𝗕𝗮𝘀𝗶𝗰 𝗣𝗿𝗼𝗺𝗽𝘁 𝗧𝗿𝗮𝗶𝗻𝗶𝗻𝗴 → How to structure requests → How to iterate and refine → How to validate outputs 𝗣𝗿𝗮𝗰𝘁𝗶𝗰𝗮𝗹 𝗨𝘀𝗲 𝗖𝗮𝘀𝗲𝘀 → tied to real workflows → not generic demos 𝗘𝘅𝗽𝗲𝗰𝘁𝗮𝘁𝗶𝗼𝗻 𝗦𝗲𝘁𝘁𝗶𝗻𝗴 → AI as a collaborator →↳ not an autopilot 𝗧𝗵𝗲 𝗧𝗮𝗸𝗲𝗮𝘄𝗮𝘆 The real question is no longer: → “𝗛𝗮𝘃𝗲 𝘄𝗲 𝗱𝗲𝗽𝗹𝗼𝘆𝗲𝗱 𝗚𝗲𝗻 𝗔𝗜?” It is: → “𝗛𝗮𝘃𝗲 𝘄𝗲 𝘁𝗿𝗮𝗶𝗻𝗲𝗱 𝗼𝘂𝗿 𝗽𝗲𝗼𝗽𝗹𝗲 𝘁𝗼 𝘂𝘀𝗲 𝗶𝘁 𝘄𝗲𝗹𝗹?” Gen AI doesn’t create an advantage on its own. Skilled users do. If this resonates, tap 👍, follow for more practical AI adoption insights, and share ♻️ your perspective. #GenerativeAI #AIAdoption #PromptEngineering #FutureOfWork #DigitalTransformation #WorkplaceAI #AITraining #Leadership #EnterpriseAI #Productivity #AIStrategy
-
Major preprint just out! We compare how humans and LLMs form judgments across seven epistemological stages. We highlight seven fault lines, points at which humans and LLMs fundamentally diverge: The Grounding fault: Humans anchor judgment in perceptual, embodied, and social experience, whereas LLMs begin from text alone, reconstructing meaning indirectly from symbols. The Parsing fault: Humans parse situations through integrated perceptual and conceptual processes; LLMs perform mechanical tokenization that yields a structurally convenient but semantically thin representation. The Experience fault: Humans rely on episodic memory, intuitive physics and psychology, and learned concepts; LLMs rely solely on statistical associations encoded in embeddings. The Motivation fault: Human judgment is guided by emotions, goals, values, and evolutionarily shaped motivations; LLMs have no intrinsic preferences, aims, or affective significance. The Causality fault: Humans reason using causal models, counterfactuals, and principled evaluation; LLMs integrate textual context without constructing causal explanations, depending instead on surface correlations. The Metacognitive fault: Humans monitor uncertainty, detect errors, and can suspend judgment; LLMs lack metacognition and must always produce an output, making hallucinations structurally unavoidable. The Value fault: Human judgments reflect identity, morality, and real-world stakes; LLM "judgments" are probabilistic next-token predictions without intrinsic valuation or accountability. Despite these fault lines, humans systematically over-believe LLM outputs, because fluent and confident language produce a credibility bias. We argue that this creates a structural condition, Epistemia: linguistic plausibility substitutes for epistemic evaluation, producing the feeling of knowing without actually knowing. To address Epistemia, we propose three complementary strategies: epistemic evaluation, epistemic governance, and epistemic literacy. Full paper in the first comment. Joint with Walter Quattrociocchi and Matjaz Perc.
-
Financial services spend the most on AI and extract the least value from it. Financial services firms lead global AI spending, yet adoption remains low because operating models have not caught up with technical capability. Capital is being spent on models while workflows are still designed for manual reviews and slow approvals because governance has not evolved. The risk is not unused software but delayed decisions that slow revenue and increase compliance cost. Ignoring this keeps institutions operating at higher latency. AI systems now generate real-time signals in areas like fraud detection and customer targeting because data access and computing power have improved. Organizations struggle to act on these signals because approval structures and trust models were built for periodic reports and not continuous decisions. This creates a gap where insight exists but execution stalls. This means: value erodes before it reaches the customer or the balance sheet. AI adoption doesn’t succeed when added to unchanged workflows because people become the constraint instead of the technology. One practical way to begin is to choose a decision that currently takes days, redesign the approval path to work in minutes and avoid using AI where accountability cannot be clearly assigned. #AIInBanking #FinancialServices #AIAdoption #EnterpriseAI #OperatingModel #WorkflowDesign #DecisionLatency #AIGovernance #BusinessOutcomes #DigitalOperations #AIExecution #Leadership