Yann LeCun's vision: machines should learn like humans — by building internal world models, not reconstructing every pixel. We just validated this idea at the largest scale ever attempted in cardiac ultrasound. Introducing EchoJEPA — the first world model for medical video.🔥 🫀 18M echocardiograms 👥 300K patients 🧠 Learns heart dynamics — not imaging noise The problem: Ultrasound is messy. Speckle, shadows, attenuation. Most pretraining objectives end up modeling the scanner, not the heart. The idea: Stop reconstructing pixels. Predict latent structure instead. EchoJEPA discards what’s unpredictable and locks onto what matters clinically: ➡️ chamber geometry ➡️ wall motion ➡️ valve dynamics The results (frozen encoder, no fine-tuning): • 20% ↓ error in LVEF • 17% ↓ error in RVSP • 79% accuracy with 1% labels (vs 42% for baselines w/ 100%) • 2% degradation under acoustic artifacts (vs 17%) • Zero-shot pediatric transfer beats all fine-tuned models Why this works: When we project embeddings: ❌ prior methods → diffuse, entangled clusters ✅ EchoJEPA → clean anatomical organization Structure separated from acquisition noise. 📄 Paper: https://lnkd.in/gPxhQpCR 💻 Code: https://lnkd.in/gQ-i6yMx Huge credit to Alif Munim, who pushed JEPA thinking into medical video and led this effort 💥 Guidance from AI at Meta (Quentin Garrido, Koustuv Sinha) Co-authors: Adib Fallahpour Teodora Szasz Ahmadreza Attarpour, PhD etc! Teams: University Health Network Amazon Web Services (AWS) University of Toronto UChicago Medicine University of California, San Francisco Philips This is representation learning for physiology, not pixels.
Vision Models in Medical Imaging
Explore top LinkedIn content from expert professionals.
Summary
Vision models in medical imaging use artificial intelligence to analyze and interpret medical images like MRIs, CT scans, and pathology slides, helping doctors diagnose diseases more accurately and efficiently. These models often combine image and language processing, allowing them to utilize both visual data and medical reports for improved understanding.
- Explore open tools: Take advantage of freely available vision-language models that can accelerate research and clinical workflows in healthcare.
- Prioritize structured knowledge: Use curated diagnostic resources alongside AI models to support evidence-based predictions and build trust in clinical settings.
- Embrace rapid adaptation: Look for modular vision models that can be quickly tuned to new tasks or datasets, saving valuable time and computational resources.
-
-
Yesterday, we released MedGemma a open medical vision-language model for Healthcare! Built on Google DeepMind Gemma 3 it advances medical understanding across images and text, significantly outperforming generalist models of similar size. MedGemma is one of the best open model under 50B! How MedGemma Was Trained: 1️⃣ Fine-tuned Gemma 3 vision-encoder (SigLIP) on over 33 million medical image-text pairs (radiology, dermatology, pathology, etc.) to create the specialized MedSigLIP, including some general data to prevent catastrophic forgetting. 2️⃣ Further pre-trained Gemma 3 Base by mixing in the medical image data (using the new MedSigLIP encoder) to ensure the text and vision components could work together effectively. 3️⃣ Distilling knowledge from a larger "teacher" model, using a mix of general and medical text-based question-answering datasets. 4️⃣ Reinforcement Learning similar to Gemma 3 on medical imaging and text data, RL led to better generalization than standard supervised fine-tuning for these multimodal tasks. Insights: - 💡 Outperforms Gemma 3 on medical tasks by 15-18% improvements in chest X-ray classification. - 🏆 Competes with, and sometimes surpasses, much larger models like GPT-4o. - 🥇 Sets a new state-of-the-art for MIMIC-CXR report generation. - 🩺 Reduces errors in EHR information retrieval by 50% after fine-tuning. - 🧠 The 27B model outperforms human physicians in a simulated agent task. - 🤗 Openly released to accelerate development in healthcare AI. - 🔬 Reinforcement Learning was found to be better for multimodal generalization. Paper: https://lnkd.in/dBTiH_cJ Model: https://lnkd.in/dnyxWPju
-
MedSAM2 just brought “segment anything” to 3D medical images and videos. Generalist segmentation models like SAM2 have shown promise in natural images, but struggle with medical data. 𝗠𝗲𝗱𝗦𝗔𝗠𝟮 bridges that gap as a 𝗽𝗿𝗼𝗺𝗽𝘁𝗮𝗯𝗹𝗲 𝘀𝗲𝗴𝗺𝗲𝗻𝘁𝗮𝘁𝗶𝗼𝗻 𝗳𝗼𝘂𝗻𝗱𝗮𝘁𝗶𝗼𝗻 𝗺𝗼𝗱𝗲𝗹 𝘁𝘂𝗻𝗲𝗱 𝗳𝗼𝗿 𝗯𝗼𝘁𝗵 𝟯𝗗 𝗺𝗲𝗱𝗶𝗰𝗮𝗹 𝘀𝗰𝗮𝗻𝘀 𝗮𝗻𝗱 𝘁𝗲𝗺𝗽𝗼𝗿𝗮𝗹 𝘃𝗶𝗱𝗲𝗼 𝗳𝗿𝗮𝗺𝗲𝘀. 1. Fine-tuned on over 455,000 3D image-mask pairs and 76,000 video frames 2. Achieved 88.84% Dice score on CT organs, 88.37% on MRI lesions, and 87.22% on PET lesions which leads across all tasks in the 3D benchmark. 3. Outperformed all models on ultrasound and endoscopy videos, with up to 96.13% accuracy on left ventricle segmentation and 92.22% on hard polyps. 4. Cut 3D lesion annotation time by 86% for CT (from 526s to 74s/lesion) and 87% for liver MRI (from 520s to 65s/lesion) through a human-in-the-loop pipeline. Couple thoughts: • use of memory attention mechanisms to maintain spatial and temporal consistency across slices/frames is a cool arch choice • 𝗵𝘂𝗺𝗮𝗻-𝗶𝗻-𝘁𝗵𝗲-𝗹𝗼𝗼𝗽 𝗮𝗻𝗻𝗼𝘁𝗮𝘁𝗶𝗼𝗻 𝗽𝗶𝗽𝗲𝗹𝗶𝗻𝗲 𝗵𝗮𝘀 𝗶𝗺𝗺𝗲𝗻𝘀𝗲 𝗽𝗿𝗮𝗰𝘁𝗶𝗰𝗮𝗹 𝘃𝗮𝗹𝘂𝗲 𝗮𝗻𝗱 𝘀𝗵𝗼𝘂𝗹𝗱 𝗯𝗲 𝗶𝗺𝗽𝗹𝗲𝗺𝗲𝗻𝘁𝗲𝗱 𝗺𝗼𝗿𝗲 • love how the tool is available on 3D Slicer, Jupyter, Colab, and Gradio to translate research into usable tools. The barrier to use models NEED to be lower in medical AI research in general for faster adoption and iterative loops of improvement Here's the awesome work: https://lnkd.in/g7jy5W7G Congrats to Jun Ma, Zongxin Yang, Sumin Kim, Beatrice Bihui Chen, Bo Wang, and co! I post my takes on the latest developments in health AI – 𝗰𝗼𝗻𝗻𝗲𝗰𝘁 𝘄𝗶𝘁𝗵 𝗺𝗲 𝘁𝗼 𝘀𝘁𝗮𝘆 𝘂𝗽𝗱𝗮𝘁𝗲𝗱! Also, check out my health AI blog here: https://lnkd.in/g3nrQFxW
-
MRI images are hard to train AI on, but Decipher-MR, a new 3D vision-language model, shows that pairing scans with text reports across the full body can dramatically improve performance. 1️⃣ Decipher-MR was trained on 200,000 MRI series from 22,594 studies across diverse body regions, ages, and MRI protocols. 2️⃣ It uses a two-stage pretraining: first, self-supervised vision and text encoders; second, contrastive learning aligns MR images with their radiology reports. 3️⃣ The model supports modular adaptation: frozen encoder + lightweight task-specific decoders, reducing compute costs and enabling rapid tuning. 4️⃣ In classification tasks (e.g., disease, demographic, sequence type), it outperformed models like DINOv2 and BiomedCLIP, especially in low-data settings. 5️⃣ Text supervision and anatomical diversity boosted performance. Models trained on narrow data (e.g., only head/neck or T2) performed worse across the board. 6️⃣ It excels in zero-shot cross-modal retrieval: retrieving images from text queries and vice versa, with high accuracy in body region and tumor-type matching. 7️⃣ In 3D segmentation (abdomen, pelvis, heart), Decipher-MR nearly matched nnUNet, despite using a frozen encoder, and converged faster than MedSAM3D. 8️⃣ For anomaly localization (e.g., tumors, organ removal), it beat task-specific models in both visual-only and text-guided scenarios using a DETR3D or MedRPG head. 9️⃣ Sex-based evaluations showed consistent generalization across male and female patients, outperforming peers even under cross-sex testing conditions. 🔟 Limitations include underperformance in fine-grained imaging attribute classification and some pathology-specific retrievals, likely due to text diversity gaps. ✍🏻 Zhijian Yang, Noel DSouza, István Megyeri, Xiaojian Xu, Amin Honarmandi Shandiz, Farzin Haddadpour, Krisztián Koós, Laszlo Rusko, Emanuele Valeriano, Bharadwaj Swaminathan, Lei Wu, Parminder Bhatia, Taha Kass-Hout, MD, MS, Erhan Bas. Decipher-MR: A Vision-Language Foundation Model for 3D MRI Representations. arXiv. 2025. DOI: 10.48550/arXiv.2509.21249v1
-
🎉 Happy to share a new paper from my lab, published in Springer Nature npj Digital Medicine🎉: "Melan-Dx: A knowledge-enhanced vision-language framework improves differential diagnosis of melanocytic neoplasm pathology" 🔗 https://lnkd.in/gF-BtuzN This work was led by my PhD students Jialu Yao, Songhao Li, and postdoc Peixian Liang, in close collaboration with Penn Medicine Pathology and Laboratory Medicine expert dermatopathologists Xiaowei Xu and David Elder. 🔍 Why this work matters #Melanoma is one of the top 5 #cancer types. In the past, #pathology foundation models were trained and then fine-tuned to improve classification performance. However, for melanocytic neoplasms, this strategy has remained insufficient. The disease spectrum is extremely subtle and heterogeneous (44 subtypes), with high under- and over-diagnosis rate, even among experts. In order to predict, and eventually assist pathologists, there is an urgent need for innovative #AI approaches that can address this uniquely challenging differential diagnosis problem. 🧠 What we did differently Even further fine-tuning pathology foundation model vision backbones yield suboptimal results. Instead, we introduce Melan-Dx, a knowledge-enhanced vision-language framework that can be applied to any pathology vision-language foundation model to improve differential diagnosis through explicit knowledge retrieval. At the core of Melan-Dx is the Penn Melan-Dx Knowledge Atlas, a carefully curated #dermatopathology resource integrating images, text, and structured diagnostic knowledge, collected over 10+ years at the Penn Medicine Pathology and Laboratory Medicine by expert dermatopathologists Xiaowei Xu and David Elder. Particularly: 🧬 We curated Penn Melan-Dx Knowledge Atlas, containing 2,800+ high-quality H&E images spanning 44 melanocytic neoplasm subtypes. 📚 The atlas contains high-quality structured diagnostic knowledge including histologic features, diagnostic criteria, and differential diagnoses. 🏷️ It is hierarchically organized according to the WHO classification of tumors and MPATH-Dx taxonomy. 👩⚕️ These image are collected over 10+ years at Penn and expert-validated by board-certified dermatopathologists. This atlas enables retrieval-based, evidence-grounded predictions, allowing Melan-Dx to reason over both visual patterns and domain knowledge, closely mirroring real clinical diagnostic workflows. 📊 Key results ✅ 0.869 accuracy for melanoma vs non-melanoma 🎯 0.699 Top-1 accuracy across 40 melanocytic subtypes (very challenging task! And we will do better in the future!) 🧬 0.915 ROC AUC for few-shot whole-slide classification ⚡ Up to 96% reduction in training time vs full fine-tuning 🌟 Takeaway For challenging domains like melanocytic pathology, better knowledge beats more fine-tuning. Our study disclosed that structured domain expertise with #knowledge is essential for accuracy, efficiency, and clinical trust. #AI #Pathology #Melanoma
-
New paper alert! Really proud to share our latest work: 3DReasonKnee, a new dataset for teaching AI models to reason through 3D medical images the way clinicians do. The core insight: When radiologists assess knee MRIs, they follow a structured process—localizing specific anatomical regions, evaluating what they see, and reasoning through severity assessments step by step. Most existing datasets give models images and labels, but not the reasoning process that connects them. We wanted to capture that. What we built: - 494k samples from ~8,000 knee MRI volumes - 3D bounding boxes for precise anatomical localization - Clinician-written reasoning chains that detail the diagnostic thought process Structured severity assessments following clinical frameworks (MOAKS) The reasoning chains are what I'm most excited about. Our clinical collaborators spent 450+ hours not just labeling, but articulating how they think through each diagnosis. It's a real attempt to capture expert knowledge in a way that models can learn from. We benchmark several state-of-the-art vision-language models and show there's significant room for improvement in grounded 3D reasoning—which makes sense, since models haven't really been trained on data like this before. This is one step toward building medical AI that's more aligned with how clinicians actually work. Lots more to do, but I think this dataset gives the community a useful new tool. Congratulations to first authors Sraavya Sambara & Sung Eun Kim with Xiaoman Zhang, Luyang Luo, Shreya Johri, Mohammed Baharoon, and Du Hyun Ro. Dataset: rajpurkarlab/3DReasonKnee on HuggingFace Paper: arXiv:2510.20967 #MedicalAI #ComputerVision #HealthcareAI #Radiology
-
From free-text to 3D masks: meet 𝑽𝒐𝒙𝑻𝒆𝒍𝒍 With #SAM3 on the horizon pushing text-promptable segmentation into the spotlight for natural images, we asked what this might look like for 3D medical data. Our answer is VoxTell: a vision-language model that segments volumetric CT, MRI and PET scans directly from free-text prompts. VoxTell (voxel + tell) links what you type to what you see: 📝 Segment from single words or full clinical sentences 🔄 Generalize across modalities and to related unseen concepts 🩺 Understand real clinical language from radiology reports 📚 Trained on 1,000+ organs, substructures, and pathologies This makes it the largest model to date in terms of the diversity of structures it can process. It delivers state-of-the-art zero-shot performance across organs and pathologies and can localize highly specific findings from clinical text such as: "𝘴𝘱𝘪𝘤𝘶𝘭𝘢𝘵𝘦𝘥 𝘤𝘢𝘳𝘤𝘪𝘯𝘰𝘮𝘢 𝘪𝘯 𝘵𝘩𝘦 𝘭𝘦𝘧𝘵 𝘶𝘱𝘱𝘦𝘳 𝘭𝘰𝘣𝘦 𝘸𝘪𝘵𝘩 𝘱𝘭𝘦𝘶𝘳𝘢𝘭 𝘤𝘰𝘯𝘵𝘢𝘤𝘵". Huge thanks to my co-first author Moritz Langenberg and all collaborators for making this project happen, this was a true team effort! 🙌 If this sounds interesting, you can find more details here: 📄 Paper: https://lnkd.in/e2-rwZTA 💻 Code: https://lnkd.in/eHMViGkh (coming soon) Would love to hear your thoughts, feedback, and potential use cases from the community. Yannick Kirchhoff Fabian Isensee Benjamin Hamm (chief design wizzard) Constantin Ulrich PD Dr. med. Sebastian Regnery Lukas Bauer Efthimios Katsigiannopulos Tobias Norajitra Klaus Maier-Hein #VoxTell #medicalimaging #visionlanguage #3Dsegmentation #AIinHealthcare #radiology
-
The Segment Anything paradigm is shifting from geometric prompts (clicks and boxes) to deep semantic understanding (concepts and text). Research highlights a move toward domain-specific adaptation to eliminate the need for manual spatial cues—segmenting what we mean rather than just what we point at. Here is how researchers are bridging the gap between pixel-level precision and semantic reasoning across general, medical, microscopic, and geospatial domains: The Generalist Foundation: Nicolas Carion et al. introduce SAM 3, shifting the architecture from Promptable Visual Segmentation to Promptable Concept Segmentation. Unlike its predecessors, SAM 3 does not require spatial cues; it aligns a massive scale of visual data with text embeddings. This allows users to prompt with simple noun phrases (e.g., "striped cat") or image exemplars, achieving strong zero-shot performance without manual bounding boxes. Adapting for Radiology: Chongcong Jiang et al. present Medical SAM3, addressing the failure of generalist models in healthcare. They demonstrate that vanilla SAM 3 relies heavily on privileged geometric prompts (boxes), failing catastrophically on medical tasks when relying on text alone. By fine-tuning the full model on 33 datasets across 10 modalities, they achieved text-driven semantic alignment, allowing the model to localize anatomy using clinical terminology rather than manual guidance. Adapting for Microscopy: Anwai Archit et al. release Segment Anything for Microscopy (μSAM). Recognizing that generalist models struggle with the unique textures of Light and Electron Microscopy, the team fine-tuned specific models for these modalities. Crucially, they prioritized workflow integration, releasing a plugin that supports interactive annotation and tracking, enabling researchers to rapidly train specialist models on their own data. Adapting for Earth Observation: The RemoteSAM Team (Yao et al.) introduces a framework for satellite imagery, where standard models often fail due to scale and task complexity. They propose a task unification paradigm, treating Referring Expression Segmentation as the core capability. By predicting a pixel-level mask from text and converting it for downstream needs (detection, counting, classification), RemoteSAM achieves high efficiency, proving that lightweight, unified models can outperform massive generic backbones in specialized tasks. SAM 3: https://lnkd.in/eusmEVqg Medical SAM3: https://lnkd.in/eE5pXze7 µSAM: https://lnkd.in/ed8C_mF7 RemoteSAM: https://lnkd.in/eiwJ7Mq6 --- Keeping up with the literature is increasingly a team sport. This analysis was supported by NotebookLM and grounded in my own review and experience. If you found this useful, let me know in the comments. If it missed the mark, I want that feedback too. Weekly briefings on making vision AI work in the real world → https://lnkd.in/guekaSPf
-
We are pleased to share our latest work (+code) published in Magnetic Resonance in Medicine: “On the Utility of Foundation Models for Fast MRI: Vision-Language-Guided Image Reconstruction” 👉 Paper: https://lnkd.in/eg35Sqtz 👉 Code: https://lnkd.in/eAXSbCjV Why this work matters Foundation models have rapidly transformed fields like computer vision and natural language processing. In contrast, MRI reconstruction has largely remained within task-specific models and signal-level optimization. In this work, we explore a simple but important question: 💡 Can foundation models help accelerate MRI and improve the MRI workflow? What’s new We show that pretrained foundation models can guide MRI reconstruction—bringing high-level understanding to what has traditionally been a purely physics- and data-driven process. Rather than designing new reconstruction architectures, this approach leverages existing foundation models to provide a new form of prior—one that is more general, flexible, and transferable. Key message This work represents an early step toward integrating foundation models into the MRI pipeline. It suggests a shift from narrowly designed models toward more general-purpose AI systems that can support reconstruction, analysis, and interpretation in a unified way. We are excited to see how this direction evolves and welcome discussion from the community. First author: Ruimin Feng, PhD; huge thanks to our collaborators and clinical partners, Zachary Stewart, MD, and Ron Mercer, MD. Our research is supported by the National Institute of Arthritis and Musculoskeletal and Skin Diseases (NIAMS) (R01AR079442, R01AR081344, R56AR081017) and the National Institute of Biomedical Imaging and Bioengineering (NIBIB) (R21EB031185). #Research is conducted at The Martinos Center for Biomedical Imaging, Harvard Medical School and Massachusetts General Hospital. #MRI #MedicalImaging #AI #FoundationModels #DeepLearning #Radiology #ISMRM #RSNA #ArtificialIntelligence