Top LinkedIn Content on AI Techniques For Image Recognition

Agents & Gemini API, MTS at Google DeepMind 🔵 prev: Tech Lead at Hugging Face, AWS ML Hero 🤗 Sharing my own views and AI News

165,595 followers 10mo

Yesterday, we released MedGemma a open medical vision-language model for Healthcare! Built on Google DeepMind Gemma 3 it advances medical understanding across images and text, significantly outperforming generalist models of similar size. MedGemma is one of the best open model under 50B! How MedGemma Was Trained: 1️⃣ Fine-tuned Gemma 3 vision-encoder (SigLIP) on over 33 million medical image-text pairs (radiology, dermatology, pathology, etc.) to create the specialized MedSigLIP, including some general data to prevent catastrophic forgetting. 2️⃣ Further pre-trained Gemma 3 Base by mixing in the medical image data (using the new MedSigLIP encoder) to ensure the text and vision components could work together effectively. 3️⃣ Distilling knowledge from a larger "teacher" model, using a mix of general and medical text-based question-answering datasets. 4️⃣ Reinforcement Learning similar to Gemma 3 on medical imaging and text data, RL led to better generalization than standard supervised fine-tuning for these multimodal tasks. Insights: - 💡 Outperforms Gemma 3 on medical tasks by 15-18% improvements in chest X-ray classification. - 🏆 Competes with, and sometimes surpasses, much larger models like GPT-4o. - 🥇 Sets a new state-of-the-art for MIMIC-CXR report generation. - 🩺 Reduces errors in EHR information retrieval by 50% after fine-tuning. - 🧠 The 27B model outperforms human physicians in a simulated agent task. - 🤗 Openly released to accelerate development in healthcare AI. - 🔬 Reinforcement Learning was found to be better for multimodal generalization. Paper: https://lnkd.in/dBTiH_cJ Model: https://lnkd.in/dnyxWPju

35 Comments

Anima Anandkumar

228,879 followers 11mo

Neural Operators: AI for medical imaging under any subsampling Current AI models are purpose built for each medical imaging device and specific subsampling scheme, making them inflexible, narrow and having limited resolution to bring out the important features. We propose Neural Operators as a universal AI scheme that can handle any subsampling scheme and can do super-resolution and FOV out of the box, without the need for any retraining. For MRI, it can handle any k-space pattern with +11 % SSIM, +4 dB PSNR vs. earlier deep learning models and is 600× faster than diffusion models. See our Hugging Face demo and #cvpr paper https://lnkd.in/gNuQMuNv https://armeet.ca/nomri

NO-MRI - a Hugging Face Space by armeet huggingface.co

8 Comments

Jan Beger

Our conversations must move beyond algorithms.

90,219 followers 1y

This paper explores the applications of large-scale AI models in medicine, focusing on Medical Large Models (MedLMs), including LLMs, Vision Models, 3D Large Models, and Multimodal Models. 1️⃣ LLMs process clinical text, aiding in electronic health records (EHR) analysis, medical question-answering, and treatment planning. Examples include MedPaLM and MedGPT, which support medical education and diagnostics. 2️⃣ Vision models based on CNNs assist in medical imaging tasks like cancer detection and anomaly detection, achieving dermatologist-level accuracy in skin cancer diagnosis. Vision-Language Models (VLMs) enhance zero-shot learning for medical images. 3️⃣ 3D large models analyze volumetric medical data, aiding in tumor segmentation, virtual surgery simulations, and anatomical modeling for prosthetics. 4️⃣ Multimodal models integrate clinical text, imaging, and genomic data to improve diagnostic accuracy and personalized treatment planning, particularly in oncology. 5️⃣ Graph large models (LGMs) use graph neural networks (GNNs) in medical knowledge graphs, drug discovery, and genomics, aiding in disease risk prediction and biomarker identification. 6️⃣ Drug discovery is accelerated by MedLMs such as AlphaFold and GraphDTA, which predict protein structures and drug-target interactions, improving efficiency in molecular design. 7️⃣ AI-driven models assist in summarizing patient records, generating diagnostic reports, and enhancing clinical documentation, reducing physician workload. 8️⃣ Biomedical image generation using GANs and diffusion models produces high-quality synthetic medical images for data augmentation, improving AI training in pathology and radiology. 9️⃣ AI-driven models enhance precision medicine by integrating multi-source patient data, enabling individualized diagnosis and treatment strategies. 🔟 Challenges include high computational costs, ethical concerns, and potential inaccuracies (AI hallucinations), which limit real-world implementation. ✍🏻 YunHe Su, Zhengyang Lu, Junhui Liu, Ke Pang, Haoran Dai, Sa Liu, Yuxin Jia, Lujia Ge, Jing-min Yang. Applications of Large Models in Medicine. arXiv 2025. DOI: 10.48550/arXiv.2502.17132v1

15 Comments

Piotr Skalski

Open Source Lead @ Roboflow | Computer Vision | Vision Language Models

90,594 followers 1y

Open Vocabulary Detection with Qwen2.5-VL 🔥 🔥 🔥 I've been diving into how well Vision-Language Models (VLMs) like Qwen2.5-VL can understand images and find objects. This builds on my earlier tutorials about zero-shot detection with models like GroundingDINO and YOLO-World (links in comments). I wanted to see how well VLMs can not only detect objects but also understand where they are in an image. As usual, I've created a notebook to show you what I've been testing: - Object detection using Qwen2.5-VL with different types of instructions (prompts). - Trying to find single and multiple objects in an image. - Using descriptions like "the object on the left" or "the closest object" to find specific items. - Asking the model to reason about objects: "What would I use to open this?" ⮑ 🔗 notebook: https://lnkd.in/dJchZZiJ More examples are in the comments! 👇🏻

86 Comments

Ahmed Serag, PhD

6,484 followers 1y

𝗡𝗲𝘄 𝗿𝗲𝘀𝗲𝗮𝗿𝗰𝗵 𝗽𝘂𝗯𝗹𝗶𝘀𝗵𝗲𝗱! Medical imaging is packed with hidden clinical biomarkers, but privacy hurdles and data scarcity often keep this treasure trove locked away from AI innovation. Frustrating, right? That’s exactly what inspired me and Abdullah Hosseini to ask: Can we generate synthetic medical images that not only look real, but also preserve the critical biomarkers clinicians rely on? So, we dove in. Using cutting-edge diffusion models fused with Swin-transformer networks, we generated synthetic images across three modalities—radiology (chest X-rays), ophthalmology (OCT), and histopathology (breast cancer slides). The big question: 𝗗𝗼 𝘁𝗵𝗲𝘀𝗲 𝘀𝘆𝗻𝘁𝗵𝗲𝘁𝗶𝗰 𝗶𝗺𝗮𝗴𝗲𝘀 𝗸𝗲𝗲𝗽 𝘁𝗵𝗲 𝘀𝘂𝗯𝘁𝗹𝗲, 𝗱𝗶𝘀𝗲𝗮𝘀𝗲-𝗱𝗲𝗳𝗶𝗻𝗶𝗻𝗴 𝗳𝗲𝗮𝘁𝘂𝗿𝗲𝘀 𝗶𝗻𝘁𝗮𝗰𝘁? • Our diffusion models faithfully preserved key biomarkers—like lung markings in X-rays and retinal abnormalities in OCT—across all datasets. • Classifiers trained only on synthetic data performed nearly as well as those trained on real images, with F1 and AUC scores hitting 0.8–0.99. • No statistically significant difference in diagnostic performance—meaning synthetic data could stand in for real data in many AI tasks, while protecting patient privacy. This work shows synthetic data isn’t just a lookalike—it’s a powerful, privacy-preserving tool for research, clinical AI, and education. Imagine sharing and scaling medical data without the headaches of privacy risk or limited access! Read the full paper: https://lnkd.in/eW6TM9H2 Get the code & datasets: https://lnkd.in/ek4wSkg3 #AI #Innovation #SyntheticData #DiffusionModels #MedicalImaging #HealthcareInnovation #DigitalHealth #Frontiers #WeillCornell #HealthTech #HealthcareAI #PrivacyPreservingAI #GenerativeAI #Biomarkers #MachineLearning #Qatar #MENA #MiddleEast #NorthAfrica #MENAIRegion #MENAInnovation #UAE #UnitedArabEmirates #SaudiArabia #KSA #Egypt AI Innovation Lab Weill Cornell Medicine Weill Cornell Medicine - Qatar Cornell Tech Cornell University

14 Comments

Satya Mallick

CEO @ OpenCV | BIG VISION Consulting | AI, Computer Vision, Machine Learning

69,716 followers 1mo Edited

After 15+ years building computer vision systems, I tested Opus 4.7 on one of the simplest CV tasks — detecting cars in an aerial parking lot image. The result? It took 4–5 minutes and still missed cars, produced false positives, and placed points on empty spaces. Bounding box results were even worse — boxes in completely wrong locations. I also tested Codex, which did significantly better at pointing (found 24 cars with accurate placement). But it still took nearly 3 minutes. Meanwhile, a YOLO model does the same in 30 milliseconds. Here's the takeaway for anyone building with AI right now: Multimodal LLMs are incredible for reasoning, summarization, and code generation. But they are not a replacement for dedicated computer vision models. For standard detection → train a model or use YOLO. For open-vocabulary tasks like "find only red cars" → use Qwen 3 VL or MoonDream. Both respond in milliseconds, cost a fraction of the tokens, and produce far more accurate results. The right model for the right task. Always. Full breakdown in the video. Link in comments. #ComputerVision #AI #MachineLearning #YOLO #DeepLearning #ObjectDetection #LLM #AIEngineering #MLOps #OpenCV

46 Comments

Paul Thompson

Director at ENIGMA Center for Worldwide Medicine, Imaging & Genomics

6,895 followers 1y

🔥 CAN YOU TALK TO YOUR BRAIN MRI SCANS and ask them questions? 🔥 Our new AI vision-language model identifies factors that affect the brain via natural language supervision (paper linked below) 🔥 Most AI methods for brain imaging are trained to do one task. But what if you COULD ask them to do disease detection, retrieval, and even captioning (telling you what's in the medical image)? How would you train such a model to learn about brains and diseases, and variations that matter? 🔥 In Dhinagar et al., we train an #AI to learn multiple brain imaging tasks using natural language. Instead of training a deep learning model on images only, we extend transformer-based text encoders to learn new concepts like "Alzheimer's" by jointly embedding brain MRIs and text, followed by contrastive learning with image-to-text and text-to-image losses 🔥 We fine-tune the vision-language model by freezing the image encoder backbone and fine-tuning everything else 🔥 Next we evaluate multiple state-of-the-art transformer-based decoder-only large language models (LLMs) for visual question answering including Google’s Gemma2 series, Meta’s Llama3 series, and Mistral’s 7B. The Mistral 7B LLM was selected due to its strong ability to adhere to provided instructions. We used Meta AI’s optimized FAISS vector store to create our database of vector embeddings for retrieval and re-ranking mechanisms for visual question answering. 🔥 Most ‘off-the-shelf’ text encoders were not sensitive to numerical and categorical concepts that are crucial for neuroimaging - so you can't just feed your MRI scans into a chatbot and hope it works, as suggested by Elon Musk [1] - the text encoder and its pre-training data are crucial for cross-modal retrieval and classification performance; the VLM can discover brain features that align with text concepts directly, as shown via classification of Alzheimer's disease, age prediction from MRI 🔥 We evaluate different cross-modal fine-tuning methods - fully fine-tuning all layers was best, but locked image tuning – fine-tuning only the text backbone along with the image projection head – greatly reduces the number of tunable parameters, lending itself to low-resource settings. 🔥 The approach is quite general + could help with a variety of tasks in radiology and medical imaging - you could use such a VLM interface to perform virtual experiments: identify brain abnormalities in a group of patients, discover imaging patterns associated with a medication or risk factor described in the associated text. Will try that next :) PDF: https://lnkd.in/gHQrKke8 Abstract: https://lnkd.in/gigK3M-v [1] https://lnkd.in/gARFiUUP #AI #VLM

12 Comments

Niels Rogge

Machine Learning Engineer at ML6 & Hugging Face

69,605 followers 9mo

Some new impressive open vocabulary detectors landed in the Hugging Face Transformers library! 🔥 LLMDet (CVPR '25 highlight) and MM-Grounding DINO are now available. These models are so-called "open vocabulary" or "zero-shot" detection models. This means that they can detect objects in an image just via prompting, no training involved! Previously, we supported impressive models such as Google's OWL-ViT, OWLv2 as well as Grounding DINO - they are pretty popular on the hub with more than 1 million monthly downloads. They can greatly speed up annotation. Today, those got leveled up by some more recent works. MM-Grounding DINO is an improvement of Grounding DINO built on the MMDetection library. It serves as an open-source, comprehensive, and user-friendly baseline, as the original Grounding DINO didn't open-source any training code 😞. The authors also outperform the original model. The second one is called LLMDet and leverages a Large Language Model (LLM) to generate both region-level short captions and image-level long captions on 1 1 million images in total. Thanks to the supervision of the LLM, the model outperforms prior models by a large margin, especially for rare or compositional categories. It is fully compatible with the architecture of MM-Grounding DINO. As we now support 5 different models, Aritra Roy Gosthipaty build an awesome "zero-shot object detection" arena where you can quickly compare the results on an image. Resources: - MM-Grounding-DINO: https://lnkd.in/e-uNjvzj - LLMDet: https://lnkd.in/e5Jxdbqh - Zero-shot object detection arena: https://lnkd.in/et7tpdxu - Zero-shot object detection explained: https://lnkd.in/e6SprRBV

14 Comments

Heather Couture, PhD

Fractional Principal CV/ML Scientist | Making Vision AI Work in the Real World | Solving Distribution Shift, Bias & Batch Effects in Pathology & Earth Observation

17,172 followers 9mo

𝐀𝐈 𝐀𝐠𝐞𝐧𝐭 𝐀𝐜𝐡𝐢𝐞𝐯𝐞𝐬 𝟗𝟏% 𝐀𝐜𝐜𝐮𝐫𝐚𝐜𝐲 𝐢𝐧 𝐂𝐚𝐧𝐜𝐞𝐫 𝐃𝐢𝐚𝐠𝐧𝐨𝐬𝐢𝐬 Oncology decision-making is notoriously complex. Clinicians must integrate histopathology images, radiology scans, genetic profiles, and ever-evolving treatment guidelines to make personalized care decisions. It's a cognitive challenge that even experienced specialists find demanding. A new study by Ferber et al. in Nature Cancer shows how an autonomous AI agent tackled this complexity head-on—and the results are striking. 𝗪𝗵𝘆 𝘁𝗵𝗶𝘀 𝗺𝗮𝘁𝘁𝗲𝗿𝘀: Current AI approaches in healthcare often work in isolation—analyzing single data types or providing generic responses. But real clinical decisions require synthesizing multiple sources of evidence simultaneously, something that has remained challenging for AI systems. 𝗞𝗲𝘆 𝗶𝗻𝗻𝗼𝘃𝗮𝘁𝗶𝗼𝗻𝘀: ◦ 𝐌𝐮𝐥𝐭𝐢𝐦𝐨𝐝𝐚𝐥 𝐭𝐨𝐨𝐥 𝐢𝐧𝐭𝐞𝐠𝐫𝐚𝐭𝐢𝐨𝐧: Vision transformers detect genetic mutations directly from tissue slides, MedSAM segments tumors in radiology images, and the system queries precision oncology databases autonomously ◦ 𝐒𝐞𝐪𝐮𝐞𝐧𝐭𝐢𝐚𝐥 𝐫𝐞𝐚𝐬𝐨𝐧𝐢𝐧𝐠: The agent chains tools together—first measuring tumor growth from imaging, then checking mutation databases, then searching recent literature ◦ 𝐄𝐯𝐢𝐝𝐞𝐧𝐜𝐞-𝐛𝐚𝐬𝐞𝐝 𝐜𝐢𝐭𝐚𝐭𝐢𝐨𝐧𝐬: 75.5% accuracy in citing relevant medical guidelines, addressing the critical problem of AI hallucinations in healthcare 𝗧𝗵𝗲 𝗿𝗲𝘀𝘂𝗹𝘁𝘀: When tested on 20 realistic patient cases, the integrated system achieved 91% accuracy in clinical conclusions. Perhaps more telling: GPT-4 alone managed only 30% accuracy on the same cases—nearly a 3x improvement through tool integration. The agent successfully used appropriate diagnostic tools 87.5% of the time and provided helpful responses to 94% of clinical questions. 𝗧𝗵𝗲 𝗯𝗶𝗴𝗴𝗲𝗿 𝗽𝗶𝗰𝘁𝘂𝗿𝗲: This isn't about replacing oncologists—it's about augmenting clinical reasoning with AI that can process multiple data streams simultaneously. The modular approach means individual tools can be updated, validated, and regulated independently. While challenges remain around data privacy and regulatory approval, this research points toward a future where AI agents serve as sophisticated clinical reasoning partners, helping doctors navigate the increasing complexity of modern medicine. https://lnkd.in/e52xBZj9 #AIinHealthcare #PrecisionOncology #ClinicalAI #DigitalHealth #MachineLearning #Oncology

1 Comment

Sreenivas B.

Director / Head of Digital Solutions at Zeiss

9,606 followers 3w

I spent the last few evenings building something I have wanted for a long time: a complete LLM-assisted image annotation pipeline for scientists. The full series is now live as a playlist, four videos covering everything from concepts to working code to a desktop application you can run on your own GPU. What the series covers: Video 1: The concepts, what Grounding DINO and SAM 2 actually are, where they work, and where they fail (honestly) Video 2: Text-prompted object detection in Google Colab, detect anything you can describe, no training required Video 3: Text to pixel masks, combining Grounding DINO with SAM 2 for precise segmentation, including point-click correction for domains where automatic detection fails Video 4: A full desktop annotation tool built in Python, per-class thresholds, multiple detection phrases, single-click manual correction, and labeled mask export ready for training pipelines On the domain gap question that came up in the comments earlier this week: you are right that these models were not trained on your specific stain or imaging modality. But as the screen recording shows, the prompt matters more than you might expect. "Glomerulus" failed. "Small circular objects" worked. The model does not need to understand the science, it needs a description it can match visually. That insight is worth knowing before you write these tools off. All notebooks and the annotation tool source code are on GitHub. Playlist link: https://lnkd.in/gKbGsVsd GitHub link: https://lnkd.in/g3SkZU-Q #MachineLearning #ImageAnnotation #Python #DeepLearning #ComputerVision #SAM2 #GroundingDINO #AIforScience #Microscopy #DigitalPathology

22 Comments

AI Techniques For Image Recognition

More in AI Techniques For Image Recognition

More Artificial Intelligence topics

Explore categories