I spent the last few evenings building something I have wanted for a long time: a complete LLM-assisted image annotation pipeline for scientists. The full series is now live as a playlist, four videos covering everything from concepts to working code to a desktop application you can run on your own GPU. What the series covers: Video 1: The concepts, what Grounding DINO and SAM 2 actually are, where they work, and where they fail (honestly) Video 2: Text-prompted object detection in Google Colab, detect anything you can describe, no training required Video 3: Text to pixel masks, combining Grounding DINO with SAM 2 for precise segmentation, including point-click correction for domains where automatic detection fails Video 4: A full desktop annotation tool built in Python, per-class thresholds, multiple detection phrases, single-click manual correction, and labeled mask export ready for training pipelines On the domain gap question that came up in the comments earlier this week: you are right that these models were not trained on your specific stain or imaging modality. But as the screen recording shows, the prompt matters more than you might expect. "Glomerulus" failed. "Small circular objects" worked. The model does not need to understand the science, it needs a description it can match visually. That insight is worth knowing before you write these tools off. All notebooks and the annotation tool source code are on GitHub. Playlist link: https://lnkd.in/gKbGsVsd GitHub link: https://lnkd.in/g3SkZU-Q #MachineLearning #ImageAnnotation #Python #DeepLearning #ComputerVision #SAM2 #GroundingDINO #AIforScience #Microscopy #DigitalPathology
Object Detection and Segmentation in Computer Vision
Explore top LinkedIn content from expert professionals.
Summary
Object detection and segmentation in computer vision are techniques that allow computers to find, identify, and outline specific items or regions within images or videos, helping machines "see" and organize visual data much like humans do. These methods are used in fields such as satellite imagery, medical analysis, and robotics to automatically pinpoint and separate objects—even those the system hasn't seen before—using prompts, clicks, or text descriptions.
- Try prompt tuning: Experiment with different descriptions, clicks, or bounding boxes to guide models in isolating objects, especially when working with unfamiliar categories or challenging imagery.
- Use high-resolution images: Select clear, detailed pictures when possible to improve the accuracy of segmentation for small or complex items.
- Explore 3D segmentation: Consider newer models that extend segmentation into three dimensions for tasks like AR/VR or robotics, allowing for richer spatial understanding beyond flat images.
-
-
Lessons from a full day with SAM 2 on satellite imagery. First off, what is SAM 2? It’s a zero‑shot, promptable segmentation model, meaning it can segment unseen objects out-of-the-box, without any training on those classes using only simple prompts like clicks, boxes, or text descriptions (what I used) to guide the process. Why apply it to satellite imagery? SAM 2 excels at segmenting environmental features (ex. roads, buildings, orchards) without retraining. My top tips? 🛰️ Use high‑res imagery (30 cm–1 m/pixel) for crisp segmentation especially for small objects. 🍃 Adjust prompts for the overhead view (e.g., "green leaves" or "shrubs" instead of "trees" - I even used "grey boxes" to find air conditioning units on top of buildings) 🚗 Small objects are detectable with careful prompting, even counting cars works. At Wherobots we embed SAM 2 into our raster inference engine. Users write simple SQL/Python prompts with text, inference runs in parallel on tiles, and results are stored as Iceberg tables in S3. From there, you can use the vector objects that are returned just like regular geospatial data with no special modeling needed. SAM 2 brings zero‑shot segmentation to geospatial data and when you combine it with prompt tuning, high‑res imagery, and distributed inference, and you can pull out earth scale insights in a day. Would love to hear your experiences with vision models on remote sensing! 🌎 I'm Matt and I talk about modern GIS, geospatial data engineering, and AI and geospatial is changing. 📬 Want more like this? Join 7k+ others learning from my newsletter → forrest.nyc
-
If you took the time to read the SAM 2 paper I shared last week, you learned that in SAM 2, a user clicks a point on an object or draws a bounding box around it, and the model returns a segmentation mask for that particular object. To segment five similar objects in a scene, five separate prompts are needed. SAM 3 shifts from "pointing" at pixels to "naming" concepts. A text prompt like "person wearing glasses" returns masks for all matching objects across an image or video. Clicking one "rusted bolt" generalizes detection to all rusted bolts—a capability SAM 2 lacks. The authors built an automated pipeline that generated training data with 4 million unique concepts. SAM 3 doubles prior accuracy on concept segmentation. On MOSEv2—a video benchmark with severe occlusions, small objects, adverse weather, and low-light scenes—SAM 2.1 scores 47.9%, while SAM 3 reaches 60.3%. Read and ask questions on ChapterPal: https://lnkd.in/ehqQHFWV Download the PDF: https://lnkd.in/ePh3H3Zb
-
Forget flat photos—SAM3D is rewriting how machines understand the world. In this episode, we break down the groundbreaking new model that takes the core ideas of Meta’s Segment Anything Model and expands them into the third dimension, enabling instant 3D segmentation from just a single image. We start with the limitations of traditional 2D vision systems and explain why 3D understanding has always been one of the hardest problems in computer vision. Then we unpack the SAM3D architecture in simple terms: its depth-aware encoder, its multi-plane representation, and how it learns to infer 3D structure even when parts of an object are hidden. You’ll hear real examples—from mugs to human hands to complex indoor scenes—demonstrating how SAM3D reasons about surfaces, occlusions, and geometry with surprising accuracy. We also discuss its training pipeline, what makes it generalize so well, and why this technology could power the next generation of AR/VR, robotics, and spatial AI applications. If you want a beginner-friendly but technically insightful overview of why SAM3D is such a massive leap forward—and what it means for the future of AI—this episode is for you. Resources: SAM3D Website https://ai.meta.com/sam3d/ SAM3D Github https://lnkd.in/g9Snnh4i https://lnkd.in/gEwvPVJc SAM3D Demo https://lnkd.in/gkvxYKic SAM3D Paper https://lnkd.in/gv-5zvmH Need help building computer vision and AI solutions? https://bigvision.ai Start a career in computer vision and AI https://lnkd.in/gwi4kP2M
-
Vision language models (VLMs) have become very impressive in processing scientific imagery. It is opening a new avenue for how R&D teams can work with microscopy data. Back in graduate school, a lot of my TEM and SEM analysis was done manually. Counting particles, measuring fiber diameters, and eyeballing whether a crystalline pattern was real or an imaging artifact. Tools like ImageJ helped with the mechanics, but the workflow was still far from autonomous. Fast forward to today, machine learning and computer vision have become ubiquitous in scientific image analysis. The newer addition is VLMs, which have emerged as a strong and remarkably versatile contender (https://lnkd.in/ea7NRN-x). A new paper in npj Computational Materials presents a systematic evaluation of popular VLMs on microscopy images, spanning standard classification, segmentation, and counting tasks, plus the VLM-specific visual question answering. Here is how the models fared across the four tasks: 🔹Classification: Gemini-2.5 Pro and ChatGPT-5 classified 10 types of SEM images (fibers, particles, nanowires, MEMS) at 77% and 61% zero-shot. Their written rationales were often reasonable even when the classification was wrong. 🔹Segmentation: SAM-2, Meta's general-purpose segmentation model, still produced the most reliable masks across the microscopy datasets. Light parameter tuning closed much of the gap to specialized, domain-trained pipelines. 🔹Counting: SAM-2 with lightweight post-processing was the most dependable across datasets, handling both isolated and densely packed objects. Interestingly, Llama-3.2 Vision emerged as a competitive alternative. 🔹Visual question answering: Arguably the most interesting and highest-potential task. Models handle scale bars, morphology, and process stages. Even starting with an incorrect answer, conversational steering by a domain expert is often effective. A couple of things worth noting. These models operate in both pure-vision and code-based modes (writing Python to run image analysis tools), each with distinct strengths and failure modes. And the models studied here are from mid-2025 or earlier; today's frontier VLMs are meaningfully more capable. For R&D teams working with microscopy, this is a good moment to take VLMs seriously, either directly for first-pass analysis or for generating labeled data to train domain-specialized models. 📄 Vision Language Models for Scientific Image Analysis: An Evaluation Highlighting Opportunities and Challenges, npj Computational Materials, April 21, 2026 🔗 https://lnkd.in/ejZ3V4He
-
Summary DINOv3 is a new AI model for image understanding, known as a "vision foundation model." It uses self-supervised learning (SSL) to train on a massive 1.6-billion-image dataset without human labels, learning patterns like a person observes the world. Researchers addressed a trade-off where longer training improved high-level understanding but degraded pixel-level detail. They introduced "Gram anchoring," a technique that preserves spatial detail during training, making DINOv3 excel in both high-level recognition and fine detail. It achieves state-of-the-art results in tasks like object detection and depth estimation, making it a versatile tool for computer vision applications. Methodology DINOv3 builds on four pillars: data scaling, a new training objective, and post-training refinement. It uses a 1.689-billion-image dataset (LVD-1689M), including ~10% ImageNet. A Vision Transformer (ViT) model with 7 billion parameters was trained for 1 million steps using DINOv2 objectives. Gram Anchoring, the core innovation, prevents feature degradation by comparing the model's Gram matrix with a "Gram teacher" checkpoint. Post-training includes resolution scaling, distillation to smaller ViT models, and text alignment with a separate text encoder for open-vocabulary understanding. Results and Discussion DINOv3 sets new state-of-the-art (SOTA) benchmarks in visual representation. It achieves exceptional dense features, outperforming DINOv2 (49.5 mIoU) and SigLIP 2 (42.7 mIoU) with 55.9 mIoU on ADE20k segmentation, and leads in 3D keypoint matching, video tracking, and unsupervised object discovery. For the first time, it matches text-supervised models like PEcore and SigLIP 2 in global image classification while setting SOTA in instance retrieval. As a "frozen backbone," DINOv3 achieves SOTA in object detection and semantic segmentation, even with a lightweight 100M-parameter head and no fine-tuning. Its domain versatility is shown by training on 493 million satellite images, achieving SOTA in geospatial tasks. Implications of the Study DINOv3 demonstrates that self-supervised learning (SSL) can surpass traditional supervised and weakly-supervised methods. It supports the vision of a "one backbone" model, handling tasks like object detection, segmentation, depth estimation, and 3D understanding with a single frozen model. "Gram anchoring" resolves the global-vs-dense trade-off, enabling larger SSL models (10B+ parameters) without feature loss. The method also supports training in specialized domains like medical imaging without labeled data. Model distillation further makes this technology accessible to developers without requiring supercomputers.
-
We are not talking enough about the Meta Segment Anything Model 2 (SAM 2)! It was released a few weeks ago and is making an impact. It is a next-generation model for object segmentation, bringing SOTA capabilities to images and videos. 𝗪𝗵𝗮𝘁 𝗶𝘀 𝗢𝗯𝗷𝗲𝗰𝘁 𝗦𝗲𝗴𝗺𝗲𝗻𝘁𝗮𝘁𝗶𝗼𝗻? Object segmentation is a fundamental task in computer vision that involves identifying and isolating objects within an image or video by assigning a unique label to each pixel belonging to that object. It helps distinguish between different objects and the background, allowing computers to understand scenes granularly. 💡 𝗨𝘀𝗲 𝗖𝗮𝘀𝗲𝘀 𝗳𝗼𝗿 𝗢𝗯𝗷𝗲𝗰𝘁 𝗦𝗲𝗴𝗺𝗲𝗻𝘁𝗮𝘁𝗶𝗼𝗻: - Video Editing and Effects: Precisely isolate objects in videos for adding effects or modifying specific parts. - Augmented Reality (AR): Enhance AR experiences by accurately recognizing and tracking objects in real-time. - Healthcare: Segment medical images like X-rays or MRIs to assist doctors in diagnosing and planning treatments. - Autonomous Vehicles: Detect and track objects like pedestrians, vehicles, and road signs to navigate safely. - Retail and Fashion: Enable virtual try-on solutions by segmenting clothing items from images or videos. - Robotics: Guide robots in complex environments by helping them identify and interact with specific objects. 🔹 𝗞𝗲𝘆 𝗙𝗲𝗮𝘁𝘂𝗿𝗲𝘀 𝗼𝗳 𝗦𝗔𝗠 𝟮: - Segment any object in any image or video with a click, box, or mask. - Robust zero-shot performance, even on unfamiliar objects. - Real-time interactivity with streaming inference. - Unified segmentation for both images and videos in a single model. 📝 𝗪𝗵𝘆 𝗦𝗔𝗠 𝟮 𝗦𝘁𝗮𝗻𝗱𝘀 𝗢𝘂𝘁: - First unified model for segmenting across both images and videos. - Outperforms existing video segmentation models, especially in tracking parts. - Significantly reduces interaction time compared to existing methods. 📊 𝗧𝗵𝗲 𝗔𝗽𝗽𝗿𝗼𝗮𝗰𝗵: SAM 2 extends the powerful capabilities of SAM into the video domain. The per-session memory module lets SAM 2 remember and track objects throughout video frames, even if they're temporarily hidden. 📚 𝗘𝘅𝗽𝗹𝗼𝗿𝗲 𝘁𝗵𝗲 𝗥𝗲𝘀𝗲𝗮𝗿𝗰𝗵 𝗮𝗻𝗱 𝗗𝗮𝘁𝗮𝘀𝗲𝘁: Meta open-sourced the Segment Anything Video Dataset (SA-V), with ~600K+ masklets across ~51K videos collected from 47 countries. Annotations include whole objects, parts, and even challenging occlusions. This helps with their commitment to transparency and representation. 👉 Ready to explore the power of SAM 2? - Try the demo: https://lnkd.in/gSD9a_R4 - Download the model: https://lnkd.in/gQq3FsPc - Download the data: https://lnkd.in/gm4JAkYU - Read the paper: https://bit.ly/3zsk5dh
-
Open Vocabulary Detection with Qwen2.5-VL 🔥 🔥 🔥 I've been diving into how well Vision-Language Models (VLMs) like Qwen2.5-VL can understand images and find objects. This builds on my earlier tutorials about zero-shot detection with models like GroundingDINO and YOLO-World (links in comments). I wanted to see how well VLMs can not only detect objects but also understand where they are in an image. As usual, I've created a notebook to show you what I've been testing: - Object detection using Qwen2.5-VL with different types of instructions (prompts). - Trying to find single and multiple objects in an image. - Using descriptions like "the object on the left" or "the closest object" to find specific items. - Asking the model to reason about objects: "What would I use to open this?" ⮑ 🔗 notebook: https://lnkd.in/dJchZZiJ More examples are in the comments! 👇🏻
-
I did a small hands-on comparison between YOLOv26 (Ultralytics) and Detectron2 (Meta AI) on the same video to see how they behave in a real-world scenario From my experience: ⚡ YOLOv26 (Ultralytics) It feels much faster and more stable for video processing. Bounding boxes are solid and confidence scores are high, making it a great choice for real-time applications and deployment. 🧠 Detectron2 (Meta AI) Still a very powerful framework, especially for research, segmentation, and detailed analysis. However, it’s heavier and slower, so it’s not always ideal for real-time pipelines. My takeaway: If speed and production deployment matter, YOLOv26 is the better option If you need flexibility, advanced segmentation, or research-level control, Detectron2 remains a strong tool At the end of the day, it’s all about choosing the right model for the right use case. #ComputerVision #YOLO #Detectron2 #Ultralytics #MetaAI #AI #DeepLearning #ObjectDetection #VideoAnalytics