Zero-Shot Classification Techniques for Robotic Motion Control

Explore top LinkedIn content from expert professionals.

Summary

zero-shot classification techniques for robotic motion control allow robots to perform new tasks or handle unfamiliar objects without extra training or demonstrations. these methods use advanced models that interpret visual or language instructions to generate actions in real time, bridging perception and action even in complex or changing environments.

  • explore open-world skills: consider how these techniques let robots adapt to tasks or objects they’ve never seen before, making them much more flexible in real-world scenarios.
  • embrace minimal data: remember that zero-shot approaches reduce the need for large training datasets, so robots can start working in new settings faster and with less preparation.
  • connect reasoning and action: look for models that combine perception, planning, and control, as these can help robots accomplish diverse tasks just from high-level instructions or single camera views.
Summarized by AI based on LinkedIn member posts
  • View profile for Animesh Garg

    RL + Foundation Models in Robotics. Faculty at Georgia Tech. Prev at Nvidia

    18,891 followers

    Robotics data is expensive and slow to collect. A lot of videos are available online, but not readily usable by robotics because of lack of action labels. AMPLIFY solves this problem by learning Actionless Motion Priors that unlock better sample efficiency, generalization, and scaling for robot learning. Our key insight is to factor the problem into two stages: The "what": Predict the visual dynamics required to accomplish a task The "how": Map predicted motions to low-level actions This decoupling enables remarkable generalizability: our policy can perform tasks where we have NO action data, only videos. We outperform SOTA BC baselines on this by 27x 🤯 AMPLIFY is composed of three stages: 1. Motion Tokenization: We track dense keypoint grids through videos and compress their trajectories into discrete motion tokens. 2. Forward Dynamics: Given an image and task description (e.g., "open the box"), we autoregressively predict a sequence of motion tokens representing how keypoints should move over the next second or so. This model can train on ANY text-labeled video data - robot demonstrations, human videos, YouTube videos. 3. Inverse Dynamics: We decode predicted motion tokens into robot actions. This module learns the robot-specific mapping from desired motions to actions. This part can train on ANY robot interaction data - not just expert demonstrations (think off-task data, play data, or even random actions). So, does it actually work? Few-shot learning: Given just 2 action-annotated demos per task, AMPLIFY nearly doubles SOTA few-shot performance on LIBERO. This is possible because our Actionless Motion Priors provide a strong inductive bias that dramatically reduces the amount of robot data needed to train a policy. Cross-embodiment learning: We train the forward dynamics model on both human and robot videos, but the inverse model sees only robot actions. Result: 1.4× average improvement on real-world tasks. Our system successfully transfers motion information from human demonstrations to robot execution. And now my favorite result: AMPLIFY enables zero-shot task generalization. We train on LIBERO-90 tasks and evaluate on tasks where we’ve seen no actions, only pixels. While our best baseline achieves ~2% success, AMPLIFY reaches a 60% average success rate, outperforming SOTA behavior cloning baselines by 27x. This is a new way to train VLAs for robotics which dont always start with large scale teleoperation. Instead of collecting millions of robot demonstrations, we just need to teach robots how to read the language of motion. Then, every video becomes training data. led by Jeremy Collins & Loránd Cheng in collaboration with Kunal Aneja, Albert Wilcox, Benjamin Joffe at College of Computing at Georgia Tech Check out our paper and project page for more details: 📄 Paper: https://lnkd.in/eZif-mB7 🌐 Website: https://lnkd.in/ezXhzWGQ

  • View profile for Ilir Aliu

    AI & Robotics | 140k+ | 22Astronauts

    101,483 followers

    What if a robot could simulate the physical world from a single image. [📍Bookmark Paper & GitHub for later] PointWorld-1B from Stanford and NVIDIA is a large 3D world model that predicts how an entire scene will move, given RGB-D input and robot actions. The key idea is simple but powerful: Actions are not joint angles❗️ They are 3D point flows sampled from the robot’s own geometry. The model reasons in the same space where physics actually happens. • State and action are unified as 3D point trajectories. • One forward pass predicts full-scene motion for one second. • No object masks, no trackers, no material priors. • Trained on ~500 hours of real and simulated robot interaction data. • Micrometer-level trajectory error, thinner than a human hair. • Works across embodiments, from single arm to bimanual humanoid. The model is then used inside an MPC planner to push objects, manipulate cloth, and use tools, all zero-shot, from a single fixed camera and without finetuning. This feels like a shift from “learning policies” to “learning physics in 3D”. Thanks for sharing, @wenlong_huang 📍Project point-world.github.io 📍Paper arxiv.org/abs/2601.03782 📍GitHub https://lnkd.in/dqsjUTxg (will be published soon) —— Weekly robotics and AI insights. Subscribe free: scalingdeep.tech

  • View profile for Murtaza Dalal

    Robotics ML Engineer @ Tesla Optimus | CMU Robotics PhD

    2,084 followers

    Can my robot cook my food, tidy my messy table, rearrange my dresser and do much much more without ANY demos or real-world training data? Introducing ManipGen: A generalist agent for manipulation that can solve long-horizon robotics tasks entirely zero shot, from text input! Key idea: for many manipulation tasks of interest, they can be decomposed into two phases, contact-free reaching (aka motion planning!) and contact-rich local interaction. The latter is hard to learn, and we take a sim2real transfer approach! We define local policies, which operate in a local region around an object of interest. They are uniquely well-suited to generalization (see below!) and sim2real transfer. This is because they are invariant to: 1) Absolute pose 2) Skill orders 3) Environment configurations As an overview, our approach 1) acquires generalist behaviors for local skills at scale using RL 2) distills these behaviors into visuomotor policies using multitask DAgger and 3) deploys local policies in the real-world using VLMs and motion planning. Phase 1: Train state-based, single-object policies to acquire skills such as picking, placing, opening and closing. We train policies using PPO across thousands of objects, designing reward and observation spaces for efficient learning and effective sim2real transfer. Phase 2: We need visuomotor policies to deploy on robots! We distill single-object experts into multi-task policies using online imitation learning (aka DAgger) that observe local visual (wrist cam) input with edge and hole augmentation to match real-world depth noise. To deploy local policies in the real-world, we decompose the task into components (GPT-4o), estimate where to go using Grounded SAM, and motion plan using Neural MP. For control, we use Industreallib from NVIDIA, an excellent library for sim2real transfer! ManipGen can solve long-horizon tasks in the real-world entirely zero-shot generalizing across objects, poses, environments and scene configurations! We outperform SOTA approaches such as SayCan, OpenVLA, LLMTrajGen and VoxPoser across 50 tasks by 36%, 76%, 62% and 60%! ManipGen exhibits exciting capabilities such as performing manipulation in tight spaces and with clutter, entirely zero-shot! From putting items on the shelf, carefully extracting the red pepper from clutter and putting large items in drawers, ManipGen is quite capable. By training local policies at scale on thousands of objects, ManipGen generalizes to some pretty challenging out of distribution objects that don’t look anything like what was in training, such as pliers and the clamps as well as deformable objects such as the wire. This work was done at Carnegie Mellon University Robotics Institute, with co-lead Min Liu, as well as Deepak Pathak and Russ Salakhutdinov and in collaboration with Walter Talbott, Chen Chen, Ph.D., and Jian Zhang from Apple.  Paper, videos and code (coming soon!) at https://lnkd.in/ekjWPXHM

  • View profile for Jiafei Duan

    Robotics & AI PhD student at University of Washington, Seattle

    6,677 followers

    🚀 Introducing MolmoAct: An Open Action Reasoning Model for Robotics Reasoning is central to purposeful action — yet most robotic foundation models still map perception and instructions directly to control, limiting adaptability, generalization, and semantic grounding. We present Action Reasoning Models (ARMs) — a new class of robotic foundation models that integrate perception, planning, and control through a structured three-stage pipeline. Our model, MolmoAct, 🔹 Encodes observations & instructions into depth-aware perception tokens 🔹 Generates mid-level spatial plans as editable trajectory traces 🔹 Predicts precise low-level actions for explainable & steerable behavior Key Results: ✅ 70.5% zero-shot accuracy on SimplerEnv Visual Matching — outperforming closed-source π0 and GR00T N1 ✅ 86.6% avg. success on LIBERO (+6.3% over ThinkAct on long-horizon tasks) ✅ Real-world fine-tuning: +10% (single-arm) & +22.7% (bimanual) task progression over π0-FAST ✅ +23.3% improvement on out-of-distribution generalization ✅ Top human-preference scores for open-ended instruction following & trajectory steering New Release: 📦 MolmoAct Dataset — the first mid-training robot dataset with 10,000+ high-quality trajectories across diverse scenarios. Training with this dataset yields an avg. +5.5% performance boost. We are releasing: 🔓 Model weights 🔓 Training code 🔓 MolmoAct Dataset 🔓 Action reasoning dataset Access through our Ai2 blogpost: https://lnkd.in/gg-43dnw MolmoAct is not just SOTA — it’s an open blueprint for building ARMs that transform perception into purposeful action through grounded reasoning. #Robotics #FoundationModels #AI #ActionReasoning #MolmoAct #OpenSource

  • View profile for Akshet Patel 🤖

    Robotics Engineer | Creator

    49,618 followers

    What if a robot hand could grasp over 500 unseen objects, using just a single camera view? [⚡Join 2400+ Robotics enthusiasts - https://lnkd.in/dYxB9iCh] A paper by Hui Zhang, Zijian WU, Linyi Huang, Sammy Christen, and Jie Song from ETH Zürich and The Hong Kong University of Science and Technology introduces a zero-shot dexterous grasping system that generalises from simulation to real-world objects. "RobustDexGrasp: Robust Dexterous Grasping of General Objects from Single-view Perception" • Achieves 94.6% success on 512 real-world objects, trained on only 35 simulated objects • Uses a hand-centric representation based on dynamic distance vectors between finger joints and object surfaces • Employs a mixed curriculum learning strategy: imitation learning from a privileged teacher policy, followed by reinforcement learning under disturbances • Demonstrates robustness to observation noise, actuator inaccuracies, and external forces • Enables zero-shot grasping in cluttered environments and task-driven manipulation guided by vision-language models This approach enhances the adaptability of robotic hands, allowing for reliable grasping without extensive prior knowledge of object properties. It opens avenues for deploying dexterous robots in unstructured environments with minimal training data. If robots can grasp novel objects with such reliability, what complex manipulation tasks should we tackle next? Paper: https://lnkd.in/eb3itwhF Project Page: https://lnkd.in/efbaBz4H #DexterousManipulation #ReinforcementLearning #ZeroShotLearning #RoboticsResearch #ICRA2025

Explore categories