Robotics Manipulation Using Partial Scene Data

Explore top LinkedIn content from expert professionals.

Summary

Robotics manipulation using partial scene data refers to techniques that allow robots to interact with their environment even when they don't have a complete picture of everything around them. By combining smart sensing, structured world models, and learning from videos or human demonstrations, robots can reliably perform tasks like picking, placing, or moving objects in unpredictable or partially visible settings.

  • Integrate sensory inputs: Combine visual, tactile, and scene graph data to help robots make informed decisions even when their view is obstructed or incomplete.
  • Utilize structured knowledge: Build and update dynamic world models so robots can adapt their actions as the environment changes or as new information becomes available.
  • Apply learning techniques: Train robots using human demonstrations, synthetic data, or generated video flows to teach them how to manipulate objects in a wide range of real-world scenarios.
Summarized by AI based on LinkedIn member posts
  • View profile for Rangel Isaías Alvarado Walles

    Robotics & AI Engineer | AI Engineer | Machine Learning | Deep Learning | Computer Vision | Agentic AI | Reinforcement Learning | Self-Driving Cars | IIoT | AIOps | MLOps | LLMOps | DevOps | AIOps | Embodied AI

    5,378 followers

    AI/robotics researchers have a new reason to celebrate! 🚀 The paper 'Knowledge-Guided Manipulation Using Multi-Task Reinforcement Learning' introduces KG-M3PO, a groundbreaking framework that unifies perception, knowledge, and policy for multi-task robotic manipulation in partially observable environments. 🔁 At a Glance 💡 Goal: Enable robots to perform diverse manipulation tasks reliably by integrating a dynamic, structured world model. ⚙️ Approach: - Model-based policy optimization: Controls robot actions via an online 3D scene graph grounding open-vocabulary detections. - Scene graph updates dynamically, capturing spatial, containment, and affordance relations. - GNN encoder trained end-to-end shapes relational features directly by control performance. - Multimodal observations (visual, proprioceptive, language, graph) fused into a shared latent space. - Policy conditioned on lightweight graph queries, improving decision-making. 📈 Impact (Key Results) 🧪 Success in occlusion, layout shifts, and distractor scenarios. - Higher success rates and sample efficiency. - Enhanced generalization to new objects and configurations. 🔄 Robust long-horizon behaviors demonstrate the importance of structured knowledge. 🤖 Applications include multi-task robots that understand and adapt in complex, partial environments. 🔬 Experiments 🧪 Benchmarks: Isaac Sim, with Franka and UR5 robots. 🎯 Tasks: Pick, place, open, retrieve, with fully and partially observable setups. 🦾 Setup: 1024 parallel environments, GPU-accelerated physics. 📐 Inputs: Images, scene graphs, multimodal sensory data. 🛠 How to Implement 1️⃣ Build scene graphs with BBQ from sensory data. 2️⃣ Encode graphs using GNNs trained with RL loss. 3️⃣ Fuse visual, proprioception, and graph features into state. 4️⃣ Train RL policy end-to-end, propagating gradients through GNN. 5️⃣ Deploy in simulation, and future real-world tests. 📦 Deployment Benefits ✅ Enhanced generalization and sample efficiency. ✅ Robust performance under occlusion and partial views. ✅ Scalable multi-task learning. ✅ Better long-term behavior with structured relational understanding. 📣 Takeaway This work proves that integrating structured, continually updated world knowledge into RL policies significantly boosts robotic manipulation capabilities. It paves the way for more adaptable, intelligent robots that operate reliably in unstructured environments. Follow me to know more about AI, ML, and Robotics!

  • View profile for Supriya Rathi

    110k+ | India #1. World #10 | Physical-AI | Podcast Host - SRX Robotics | Connecting founders, researchers, & markets | DM to post your research | DeepTech

    113,186 followers

    Presenting FEELTHEFORCE (FTF): a robot learning system that models human tactile behavior to learn force-sensitive manipulation. Using a tactile glove to measure contact forces and a vision-based model to estimate hand pose, they train a closed-loop policy that continuously predicts the forces needed for manipulation. This policy is re-targeted to a Franka Panda robot with tactile gripper sensors using shared visual and action representa- tions. At execution, a PD controller modulates gripper closure to track predicted forces -enabling precise, force-aware control. This approach grounds robust low- level force control in scalable human supervision, achieving a 77% success rate across 5 force-sensitive manipulation tasks. #research: https://lnkd.in/dXxX7Enw #github: https://lnkd.in/dQVuYTDJ #authors: Ademi Adeniji, Zhuoran (Jolia) Chen, Vincent Liu, Venkatesh Pattabiraman, Raunaq Bhirangi, Pieter Abbeel, Lerrel Pinto, Siddhant Haldar New York University, University of California, Berkeley, NYU Shanghai Controlling fine-grained forces during manipulation remains a core challenge in robotics. While robot policies learned from robot-collected data or simulation show promise, they struggle to generalize across the diverse range of real-world interactions. Learning directly from humans offers a scalable solution, enabling demonstrators to perform skills in their natural embodiment and in everyday environments. However, visual demonstrations alone lack the information needed to infer precise contact forces.

  • View profile for Wenlong Huang

    CS PhD Student at Stanford (AI / Robotics)

    2,462 followers

    What representation enables open-world robot manipulation from generated videos? Introducing Dream2Flow, our recent work that bridges video generation and robot control with 3D object flow. 🌐 dream2flow.github.io by Stanford University 🔹Robot manipulation is about inducing changes in an environment through actions. We observe that video models (e.g., Veo) excel at producing plausible object motions from an in-the-wild image and language instructions. Intriguingly, these motions are more physically realistic when the actor is human rather than robot, likely because the internet contains far more human interaction data than robot data. 🔹But how do we turn those generated videos into low-level robot actions? This is a nuanced question beyond simple retargeting, because strategies taken by a human may not work on a robot. 🔹We propose Dream2Flow, which uses 3D object flow to separate what should happen in the scene from how a robot should realize it. We extract this flow from generated videos using off-the-shelf vision models, then use it as a shared objective for both trajectory optimization and reinforcement learning. 🔹Dream2Flow can perform a range of in-the-wild tasks zero-shot with trajectory optimization, including manipulation of rigid, articulated, and deformable objects. The robot plans by asking a counterfactual question using a dynamics model (either heuristics-based or learned): if I take this action, will the scene evolve toward the desired 3D flow? 🔹Using as reward for RL, Dream2Flow enables different embodiments to discover emergent behaviors that achieve the same effect (e.g., base motion of the robot dog). Dream2Flow unifies these behaviors through a shared task interface and unifies model-free and model-based methods around a shared tracking goal. 🔹By leveraging purely off-the-shelf video models, Dream2Flow also allows generalization to different object instances, backgrounds, and camera viewpoints. It is also surprisingly steerable: different language instructions in the same scene can induce different desired behaviors. 🔹World modeling encodes rich priors about not only environment dynamics but also behaviors within it. It is immensely useful for robotics, yet we are only scratching the surface of understanding it. The project was led by Karthik Dharmarajan and has been a year in the making, along with the rest of the team Jiajun Wu, Fei-Fei Li, and Ruohan Zhang. Karthik Dharmarajan will also be joining UC Berkeley as a PhD student this fall! Website: dream2flow.github.io  Paper: https://lnkd.in/gpwP2hkT Code: https://lnkd.in/gvJZTxaP

  • View profile for Murtaza Dalal

    Robotics ML Engineer @ Tesla Optimus | CMU Robotics PhD

    2,200 followers

    Can a single neural network policy generalize over poses, objects, obstacles, backgrounds, scene arrangements, in-hand objects, and start/goal states? Introducing Neural MP: A generalist policy for solving motion planning tasks in the real world 🤖 Quickly and dynamically moving around and in-between obstacles (motion planning) is a crucial skill for robots to manipulate the world around us. Traditional methods (sampling, optimization or search) can be slow and/or require strong assumptions to deploy in the real world. Instead of solving each new motion planning problem from scratch, we distill knowledge across millions of problems into a generalist neural network policy.  Our Approach: 1) large-scale procedural scene generation 2) multi-modal sequence modeling 3) test-time optimization for safe deployment Data Generation involves: 1) Sampling programmatic assets (shelves, microwaves, cubbys, etc.) 2) Adding in realistic objects from Objaverse 3) Generating data at scale using a motion planner expert (AIT*) - 1M demos! We distill all of this data into a single, generalist policy Neural policies can hallucinate just like ChatGPT - this might not be safe to deploy! Our solution: Using the robot SDF, optimize for paths that have the least intersection of the robot with the scene. This technique improves deployment time success rate by 30-50%! Across 64 real-world motion planning problems, Neural MP drastically outperforms prior work, beating out SOTA sampling-based planners by 23%, trajectory optimizers by 17% and learning-based planners by 79%, achieving an overall success rate of 95.83% Neural MP extends directly to unstructured, in-the-wild scenes! From defrosting meat in the freezer and doing the dishes to tidying the cabinet and drying the plates, Neural MP does it all! Neural MP generalizes gracefully to OOD scenarios as well. The sword in the first video is double the size of any in-hand object in the training set! Meanwhile the model has never seen anything like the bookcase during training time, but it's still able to safely and accurately place books inside it. Since, we train a closed-loop policy, Neural MP can perform dynamic obstacle avoidance as well! First, Jim tries to attack the robot with a sword, but it has excellent dodging skills. Then, he adds obstacles dynamically while the robot moves and it’s still able to safely reach its goal. This work is the culmination of a year-long effort at Carnegie Mellon University with co-lead Jiahui(Jim) Yang as well as Russell Mendonca, Youssef Khaky, Russ Salakhutdinov, and Deepak Pathak The model and hardware deployment code is open-sourced and on Huggingface!  Run Neural MP on your robot today, check out the following: Web: https://lnkd.in/emGhSV8k Paper: https://lnkd.in/eGUmaXKh Code: https://lnkd.in/e6QehB7R News: https://lnkd.in/enFWRvft

  • View profile for Ilir Aliu

    AI & Robotics | 150k+ | 22Astronauts

    108,174 followers

    A new VLM outperforms GPT-4o in spatial affordance prediction for robots: [📍 bookmark for later] Just found RoboPoint, a vision-language model that helps robots figure out where to act based on natural language… like picking the right spot to place or grab an object. It’s trained using only synthetic data, no real-world demos or human supervision. That makes it faster to scale, cheaper to collect, and surprisingly robust across scenes and viewpoints. Why it’s worth a look ✅ Trained entirely with synthetic data, zero real-world demos ✅ Beats GPT-4o and PIVOT by 21.8% in keypoint prediction accuracy ✅ Improves task success rate by 30.5% in downstream manipulation tasks ✅ Works for manipulation, navigation, and AR assistance ✅ Generalizes to new scenes, viewpoints, and object types The team built an automated instruction-tuning pipeline to adapt large VLMs to robotic tasks, and it works. 📄 Paper: arxiv.org/abs/2506.06272 🏷️ CoRL 2024 Zero real data. Better performance. This could be a step toward scalable, grounded VLMs for robotics.

  • View profile for Aaron Prather

    Director, Robotics & Autonomous Systems Program at ASTM International

    85,790 followers

    Training general-purpose robots—the kind that could fold laundry or tidy up like Rosie from The Jetsons—is notoriously difficult because they need vast amounts of real-world data. Traditionally, this data comes from carefully arranged external cameras, but NYU’s General-purpose Robotics and AI Lab, led by Lerrel Pinto, is testing a more scalable approach: EgoZero. EgoZero uses Meta’s research-only smart glasses to record tasks from a human’s point of view. This “egocentric” data captures exactly what a person sees while performing actions, making it both portable and highly relevant. In tests, robots trained only on this human data (no robot data required) achieved a 70% success rate on seven manipulation tasks, such as placing bread on a plate. Instead of relying on full images—which don’t translate well between human hands and robot arms—EgoZero maps hand movements as 3D points in space. This allows robots to generalize: if trained to pick up a roll, they can adapt to handle ciabatta in a new setting. The NYU team is also developing open-source robot designs, touch sensors, and smartphone-based data collection tools. Their ultimate aim is scalability: while large language models train on the Internet, robots lack an equivalent dataset for the physical world. EgoZero and similar methods could begin to close that gap by turning everyday human actions into training fuel for general-purpose robots. 📝 Research Paper: https://lnkd.in/e5m25bSA 💻 Project Page: https://lnkd.in/ewM3VPdH

  • View profile for Naveen Manwani

    Product Operations Strategy | Certified Scrum Product Owner®|Scaling 0-to-1 | Bridging Tech & Business | Optimizing Workflows with GenAI

    7,016 followers

    🚨Paper Alert 🚨 ➡️Paper Title: DeformGS: Scene Flow in Highly Deformable Scenes for Deformable Object Manipulation 🌟Few pointers from the paper 🎯Teaching robots to fold, drape, or reposition deformable objects such as cloth will unlock a variety of automation applications. While remarkable progress has been made for rigid object manipulation, manipulating deformable objects poses unique challenges, including frequent occlusions, infinite-dimensional state spaces and complex dynamics. 🎯Just as object pose estimation and tracking have aided robots for rigid manipulation, dense 3D tracking (scene flow) of highly deformable objects will enable new applications in robotics while aiding existing approaches, such as imitation learning or creating digital twins with real2sim transfer. 🎯 Authors of this paper have proposed “DeformGS”, an approach to recover scene flow in highly deformable scenes, using simultaneous video captures of a dynamic scene from multiple cameras. 🎯DeformGS builds on recent advances in Gaussian splatting, a method that learns the properties of a large number of Gaussians for state-of-the-art and fast novel-view synthesis. DeformGS learns a deformation function to project a set of Gaussians with canonical properties into world space. 🎯The deformation function uses a neural-voxel encoding and a multilayer perceptron (MLP) to infer Gaussian position, rotation, and a shadow scalar. They enforced physics-inspired regularization terms based on conservation of momentum and isometry, which leads to trajectories with smaller trajectory errors. 🎯They also leveraged existing foundation models SAM and XMEM to produce noisy masks, and learn a per-Gaussian mask for better physics-inspired regularization. DeformGS achieves high-quality 3D tracking on highly deformable scenes with shadows and occlusions. 🎯In experiments, DeformGS improves 3D tracking by an average of 55.8 % compared to the state-of-the-art. With sufficient texture, DeformGS achieves a median tracking error of 3.3 mm on a cloth of 1.5 × 1.5 m in area. 🏢Organization: Carnegie Mellon University, Stanford University, NVIDIA, National University of Singapore, Technical University of Munich 🧙Paper Authors:  Bardienus Duisterhof , Mandi Zhao, Yunchao Yao, JiaWei Liu, Jenny Seidenschwarz, Mike Zheng SHOU, Deva ramanan, Shuran Song , Stan Birchfield, Bowen Wen, Jeffrey Ichnowski Ph.D. 1️⃣Read the Full Paper here: https://lnkd.in/gmPqgm7H 2️⃣Project Page: https://lnkd.in/g_-TRmvd 3️⃣Code: https://lnkd.in/g4AERgFs 🎥 Be sure to watch the attached Demo Video-Sound on 🔊🔊 Find this Valuable 💎 ? ♻️REPOST and teach your network something new Follow me 👣, Naveen Manwani, for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements. #robotics #gaussiansplatting

  • View profile for Lentin Joseph

    Author of 11 ROS & Robotics books | ROS 2 | Physical AI | NVIDIA Omniverse | NVIDIA Isaac | Senior Robotics & AI Consultant @RUNTIME Robotics | ROS 2 & NVIDIA Isaac Trainer @Robocademy | TEDx Speaker

    20,579 followers

    MolmoAct — Action Reasoning in 3D Space 🤖🧠 Most AI models still “think” in text. Even the best vision-language models depend heavily on language, which makes them struggle whenever real-world movement or spatial understanding is involved. That’s exactly the gap MolmoAct tries to close — and it does it in a very interesting way 🙂 MolmoAct is an Action Reasoning Model (ARM) built to understand and plan actions directly in 3D space, using depth and geometric cues instead of relying only on text. This makes its reasoning feel much closer to how humans use visuals, diagrams, and spatial intuition. Here’s a clear technical breakdown👇 ✅ Physically-grounded scene understanding Uses depth-aware perception tokens to estimate object geometry and distances. ✅ Waypoint-based planning in image space The model predicts visual waypoints that outline the task step-by-step, independent of robot embodiment. ✅ Precise action decoding Those waypoints convert into low-level robot actions for arms, grippers, humanoids, etc. ✅ Adapts across embodiments Minimal fine-tuning is enough for MolmoAct to adjust to new robot bodies and tasks 💪 ✅ Fully open and reproducible Weights, code, training data, evaluation scripts — everything is public and inspectable. ✅ Strong performance MolmoAct-7B shows state-of-the-art results on SimplerEnv, LIBERO, and several generalization benchmarks. One feature I personally like: Before sending control commands, MolmoAct overlays its planned trajectory directly onto the image. Users can even sketch corrections or target poses on a tablet or screen ✏️📱 — the model adjusts in real time. This makes interaction more intuitive and far more transparent. The team has also released a brand-new post-training dataset with ~10k robot episodes, making MolmoAct a solid foundation for future robotics work. If you’re building anything in robotics, manipulation, or spatially-aware agents, MolmoAct is definitely worth exploring 🚀 👉 Introducing MolmoAct: https://lnkd.in/g_ii3ti2 👉 Read More: https://lnkd.in/g_DhnWqt #vla #molmoact #robotics #ai Robocademy

  • View profile for Peter Farkas

    Robotics - Automation - Physical AI > | Business Development | Sales | Channel Management

    7,965 followers

    Vision system force map used for robotic object manipulation strategies. Force Map: Learning to Predict Contact Force Distribution from Vision When humans see a scene, they can roughly imagine the forces applied to objects based on their experience and use them to handle the objects properly. This paper considers transferring this “force-visualization” ability to robots. We hypothesize that a rough force distribution (named “force map”) can be utilized for object manipulation strategies even if accurate force estimation is impossible. Based on this hypothesis, we propose a training method to predict the force map from vision. To investigate this hypothesis, we generated scenes where objects were stacked in bulk through simulation and trained a model to predict the contact force from a single image. We further applied domain randomization to make the trained model function on real images. The experimental results showed that the model trained using only synthetic images could predict approximate patterns representing the contact areas of the objects even for real images. Then, we designed a simple algorithm to plan a lifting direction using the predicted force distribution. We confirmed that using the predicted force distribution contributes to finding natural lifting directions for typical real-world scenes. Furthermore, the evaluation through simulations showed that the disturbance caused to surrounding objects was reduced by 26 % (translation displacement) and by 39 % (angular displacement) for scenes where objects were overlapping. National Institute of Advanced Industrial Science and Technology (AIST) Industrial Cyber-Physical Systems Research Center Automation Research Team Natsuki Yamanobe, Abdullah Mustafa, Ryo Hanai, Yukiyasu Domae, Ixchel G. Ramirez-Alpizar,  Bruno Leme, of AIST and Tetsuya Ogata of Waseda University https://lnkd.in/gmyXxNrc

Explore categories