Flowing from Reasoning to Motion: Learning 3D Hand Trajectory Prediction from Egocentric Human Interaction Videos
Mingfei Chen1,2 Β· Yifan Wang1 Β· Zhengqin Li1 Β· Homanga Bharadhwaj1 Β· Yujin Chen1 Β· Chuan Qin1 Β· Ziyi Kou1 Β· Yuan Tian1 Β· Eric Whitmire1 Β· Rajinder Sodhi1 Β· Hrvoje Benko1 Β· Eli Shlizerman2 Β· Yue Liu1
1Meta Β· 2University of Washington
Prior works on 3D hand trajectory prediction are constrained by datasets that decouple motion from semantic supervision and by models that weakly link reasoning and action. To address these, we first present the EgoMAN dataset, a large-scale egocentric dataset for interaction stage-aware 3D hand trajectory prediction with 219K 6DoF trajectories and 3M structured QA pairs for semantic, spatial, and motion reasoning.
We then introduce the EgoMAN model, a reasoning-to-motion framework that links visionβlanguage reasoning and motion generation via a trajectory-token interface. Trained progressively to align reasoning with motion dynamics, our approach yields accurate and stage-aware trajectories with generalization across real-world scenes.
Note: Model weights and processed dataset are not released due to legal and licensing considerations. We provide complete model code and dataset creation scripts in this repo to ensure full reproducibility.
| Category | Description |
|---|---|
| βοΈ Environment Setup | Install dependencies and set up CUDA environment |
| π Quick Start & Inference | Single image and batch trajectory prediction |
| π Evaluation | Benchmark assessment with trajectory and waypoint metrics |
| π§ Training | Progressive three-stage training pipeline |
| π¦ Dataset Creation | Scripts to build EgoMAN dataset from scratch |
Install dependencies using the provided script (requires CUDA 12.4):
conda create python=3.12 -n perception_models
conda activate perception_models
bash env_install.shKey Dependencies:
torch==2.6.0,torchvision==0.21.0flash-attn==2.7.4transformers==4.50.0- pytorch3d
Get started quickly by running inference on a single egocentric image. This demo predicts future 3D hand trajectories based on the current scene and a text description of the intended action.
What you need:
- An egocentric image (first-person view)
- A text prompt specifying the intended action and the acting hand (e.g., "open refrigerator with left hand")
- (Optional) Past motion data for better performance (e.g., 5 frames of 6DoF hand poses)
Setup: Visual features are extracted on-the-fly using DINOv3. First set up DINOv3:
cd model/semantics_extractor
git clone https://github.com/facebookresearch/dinov3.git
cd ../..Note: DINOv3 weights will be automatically downloaded on first run if not found. Alternatively, you can manually download from here and place dinov3_vitl16_pretrain_lvd1689m-8aa4cbdd.pth under data/weights/.
Run inference on a single image with text prompt:
cd model
# Option 1: Run inference WITH past motion (recommended for best performance)
python tools/egoman_demo_vlonly.py \
--image ../data/examples/open_refrigerator_to_access_contents_with_left_hand.jpg \
--text "open refrigerator with left hand" \
--past_motion ../data/examples/open_refrigerator_to_access_contents_with_left_hand+past_motion.npy \
--model_path ../data/weights/EgoMAN-7B \
--output_dir ../output \
--num_samples 3
# Option 2: Run inference WITHOUT past motion (simpler but may degrade performance)
# Note: The --past_motion parameter is optional but recommended for best performance
python tools/egoman_demo_vlonly.py \
--image ../data/examples/open_refrigerator_to_access_contents_with_left_hand.jpg \
--text "open refrigerator with left hand" \
--model_path ../data/weights/EgoMAN-7B \
--output_dir ../output \
--num_samples 3Output:
- Prediction results:
output/{image_name}_result.pkl - Visualizations:
output/visualizations/{image_name}/
Note: Past motion format: .npy file with shape (5, 2, 7) - 5 frames, 2 hands (left, right), 3D positions + quaternion (x, y, z, qx, qy, qz, qw)
Process multiple egocentric images in batch mode for efficient trajectory prediction. This workflow demonstrates how to extract visual and text features, run parallel inference, and visualize the predicted hand trajectories.
Pipeline Overview:
- Extract DINOv3 visual features from images
- (Optional) Extract CLIP text embeddings for action descriptions
- Run batch inference to generate multiple trajectory samples
- Visualize results with overlaid trajectories on images
Run batch inference on multiple example images:
cd model
# Extract DINOv3 visual features from example images
python semantics_extractor/extract_vis_dinov3.py
# (Optional) Extract CLIP features from intention text
python semantics_extractor/extract_text_clip_emb.py
# Infer 3 samples from examples, output saved to output/[MODEL_NAME]-examples.pkl
BATCH_SIZE=1 SAVE_EVERY=50 MODEL_NAME=EgoMAN-7B torchrun --nproc_per_node=1 tools/infer_batch_aria_examples.py
# Visualize the output
python tools/visualize_batch_results.pyNote: The visualization plots K=3 predicted trajectories by default. You can change K in the function visualize_predictions:
visualize_predictions(
img_dir,
result_pkl_path=result_pkl_path,
output_dir=output_dir,
cam_params_dict=cam_params_dict,
K=3, # plot K predicted hand trajectories
)Evaluate the EgoMAN model on our comprehensive benchmark dataset. This benchmark contains challenging egocentric interaction scenarios to assess the model's trajectory prediction accuracy and stage awareness.
Requirements:
- Download and place the
egoman_imgsdata folder underdata/ - Ensure you have multiple GPUs available for parallel processing
Evaluation Metrics:
- Trajectory accuracy: ADE, FDE, DTW, ROT
- Waypoint prediction: Contact, Traj-Warp (Traj)
Run the full evaluation pipeline:
cd model
# Extract DINOv3 visual embeddings, output to data/egomanbench_vis_features.pkl
python semantics_extractor/extract_vis_egomanbench.py
# Run inference with 4 GPUs in parallel
# Output saved to output/[MODEL_NAME]-egomanbench.pkl
BATCH_SIZE=1 SAVE_EVERY=50 MODEL_NAME=EgoMAN-7B torchrun --nproc_per_node=4 tools/infer_batch_egomanbench.py
# Compute trajectory metrics
python tools/egomanbench_metrics.py
# Compute waypoints metrics with shift radius of 0.06 (spatial tolerance of the palm affordance point to wrist position)
python tools/waypoint_metrics.py --shift --shift_radius 0.06Train the EgoMAN model from scratch using our progressive three-stage training approach. This methodology ensures that both reasoning and motion modules are properly aligned for accurate trajectory prediction.
Training Stages:
- Preprocessing: Feature Extraction - Extract DINOv3 visual features and CLIP action embeddings from training data
- Stage 1: Reasoning Module Pretraining - Train the vision-language model on semantic, spatial, and motion reasoning tasks
- Stage 2: Motion Expert Pretraining - Train the trajectory decoder on motion dynamics
- Stage 3: Joint Training - Align reasoning with motion generation via the trajectory-token interface
Before starting the three-stage training, you must extract visual and text features from your training dataset. This preprocessing step is critical as the training dataloaders expect pre-computed features to avoid redundant computation during training.
Prerequisites:
- Place your training dataset annotations at:
data/egoman_dataset/egoman_pretrain.pkl(for reasoning pretraining)data/egoman_dataset/egoman_finetune.pkl(for finetuning)data/egoman_dataset/egoman-test-final.pkl(for validation, optional)
- Ensure training images are accessible from paths specified in the annotation PKL files
Extract visual features from all training images using DINOv3. These features are used by all three training stages.
cd model
# Extract DINOv3 features from training images
python semantics_extractor/extract_vis_dinov3_train.pyOutput:
data/egoman_dataset/egoman_dinov3_features.pkl- Dictionary mapping{image_path: dinov3_feature_array}- Feature shape:
(1024,)for DINOv3-L/16
- Feature shape:
Extract text embeddings from action phrases using CLIP. These embeddings encode semantic information about intended hand-object interactions.
cd model
# Extract CLIP embeddings from action phrases
python semantics_extractor/extract_act_emb_clip_train.pyOutput:
data/egoman_dataset/act_emb_dict.pkl- Dictionary mapping{image_path + "_" + action_phrase: {"text": ..., "emb": ...}}- Embedding shape:
(768,)for CLIP-L/14
- Embedding shape:
data/egoman_dataset/act_emb_val_dict.pkl(optional) - Same format for validation data
Train the reasoning module on structured QA pairs to develop understanding of egocentric interactions, object relationships, and motion semantics.
cd model
sh train_scripts/reason_pretrain.shTrain the motion generation module to learn realistic hand trajectory dynamics and interaction patterns.
cd model
sh train_scripts/motion_pretrain.shFine-tune both modules together with the trajectory-token interface to enable seamless reasoning-to-motion transfer.
cd model
sh train_scripts/joint_finetune.shBuild the EgoMAN dataset from scratch using our automated pipeline. This process transforms raw egocentric videos into a comprehensive dataset with 219K trajectories and 3M QA pairs.
Overview: The dataset creation pipeline consists of 5 automated steps that progressively process raw video data into structured interaction episodes with semantic annotations and 6DoF hand trajectories.
Source Datasets Required:
- EgoExo4D - Large-scale egocentric video dataset
- Nymeria Dataset - Egocentric interaction videos
- HOT3D - 3D hand-object tracking dataset
Setup:
Place downloaded source datasets under data/egoman_dataset/. You'll need to configure your own GPT API credentials and update data paths in the scripts.
We provide scripts with prompts to create the EgoMAN dataset. Please replace the GPT call function and api with your own.
Please download the source datasets following their own repo: EgoExo4D, Nymeria Dataset, HOT3D. The data should be put under data/egoman_dataset. We recommend you update the source data path, output data path and temp file paths in these scripts based on your own case.
Annotate 5-second video clips with interaction stages (approach, manipulation) using GPT-4.1. This step identifies key interaction moments and labels them with text descriptions, timestamps, and reasoning.
Output: Annotated interaction clips with stage labels and descriptions
Annoate 5s video clips with interaction stages (approach, manipulation) including text, timestamps, reason, etc.
Annotation function with GPT4.1 in script: dataset/scripts/step1_gpt_anno_interact.py.
Apply rule-based and GPT-powered filters to remove invalid annotations. This step ensures interactions are realistic, properly timed, and semantically meaningful. Also generates high-level intention summaries.
Filters Applied:
- Duration constraints (not too short/long)
- Semantic relevance checks
- Realism validation
- Annotation quality assessment
We use rules and GPT to filter out invalid interactions that are wrongly annotated, too short/long, not relevant, unrealistic. We also summarize the high level intention goal using GPT.
Filtering functions in script: dataset/scripts/step2_valid_interact_filter.py.
Generate diverse non-numeric question-answer pairs for semantic, spatial, and motion reasoning. Each valid interaction produces multiple QA pairs covering object recognition, spatial relationships, motion patterns, and interaction stages.
Output QA Categories:
- Current intention goals
- Which hand will be used
- What action will occur
- What object will be manipulated
- Hand trajectory descriptions
- Interaction stage information
- Reasoning about why actions occur
Generate non-numeric QA pairs using GPT for each valid interaction item after filtering in Step 2.
Generator function in script: dataset/scripts/step3_gpt_qa_generator.py.
Extract 6DoF hand trajectories (3D position + quaternion orientation) from the source datasets. This step processes hand tracking data and aligns it with the annotated interaction clips from Step 2.
Trajectory Format:
- Position: (x, y, z) in meters, camera-relative coordinates
- Orientation: Quaternion (qx, qy, qz, qw)
- Frequency: 10 FPS
- Smooth interpolation for missing frames
Output:
- 6DoF hand trajectories aligned with interaction clips
- Projected 2D wrist positions in image coordinates
- Head pose trajectory for context
- Camera transformation matrices
Extract 6DoF hand trajectory (3D location and quaternion) from MPS hand tracking data.
Note: For EgoExo4D, we re-run the MPS Hand Tracking Service and place the hand tracking result for each take under: data/egoman_dataset/egoexo/vrs_list/[take_name]/hand_tracking/
Script for EgoExo and Nymeria: dataset/scripts/step4_6dof_traj_process.py.
Script for HOT3D: dataset/scripts/step4_6dof_traj_process_hot3d.py.
Important: This step runs AFTER Step 4 and requires trajectory data with 6DoF hand poses.
Generate numeric question-answer pairs that require quantitative reasoning about hand trajectories. This script creates QA pairs with numeric answers (3D positions, timestamps, quaternions) for both pretraining and finetuning the reasoning module.
Output QA Types:
-
Pretraining QA (diverse individual questions):
- Temporal: "When will the hand approach/complete manipulation?"
- Spatial: "What will be the 3D position of the [hand] at [stage]?"
- Spatiotemporal: "When and where will the hand make contact?"
- Action semantic identification: "What is the next hand-object interaction?"
-
Finetuning QA (full trajectory prediction):
- Question: "Where will the hands move to [intention]?<HOI_QUERY>"
- Answer: "" with full trajectory data
- Includes past motion context (5 frames of historical poses)
Numeric Answer Format:
- Special tokens:
<ACT>,<START>,<CONTACT>,<END>, etc. - 11-dimensional vectors encoding timestamps, 3D positions, and 2D projections
Generator function in script: dataset/scripts/step5_reason_numeric_qa_generator.py.
Combine the outputs from Step 3 (non-numeric QA) and Step 5 (numeric QA) to form the complete reasoning dataset for training.
Final quality control to ensure only high-quality, physically plausible trajectories are included in the finetuning dataset and evaluation benchmark.
Quality Criteria:
- Smooth motion without abrupt jumps
- Physically plausible hand movements
- Consistent with visual observations
- Proper alignment with interaction stages
Filter out low quality trajectories:
- By rules: dataset/scripts/step6_traj_quality_filter_rules.py
- By GPT: dataset/scripts/step6_traj_quality_filter_gpt.py
The majority of EgoMAN is licensed under CC-BY-NC, however portions of the project adapted function code are available under separate license terms: QwenVL3 and FastChat are licensed under the Apache 2.0 license.
If you find EgoMAN useful in your research, please consider citing:
@misc{chen2025flowingreasoningmotionlearning,
title={Flowing from Reasoning to Motion: Learning 3D Hand Trajectory Prediction from Egocentric Human Interaction Videos},
author={Mingfei Chen and Yifan Wang and Zhengqin Li and Homanga Bharadhwaj and Yujin Chen and Chuan Qin and Ziyi Kou and Yuan Tian and Eric Whitmire and Rajinder Sodhi and Hrvoje Benko and Eli Shlizerman and Yue Liu},
year={2025},
eprint={2512.16907},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2512.16907},
}