Scalable Behavior Cloning with Open Data, Training, and Evaluation
Real-world data

Model ablations
Simulation

Real evals

Abstract
We introduce ABC, a fully open-source stack for bimanual manipulation with behavior cloning. At its core is the release of ABC-130K, the largest bimanual teleoperation dataset to date, featuring 3,500 hours of data spanning over 130K episodes across nearly 200 diverse tasks. Furthermore, we open-source our accessible hardware setup, training infrastructure, and simulation pipeline. We also release 400 hours of sim-teleop data and provide a co-training recipe that produces correlated simulation and real-world evaluation. This allows researchers to effectively evaluate design choices without deploying on physical robots. We explore various training recipes and compare common architectural choices for Diffusion Transformers (DiT) and Vision-Language-Action (VLA) models, grounding our findings in real-world evaluations whose logs will also be released. The resulting policies successfully execute dexterous tasks such as box folding and extracting credit cards from wallets. By providing a reproducible toolbox, we aim to place researchers on an equal footing, establishing the necessary foundation to learn the ABCs of Behavior Cloning together as a community.
Results
All videos are autonomous, real-time rollouts.
Data
Our dataset, ABC-130K, comprises 134,806 episodes across 195 tasks, totaling 3,553 hours of bimanual manipulation. It spans diverse manipulation primitives such as pick-and-place, folding, handover, insertion, tool use, and assembly. We organize the 195 tasks into 7 primitive categories that capture the dominant contact mode and control strategy. Within each task, we vary the objects and initial configurations. Description of our primitive categories can be found in the Appendix. The duration of episodes ranges from ~7 s (put the screwdriver in the bin) to 469 s (folding t-shirt pile and stacking).
To explore the full dataset and download episodes, click here.
Pick-and-Place
Folding
Sorting
Tool Use
Model
To work out how to most effectively leverage this data, we study various architectural choices for both Diffusion Transformers and Vision-Language-Action models, two popular kinds of robot policy architectures. We introduce two models, ABC-DiT and ABC-VLA, for which we release open source training code and checkpoints.
ABC-DiT 2B
ABC-DiT is a diffusion transformer paired with a pretrained vision encoder. Its parameter split is unusual: a large 1.93B DiT head with a comparatively small 85.7M DINOv3 backbone. We choose this split since the visual encoder is compute-heavy per parameter (it has to attend over 3 camera images) so increasing the size of the action head is cheaper than increasing the size of the vision encoder. We sweep four DiT sizes (S, B, L, xL) to see how loss changes with head capacity, finding the largest model to be the most compute-efficient at our training scale.
ABC-VLA 4.3B
ABC-VLA is a Gemma 3 4.3B parameter backbone with a lightweight 45M parameter action head. The lopsided parameter split — a 4.3B VLM behind a 45M action head — opens up a free lunch. Each VLM forward pass is expensive, but each diffusion target sampled from it is cheap. Replicating the VLM's hidden states k times in the batch and pairing each copy with an independent (noise, timestep) draw amortizes that one VLM pass across many gradient signals. The backward through the VLM still happens once. We see lower-variance gradients and faster convergence — and crucially, gradients from the diffusion action loss now flow into the VLM at higher signal-to-noise.
Scaling
To ensure our models effectively leverage our data and compute scale, we study how our models are impacted by varying data and compute.
To investigate the impact of data scaling, we train our ABC-DiT on 1K, 3K, and 10K hours. We track two offline signals through training: validation loss on a held-out split and validation action error (L2 distance between generated and ground-truth action chunks). We find that validation error and validation action error both decrease with data scale.
We conduct real-world evaluations to see how the performance of ABC-DiT and ABC-VLA changes with varying batch size and compute. We find that the VLA benefits from a large batch size. Both architectures consistently benefit from increasing training compute.
Sim
Real-world evaluations of policies are essential, but are slow and expensive. To allow users of our dataset to iterate faster, we built ABC Sim. We collect over 400 hours of real teleoperation data across 20 simulation tasks and we use these as a cheap proxy for real-world performance during development.
We do simulation in MuJoCo. We also release a Blender re-rendering pipeline that takes any saved trajectory and re-renders it with ray tracing for higher visual fidelity.
We need to know whether our sim-eval is predictive of real-world performance. We evaluate our checkpoints on three matched tasks (throw bottles in bin, load plates in dishrack, turn mugs right-side up) in both sim and real, across 12 checkpoints spanning multiple architectures, batch sizes, and training durations, finding a correlation between sim and real performance.
Evaluation
For our modelling experiments, we ran over 100 hours of physical-robot evaluations across our architecture, batch-size, dataset, and finetuning ablations. Each task was evaluated for 50 trials with a fixed rubric, and we release the full evaluation data so other researchers can reproduce our protocol.
In addition to using our evaluation data to directly compare methods, we look at global correlations across all of our evals. We find that training error and validation action error are correlated with real world performance across checkpoints with differing architectures, compute levels, and batch sizes.
Infrastructure
We release our full hardware setup, our training and inference code, and our simulation pipeline.
Collecting Interventions with Passive Leader Arms
Most teleoperation rigs use leader arms — kinematically-matched twins of the follower robot — driven by the operator. We use cheap passive arms (a GELLO-style design) that the operator just moves by hand: an encoder-only readout drives the followers via inverse kinematics, so the same hardware that records demonstrations can also stream live joint commands.
A common method for improving robot policy performance is DAgger — directly collecting interventions when the policy fails or is about to fail. However, collection interfaces for DAgger often rely on active leader arms to match the robot's pose during a rollout. We introduce a method for doing interventions on passive leader arms. To do this, we record the delta of the leader pose from the moment of intervention until the current timestep, add it to the follower's current end-effector, and IK-solve onto the followers to control them. This enables intervention without the need for the leader arm joint positions to be exactly matching.
We use this loop to do DAgger on a hard long-horizon task: folding a cardboard box and closing the lid. We first finetune ABC-DiT on 10 hours of curated single-task box-folding data. This gets to 24% mean progress — the policy understands the task but can't make the fine adjustments needed to get all the way through. After two rounds of DAgger collection (~1–1.5h each, with intervention rates of 30% then 15%) and continued training, mean progress jumps to 85%.
The same recipe trains pack a student bag, a long-horizon dexterous task where the robot has to unzip a backpack, load multiple objects into it, and zip it back up. With passive-leader DAgger on top of a finetuned base policy, the trained policy executes the full sequence autonomously:
Inference speed
In order to ensure our models run fast enough, we progressively compile our inference path: starting from eager PyTorch, then separate
torch.compile on each block, then a single fullgraph compile, then CUDA graphs on top. Each layer of optimization removes another source of CPU-side overhead.
Fast data loading
Training on thousands of hours of data is a challenge from a dataloading perspective. To support this for our dataset, we release abcdl, our distributed dataloader, alongside the dataset. We encode each episode as a single MP4 (with stacked camera views) plus a binary state/action file. We can get significant speedups and bandwidth reductions by carefully choosing the encoding options, such as having deterministic and frequent keyframe positions.
torchcodec's default scans the entire file to build its frame index (top). Correctly encoding the file allows us to compute the index analytically, meaning we only need to read the file header plus frames since the closest keyframe, leaving the rest of the file untouched, reducing disk pressure. The decoded frame is three vertically stacked camera views.Citation
@article{abc2026,
title = {Scalable Behavior Cloning with Open Data, Training, and Evaluation},
author = {Allshire, Arthur and Singh, Himanshu Gaurav and Singh, Ritvik and Rashid, Adam and Choi, Hongsuk and McAllister, David and Yu, Justin and Chen, Yiyuan and Huang, Huang and Abbeel, Pieter and Chen, Xi and Duan, Rocky and Isola, Phillip and Malik, Jitendra and Shentu, Fred and Shi, Guanya and Wu, Philipp and Kanazawa, Angjoo},
year = {2026},
journal = {arXiv preprint},
url = {https://abc.bot/},
}