Scalable Behavior Cloning with Open Data, Training, and Evaluation

Arthur Allshire1*, Himanshu Gaurav Singh1*, Ritvik Singh1*, Adam Rashid2*, Hongsuk Choi1*, David McAllister1*
Justin Yu1,4, Yiyuan Chen4, Huang Huang4, Pieter Abbeel1,3, Xi Chen3, Rocky Duan3, Phillip Isola2, Jitendra Malik1,3, Fred Shentu1,4, Guanya Shi3,5, Philipp Wu4, Angjoo Kanazawa1,3

1UC Berkeley 2MIT 3Amazon FAR 4XDOF 5Carnegie Mellon University

*Core Contributor; work performed during an internship at Amazon FAR.

0:00 / 0:00
Speed
Real-world data
Real-world dataset task grid
3.5K hours of real-world interaction data across ~130K trajectories.
Model ablations
Sim-eval progress for DiT & VLA, averaged across 3 tasks.
Simulation
Simulation rollout
400 hours of simulation data across 20 scenes.
Real evals
Real-world evaluation grid
>100 hours of real evaluations, with data and rubrics released.
An overview of the ABC stack. We collect real-world and simulated teleoperation data, develop rubrics for evaluating policies, and use them to compare model design choices across compute scales. Data, code, and evaluation logs are released for the community to build on.
Abstract
We introduce ABC, a fully open-source stack for bimanual manipulation with behavior cloning. At its core is the release of ABC-130K, the largest bimanual teleoperation dataset to date, featuring 3,500 hours of data spanning over 130K episodes across nearly 200 diverse tasks. Furthermore, we open-source our accessible hardware setup, training infrastructure, and simulation pipeline. We also release 400 hours of sim-teleop data and provide a co-training recipe that produces correlated simulation and real-world evaluation. This allows researchers to effectively evaluate design choices without deploying on physical robots. We explore various training recipes and compare common architectural choices for Diffusion Transformers (DiT) and Vision-Language-Action (VLA) models, grounding our findings in real-world evaluations whose logs will also be released. The resulting policies successfully execute dexterous tasks such as box folding and extracting credit cards from wallets. By providing a reproducible toolbox, we aim to place researchers on an equal footing, establishing the necessary foundation to learn the ABCs of Behavior Cloning together as a community.
Results

All videos are autonomous, real-time rollouts.

Data
Our dataset, ABC-130K, comprises 134,806 episodes across 195 tasks, totaling 3,553 hours of bimanual manipulation. It spans diverse manipulation primitives such as pick-and-place, folding, handover, insertion, tool use, and assembly. We organize the 195 tasks into 7 primitive categories that capture the dominant contact mode and control strategy. Within each task, we vary the objects and initial configurations. Description of our primitive categories can be found in the Appendix. The duration of episodes ranges from ~7 s (put the screwdriver in the bin) to 469 s (folding t-shirt pile and stacking).
Pick-and-Place
Organize desk
Organize makeup
Folding
Fold long-sleeve shirt
Fold paper box
Sorting
Sort hair-cutting tools
Sort pills
Tool Use
Lock with key
Paint nails
Model
To work out how to most effectively leverage this data, we study various architectural choices for both Diffusion Transformers and Vision-Language-Action models, two popular kinds of robot policy architectures. We introduce two models, ABC-DiT and ABC-VLA, for which we release open source training code and checkpoints.
ABC-DiT 2B
ABC-DiT is a diffusion transformer paired with a pretrained vision encoder. Its parameter split is unusual: a large 1.93B DiT head with a comparatively small 85.7M DINOv3 backbone. We choose this split since the visual encoder is compute-heavy per parameter (it has to attend over 3 camera images) so increasing the size of the action head is cheaper than increasing the size of the vision encoder. We sweep four DiT sizes (S, B, L, xL) to see how loss changes with head capacity, finding the largest model to be the most compute-efficient at our training scale.
vs. optimizer steps vs. cumulative training compute training loss 0.04 0.05 0.06 0.07 0.08 0.10 0 100K 200K 300K optimizer steps 0 500 1000 1500 cumulative training compute (EFLOPs) DiT-S DiT-B DiT-L DiT-xL
Larger DiTs reach lower training loss — both per step and per FLOP. At a fixed step budget the loss floor drops monotonically with model size (left). And at a fixed compute budget the larger models still come out ahead (right): scaling the head pays for itself in loss per EFLOP. By 300K steps DiT-xL reaches 0.040 vs DiT-S's 0.059.
ABC-VLA 4.3B
ABC-VLA is a Gemma 3 4.3B parameter backbone with a lightweight 45M parameter action head. The lopsided parameter split — a 4.3B VLM behind a 45M action head — opens up a free lunch. Each VLM forward pass is expensive, but each diffusion target sampled from it is cheap. Replicating the VLM's hidden states k times in the batch and pairing each copy with an independent (noise, timestep) draw amortizes that one VLM pass across many gradient signals. The backward through the VLM still happens once. We see lower-variance gradients and faster convergence — and crucially, gradients from the diffusion action loss now flow into the VLM at higher signal-to-noise.
training loss 0.05 0.06 0.07 0.08 0.10 0.12 0.14 0 1000 2000 3000 H200 GPU-hours 1 draw 8 draws
Doing multiple diffusion draws for the same VLA conditioning significantly lowers training loss.
Scaling
To ensure our models effectively leverage our data and compute scale, we study how our models are impacted by varying data and compute.
To investigate the impact of data scaling, we train our ABC-DiT on 1K, 3K, and 10K hours. We track two offline signals through training: validation loss on a held-out split and validation action error (L2 distance between generated and ground-truth action chunks). We find that validation error and validation action error both decrease with data scale.
validation loss 0.07 0.08 0.10 0.12 0.15 0.20 0 50K 100K 150K 200K optimizer steps validation action error 0.09 0.10 0.12 0.15 0.20 0 50K 100K 150K 200K optimizer steps 1K hours 3K hours 10K hours
Validation error and validation action error both decrease with data scale.
single-task finetuning mean progress (%) 0 25 50 75 100 Average Sort LEGOs Bottle caps Pen caps Credit cards scratch 3.5K-hour pretrain 7K-hour pretrain
Single-task finetuning performance improves with more pretraining data.
We conduct real-world evaluations to see how the performance of ABC-DiT and ABC-VLA changes with varying batch size and compute. We find that the VLA benefits from a large batch size. Both architectures consistently benefit from increasing training compute.
performance across compute scales real success vs. training metrics 30% 40% 50% 60% 70% 80% 10²⁰ 10²¹ 10²² training FLOPs mean progress (%) 1.5K 4.6K DiT 9K 1.5K 4.6K VLA 9K 0.04 0.05 0.06 0.07 0.08 0.10 0% 10% 20% 30% 40% 0 50K 100K 150K 200K training steps loss (train / val) real success (%) train loss val loss real success
More training compute yields better performance.
Sim
Real-world evaluations of policies are essential, but are slow and expensive. To allow users of our dataset to iterate faster, we built ABC Sim. We collect over 400 hours of real teleoperation data across 20 simulation tasks and we use these as a cheap proxy for real-world performance during development.
We do simulation in MuJoCo. We also release a Blender re-rendering pipeline that takes any saved trajectory and re-renders it with ray tracing for higher visual fidelity.
Blender MuJoCo Warp
Top view
Blender MuJoCo Warp
Left wrist
Blender MuJoCo Warp
Right wrist
We need to know whether our sim-eval is predictive of real-world performance. We evaluate our checkpoints on three matched tasks (throw bottles in bin, load plates in dishrack, turn mugs right-side up) in both sim and real, across 12 checkpoints spanning multiple architectures, batch sizes, and training durations, finding a correlation between sim and real performance.
strict success task progress 0% 20% 40% 60% 0% 20% 40% 60% sim success real success r = 0.85 p = 4.2 × 10⁻⁴ 30% 40% 50% 60% 70% 80% 30% 40% 50% 60% 70% 80% sim progress real progress r = 0.91 p = 5.0 × 10⁻⁵ DiT 7K real DiT 3.5K VLA 3.5K fit y=x
Sim performance is a faithful predictor of real performance. Each point is one checkpoint. Across 12 checkpoints we get r = 0.85 on strict success and r = 0.91 on task progress — strong enough that users of our dataset can iterate on architecture, batch size, and training duration in sim before doing real-robot evaluations.
Evaluation
For our modelling experiments, we ran over 100 hours of physical-robot evaluations across our architecture, batch-size, dataset, and finetuning ablations. Each task was evaluated for 50 trials with a fixed rubric, and we release the full evaluation data so other researchers can reproduce our protocol.
In addition to using our evaluation data to directly compare methods, we look at global correlations across all of our evals. We find that training error and validation action error are correlated with real world performance across checkpoints with differing architectures, compute levels, and batch sizes.
real strict success (%) 0% 10% 20% 30% 40% 50% vs. training loss 0.04 0.05 0.06 0.07 training loss r = −0.84 vs. validation loss 0.06 0.07 0.08 0.09 0.10 validation loss r = −0.04 vs. validation action error 0.09 0.10 0.11 0.12 0.13 validation action error r = −0.89 DiT VLA fit
Across 16 checkpoints, real-world success correlates strongly with training loss (r = −0.84) and validation action error (r = −0.89), but is essentially uncorrelated with validation loss (r = −0.04).
Infrastructure
We release our full hardware setup, our training and inference code, and our simulation pipeline.
Collecting Interventions with Passive Leader Arms
Most teleoperation rigs use leader arms — kinematically-matched twins of the follower robot — driven by the operator. We use cheap passive arms (a GELLO-style design) that the operator just moves by hand: an encoder-only readout drives the followers via inverse kinematics, so the same hardware that records demonstrations can also stream live joint commands.
An operator teleoperates the bimanual robot with passive leader arms.
A common method for improving robot policy performance is DAgger — directly collecting interventions when the policy fails or is about to fail. However, collection interfaces for DAgger often rely on active leader arms to match the robot's pose during a rollout. We introduce a method for doing interventions on passive leader arms. To do this, we record the delta of the leader pose from the moment of intervention until the current timestep, add it to the follower's current end-effector, and IK-solve onto the followers to control them. This enables intervention without the need for the leader arm joint positions to be exactly matching.
An operator intervenes in a running policy. During interventions, the follower arms match the end-effector delta of the leaders' from the point of intervention. This method allows collection of intervention data with leader arms without having to make them active.
We use this loop to do DAgger on a hard long-horizon task: folding a cardboard box and closing the lid. We first finetune ABC-DiT on 10 hours of curated single-task box-folding data. This gets to 24% mean progress — the policy understands the task but can't make the fine adjustments needed to get all the way through. After two rounds of DAgger collection (~1–1.5h each, with intervention rates of 30% then 15%) and continued training, mean progress jumps to 85%.
box folding · mean progress (%) 0 25 50 75 100 finetuning only 24% + 2 rounds DAgger 85%
The same recipe trains pack a student bag, a long-horizon dexterous task where the robot has to unzip a backpack, load multiple objects into it, and zip it back up. With passive-leader DAgger on top of a finetuned base policy, the trained policy executes the full sequence autonomously:
The DAgger-tuned policy autonomously packing a full bag end-to-end.
Inference speed
In order to ensure our models run fast enough, we progressively compile our inference path: starting from eager PyTorch, then separate torch.compile on each block, then a single fullgraph compile, then CUDA graphs on top. Each layer of optimization removes another source of CPU-side overhead.
inference trace · ABC-DiT (10 diffusion steps) GPU kernels GPU idle CPU launches Eager 63.0 ms 15.9 Hz · 44% GPU Separate compile 47.5 ms 21.0 Hz · 81% GPU Single Graph + autotune 41.3 ms 24.2 Hz · 85% GPU Single Graph + autotune + CUDA graphs 36.3 ms 27.6 Hz · 99% GPU 0 20 40 60 80 time (ms) 20 21 22 23 24
ABC-DiT inference at 27.6 Hz on a single 5090.
inference trace · ABC-VLA (10 diffusion steps) GPU kernels GPU idle CPU launches Eager 47.8 ms 20.9 Hz · 59% GPU Separate compile 22.6 ms 44.2 Hz · 88% GPU Fullgraph compile 17.49 ms 57.2 Hz · 94% GPU 0 10 20 30 40 50 time (ms) 4 5 6 7 8
ABC-VLA inference at 57.2 Hz on a single 5090.
Fast data loading
Training on thousands of hours of data is a challenge from a dataloading perspective. To support this for our dataset, we release abcdl, our distributed dataloader, alongside the dataset. We encode each episode as a single MP4 (with stacked camera views) plus a binary state/action file. We can get significant speedups and bandwidth reductions by carefully choosing the encoding options, such as having deterministic and frequent keyframe positions.
byte offset (MB) 0 2 4 6 8 10 12 naive after constantkeyframe encoding whole-file index scan one GOP decoded moov / frame index never read
Efficient frame access for fast data loading. Encoding keyframes more frequently and in a manner that allows for an analytically reconstructed frame index makes random frame access nearly free. To read one frame from a video, torchcodec's default scans the entire file to build its frame index (top). Correctly encoding the file allows us to compute the index analytically, meaning we only need to read the file header plus frames since the closest keyframe, leaving the rest of the file untouched, reducing disk pressure. The decoded frame is three vertically stacked camera views.
Citation
@article{abc2026,
  title   = {Scalable Behavior Cloning with Open Data, Training, and Evaluation},
  author  = {Allshire, Arthur and Singh, Himanshu Gaurav and Singh, Ritvik and Rashid, Adam and Choi, Hongsuk and McAllister, David and Yu, Justin and Chen, Yiyuan and Huang, Huang and Abbeel, Pieter and Chen, Xi and Duan, Rocky and Isola, Phillip and Malik, Jitendra and Shentu, Fred and Shi, Guanya and Wu, Philipp and Kanazawa, Angjoo},
  year    = {2026},
  journal = {arXiv preprint},
  url     = {https://abc.bot/},
}
ABC: a yellow dog watches a robot arm carry a blue mouse while a cat rests on ABC blocks under a starry painted sky. A signpost reads 'BC is EZPZ.'

— learning the ABCs of behavior cloning, together —

art by Isabella Yu