Scalable Behavior Cloning with Open Data, Training, and Evaluation

Arthur Allshire^1*, Himanshu Gaurav Singh^1*, Ritvik Singh^1*, Adam Rashid^2*, Hongsuk Choi^1*, David McAllister^1*
Justin Yu^1,4, Yiyuan Chen⁴, Huang Huang⁴, Pieter Abbeel^1,3, Xi Chen³, Rocky Duan³, Phillip Isola², Jitendra Malik^1,3, Fred Shentu^1,4, Guanya Shi^3,5, Philipp Wu⁴, Angjoo Kanazawa^1,3

¹UC Berkeley ²MIT ³Amazon FAR ⁴XDOF ⁵Carnegie Mellon University

^*Core Contributor; work performed during an internship at Amazon FAR.

0:00 / 0:00

Paper · Code · Data

Real-world dataset task grid — 3.5K hours of real-world interaction data across ~130K trajectories.

Simulation rollout — 3.5K hours of real-world interaction data across ~130K trajectories.

Abstract

We introduce ABC, a fully open-source stack for bimanual manipulation with behavior cloning. At its core is the release of ABC-130K, the largest bimanual teleoperation dataset to date, featuring 3,500 hours of data spanning over 130K episodes across nearly 200 diverse tasks. Furthermore, we open-source our accessible hardware setup, training infrastructure, and simulation pipeline. We also release 400 hours of sim-teleop data and provide a co-training recipe that produces correlated simulation and real-world evaluation. This allows researchers to effectively evaluate design choices without deploying on physical robots. We explore various training recipes and compare common architectural choices for Diffusion Transformers (DiT) and Vision-Language-Action (VLA) models, grounding our findings in real-world evaluations whose logs will also be released. The resulting policies successfully execute dexterous tasks such as box folding and extracting credit cards from wallets. By providing a reproducible toolbox, we aim to place researchers on an equal footing, establishing the necessary foundation to learn the ABCs of Behavior Cloning together as a community.

Results

All videos are autonomous, real-time rollouts.

Data

Our dataset, ABC-130K, comprises 134,806 episodes across 195 tasks, totaling 3,553 hours of bimanual manipulation. It spans diverse manipulation primitives such as pick-and-place, folding, handover, insertion, tool use, and assembly. We organize the 195 tasks into 7 primitive categories that capture the dominant contact mode and control strategy. Within each task, we vary the objects and initial configurations. Description of our primitive categories can be found in the Appendix. The duration of episodes ranges from ~7 s (put the screwdriver in the bin) to 469 s (folding t-shirt pile and stacking).

To explore the full dataset and download episodes, click here.

Pick-and-Place

Organize desk

Organize makeup

Folding

Fold long-sleeve shirt

Fold paper box

Sorting

Sort hair-cutting tools

Sort pills

Tool Use

Lock with key

Paint nails

Model

To work out how to most effectively leverage this data, we study various architectural choices for both Diffusion Transformers and Vision-Language-Action models, two popular kinds of robot policy architectures. We introduce two models, ABC-DiT and ABC-VLA, for which we release open source training code and checkpoints.

ABC-DiT 2B

ABC-DiT is a diffusion transformer paired with a pretrained vision encoder. Its parameter split is unusual: a large 1.93B DiT head with a comparatively small 85.7M DINOv3 backbone. We choose this split since the visual encoder is compute-heavy per parameter (it has to attend over 3 camera images) so increasing the size of the action head is cheaper than increasing the size of the vision encoder. We sweep four DiT sizes (S, B, L, xL) to see how loss changes with head capacity, finding the largest model to be the most compute-efficient at our training scale.

Larger DiTs reach lower training loss — both per step and per FLOP. At a fixed step budget the loss floor drops monotonically with model size (left). And at a fixed compute budget the larger models still come out ahead (right): scaling the head pays for itself in loss per EFLOP. By 300K steps DiT-xL reaches 0.040 vs DiT-S's 0.059.

ABC-VLA 4.3B

ABC-VLA is a Gemma 3 4.3B parameter backbone with a lightweight 45M parameter action head. The lopsided parameter split — a 4.3B VLM behind a 45M action head — opens up a free lunch. Each VLM forward pass is expensive, but each diffusion target sampled from it is cheap. Replicating the VLM's hidden states k times in the batch and pairing each copy with an independent (noise, timestep) draw amortizes that one VLM pass across many gradient signals. The backward through the VLM still happens once. We see lower-variance gradients and faster convergence — and crucially, gradients from the diffusion action loss now flow into the VLM at higher signal-to-noise.

Doing multiple diffusion draws for the same VLA conditioning significantly lowers training loss.

Scaling

To ensure our models effectively leverage our data and compute scale, we study how our models are impacted by varying data and compute.

To investigate the impact of data scaling, we train our ABC-DiT on 1K, 3K, and 10K hours. We track two offline signals through training: validation loss on a held-out split and validation action error (L2 distance between generated and ground-truth action chunks). We find that validation error and validation action error both decrease with data scale.

Validation error and validation action error both decrease with data scale.

Single-task finetuning performance improves with more pretraining data.

We conduct real-world evaluations to see how the performance of ABC-DiT and ABC-VLA changes with varying batch size and compute. We find that the VLA benefits from a large batch size. Both architectures consistently benefit from increasing training compute.

More training compute yields better performance.

Sim

Real-world evaluations of policies are essential, but are slow and expensive. To allow users of our dataset to iterate faster, we built ABC Sim. We collect over 400 hours of real teleoperation data across 20 simulation tasks and we use these as a cheap proxy for real-world performance during development.

We do simulation in MuJoCo. We also release a Blender re-rendering pipeline that takes any saved trajectory and re-renders it with ray tracing for higher visual fidelity.

Blender MuJoCo Warp

Top view

Blender MuJoCo Warp

Left wrist

Blender MuJoCo Warp

Right wrist

We need to know whether our sim-eval is predictive of real-world performance. We evaluate our checkpoints on three matched tasks (throw bottles in bin, load plates in dishrack, turn mugs right-side up) in both sim and real, across 12 checkpoints spanning multiple architectures, batch sizes, and training durations, finding a correlation between sim and real performance.

Sim performance is a faithful predictor of real performance. Each point is one checkpoint. Across 12 checkpoints we get r = 0.85 on strict success and r = 0.91 on task progress — strong enough that users of our dataset can iterate on architecture, batch size, and training duration in sim before doing real-robot evaluations.

Evaluation

For our modelling experiments, we ran over 100 hours of physical-robot evaluations across our architecture, batch-size, dataset, and finetuning ablations. Each task was evaluated for 50 trials with a fixed rubric, and we release the full evaluation data so other researchers can reproduce our protocol.

In addition to using our evaluation data to directly compare methods, we look at global correlations across all of our evals. We find that training error and validation action error are correlated with real world performance across checkpoints with differing architectures, compute levels, and batch sizes.

Across 16 checkpoints, real-world success correlates strongly with training loss (r = −0.84) and validation action error (r = −0.89), but is essentially uncorrelated with validation loss (r = −0.04).

Infrastructure

We release our full hardware setup, our training and inference code, and our simulation pipeline.

Collecting Interventions with Passive Leader Arms

Most teleoperation rigs use leader arms — kinematically-matched twins of the follower robot — driven by the operator. We use cheap passive arms (a GELLO-style design) that the operator just moves by hand: an encoder-only readout drives the followers via inverse kinematics, so the same hardware that records demonstrations can also stream live joint commands.

An operator teleoperates the bimanual robot with passive leader arms.

A common method for improving robot policy performance is DAgger — directly collecting interventions when the policy fails or is about to fail. However, collection interfaces for DAgger often rely on active leader arms to match the robot's pose during a rollout. We introduce a method for doing interventions on passive leader arms. To do this, we record the delta of the leader pose from the moment of intervention until the current timestep, add it to the follower's current end-effector, and IK-solve onto the followers to control them. This enables intervention without the need for the leader arm joint positions to be exactly matching.

An operator intervenes in a running policy. During interventions, the follower arms match the end-effector delta of the leaders' from the point of intervention. This method allows collection of intervention data with leader arms without having to make them active.

We use this loop to do DAgger on a hard long-horizon task: folding a cardboard box and closing the lid. We first finetune ABC-DiT on 10 hours of curated single-task box-folding data. This gets to 24% mean progress — the policy understands the task but can't make the fine adjustments needed to get all the way through. After two rounds of DAgger collection (~1–1.5h each, with intervention rates of 30% then 15%) and continued training, mean progress jumps to 85%.

The same recipe trains pack a student bag, a long-horizon dexterous task where the robot has to unzip a backpack, load multiple objects into it, and zip it back up. With passive-leader DAgger on top of a finetuned base policy, the trained policy executes the full sequence autonomously:

The DAgger-tuned policy autonomously packing a full bag end-to-end.

Inference speed

In order to ensure our models run fast enough, we progressively compile our inference path: starting from eager PyTorch, then separate torch.compile on each block, then a single fullgraph compile, then CUDA graphs on top. Each layer of optimization removes another source of CPU-side overhead.

ABC-DiT inference at 27.6 Hz on a single 5090.

ABC-VLA inference at 57.2 Hz on a single 5090.

Fast data loading

Training on thousands of hours of data is a challenge from a dataloading perspective. To support this for our dataset, we release abcdl, our distributed dataloader, alongside the dataset. We encode each episode as a single MP4 (with stacked camera views) plus a binary state/action file. We can get significant speedups and bandwidth reductions by carefully choosing the encoding options, such as having deterministic and frequent keyframe positions.

Efficient frame access for fast data loading. Encoding keyframes more frequently and in a manner that allows for an analytically reconstructed frame index makes random frame access nearly free. To read one frame from a video, torchcodec's default scans the entire file to build its frame index (top). Correctly encoding the file allows us to compute the index analytically, meaning we only need to read the file header plus frames since the closest keyframe, leaving the rest of the file untouched, reducing disk pressure. The decoded frame is three vertically stacked camera views.

Citation

@article{abc2026,
  title   = {Scalable Behavior Cloning with Open Data, Training, and Evaluation},
  author  = {Allshire, Arthur and Singh, Himanshu Gaurav and Singh, Ritvik and Rashid, Adam and Choi, Hongsuk and McAllister, David and Yu, Justin and Chen, Yiyuan and Huang, Huang and Abbeel, Pieter and Chen, Xi and Duan, Rocky and Isola, Phillip and Malik, Jitendra and Shentu, Fred and Shi, Guanya and Wu, Philipp and Kanazawa, Angjoo},
  year    = {2026},
  journal = {arXiv preprint},
  url     = {https://abc.bot/},
}

ABC: a yellow dog watches a robot arm carry a blue mouse while a cat rests on ABC blocks under a starry painted sky. A signpost reads 'BC is EZPZ.'

— learning the ABCs of behavior cloning, together —

art by Isabella Yu