Today we're sharing our first research collaboration with Baseten on open-weight legal agents - and the results point to where vertical AI is heading.
Using signal from LAB (our Legal Agent Benchmark of 1,200+ tasks across 24 practice areas), we post-trained a 27B open-weight model and brought it into the closed-source frontier band. Three main takeaways:
1. Open weights unlock cost, governance, and a path to deeper capability. Reaching the top of LAB with frontier models runs ~$50 and 20+ minutes per task. Open-weight agents can live inside a firm's own secure cloud, expose their reasoning traces for audit, and - with the right post-training pipeline - close the gap on a benchmark where even frontier models complete fewer than 10% of tasks end-to-end.
2. The model and the system around it have to be built together. We designed a "compaction" system that lets agents summarize what they've read so they can keep working on long tasks without losing context. It gave frontier models a 2.6 - 3.7x boost - but did nothing for the open-weight model until we actually trained the model to use it. You need both the model and the system.
3. Smaller models can learn to work like the best ones when you train them on the right examples. With a small amount of training against LAB's expert rubrics, a 9B model stopped relying on keyword search and started reading documents in full - the same approach the top frontier models (Opus, Sonnet, GPT-5.5) use on their own. The quality of your evaluation data shapes how the model behaves.
Full write-up in the comments.