Claude Opus 4.8 is out, and we've been testing it on some of our real world trading problems as we introduced here: https://bit.ly/4uINgjw This chart shows how Opus 4.8 scores on the trading internship exam we use at Optiver, across different reasoning-effort settings and relative to previous generations of Claude models. It's exciting to see the continued progress on this exam, particularly at lower effort settings. Congrats to the Anthropic team on the release.
this is actually more positive than it might seem. token efficiency is what ultimately matters
Nice chart, seems like they're focusing on cost efficiency - no wonder why! Would be interesting to see compute cost of each of the models at each level (i.e. is 'low-reasoning' opus 4.8 comparable in compute use to 'high-reasoning' opus 4.7?)
Not much on an improvement from medium complexity onwards. What will that curve look like in 2 Versions from now ? Do we expect significant improvements also for the harder problems ? What are current models lacking ?
Great to see that improvement in models Moss. Very helpful for developers as well as the firms themselves. Could you please share those trading problems, I really wanted to solve those
Have you tested it with gpt5.5?