🎉 🎉 Llama4 and Gemma3 benchmarks! 🎉 🎉 We continually evaluate the latest LLMs to see how they perform at double-entry bookkeeping vs Digits AGL, and our team got right to work after Llama4 was released! The Evaluation: Classify 17,792 transactions, given the description, dollar amount, and a chart of accounts. The Results: Surprisingly weak… Correctly picked the category: Digits AGL: _ _ _ _ _ _ 93.5% GPT-4.5–preview: _ _ _ 66.6% GPT-o1: _ _ _ _ _ _ _ _ 66.0% Claude3.5-sonnet: _ _ 63.2% Claude3.7-sonnet: _ _ 62.7% GPT-4o: _ _ _ _ _ _ _ _ 60.6% O3-mini: _ _ _ _ _ _ _ _ 59.8% O1-mini: _ _ _ _ _ _ _ _ 59.4% Gemin-2.0-flash: _ _ _ 59.1% Grok-2: _ _ _ _ _ _ _ _ _ 57.9% DeepSeek-R1: _ _ _ _ _ 57.2% DeepSeek-V3: _ _ _ _ _ 56.9% Llama3.3-70B: _ _ _ _ _ 51.3% Gemma-3-27b: _ _ _ _ 50.8% Llama4-scout-17B: _ _ 46.4% GPT-4o-mini: _ _ _ _ _ _ 46.4% Hallucination Rate (made up a fake category): Digits AGL: _ _ _ _ _ _ 0.0% GPT-4.5–preview: _ _ _ 0.17% GPT-o1: _ _ _ _ _ _ _ _ _ 0.5% Claude3.5-sonnet: _ _ 0.48% Claude3.7-sonnet: _ _ 2.96% GPT-4o: _ _ _ _ _ _ _ _ 0.24% O3-mini: _ _ _ _ _ _ _ _ 5.95% O1-mini: _ _ _ _ _ _ _ _ 0.38% Gemin-2.0-flash: _ _ _ 0.1% Grok-2: _ _ _ _ _ _ _ _ _ 0.36% DeepSeek-R1: _ _ _ _ _ 9.44% DeepSeek-V3: _ _ _ _ _ 0.3% Llama3.3-70B: _ _ _ _ _ 0.72% Gemma-3-27b: _ _ _ _ 0.65% Llama4-scout-17B: _ _ 0.46% GPT-4o-mini: _ _ _ _ _ _ 0.99% Speed (seconds/transaction): Digits AGL: _ _ _ _ _ _ 0.04 GPT-4.5–preview: _ _ _ 1.58 GPT-o1: _ _ _ _ _ _ _ _ _ 18.71 Claude3.5-sonnet: _ _ 1.11 Claude3.7-sonnet: _ _ 1.37 GPT-4o: _ _ _ _ _ _ _ _ 0.82 O3-mini: _ _ _ _ _ _ _ _ 9.84 O1-mini: _ _ _ _ _ _ _ _ 4.54 Gemin-2.0-flash: _ _ _ 0.59 Grok-2: _ _ _ _ _ _ _ _ _ 1.0 DeepSeek-R1: _ _ _ _ _ 8.16 DeepSeek-V3: _ _ _ _ _ 2.02 Llama3.3-70B: _ _ _ _ _ 0.87 Gemma-3-27b: _ _ _ _ 0.62 Llama4-scout-17B: _ _ 4.89 GPT-4o-mini: _ _ _ _ _ _ 0.70 While these two new models underperformed previously benchmarked LLMs, we've observed that reasoning-focused Mixture of Experts architectures may potentially reduce hallucination rates, though additional testing is required for confirmation. The relatively low accuracy demonstrated by both new models could stem from their specialization in code generation or challenges in their reasoning processes. For more details on our benchmarking process, see the full white paper at https://digits.com/agl
Excellent job Jeff
😍
Thanks for sharing these results! It is very interesting to see the performance metrics and outcomes across all of these different LLMs.
Glad to see it, Jeff!
Great to see. Purpose built outshines general purpose.
Founder/CEO at Tern
3wI can't get over the disclaimer that the version of Llama 4 Maverick that Meta touts in their post is a "customized model to optimize for human preference." Glad to hear the snarky slangy model tuned to be cool is bad at accounting.