🎉 🎉 Llama4 and Gemma3 benchmarks! 🎉 🎉 We continually evaluate the…

CEO Digits, Host of Between Two Founders, 100+ Angel Investments

🎉 🎉 Llama4 and Gemma3 benchmarks! 🎉 🎉 We continually evaluate the latest LLMs to see how they perform at double-entry bookkeeping vs Digits AGL, and our team got right to work after Llama4 was released! The Evaluation: Classify 17,792 transactions, given the description, dollar amount, and a chart of accounts. The Results: Surprisingly weak… Correctly picked the category: Digits AGL: _ _ _ _ _ _ 93.5% GPT-4.5–preview: _ _ _ 66.6% GPT-o1: _ _ _ _ _ _ _ _ 66.0% Claude3.5-sonnet: _ _ 63.2% Claude3.7-sonnet: _ _ 62.7% GPT-4o: _ _ _ _ _ _ _ _ 60.6% O3-mini: _ _ _ _ _ _ _ _ 59.8% O1-mini: _ _ _ _ _ _ _ _ 59.4% Gemin-2.0-flash: _ _ _ 59.1% Grok-2: _ _ _ _ _ _ _ _ _ 57.9% DeepSeek-R1: _ _ _ _ _ 57.2% DeepSeek-V3: _ _ _ _ _ 56.9% Llama3.3-70B: _ _ _ _ _ 51.3% Gemma-3-27b: _ _ _ _ 50.8% Llama4-scout-17B: _ _ 46.4% GPT-4o-mini: _ _ _ _ _ _ 46.4% Hallucination Rate (made up a fake category): Digits AGL: _ _ _ _ _ _ 0.0% GPT-4.5–preview: _ _ _ 0.17% GPT-o1: _ _ _ _ _ _ _ _ _ 0.5% Claude3.5-sonnet: _ _ 0.48% Claude3.7-sonnet: _ _ 2.96% GPT-4o: _ _ _ _ _ _ _ _ 0.24% O3-mini: _ _ _ _ _ _ _ _ 5.95% O1-mini: _ _ _ _ _ _ _ _ 0.38% Gemin-2.0-flash: _ _ _ 0.1% Grok-2: _ _ _ _ _ _ _ _ _ 0.36% DeepSeek-R1: _ _ _ _ _ 9.44% DeepSeek-V3: _ _ _ _ _ 0.3% Llama3.3-70B: _ _ _ _ _ 0.72% Gemma-3-27b: _ _ _ _ 0.65% Llama4-scout-17B: _ _ 0.46% GPT-4o-mini: _ _ _ _ _ _ 0.99% Speed (seconds/transaction): Digits AGL: _ _ _ _ _ _ 0.04 GPT-4.5–preview: _ _ _ 1.58 GPT-o1: _ _ _ _ _ _ _ _ _ 18.71 Claude3.5-sonnet: _ _ 1.11 Claude3.7-sonnet: _ _ 1.37 GPT-4o: _ _ _ _ _ _ _ _ 0.82 O3-mini: _ _ _ _ _ _ _ _ 9.84 O1-mini: _ _ _ _ _ _ _ _ 4.54 Gemin-2.0-flash: _ _ _ 0.59 Grok-2: _ _ _ _ _ _ _ _ _ 1.0 DeepSeek-R1: _ _ _ _ _ 8.16 DeepSeek-V3: _ _ _ _ _ 2.02 Llama3.3-70B: _ _ _ _ _ 0.87 Gemma-3-27b: _ _ _ _ 0.62 Llama4-scout-17B: _ _ 4.89 GPT-4o-mini: _ _ _ _ _ _ 0.70 While these two new models underperformed previously benchmarked LLMs, we've observed that reasoning-focused Mixture of Experts architectures may potentially reduce hallucination rates, though additional testing is required for confirmation. The relatively low accuracy demonstrated by both new models could stem from their specialization in code generation or challenges in their reasoning processes. For more details on our benchmarking process, see the full white paper at https://digits.com/agl

6 Comments

TR Jordan

Founder/CEO at Tern

I can't get over the disclaimer that the version of Llama 4 Maverick that Meta touts in their post is a "customized model to optimize for human preference." Glad to hear the snarky slangy model tuned to be cool is bad at accounting.