Jeff Seibert’s Post

View profile for Jeff Seibert

CEO Digits, Host of Between Two Founders, 100+ Angel Investments

🎉 🎉 Llama4 and Gemma3 benchmarks! 🎉 🎉 We continually evaluate the latest LLMs to see how they perform at double-entry bookkeeping vs Digits AGL, and our team got right to work after Llama4 was released! The Evaluation: Classify 17,792  transactions, given the description, dollar amount, and a chart of accounts. The Results: Surprisingly weak… Correctly picked the category: Digits AGL: _ _ _ _ _ _ 93.5% GPT-4.5–preview: _ _ _ 66.6% GPT-o1: _ _ _ _ _ _ _ _ 66.0% Claude3.5-sonnet: _ _ 63.2% Claude3.7-sonnet: _ _ 62.7% GPT-4o: _ _ _ _ _ _ _ _ 60.6% O3-mini: _ _ _ _ _ _ _ _ 59.8% O1-mini: _ _ _ _ _ _ _ _ 59.4% Gemin-2.0-flash: _ _ _ 59.1% Grok-2: _ _ _ _ _ _ _ _ _ 57.9% DeepSeek-R1: _ _ _ _ _ 57.2% DeepSeek-V3: _ _ _ _ _ 56.9% Llama3.3-70B: _ _ _ _ _ 51.3% Gemma-3-27b: _ _ _ _ 50.8% Llama4-scout-17B: _ _ 46.4% GPT-4o-mini: _ _ _ _ _ _ 46.4% Hallucination Rate (made up a fake category): Digits AGL: _ _ _ _ _ _ 0.0% GPT-4.5–preview: _ _ _ 0.17% GPT-o1: _ _ _ _ _ _ _ _ _ 0.5% Claude3.5-sonnet: _ _ 0.48% Claude3.7-sonnet: _ _ 2.96% GPT-4o: _ _ _ _ _ _ _ _ 0.24% O3-mini: _ _ _ _ _ _ _ _ 5.95% O1-mini: _ _ _ _ _ _ _ _ 0.38% Gemin-2.0-flash: _ _ _ 0.1% Grok-2: _ _ _ _ _ _ _ _ _ 0.36% DeepSeek-R1: _ _ _ _ _ 9.44% DeepSeek-V3: _ _ _ _ _ 0.3% Llama3.3-70B: _ _ _ _ _ 0.72% Gemma-3-27b: _ _ _ _ 0.65% Llama4-scout-17B: _ _ 0.46% GPT-4o-mini: _ _ _ _ _ _ 0.99% Speed (seconds/transaction): Digits AGL: _ _ _ _ _ _ 0.04 GPT-4.5–preview: _ _ _ 1.58 GPT-o1: _ _ _ _ _ _ _ _ _ 18.71 Claude3.5-sonnet: _ _ 1.11 Claude3.7-sonnet: _ _ 1.37 GPT-4o: _ _ _ _ _ _ _ _ 0.82 O3-mini: _ _ _ _ _ _ _ _ 9.84 O1-mini: _ _ _ _ _ _ _ _ 4.54 Gemin-2.0-flash: _ _ _ 0.59 Grok-2: _ _ _ _ _ _ _ _ _ 1.0 DeepSeek-R1: _ _ _ _ _ 8.16 DeepSeek-V3: _ _ _ _ _ 2.02 Llama3.3-70B: _ _ _ _ _ 0.87 Gemma-3-27b: _ _ _ _ 0.62 Llama4-scout-17B: _ _ 4.89 GPT-4o-mini: _ _ _ _ _ _ 0.70 While these two new models underperformed previously benchmarked LLMs, we've observed that reasoning-focused Mixture of Experts architectures may potentially reduce hallucination rates, though additional testing is required for confirmation. The relatively low accuracy demonstrated by both new models could stem from their specialization in code generation or challenges in their reasoning processes. For more details on our benchmarking process, see the full white paper at https://digits.com/agl

  • chart, bar chart
TR Jordan

Founder/CEO at Tern

3w

I can't get over the disclaimer that the version of Llama 4 Maverick that Meta touts in their post is a "customized model to optimize for human preference." Glad to hear the snarky slangy model tuned to be cool is bad at accounting.

Christopher Salvato

Senior Staff Engineer, Platforms @ Coinbase

3w

Excellent job Jeff

Wayne Robins

Product & design leader / Xoogler & xBCG

3w

😍

Mark Mathson

Product Manager at Rightworks | Leading product strategy for growth and innovation

3w

Thanks for sharing these results! It is very interesting to see the performance metrics and outcomes across all of these different LLMs.

Ben Demers

Data-driven digital storyteller. Skilled in breaking complex topics into compelling content that engages audiences and delivers results.

1w

Glad to see it, Jeff!

Deeps Ramanathan

Founder & Exec | CMO | GTM Expert | Startup Advisor | Angel & LP | Google Twitter IBM Degreed YC-Backed EdTech Founder

3w

Great to see. Purpose built outshines general purpose.

See more comments

To view or add a comment, sign in

Explore topics