Claude Opus 4.7 Dethrones GPT-5.4 on Accounting AI Benchmark

Santiago Nestares
Co-founder, DualEntry
Santiago Nestares
Co-founder, DualEntry

Santiago is the co-founder of DualEntry. He previously co-founded Benitago, a digital consumer products group that raised $380 million in funding, grew to over 300 team members, and achieved $100M ARR over 8 years before its acquisition in 2024. Santiago has been featured in The Tim Ferriss Show, Forbes, The Wall Street Journal, and more. Originally from Venezuela, Santiago studied Computer Science at Dartmouth before leaving to launch Benitago. At DualEntry, Santiago writes about the future of AI in accounting, ERP modernization, and how finance teams can leverage technology to scale.

Learn about our editorial policies.
Last updated
April 17, 2026
Reviewed by

Learn about our editorial policies.
Claude Opus 4.7 Dethrones GPT-5.4 on Accounting AI Benchmark
Contents
More

Subscribe to the
DualEntry Newsletter

Get Fresh Al finance insights, reports and more delivered straight to your inbox

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Summarize this article

Anthropic released Claude Opus 4.7 today. Within hours, we ran it through the 2026 Accounting AI Benchmark - and it dethroned OpenAI’s GPT-5.4 to take the #1 spot on real-world accounting tasks. Here’s what the data shows.

79.2%
#1 overall accuracy
92%
Transaction classification
& journal entries
50%
Month-end close
(weakest category)
10
Models tested
from 5 providers

Opus 4.7 already made headlines today for dethroning GPT-5.4 on coding benchmarks, scoring 64.3% on SWE-bench Pro versus GPT-5.4’s 57.7%. We wanted to see if the same held true for real-world accounting work.

It did. Opus 4.7 scored 79.2% overall accuracy across end-to-end accounting tasks, knocking OpenAI’s GPT-5.4 (77.3%) off the top spot it held since our last update. GPT-5.4-Nano (75.2%) and GPT-5.4-Mini (74.3%) round out the top four, but for the first time, OpenAI doesn’t own the #1 position. The full picture, though, is more nuanced than the headline suggests.

The full leaderboard

We tested 10 models from five providers - Anthropic, OpenAI, Google, MiniMax, and Zhipu AI - across four categories of accounting tasks: transaction classification, journal entries, month-end close, and financial reporting.

Rank Model Accuracy Provider License
1 Claude Opus 4.7 79.2% Anthropic Closed
2 OpenAI GPT-5.4 77.30% OpenAI Closed
3 OpenAI GPT-5.4-Nano 75.20% OpenAI Closed
4 OpenAI GPT-5.4-Mini 74.30% OpenAI Closed
5 Z.ai GLM-5 72.3% Zhipu AI Open
6 MiniMax M2.7 71.3% MiniMax Open
7 Gemini 3.1 Pro 66.0% Google Closed
8 MiniMax M2.5 65.3% MiniMax Open
9 Claude Sonnet 4.6 63.4% Anthropic Closed
10 Claude Haiku 4.5 61.4% Anthropic Closed

Opus 4.7 beat GPT-5.4 - but both models share the same blind spot

The overall accuracy gap between Opus 4.7 and GPT-5.4 is less than 2 percentage points. Where the data gets interesting is in the task-level breakdown.

Structured tasks: Opus 4.7 hit 92% accuracy on both transaction classification and journal entries. These are well-defined, rules-based tasks where AI can pattern-match effectively, categorizing expenses, generating debit/credit entries from descriptions, and mapping transactions to the right accounts.

Complex workflows: Performance dropped sharply on month-end close (50%) and financial reporting (62%). These tasks require multi-step reasoning, cross-referencing across accounts, applying judgment calls on accruals and adjustments, and producing coherent narrative outputs. Every model on the benchmark struggled here.

This isn’t unique to Opus 4.7. Every model we tested showed the same pattern: strong on structured, repetitive tasks; weak on the complex, judgment-heavy work that accountants actually get paid for.

Open-weight models are surprisingly competitive

Z.ai’s GLM-5 (72.3%) and MiniMax M2.7 (71.3%) both outperformed Google’s Gemini 3.1 Pro (66.0%) and Anthropic’s own smaller models - Claude Sonnet 4.6 (63.4%) and Claude Haiku 4.5 (61.4%).

For accounting firms evaluating AI tools, this means the best model for their use case isn’t automatically the most expensive one.

GPT dethroned, but no model has broken 80%

Opus 4.7 took the crown from GPT-5.4, but it’s worth noting how close the race is at the top. The gap between #1 and #2 is under 2 points. And no model has yet crossed the 80% accuracy threshold. That matters because real-world accounting requires near-perfect accuracy, a model that gets 1 in 5 entries wrong isn’t ready to operate without human oversight.

The lead could easily change with the next model update from either side. What won’t change anytime soon is the gap between structured tasks (90%+) and complex workflows (50–62%). That’s the real benchmark to watch.

About the benchmark

The 2026 Accounting AI Benchmark tests models on real accounting tasks, not synthetic reasoning problems. We evaluate accuracy across transaction classification, journal entries, month-end close procedures, and financial reporting - the core workflows that accounting teams perform daily.

We update the leaderboard as new models are released. Opus 4.7 was benchmarked within hours of its public release today.

Explore the Full Benchmark & Leaderboard →

Interactive results with filtering by model, provider, and task type

See the full power of DualEntry in 30 minutes

Go live in 24 hours

By clicking "Schedule Demo" you agree to the use of your data in accordance with DualEntry's Privacy Notice, including for marketing purposes.