Claude Opus 4.7 Dethrones GPT-5.4 on Accounting AI Benchmark

Santiago Nestares

Co-founder, DualEntry

Last updated

April 17, 2026

Reviewed by

Claude Opus 4.7 Dethrones GPT-5.4 on Accounting AI Benchmark

Contents

Don't worry! The contents here will appear on the published page.

Subscribe to the
DualEntry Newsletter

Get Fresh Al finance insights, reports and more delivered straight to your inbox

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Summarize this article

Anthropic released Claude Opus 4.7 today. Within hours, we ran it through the 2026 Accounting AI Benchmark - and it dethroned OpenAI’s GPT-5.4 to take the #1 spot on real-world accounting tasks. Here’s what the data shows.

‍

79.2%

#1 overall accuracy

92%

Transaction classification
& journal entries

50%

Month-end close
(weakest category)

Models tested
from 5 providers

‍

Opus 4.7 already made headlines today for dethroning GPT-5.4 on coding benchmarks, scoring 64.3% on SWE-bench Pro versus GPT-5.4’s 57.7%. We wanted to see if the same held true for real-world accounting work.

‍

It did. Opus 4.7 scored 79.2% overall accuracy across end-to-end accounting tasks, knocking OpenAI’s GPT-5.4 (77.3%) off the top spot it held since our last update. GPT-5.4-Nano (75.2%) and GPT-5.4-Mini (74.3%) round out the top four, but for the first time, OpenAI doesn’t own the #1 position. The full picture, though, is more nuanced than the headline suggests.

The full leaderboard

We tested 10 models from five providers - Anthropic, OpenAI, Google, MiniMax, and Zhipu AI - across four categories of accounting tasks: transaction classification, journal entries, month-end close, and financial reporting.

‍

Rank	Model	Accuracy	Provider	License
1	Claude Opus 4.7	79.2%	Anthropic	Closed
2	OpenAI GPT-5.4	77.30%	OpenAI	Closed
3	OpenAI GPT-5.4-Nano	75.20%	OpenAI	Closed
4	OpenAI GPT-5.4-Mini	74.30%	OpenAI	Closed
5	Z.ai GLM-5	72.3%	Zhipu AI	Open
6	MiniMax M2.7	71.3%	MiniMax	Open
7	Gemini 3.1 Pro	66.0%	Google	Closed
8	MiniMax M2.5	65.3%	MiniMax	Open
9	Claude Sonnet 4.6	63.4%	Anthropic	Closed
10	Claude Haiku 4.5	61.4%	Anthropic	Closed

Opus 4.7 beat GPT-5.4 - but both models share the same blind spot

The overall accuracy gap between Opus 4.7 and GPT-5.4 is less than 2 percentage points. Where the data gets interesting is in the task-level breakdown.

‍

Structured tasks: Opus 4.7 hit 92% accuracy on both transaction classification and journal entries. These are well-defined, rules-based tasks where AI can pattern-match effectively, categorizing expenses, generating debit/credit entries from descriptions, and mapping transactions to the right accounts.

‍

Complex workflows: Performance dropped sharply on month-end close (50%) and financial reporting (62%). These tasks require multi-step reasoning, cross-referencing across accounts, applying judgment calls on accruals and adjustments, and producing coherent narrative outputs. Every model on the benchmark struggled here.

‍

This isn’t unique to Opus 4.7. Every model we tested showed the same pattern: strong on structured, repetitive tasks; weak on the complex, judgment-heavy work that accountants actually get paid for.

Open-weight models are surprisingly competitive

Z.ai’s GLM-5 (72.3%) and MiniMax M2.7 (71.3%) both outperformed Google’s Gemini 3.1 Pro (66.0%) and Anthropic’s own smaller models - Claude Sonnet 4.6 (63.4%) and Claude Haiku 4.5 (61.4%).

‍

For accounting firms evaluating AI tools, this means the best model for their use case isn’t automatically the most expensive one.

GPT dethroned, but no model has broken 80%

Opus 4.7 took the crown from GPT-5.4, but it’s worth noting how close the race is at the top. The gap between #1 and #2 is under 2 points. And no model has yet crossed the 80% accuracy threshold. That matters because real-world accounting requires near-perfect accuracy, a model that gets 1 in 5 entries wrong isn’t ready to operate without human oversight.

‍

The lead could easily change with the next model update from either side. What won’t change anytime soon is the gap between structured tasks (90%+) and complex workflows (50–62%). That’s the real benchmark to watch.

About the benchmark

The 2026 Accounting AI Benchmark tests models on real accounting tasks, not synthetic reasoning problems. We evaluate accuracy across transaction classification, journal entries, month-end close procedures, and financial reporting - the core workflows that accounting teams perform daily.

‍

We update the leaderboard as new models are released. Opus 4.7 was benchmarked within hours of its public release today.

‍

Explore the Full Benchmark & Leaderboard →

‍

Interactive results with filtering by model, provider, and task type

See the full power of DualEntry in 30 minutes

Santiago Nestares

Co-founder, DualEntry

Santiago is the co-founder of DualEntry. He previously co-founded Benitago, a digital consumer products group that raised $380 million in funding, grew to over 300 team members, and achieved $100M ARR over 8 years before its acquisition in 2024. Santiago has been featured in The Tim Ferriss Show, Forbes, The Wall Street Journal, and more. Originally from Venezuela, Santiago studied Computer Science at Dartmouth before leaving to launch Benitago. At DualEntry, Santiago writes about the future of AI in accounting, ERP modernization, and how finance teams can leverage technology to scale.

DualEntry's editorial content is for general informational purposes only, and does not constitute legal or financial advice. Please always consult an attorney or financial advisor for advice in relation to this content. For more information about how we create our content, please see our editorial guidelines.

Go live in 24 hours

By clicking "Schedule Demo" you agree to the use of your data in accordance with DualEntry's Privacy Notice, including for marketing purposes.

Claude Opus 4.7 Dethrones GPT-5.4 on Accounting AI Benchmark

Subscribe to the
DualEntry Newsletter

The full leaderboard

Opus 4.7 beat GPT-5.4 - but both models share the same blind spot

Open-weight models are surprisingly competitive

GPT dethroned, but no model has broken 80%

About the benchmark

See the full power of DualEntry in 30 minutes

Related content

Citibank Revlon Wire Transfer Mistake: A Forensic Breakdown for Finance Leaders

Deferred Revenue Recognition Under ASC 606: The SaaS Guide

Accounting for Venture-Backed Startups: The Complete Stage-by-Stage Guide (2026)

Go live in 24 hours

Claude Opus 4.7 Dethrones GPT-5.4 on Accounting AI Benchmark

Subscribe to the DualEntry Newsletter

The full leaderboard

Opus 4.7 beat GPT-5.4 - but both models share the same blind spot

Open-weight models are surprisingly competitive

GPT dethroned, but no model has broken 80%

About the benchmark

See the full power of DualEntry in 30 minutes

Related content

Citibank Revlon Wire Transfer Mistake: A Forensic Breakdown for Finance Leaders

Deferred Revenue Recognition Under ASC 606: The SaaS Guide

Accounting for Venture-Backed Startups: The Complete Stage-by-Stage Guide (2026)

Go live in 24 hours

Subscribe to the
DualEntry Newsletter