DualEntry Labs

The 2026 Accounting AI Benchmark

We tested the top AI models on real accounting work for accuracy.
Compare for yourself.

Model comparison
Model Provider
License Type
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Model
Overall Accuracy
Model Provider
License Type
OpenAI GPT-5.4
77.30%
77.30
OpenAI
Closed
OpenAI GPT-5.4-Nano
75.20%
75.20
OpenAI
Closed
OpenAI GPT-5.4-Mini
74.30%
74.30
OpenAI
Closed
Gemini 3.1 Pro
66.0%
66.0
Google
Closed
Z.ai GLM-5
65.3%
65.3
Zhipu AI
Open
MiniMax M2.5
65.3%
65.3
MiniMax
Open
Claude Sonnet 4.6
63.4%
63.4
Anthropic
Closed
Claude Haiku 4.5
61.4%
61.4
Anthropic
Closed
Claude Sonnet 4.5
59.4%
59.4
Anthropic
Closed
OpenAI GPT-5.2
58.4%
58.4
OpenAI
Closed
OpenAI GPT-5.1
57.4%
57.4
OpenAI
Closed
Qwen3 Coder Next
57.4%
57.4
Alibaba
Open
Z.ai GLM-4.7
56.4%
56.4
Zhipu AI
Open
Z.ai GLM-4.7 Flash
56.4%
56.4
Zhipu AI
Open
Claude Opus 4.5
55.4%
55.4
Anthropic
Closed
Moonshotai Kimi-K2.5
53.5%
53.5
Moonshot AI
Open
OpenAI GPT-OSS-120b
43.6%
43.6
OpenAI
Open
Claude Opus 4.6
38.6%
38.6
Anthropic
Closed
Nemotron Nano 12B
32.7%
32.7
NVIDIA
Open
Gemini 2.5 Flash-Lite
27.7%
27.7
Google
Closed
OpenAI GPT-4
19.8%
19.8
OpenAI
Closed
OpenAI GPT-4-0613
19.8%
19.8
OpenAI
Closed

Methodology

The methodology used for this benchmark was to divide a set of specific, domain-related questions into different categories. These categories represent the core workflow of a general accounting system.

Learn more

The methodology used for this benchmark was to divide a set of specific, domain-related questions into different categories. These categories represent the core workflow of a general accounting system.

Questions were designed against a provisioned chart of accounts and a minimal context capable of providing the information required for the questions to function without loading too much information into the prompt.

Each benchmark runs in an isolated environment per organization, without any link to a real account in our system. Each run is agnostic to the others.

The grading is deterministic, meaning there is no “reasoning” behind the answers beyond a simple binary-logic decision.

Each benchmark is allowed to run multiple times, computing accuracy, standard deviation per category, and difficulty tier.

The whole benchmark is task-oriented, not trivia-based, giving us the flexibility to perform actions in our system such as delegate_to_record_draft or use other tooling systems expected for the agent.

Category
Questions
What it tests
Transaction Classification
13
Mapping bank transactions to the correct chart of accounts
Journal Entry Creation
13
Creating balanced journal entries with correct accounts and amounts
Accounts Payable
13
Bills, vendor payments, vendor credits
Accounts Receivable
12
Invoices, customer payments, credit memos
Bank Reconciliation
12
Identifying reconciling items and computing adjusted balances
Financial Reporting
13
Ratios, cash flow, balance sheet analysis
Month-End Close
12
Accruals, deferrals, depreciation, reversals
AI Accounting Knowledge
13
Multiple-choice conceptual accounting knowledge