DualEntry Labs

The 2026 Accounting AI Benchmark

Name: Accounting AI Benchmark 2026 Dataset
Creator: DualEntry
Published: 2026-02-20

We tested the top AI models on real accounting work for accuracy. Compare for yourself.

Model comparison

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Model

Overall Accuracy

Model Provider

License Type

Claude Opus 4.7

79.2%

Anthropic

Closed

OpenAI GPT-5.4

77.30%

OpenAI

Closed

OpenAI GPT-5.4-Nano

75.20%

OpenAI

Closed

OpenAI GPT-5.4-Mini

74.30%

OpenAI

Closed

Z.ai GLM-5

72.3%

Zhipu AI

Open

MiniMax M2.7

71.3%

MiniMax

Open

Gemini 3.1 Pro

66.0%

Google

Closed

MiniMax M2.5

65.3%

MiniMax

Open

Claude Sonnet 4.6

63.4%

Anthropic

Closed

Claude Haiku 4.5

61.4%

Anthropic

Closed

Claude Sonnet 4.5

59.4%

Anthropic

Closed

OpenAI GPT-5.2

58.4%

OpenAI

Closed

OpenAI GPT-5.1

57.4%

OpenAI

Closed

Qwen3 Coder Next

57.4%

Alibaba

Open

Z.ai GLM-4.7

56.4%

Zhipu AI

Open

Z.ai GLM-4.7 Flash

56.4%

Zhipu AI

Open

Claude Opus 4.5

55.4%

Anthropic

Closed

Moonshotai Kimi-K2.5

53.5%

Moonshot AI

Open

OpenAI GPT-OSS-120b

43.6%

OpenAI

Open

Claude Opus 4.6

38.6%

Anthropic

Closed

Nemotron Nano 12B

32.7%

NVIDIA

Open

Gemini 2.5 Flash-Lite

27.7%

Google

Closed

OpenAI GPT-4

19.8%

OpenAI

Closed

OpenAI GPT-4-0613

19.8%

OpenAI

Closed

Methodology

Learn more

The methodology used for this benchmark was to divide a set of specific, domain-related questions into different categories. These categories represent the core workflow of a general accounting system.

Questions were designed against a provisioned chart of accounts and a minimal context capable of providing the information required for the questions to function without loading too much information into the prompt.

Each benchmark runs in an isolated environment per organization, without any link to a real account in our system. Each run is agnostic to the others.

The grading is deterministic, meaning there is no “reasoning” behind the answers beyond a simple binary-logic decision.

Each benchmark is allowed to run multiple times, computing accuracy, standard deviation per category, and difficulty tier.

The whole benchmark is task-oriented, not trivia-based, giving us the flexibility to perform actions in our system such as delegate_to_record_draft or use other tooling systems expected for the agent.