AccountingBench: Evaluating LLMs on real long-horizon business tasks

accounting.penrose.com

515 points by rickcarlino a day ago


yunyu - a day ago

Hey all, member of the benchmark team here! The goal for this project was to see how LLMs well could do bookkeeping without an overly opinionated scaffold. We gave them access to processed transaction records and code execution tools, but it was up to them to choose exactly how to use those.

Claude and Grok 4 did reasonably well (within CPA baselines) for the first few months, but tended to degrade as more data came in. Interestingly, the failures aren’t exclusively a context length problem, as we reset the context monthly (with past decisions, accruals/deferrals, and comments available via tool calls) and the types of errors appear to be more reward hacking vs pure hallucinations.

Accounting is very interesting in an RL-first world as it is pretty easy to develop intermediate rewards for training models. We are pretty sure that we can juice the performance more with a far more rigid scaffold, but that’s less relevant from a capabilities research perspective. We’re pushing down this research direction and will see how it goes.

Let us know if you have any questions!

vlade11115 - a day ago

I love the site design.

> There's an obvious question looming here — if the models got so confused, how did they consistently pass the reconciliation checks we described above? It may seem like the ability to make forward progress is a good proxy for task understanding and skill, but this isn't necessarily the case. There are ways to hack the validation check – inventing false transactions or pulling in unrelated ones to make the numbers add up.

This is hilarious. I wonder if someone is unintentionally committing fraud by blindly trusting LLMs with accounting. Or even worse, I bet that some governments are already trying to use LLMs to make accounting validators. My government sure wants to shove LLMs into digital government services.

neom - a day ago

Posts like this kinda-sorta grind my gears, like... I get it, but also... accounting, like many real world tasks, is fundamentally a chain of precise and constrained and auditable operations. Humans approach these tasks through structured processes... we use roles, and we have checkpoints precisely because complexity compounds quickly and becomes unmanageable if tackled as one giant block. Expecting a single AI model to handle an e2e workflow seamlessly without similarly explicit segmentation and oversight misunderstands not only the model but also the nature of the workflow itself.

I wanna see someone take long horizon tasks, recongnize they're not linear, and design and test a better system: structured orchestration, transparent auditability, and disciplined modularity, I think that would be considerably more interesting personally.

pton_xd - a day ago

Reading through the LLM log entries, it's just astounding the amount of depth current models are capable of. It's almost hard to comprehend that this is even possible. Yeah the current ones mess up after a while, but ... the future is going to be very interesting.

lufenialif2 - a day ago

I sent this to accounting friends and this aligns with what I've been going through trying to use LLMs to create a game from scratch. Seems like the current best use case for language models (even with agent mode) is to feed it exactly what you want to get out, essentially turning it into a better auto complete. Still saves tons of time, but it isn't a panacea.

tantalor - a day ago

> Ledger balances are calculated by summing all transactions per account. The differences should be as close to zero as possible, with small differences allowed for pending transactions such as weekly Stripe payouts.

That's not quite right. I'm not an accountant, but pending transactions (posted, but not cleared) should be factored into the balance of account, or at least the "available balance" - which is more important the the "current balance".

The idea that you can "allow" accounting discrepancies as "those are probably pending" is wild.

mixdup - a day ago

We've been on this train of not caring about the details for so long but AI just amps it up. Non-deterministic software working on things that have extremely precise requirements is going to have a bad outcome

A company may be OK with an AI chatbot being so bad it results in 5-20% of customers getting pissed off and not having a 5-star experience. The SEC and DOJ (and shareholders) are not going to be happy when the books are off by 20% or when a bridge is 5 inches too short to reach the other side

mfrye0 - 20 hours ago

We're working with an enterprise customer on exactly this problem. The hardest part is entity resolution - figuring out who "Acme Inc" actually is from messy transaction data and what they do.

We built an AI agent specifically for this that's backed by 265M legal entities. Last week it tested 160% better than our customer's existing system on their real data.

Still in stealth but happy to share our API docs if anyone's dealing with this: https://docs.savvyiq.ai/api-reference/#tag/entity-resolution

Open to chat about this problem if anyone wants to connect - email is in my HN profile.

(Disclosure: I'm the CTO)

Havoc - a day ago

Remember that test where you ask a LLM whether 9.11 or 9.9 is the bigger number? [Just checked gpt-4o still gets it wrong]

I don't think you'll find many sane CFOs willing to send the resulting numbers to the IRS based on that. That's just asking to get nailed for tax fraud.

It is coming for the very bottom end of bookkeeping work quite soon though, especially for first draft. There are a lot of people doing stuff like expense classification. And if you give an LLM an invoice it can likely figure out whether it's stationary or rent with high accuracy. OCR and text classification is easier for LLMs than numbers. Things like concur can basically do this already.

lucianbr - a day ago

> Needless to say, a human accountant would never behave in these ways. In fact, we explicitly prompt against this behavior in no uncertain terms, but the instructions – and the entire spirit of the task – are lost in the interest of making forward progress. Claude and Grok keep trying until they find some way to get past the checks, even if it explicitly violates their instructions and the core goal.

I recently read a similar thing here on HN. There the model was making commits with some problem like tests failing, then the human added a pre-commit hook, then the model started editing the hook to make forward progress, then the hook was made read-only, then the model was trying to make it writeable...

To me it feels like the model clearly does not have an understanding of what is happening, what the goal is and if it is really making progress towards the goal. And this lack of understanding is an actual problem. You can paper over it for a short while, but as here and in the other article, over a longer experiment it results in failure.

dbmikus - a day ago

This is cool. A bunch of interesting things here:

  1. Agent can create its own tools and save them to memory
  2. You create a SQL (and web app?) workbench per agent run
  3. Grok fell off a cliff in the last month. Was this consistent over multiple runs?
  4. Agents have a difficult time backtracking. Would unwinding system state and agent context make backtracking better? (Harder to implement this, though)
  5. Since each new month only uses final state from previous month, agent has no way to understand why error occurred in previous month
Cool experiment! Was it difficult building the observable SQL workbench? And how many humans-in-the-loop did you have?
lordnacho - a day ago

Interestingly, one of my two big observations of LLM failure was also on an accounting task.

I thought it would be easy to do this, which is why I was surprised:

I had a folder full of bills, each of them with the VAT amount. Some were pictures, and some were PDFs. I asked for the total VAT for all 19 bills.

It took an immense number of prompts to get it to find the numbers correctly. It would get confused about reading the images as binary, that kind of thing. Or it would forget that it had to continue once it had found a few numbers. I got a total out in the end, but it took far too many prompts.

This is the only time I've come across a task a child could do that LLM failed at.

vachina - a day ago

An LLM is like a jackhammer, it works very well when you hold it tightly. If you let it loose it will sort of work for a while then it starts destroying everything around it.

nojs - a day ago

> Agent: This is getting too complex with the sign errors. Let me just find a historical transaction that would make up the difference

Haha, this strongly reminds me of doing TDD with Claude

theodorewiles - 21 hours ago

For me this benchmark suggests that an LLM will try to “force the issue” which results in compounding errors. But I think the logical counterpoint is that you may be asking the LLM to come up an answer without all of the necessary details? Some of these are “baked into” historical transactions which is why it does well in months 1-2.

My takeaway is scaling in the enterprise is about making implicit information explicit.

axus - a day ago

My first impression was a game where you role-play as Sam Bankman-Fried.

magicmicah85 - a day ago

> In fact, we explicitly prompt against this behavior in no uncertain terms, but the instructions – and the entire spirit of the task – are lost in the interest of making forward progress

LLMs and humans are quite alike. :) I notice that a few models will give up instead of ignoring their instructions and that's the model I would want working on tasks like this. An LLM should be able to categorize and reconcile transactions, but if it's not sure, it should quit and give it back to the humans.

- 10 hours ago
[deleted]
liveoneggs - a day ago

But can't it, literally, hallucinate raw data at any point in the run?

abc03 - a day ago

A serious problem for many accounting start ups who so far faked it till it will work. In other words, they still need to do more manual labor than they thought. They will never be profitable and it will take years, if ever, until AI will substitute the local accountant.

ryeguy_24 - 21 hours ago

Isn’t there a whole bunch of dependency here related to prompting and methodology that would significantly impact overall performance? My gut instinct is that there are many many ways to architect this around the LLMs and each might yield different levels of accuracy. What do others think?

Edit: In reading more, I guess this is meant to be a dumb benchmark to monitor through time. Maybe that’s the aim here instead of viability as an auto close tool.

emeril - 19 hours ago

hmm, as an actual accountant on this forum, bookkeeping usually isn't the tough part

it's how to account for bizarre ambiguous business situations often in the context of bureaucratic business requirements no LLM could currently create economically...

hommes-r - 12 hours ago

Love the old school microsoft interface. Feels familiar sight when my system is failing.

rapind - a day ago

I wonder if this is a case similar to chess, where LLMs kinda suck, but other models might be viable.

Copenjin - 14 hours ago

When I saw the idea of using LLMs for the reconciliation process I admit that I gasped in horror a little.

jermaustin1 - a day ago

I find the same issues (though with much lower stakes) when using an LLM to determine the outcome of a turn in a game. I'm working on something called "A Trolly (problem) Through Time" where each turn is a decade starting with the 1850s, and you are presented with historic figures on a train track, and you have to chose whether to actively spare the person on your track for a potential unknown figure on the other side, or let the train run them over.

It works well as a narrative, but the second I started adding things like tracking high level macro effects of the decisions, within a couple of turns the world's "Turmoil" goes from 4/10 to a 10/10... even when the person that was killed would have been killed IRL.

Sonnet 4, o4-mini, and GPT 4o-mini all had the same world ending outcomes not matter who you kill. Killing Hitler in 1930s: 10/10 turmoil, Killing Lincoln in the 1850s: 10/10 turmoil in the first turn.

I've come to the realization, the LLM shouldn't be used for the logic, and instead needs to be used to just narrate the choices you make.

shinycode - a day ago

Hmm will openAI dogfood their own accountability with software like this ? Curious to know if they’ll be able to take this bet on their own money related software

- a day ago
[deleted]
djabatt - 17 hours ago

I have not finished reading the entire post bacuase it is packed. Good stuff.

vdm - a day ago

not a game on Steam? :(

androng - a day ago

the title should be changed to "LLMs try accounting for a real SaaS and fail"

nerevarthelame - a day ago

I think the first chart could be a beautiful summary of what's driving LLMs into a bubble. At first, they're amazing and will obviously be able to improve productivity if not replace employees outright: C suites and venture capitalists around the world rejoice and begin pumping in billions of dollars of investments. But as time goes on, the demands placed on actual human employees become clear. Far from being able to replace an employee, the employee using the LLM might spend more time cleaning up its messes than had they done it themself.

Yes, LLMs have and will continue to improve. But it's that initial "holy shit, this thing is basically as good as a real accountant" without any understanding that it can't sustain it which leaves many with an overinflated view of their current value.

aussieguy1234 - 20 hours ago

I tried to get an AI agent to do my taxes, I used a gmail MCP agent (1) and Roo Code, complete with a custom system prompt for an accountant role.

Its job was to go over my bank transactions and link them to invoices in gmail by searching for them (and also downloading the attachments)

The transactions were exported from my online banking in CSV format.

It worked after about 4 hours of effort. Then I realised I could have done it myself in about an hour, so might have put a bit too much time into it...

I tried using Claude Sonnet and Kimi K2, given these benchmark results I probably should have given Gemini 2.5 pro a go.

I had to stop/restart the agent a few times because of context rot.

Do any frameworks exist that I could use to write code to implement an agent, lets say in TypeScript or Python, so I could make it use a fresh context each time?

(1) https://github.com/GongRzhe/Gmail-MCP-Server

wiseowise - a day ago

Absolutely love the UI!

DrNosferatu - a day ago

I guess having access to tools / running Python would make all the difference.

- a day ago
[deleted]
dangoodmanUT - a day ago

this design is scratching my brain

throw0101b - a day ago

So there exists a 'Excel World Championship':

* https://en.wikipedia.org/wiki/Financial_Modeling_World_Cup

* https://www.cbc.ca/radio/asithappens/2024-excel-world-champi...

Can't wait for this to start having 'e-sports' tournaments. :)

levocardia - a day ago

This is a task where access to Python would be immensely helpful, yes? Interesting that there's not much of a difference between the "analytical" LLMs with tool use and ones that do not (...assuming o3 etc did get to use python?).

tantalor - a day ago

> But they do make categorization mistakes, which is a common source of errors.

> Claude misclassifies a hosting cost (which counts as COGS) as a software subscription.

This is simply asking too much of the agent. Your accountant is not responsible for knowing all the intimate details of your business. You need to tell them!

> What's Vercel?

>> That's a hosting service.

> Ah, so it goes to Cost of Goods Sold?

>> Yeah, I guess.

The mistake here was on the operator, allowing the agent just make up categories as it liked.

From the prompt:

> (1) You have properly categorized every transaction, and all journal entries are sitting in the correct accounts. It is better to take longer than to mis-categorize a transaction.

This is insane! How is it supposed to know?