Hallucination tolerance by workflow

A client emailed us a few weeks ago to ask whether they should be running supplier invoices through ChatGPT to extract the line items. The honest answer is "it depends on what happens when it gets one wrong", and that question turns out to be the only one that matters when you are working out whether to put AI on a workflow. Every workflow has a tolerance for mistakes, and matching the workflow's tolerance to the model's reliability is the whole job.

The risk banding below is what we have ended up using on the discovery call. It is rough by design; a useful frame beats a precise one when the model output is itself fuzzy.

Band one: mistakes cost you nothing

Workflows where the output is reviewed by a human before it goes anywhere, where the AI is a first-pass tool, and where a wrong answer means the human edits it rather than the business losing money.

Meeting summaries. The transcript is the source of truth. If the summary misses a point, the user adds it back. If it adds a point the meeting did not cover, the user removes it. Worst-case cost: a minute of editing.

Drafting internal communications. An all-hands email, a project update, an internal Slack message. A human reads it before it sends. The AI is doing 60% of the writing and the human is doing 40% of the editing, and the editing is the quality gate.

First-pass searching across documents. "Find me the section in the policy about overtime." If the AI returns the wrong section, the user spots it immediately because they can read. Worst-case cost: thirty seconds.

Most of the AI value we see at SMEs lives in this band. It is also the safest band to start in, which is why pilots that begin here tend to land cleanly.

Band two: mistakes cost you minutes, sometimes hours

Workflows where the output goes to an internal audience that will not necessarily catch the mistake immediately, but where a wrong answer is recoverable.

Drafting customer-facing emails for internal review. Sales team uses AI to draft outreach, marketing uses AI to draft newsletters, support uses AI to draft tier-one replies. A human still signs off, but the review may be lighter than it should be, and a confidently wrong line can ship before anyone notices. Worst-case cost: a customer asks a question the email answered incorrectly, and somebody on the team spends an hour unwinding it.

Summarising long reports for the SLT. If the summary misrepresents the conclusion of a 60-page document and the SLT reads only the summary, the decision they make is wrong. The cost is the time it takes to spot the error and revisit the decision, which is usually a meeting.

Code suggestions for internal tooling. If a developer is using AI to write a script that runs internally, a wrong answer breaks the script and the developer fixes it. Recoverable, but the recovery time is non-trivial.

This band needs lightweight controls. A second pair of eyes on customer-facing copy. A "summary of what changed" note from the AI itself. Spot-checks of the source against the output, weekly rather than per-item. The aim is not zero-defect; it is a defect rate the business can absorb.

Band three: mistakes cost real money

Workflows where a wrong answer leads to a financial loss, a regulatory issue, or a customer-facing commitment the business has to honour.

Invoice extraction and posting. AI reads a supplier invoice, extracts the line items, posts them to the accounts package. The 80% case works. The 20% case (credit notes, multi-currency, multi-line splits, items where the VAT treatment is non-standard) is where the cost lands. We have seen this one quoted as a £15,000 problem when an SME let it run unsupervised for three months and the bookkeeper had to unwind every entry by hand.

Contract drafting with no legal review. An AI generates a service agreement that a non-lawyer signs and sends. The clause the AI made up that sounds plausibly legal turns out to be unenforceable or to commit the business to something it did not mean. The cost is in the lost dispute or the lost engagement when the customer's counsel reads it.

Quote generation without a human price check. AI drafts a customer quote based on a description of the work. The quote is 30% too low. The business is now committed to the price.

These are not workflows AI cannot help with. They are workflows where AI must be a draft tool and the human must be the gate. The cost of removing the human gate is structurally too high to recover from a productivity gain.

Band four: do not do this yet

Workflows where a single wrong answer is materially expensive and the human gate is not feasible, either because volume is too high or because the failure happens in real time.

Autonomous customer-service chatbots without a human escalation path. The bot answers product questions, makes product commitments, quotes prices. Volume is too high for review. A confidently wrong answer ships before anyone sees it. We have not seen this work at any SME we have spoken to, and we have spoken to a few that tried.

Financial calculation without a downstream check. AI calculates depreciation, runs cash-flow forecasts, prices a deal. The output is used as the source figure for a decision. A model that hallucinates 1% of the time is unacceptable when 1% of the figures are wrong.

Compliance reporting. A model summarising regulatory submissions or generating the SAR (subject-access request) responses. The cost of a wrong answer is the regulatory exposure, which can be six-figure even for SMEs in regulated sectors.

Band-four workflows are not permanently off the table. They are off the table until either the model reliability improves materially or the human gate becomes economically viable at the volume in question.

How to use the frame

Three steps on the call:

List the workflows. Not "AI in general"; the specific things the team does today that AI might change. Drafting outreach, summarising calls, extracting invoice data, drafting contracts, answering customer queries.
Band each one. Where does a wrong answer land? Recoverable in seconds, in minutes, in money, or not at all?
Build the pilot list from bands one and two. Stay out of bands three and four for the first 90 days. Revisit band three with human gates once the team has a feel for the model's reliability on real work.

Where this lands with us

The risk band is the first slide of every AI assessment we run. It does two useful things at once: it sets a realistic expectation for what AI is going to do in the first quarter, and it gives the policy a backbone, because the acceptable-use policy that comes out the other side maps directly to the bands.

The mistake we see SMEs make is not running an unsafe workflow once. It is running a band-three workflow as if it were band one, because the band-one rollout went well and the team forgot the bands are not the same shape.

Thinking about which workflows to put AI on next and want a second opinion on the risk shape? Drop us a note at info@jmopartners.co.uk.

JMO|Partners · Enterprise IT, sized for SMEs.

Hallucination tolerance: where mistakes cost you nothing, where they cost £15,000