McDonald's ended the IBM drive-thru AI. It lacked a number, not technology., Forja

Notes · Criterion, timeline, and cost of AI projects · Public post-mortem

Forja · Forward-deployed engineering · May 25, 2026 · Ler em portugues › · Leer en español ›

In 2021, McDonald’s and IBM started testing voice ordering in the drive-thru. In June 2024, McDonald’s ended the test.

It became a meme before it became a case study. Clips went around of an order stacking more than 200 McNuggets on its own, of bacon added to an ice cream, of a bill running past two hundred dollars.

The easy reading is simple: the AI wasn’t ready. It’s wrong. Not for the reason you’d guess.

Let’s take the facts in order.

What was announced

In 2019, McDonald’s bought Apprente, a voice-recognition company, and built an internal tech lab. In 2021, it sold that lab to IBM and signed a partnership to develop automated voice ordering at the drive-thru.

The system went into more than 100 US restaurants. The promise was operational, not technical: take the crew member off the order-taking step and hand that time back to the kitchen and the pickup window.

Removing that position at peak is what changes the math. During a rush, the person taking the order is the bottleneck, and every extra second at the window becomes a car backed up onto the street.

By business-press accounts, the system reached roughly 85% of orders without a person stepping in. That’s a high number for a hard problem. A drive-thru is engine noise, accents, a kid in the back seat, and a menu with thousands of combinations.

The viral errors were a symptom, not the cause. Every order taker has bad days. The question was never whether the machine made mistakes. It was whether it made few enough to change the staffing.

What changed in the public record

In June 2024, McDonald’s told franchisees it would pull IBM’s technology from the test restaurants by the end of July. The company said it would look at voice ordering again with other partners and decide its next steps by year-end.

Notice what that statement does not say. It doesn’t say the AI failed. It says the test is over.

Three years of piloting. More than 100 restaurants. And the ending arrived with no public number that said “passed” or “didn’t pass.”

The omission that explains it

Here’s the part that matters, and it’s one thing.

The business case for drive-thru voice ordering is removing or redeploying the order taker. That’s the gain. Not the AI being clever. The order-taking position ceasing to exist.

Through all three years of the test, a crew member kept confirming each order on screen. The human never left the loop.

Do the rough math. If voice handles 85% and a person still confirms 100%, payroll didn’t move. On some shifts, the on-screen confirmation even adds a step.

When the human stays to catch the 15% the machine misses, you haven’t traded one cost for another. You’ve stacked a new system on top of a salary that’s still there.

McDonald’s didn’t fail at building the AI. It failed to define, before the pilot, the number that would let the human leave.

Without that number, the test had no finish line. It measured “is it improving” for three years instead of “did it cross the point that removes the person.” Improvement with no defined end isn’t progress. It’s cost parked in the middle.

Klarna skipped a similar boundary in customer service: when the AI should stop and hand off to a human. The drive-thru had the opposite boundary. When the human could stop. Neither was written down at the start.

The criterion that would have caught it

If your operation is weighing automation that replaces a position (self-checkout, shelf vision, ordering over WhatsApp, voice at the counter), three numbers have to exist before the first store:

The autonomy threshold. The accuracy, at real peak hour and not on average, where the human leaves the position. Skip the peak and the number lies to you.
The intervention ceiling. The human-correction rate above which the pilot is a cost, not a saving. If the person corrects more than that ceiling, the project already answered “no.”
The kill-or-ship date. The pilot does not run forever. It has a deadline to become production or be switched off, set before it starts.

You set all three in week zero, in a conversation, with the operation in the room. They aren’t technical. They’re a negotiation between whoever runs the store, whoever owns the cost, and whoever answers for the experience.

Skip that conversation and the pilot does what McDonald’s did. It runs for years in the expensive middle, improving a number nobody agreed on a target for.

Be honest with yourself for a second.

Look at the automation pilot running in your operation right now. Is there a human confirming what the machine does? Ask one thing: what’s the number that lets that human leave?

If nobody in the room can answer, your pilot has no end. It only has cost.

What this post is not saying

We don’t have McDonald’s or IBM internal data. The reading is built on the public record: announcements, reporting, official statements.

We’re not saying the attempt was foolish. Voice in a drive-thru is one of the hardest problems in retail, and testing it at scale was a defensible call.

We’re saying only that an honest post-mortem points at a missing criterion, not a technology failure. The difference changes your next budget conversation.

If it opens with “let’s automate the counter,” the question before you approve isn’t “which vendor.” It’s this: what accuracy, at peak, lets the order taker leave, and how long do we give it to get there?

Defining that number is a fight between departments. Each one defends the metric it already tracks. It’s easier with someone outside the room who knows the shape of the conversation and forces it closed.

Send the three indicators your operation already tracks at that counter or that checkout. In an hour I’ll send back which ones would have predicted McDonald’s outcome and which are noise.

If we see a project shape, the two-week Diagnóstico ends with a one-page document: three indicators, three target ranges, three review triggers. That document goes on the operation’s wall, before the first line of code.