Klarna didn't fail at AI. It failed to define when AI should have stopped., Forja

Notes · Criterion, timeline, and cost of AI projects · Public post-mortem

Forja · Post-mortem · May 20, 2026 · Ler em portugues › · Leer en español ›

In January 2024, Klarna announced its AI had replaced 700 customer service agents and was generating $60M a year in savings. In May 2025, it started rehiring humans.

The case quickly became a reference point for “AI failed in customer service.” That reading is wrong.

The AI did what it promised. On tier 1 (order status, simple adjustments, password resets), it answered faster than a human. Satisfaction stayed near baseline.

What came next broke.

Over 2024, the mix of conversations shifted. Simple cases resolved in seconds. Complex cases (billing disputes, slow refunds, merchant conflicts) kept cycling through automated attempts.

The customer left without resolution and without a clear path to a human.

The post-contact survey started recording the same complaint in three languages: “the bot didn’t understand me and wouldn’t let me reach someone.” The metric the internal team tracked (average handling time) kept dropping. The metric the customer lived (first-contact resolution for complex cases) was not being measured.

In May 2025, Sebastian Siemiatkowski, the CEO of Klarna, told Bloomberg the company had “gone too far.” The quality drop in sensitive cases was a cost nobody had modeled.

The lazy reading: the AI was not ready. That is not it.

The project started with the right hypothesis. Reducing tier 1 with AI is one of the most validated use cases in retail and fintech.

Technology was not what was missing. What was missing was an explicit criterion: when this AI should stop answering, and when it should hand off to a human.

The decision that looked technical was organizational.

Think about your own operation for a second.

Before any average handling time drops, there is a number that sits earlier: the share of cases that need, and receive, human escalation. That number is the boundary between what AI operates and what it does not.

Without that boundary set early, “the AI handles it” becomes the default. The system answers everything because nobody told it to stop.

Here is what matters: the omission is rarely deliberate. It happens because the question “what should the AI not handle?” is harder than the question it replaces, “what should the AI handle?”.

The first one forces a fight between teams: operations, legal, CX, finance. The second has a clean technical answer. Skipping the harder conversation is the path of least resistance.

It surfaces as metric blindness six to nine months later.

At Klarna, in hindsight, the symptom was visible by week three of the rollout. The transfer rate to humans had dropped near zero.

Not because cases got simpler. Because the path to a human had been suppressed.

That number alone was enough to trigger a review. It was not treated as an alert because nobody had defined what the expected range should look like.

The criterion that was missing

If your operation is thinking about AI for tier 1, three indicators have to exist from the first week in production:

The human-escalation rate, with the expected range defined in conversation BEFORE go-live. If it drops too far or rises too far, someone yells.
First-contact resolution, measured separately for simple and complex cases. As an average, the metric misleads. Simple cases resolve easily and pull the number up while complex cases quietly break.
Qualitative sampling, read by a senior human, with real weight in the decision to continue or adjust.

Each answers a different question.

The escalation rate tells you whether the AI is overreaching. First-contact resolution tells you whether the customer’s problem actually ended. Qualitative sampling tells you what the numbers cannot see.

Tracking only the first two builds an operation that looks healthy in the report. Tracking only the third produces anecdote without scale.

The work is in keeping all three loud at the same cadence.

None of the three is new. All are standard in mature customer-service operations.

What changed with AI is not the need for the indicator. It is the ease of forgetting it.

The number that shows up on the CFO’s dashboard looks good while the three above quietly deteriorate.

The window where the problem is visible and still cheap to fix is narrow. At Klarna, from the public record, it lasted from about week three through the end of the following quarter.

After that, the operation had reorganized around the wrong number.

Reversing it cost a communications campaign, rehiring, and a slice of the credibility the company had built around the topic.

What this post is not saying

We do not have access to Klarna’s internal architecture. We are not saying the leadership was reckless to try.

The attempt was reasonable and the public learning is valuable for the whole industry.

We are saying, only, that an honest post-mortem of this case points to a missing criterion, not to a model failure.

That distinction matters for you.

If your next budget conversation involves “let’s use AI to cut customer service cost,” the question to ask before approving is not “which model?”. It is this: “what is the human-escalation range we are going to defend, and what is the review trigger if it drifts?”

That question has to be answered in week zero. In week sixty, it is too late.

Be honest with yourself.

Look at your operation right now. If you have an automated agent anywhere in the customer-service stack, open the dashboard.

Find the human-escalation rate. If you cannot say what it should be, that is the work that has to happen before any code.

This is not technical work. It is a fight between operations, legal, CX, and finance about what each one will defend.

Hard to do alone because each department defends the number it already measures. Easier to do with an outsider in the room who knows the shape of the conversation and forces the close.

Send me the three indicators your operation already tracks for customer service. In one hour I will tell you which predict the Klarna problem and which are noise.

If a project shape confirms, the two-week Diagnóstico ends with a one-page document: three indicators, three target ranges, three review triggers. That document goes on the operation’s wall.