Skip to main content
Back to blog

Why AI Pilots Stall in Financial Services

Most AI pilots in financial services do not fail technically. They stall in the gap between an interesting demo and a production system that risk and compliance can sign off on.

Financial ServicesAIComplianceSOC 2

In financial services, the AI pilots that ship are not the most technically impressive ones. They are the ones whose teams understood, before they started, what production deployment actually requires.

We have watched a handful of FinServ pilots stall over the past year, mostly with the same pattern: an exciting demo from an internal team or a vendor, followed by months of risk and compliance review, followed by a quiet wind-down. The technology was not the problem. The problem was that the pilot was designed to impress an executive sponsor, not to clear a production review.

Here is what we see go wrong, in roughly the order it shows up.

Pilot scope mismatched to production scope

Pilots get scoped against ease of demo: a sample data set, a single workflow, a hand-picked set of test users. Production scope is different: real data, integration with the systems of record, every user role, edge cases that did not appear in the sample.

The result: the pilot works. Production breaks on the gap. Risk teams refuse to approve a system whose pilot did not surface the failure modes that matter.

The fix: design the pilot against a production-shaped slice of the workload. Smaller volume, but the same data heterogeneity, the same integration depth, the same user roles. The pilot should be smaller, not different.

Audit logging not in scope

The pilot demo focuses on the AI's outputs. The audit log is not built. When risk asks "show me how this would be operated in production — every model interaction, captured, queryable, retained" — there is no answer.

This is the most common reason pilots stall. The team built the AI, not the AI plus its audit infrastructure. The audit infrastructure is more work than the AI in many cases. Surprising risk teams with this discovery midway through review is a project killer.

The fix: build the audit log first. The AI's outputs are observable in the audit log from day one. By the time risk reviews the system, the audit infrastructure is mature, the queries risk wants to run already work, and the conversation is about whether the system meets policy — not whether the system is auditable at all.

Tenant or counterparty isolation undefined

In financial services, your data has to be separated from your counterparties' data, your customers' data has to be separated from each other, your investment bank's information walls have to be enforced in software. The boundaries are not optional.

Pilots often skip this. "It's a pilot, we'll figure it out for production." Production then requires a redesign of the data model, the retrieval index, the access control layer. The redesign is months of work. The pilot stalls.

The fix: design tenancy and information barriers into the architecture before the pilot. They get cheaper, not more expensive, when designed in early.

Outputs are not citable

The AI produces an answer. The answer is wrong sometimes. Risk asks: "When the AI gets it wrong, how does the user know?"

If the system does not cite its sources, the answer is "they don't." The user has no way to verify the AI's output without redoing the work. Risk reasonably concludes that the system creates a new failure mode — confident-sounding wrong answers — and refuses approval.

The fix: every output cites the sources it relied on. The user verifies before acting. The audit log captures both the citation and the user's action. This is a hard requirement for FinServ AI, not a nice-to-have.

No human-in-the-loop boundary

For any AI output that affects a regulated decision — a trading recommendation, a loan adjudication, a fraud alert classification — a human reviews and approves before the action takes effect. Pilots often blur this boundary. "The user can override the AI" is not the same as "the user must affirmatively approve."

The fix: explicit approval steps for any consequential output. Logged. Tied to the approver's user identity. The AI proposes; the human disposes; the audit log captures both.

Model selection treated as a one-time decision

The pilot uses the latest, most capable model. The bill at production volume is unsustainable. The team scrambles to replace the model with a cheaper one and discovers that retrieval and prompting were tuned to the original model's behavior. Quality drops. The pilot is now expensive AND worse.

The fix: assume model selection will change. Build the system to be model-agnostic — clean separation between the orchestration layer and the model layer, evaluation harnesses that can run against any model, prompts that do not exploit specific model quirks. When you have to switch models, it is a configuration change, not a redesign.

Vendor BAA / sub-processor due diligence happens late

The team picks a vendor for the AI components. The vendor was great in evaluation. Risk asks for the vendor's SOC 2 report, sub-processors list, and security policies, plus the AI model provider's terms, plus the embedding endpoint's terms. Each comes with a different timeline.

By the time the third-party review completes, the pilot is six months in and the executive sponsor has lost patience.

The fix: front-load vendor due diligence. Work with the procurement and risk teams in week one, not month four. The vendor that clears review fast is not always the most exciting; it is often the right choice.

The conversation gets political

When pilots stall in FinServ, the explanation in the room is often not technical. "Risk is being unreasonable." "Compliance is moving the goalposts." "The vendor isn't doing their part." These framings are tempting and rarely accurate.

Risk and compliance are doing what they are paid to do: refusing to approve systems that cannot be operated safely under the firm's regulatory regime. If the pilot did not produce an auditable, isolatable, citable, human-supervised system, the answer is no, and the right answer is no.

The fix is upstream: design the pilot to clear those bars from day one. The pilots that ship in FinServ are the ones designed against the production review the firm will actually do — not against the demo the executive sponsor wants to see.

What we do differently

When we work with FinServ teams, we sequence the work so audit logging, tenancy, and output citation are in place before the AI does anything interesting. The first weeks of the engagement are not glamorous. They produce the infrastructure that makes the AI defensible. Once that is in place, building the AI features on top of it is fast.

Teams that try to skip ahead — get to the impressive demo first, then add the compliance scaffolding — almost always end up paying more for less. The audit-first sequence is the cheap path, even though it does not feel that way at the start.

If you are working on a FinServ AI project that is heading into pilot — or stalled in pilot review — we have done this enough times to be useful.

Architecture Review