Five Questions Every AI System Should Be Able to Answer

Every AI system in production will eventually produce an output that someone wants to investigate. A patient will ask why a recommendation was made. A regulator will ask what data was used. An auditor will ask whether the system has been operated within policy. A breach investigator will ask what data the model saw during a specific window.

When that happens, you will need to answer five questions. The teams whose AI systems can answer them in seconds keep operating. The teams whose systems cannot, do not.

1. Who used the AI, when, and what did they do?

The most basic audit question. "Show me every AI interaction by user X between dates A and B."

The answer requires the audit log to capture the requesting user identity — not the AI service account, but the human user the request originated from — alongside every model invocation. The log has to be queryable by user, by date, by tenant.

Most AI systems we see in early stages capture the AI's outputs but not the calling user. The user identity has to be threaded through the application layer to the AI invocation layer and persisted with each turn. Adding it later is a rebuild.

2. What did the AI see?

For any given output, what was in the context window when the model produced it?

This is the question that exposes RAG systems built without retrieval logging. The model produced output; the output references a fact; the fact came from somewhere. That somewhere has to be in the audit log, with chunk-level detail.

A complete answer captures: the system prompt, the conversation history, the retrieved chunks (by document and chunk ID), the user message, and the full input as sent to the model. Not a summary. The full input.

3. What did the AI produce, and what happened next?

The model output, plus the action that was taken. For a recommendation that was approved by a human, the audit log shows the AI's draft, the human's approval, any edits made, and the final action. For a recommendation that was rejected, the same chain.

If the AI's output triggered a downstream system call — wrote to the case record, sent an email, updated a billing line — that call is in the audit log too. The chain has to be reconstructable end-to-end.

4. For this specific subject (patient, customer, case), every interaction.

Tenant-scoped queries. "Show me every AI interaction that touched this patient's record, ever."

This requires the tenant identifier to be in every audit row, with appropriate indexes for tenant-scoped queries. For healthcare, the patient identifier; for legal, the matter identifier; for financial services, the account or customer identifier.

The query has to return everything: model invocations, retrieval calls, tool calls, approval decisions. If a piece of the data path bypassed the audit log, that is the gap an investigation will find.

5. Was anything outside policy?

The hardest question. "Were there any AI interactions during this window that should not have happened?"

Answering it requires policy to be expressed in something more than prose. The audit log captures the inputs, outputs, and actions; an automated evaluation runs against the log to flag interactions that violated guardrails (PHI sent to an unapproved endpoint, retrieval crossing tenant boundaries, outputs without citations, model versions outside the approved set, tool calls outside the allowed scope).

Most organizations treat policy review as manual log inspection. That works at small scale and breaks at production scale. The systems that scale have policy expressed as code that runs against the audit log, with alerts and dashboards on policy violations.

What "audit-ready" actually requires

To answer all five questions, a production AI system needs:

Audit log capture at the model invocation layer. Every model call, with full input and output, in a structured format.
User identity threaded through every layer. The application user is captured at the model layer, not invented or replaced with a service account.
Tenant identifier on every row. With appropriate indexes for tenant-scoped queries.
Tool calls captured. Every tool invocation with parameters and results.
Approval chains captured. For human-in-the-loop decisions, the human's identity, decision, and any edits.
Retention sized to obligations. Six years for HIPAA, longer for some workloads, with object lock or equivalent immutability.
Policy expressed as code. Automated evaluation of the log against the policy set.

This is not light infrastructure. For an AI system handling regulated workloads, the audit infrastructure is often comparable in size and effort to the AI itself. The teams that ship the AI without this infrastructure ship faster initially and slower in the long run, because the audit work has to happen eventually.

When the question gets asked

The question that prompts this post: "Can your AI system answer these five questions today?"

If the answer is yes, the system is audit-ready. The compliance review will be a confirmation, not a discovery process. The breach investigation, if one happens, will be hours of work, not weeks.

If the answer is no — not all five, or not in seconds, or only with engineering effort — the gap is a project. The cheapest time to close it is before the audit. The most expensive time is during.

We have walked into both situations. The retrofitted audit log is always more painful than the day-one one. We say this not as theory but as engineers who have done both.

Where we fit

If you are operating an AI system in a regulated environment and any of the five questions feel uncertain, that is the project. We do audit-readiness reviews of existing AI systems regularly. They produce a written gap analysis, a prioritized remediation plan, and an estimate. Reach out if your AI is in production and the audit conversation is starting to come up.