Skip to main content
Back to guides
AI Architectureadvanced

Audit Logging for AI Agents

A reference architecture for capturing, storing, and querying the audit trail of an AI agent system in regulated environments.

AIAuditHIPAASOC 2Architecture

If your AI agent operates in a regulated environment — healthcare, legal, financial services — the audit log is not a feature. It is the artifact you produce, after the fact, to answer the questions that an auditor or breach investigator will ask.

This guide covers the schema, storage, and query patterns we use for AI agent audit trails on AWS. The patterns generalize to other clouds.

What the audit log has to answer

Before designing the schema, write down the questions. Ours, derived from real conversations with Security Officers and external auditors:

  1. Who used the AI agent during a given window?
  2. For a specific user, what did they ask, what did the agent see, and what did it return?
  3. For a specific patient (or case, or account), every AI interaction that touched their record, with full content.
  4. For a specific tool the agent has access to, every invocation — when, by whom, with what parameters, with what result.
  5. For a specific suspected breach window, every model call with full content and result.
  6. For a specific output the user is questioning, the full reasoning chain that produced it.

The schema has to make all six answerable in seconds, not days.

The schema

We use a two-tier model: a metadata layer that is queryable and frequently accessed, and a body layer that is large, infrequently accessed, and pulled on demand.

Metadata layer (DynamoDB)

{
  "pk": "CONV#<conversation_id>",
  "sk": "TURN#<timestamp>#<turn_id>",
  "user_id": "<application user id>",
  "tenant_id": "<customer id, matter id, or patient id>",
  "model_id": "anthropic.claude-3-5-sonnet-20241022-v2:0",
  "model_version": "<provider version>",
  "tool_calls": ["retrieve_policy", "lookup_patient_meds"],
  "input_token_count": 1248,
  "output_token_count": 412,
  "latency_ms": 2104,
  "rag_doc_ids": ["doc_a4f", "doc_71b"],
  "body_pointer": "s3://td-audit/conv/<id>/<turn_id>.json.gz",
  "outcome": "success",
  "approved_by": null,
  "ttl": 1893456000
}

This row is small — a few hundred bytes — and indexed by conversation, by user, by tenant, and by date. Common queries hit this layer alone.

Body layer (S3, write-once)

{
  "turn_id": "...",
  "timestamp": "2026-05-07T14:23:11.402Z",
  "user_id": "...",
  "tenant_id": "...",
  "input": {
    "prompt": "<full prompt including system message and history>",
    "user_message": "<verbatim user input>"
  },
  "context": {
    "rag_chunks": [
      { "id": "...", "doc_id": "...", "score": 0.84, "text": "..." }
    ],
    "visitor_context": {...}
  },
  "tool_calls": [
    {
      "name": "retrieve_policy",
      "params": {...},
      "result_summary": "...",
      "result_full": "<full result>"
    }
  ],
  "output": "<full model output>",
  "approval_chain": []
}

S3 with object lock in compliance mode gives you write-once semantics — no one, including admins, can modify or delete the object inside the retention window.

Why two tiers

The metadata layer is queried constantly. "Show me every interaction this user had with this patient's record this week." That query hits DynamoDB — milliseconds, predictable cost.

The body layer is queried rarely, usually only during an investigation. "Show me the full prompt and output for these specific turns." That query reads from S3 — slower, but cheap to store at scale.

Putting everything in one tier produces either expensive DynamoDB rows (if you store full prompts) or slow queries (if you scan S3 for metadata).

Retention

HIPAA requires six years for documentation related to PHI. Some workloads need longer — discovery in a litigation matter, contractual retention with a customer, regulatory holds.

We default to:

  • DynamoDB: TTL set to 7 years from creation.
  • S3: object lock retention period of 7 years, in compliance mode.
  • Lifecycle policy moves S3 objects to Glacier Instant Retrieval after 90 days, Glacier Deep Archive after 1 year.

Review the retention numbers against your specific obligations. For pediatric records, HIPAA's clock can be different. For litigation holds, retention extends until the hold is released.

Access control on the audit log

The audit log is itself sensitive. The bodies contain prompts, tool results, retrieved chunks — all of which can be PHI. Treat the log like any other PHI store:

  • IAM policies limit who can read the audit data — typically a small Security and Compliance group, plus a break-glass role for incident response.
  • Reads from the audit log are themselves logged. If a Security Officer queries the audit log, that query is captured in CloudTrail.
  • Writes to the audit log come from a single service role used only by the agent runtime. No other path can write.
  • KMS keys are separate from operational data keys. The audit-log key has a tighter access policy.

What goes in the input

The full prompt — system message, conversation history, retrieved context, and the user's message. All of it. If a prompt is long enough that storing it is expensive, you have a different problem (prompt bloat) and should fix that, not skip the logging.

Common temptation: store only the user's message and reconstruct the prompt later. Do not. Prompt templates change. System messages change. The retrieved context changes by the second. The only way to know what the model actually saw is to log what was sent.

What goes in the output

The full output, exactly as the model produced it, before any post-processing. If your application strips formatting, redacts PHI from display, or rewrites the output, log both versions.

Tool calls

Every tool invocation — name, parameters, result. The result has to include enough detail that "did the agent get the right data" is answerable. For retrieval tools, log the chunk IDs and content. For database lookups, log the query and the row count or specific records returned. For writes, log the data written and the resulting state.

A common failure: logging the tool call but not the result, on the theory that the result is large and inferable. It is not inferable. The tool's downstream system may have changed since the call. Log the result.

Approval chains for human-in-the-loop

When an AI proposal requires human approval before taking effect, the audit log records:

  • The proposal (output of the model)
  • The approver's identity
  • The approver's decision (approve, edit, reject)
  • The final action that was taken
  • Any edits the approver made

If the approver edited the proposal, both the AI's original output and the human's edited version are in the log. The downstream system always acts on the human's version, and the log shows where the AI ended and the human began.

Querying the log

For the six questions at the start, here is how each is answered:

  1. Who used the AI agent during a window? GSI on user_id + date.
  2. What did this user ask? GSI on user_id, ordered by timestamp. Pull bodies on demand.
  3. All AI interactions touching this patient? GSI on tenant_id, ordered by timestamp.
  4. All invocations of a specific tool? GSI on tool_name (extracted from tool_calls), or a separate tool-invocation table for high-volume tools.
  5. Specific suspected breach window? Range query on timestamp, bodies pulled on demand.
  6. Reasoning chain for a specific output? Look up turn by ID, pull the body.

In practice, all of these are written as predefined queries in the security team's runbook, not free-form data exploration.

Common mistakes

Logging in the application logs. CloudWatch is not an audit log. It rotates, it is not write-once, and it does not have the access controls a HIPAA audit log requires.

Storing the audit log in the same database as application data. Operational queries and audit queries have different access patterns and different access controls. Mixing them creates risk.

Logging too coarsely. "Conversation completed at 14:23" is not an audit log. The minimum useful unit is the model invocation, with full input and output.

Logging too verbosely. Every keystroke, every component render, every cache miss does not belong in the audit log. The audit log is for events that have to be reconstructable for compliance purposes.

No retention policy. Logs that grow forever cost money and create discovery exposure. Define retention up front.

Where to start

For a new AI agent project, the audit log is the first system to design, before the agent itself. Build the agent against the audit interface, so logging is a property of the system, not an afterthought.

If you are retrofitting audit logging onto an existing AI system, prioritize the metadata layer first — it is what answers the "who did what" questions that most investigations actually need. Body capture can come second.

Architecture Review