RAG Cost Models in Production

RAG systems have a reputation for being cheap to prototype and surprisingly expensive in production. The reputation is half-right. Production RAG is not expensive because of any single line item; it is expensive because the cost components compound across volume in ways that are not obvious until you look at the bill.

This guide breaks down where the money goes and what to do about it. Numbers are illustrative — the actual figures move with provider pricing — but the structure is stable.

The four cost components

RAG cost in production breaks into four buckets:

Embedding generation. Each document chunk is embedded once at indexing time, plus every query is embedded at retrieval time.
Vector storage and retrieval. The vector database costs scale with the number of vectors stored and the queries served.
LLM invocations. The model call that synthesizes the answer from the retrieved context.
Operations. Logging, monitoring, retraining or re-indexing on changes, and the engineering time to maintain the system.

Most cost surprises come from underestimating one of these — usually #4.

Embedding costs

For a standard embedding model (text-embedding-3-large, Cohere embed-english-v3.0, Bedrock Titan Embeddings, etc.), embedding cost is on the order of $0.10–$0.13 per million tokens.

For a document corpus of 100,000 documents averaging 5 pages each (about 2,500 tokens), that is 250M tokens — roughly $25–$33 to embed the whole corpus once. Trivial.

Where embedding cost gets interesting:

Re-indexing. Every time you change embedding model, re-chunk, or update a significant fraction of the corpus, you re-embed. If you re-embed monthly, your annual embedding cost is 12× the one-time cost.
Query embedding. Each query is embedded. At 1M queries per month with 200 token average queries, that is 200M tokens per month — about $20–$26.

The total embedding spend on most production RAG systems is small. It is rarely the line item to worry about.

Vector storage and retrieval

Vector database cost depends heavily on the choice:

Bedrock Knowledge Bases backed by OpenSearch Serverless. Pricing is OCU-based; a small workload is on the order of hundreds of dollars per month, scaling with index size and query rate.
Aurora PostgreSQL with pgvector. The cost is the Aurora cluster — db.r6g.large is around $250/month plus storage. Scales with workload, not specifically with vectors.
Self-hosted Weaviate / Qdrant on EC2. Compute and EBS costs. For a few million vectors, single-digit hundreds per month. For larger indexes, replication and HA add up.
Managed vendor (Pinecone, Weaviate Cloud). Per-pod or per-RU pricing. Can be the cheapest at small scale, can be expensive at high query volumes.

For most regulated SaaS workloads in the < 10M vector range, vector storage is a few hundred dollars per month, climbing with index size and query volume. The cost is meaningful but rarely the largest line.

LLM invocations — the line that dominates

This is where the bill lives. For a Claude 3.5 Sonnet call (a good default for many RAG workloads):

Input: $3 per million tokens
Output: $15 per million tokens

A typical RAG call has ~3,000 input tokens (system prompt + retrieved chunks + conversation history + user query) and ~500 output tokens. That is $0.009 input + $0.0075 output = ~$0.017 per call.

At 1M calls per month, that is $17,000.

This is the right ballpark to budget. The variations:

Smaller / cheaper model. Claude Haiku, GPT-4o-mini, Llama 3.1 70B Instruct. 5–10× cheaper. Often good enough for retrieval-grounded answers, frequently insufficient for complex reasoning.
Larger / more capable model. Claude Opus, GPT-4o. 5× more expensive than Sonnet, marginal accuracy gains for most RAG tasks.
Longer context. Larger retrieved chunks, more conversation history, larger system prompts. Each turn becomes more expensive.

For most production RAG systems, LLM invocations are 60–85% of the AI-related bill.

Operations — the line you forget

The cost line that surprises teams most is operations:

Audit logging. S3 storage with object lock, lifecycle to Glacier, CloudWatch Logs ingestion. Scales with invocation volume. For 1M invocations per month with full prompt/output capture, this is $200–$800 per month.
Re-indexing on corpus updates. If your corpus changes daily, you are running an indexing pipeline daily. Embedding cost (small), Lambda or Fargate cost (modest), bandwidth.
Evaluation harness. A held-out eval set runs against every model or prompt change. Each eval run costs as much as a normal invocation. Run it on every PR and you are adding several percent to your invocation costs.
Engineering time. RAG quality is an ongoing tuning problem. Budget for 0.25–1 FTE of engineering attention per significant RAG system, indefinitely.

Operations is where production RAG costs more than prototype RAG. The prototype skips logging, evaluation, and re-indexing automation. Production cannot.

Where the savings actually come from

Tactics we use to control RAG cost in production, in order of impact:

1. Smaller models where possible

The biggest lever is moving non-critical paths to a smaller model. Use cases:

Intent classification and routing. Haiku or GPT-4o-mini does this fine. No reason to spend Sonnet tokens on it.
Simple summarization. Haiku is good enough for short summaries.
Generation when retrieval is high-confidence. If retrieval scored well above threshold, the synthesis task is mechanical — a smaller model handles it.

A routing layer that sends 70% of traffic to Haiku and 30% to Sonnet cuts the LLM bill substantially.

2. Caching

Identical or near-identical queries hit a cache instead of the model. The hit rate depends on workload — a customer support assistant might see 40% cache hits; a document analysis tool might see 5%.

The infrastructure: a key derived from the query, retrieved chunks, and model version. Redis or ElastiCache in front of the LLM call. Cache TTL based on data freshness requirements.

3. Prompt compaction

The retrieved chunks dominate the input token count. Tactics:

Re-rank and keep only top-K chunks instead of all retrieved. Top-3 instead of top-10 cuts input tokens by 70%.
Summarize older conversation history rather than carrying the full transcript.
Externalize stable system instructions to model-side caching where the provider supports it (Anthropic prompt caching cuts the cost of repeated system prompts dramatically).

4. Output budgeting

Set max_tokens to a realistic ceiling. The model produces what it needs and stops. Without a ceiling, the model sometimes generates much longer responses than the use case requires.

5. Embedding model selection

For most workloads, the cheapest BAA-covered embedding model is fine. Spending more on embedding rarely translates to retrieval quality gains worth the operational complexity.

Budget framework

For a planning conversation with a finance partner, here is the rough shape:

Pilot (10K–100K invocations/month, < 1M vectors). $500–$2,500 per month all-in. Embedding negligible, vector DB modest, LLM tens to low hundreds, ops modest.
Production (1M invocations/month, 1–10M vectors). $15K–$30K per month. LLM dominates, ops growing, vector DB and embedding are background noise.
Enterprise scale (10M+ invocations/month). Six figures per month. Routing, caching, and prompt compaction become essential to keep the LLM line manageable.

These numbers are illustrative, not pricing commitments. Real numbers depend on the model mix, the cache hit rate, and the complexity of each invocation.

What to track

If you are operating a production RAG system, the cost metrics to put on a dashboard:

Cost per invocation. Total cost / total invocations per month. Tracks model selection and prompt compaction.
Cache hit rate. Cached invocations / total invocations. Higher is cheaper.
Cost per active user. Total cost / monthly active users. Connects cost to value.
Tokens per invocation. Input and output, separately. Trends here reveal prompt bloat early.
Operations overhead. Logging, eval, re-indexing as a percentage of total cost.

A dashboard that shows these trends weekly catches cost regressions before they become surprises.

Where most teams overspend

The pattern we see most often: a team builds a working RAG system, deploys it, and never re-evaluates the model selection. They use the most capable model for everything because it was the right choice during prototyping. A year later, 80% of the workload would run fine on a model 5× cheaper, and the bill is 4× what it could be.

The fix is not exotic. Profile actual workload by complexity. Route accordingly. Re-profile every quarter. The savings show up immediately, the engineering work is bounded, and the system gets cheaper without getting worse.