RAG (Retrieval-Augmented Generation)

RAG (retrieval-augmented generation) is the dominant pattern for building LLM applications that need to reason over a private corpus — internal documents, knowledge bases, regulatory documents, or any data the model was not trained on.

A RAG system has two phases. At indexing time, you split source documents into chunks, generate embeddings for each chunk, and store the embeddings in a vector database. At query time, you embed the user's question, retrieve the most semantically similar chunks, and pass them to the LLM as context with the original question.

Why this matters in regulated industries: RAG keeps your data outside the model. The LLM sees retrieved chunks at inference time but never trains on them. That separation makes audit logging, access control, and right-to-deletion concrete — you can show exactly which documents informed a given answer, and you can revoke access to a chunk by deleting it from the index without retraining.

The hard parts of RAG are not the LLM call. They are chunk boundaries, retrieval quality, citation formatting, evaluation harnesses, and handling cases where retrieval returns nothing relevant. Most production RAG failures are retrieval failures, not generation failures.

RAG (Retrieval-Augmented Generation)

See also

Related terms