AI & LLMs Advanced

Building a RAG System in Django with PostgreSQL and pgvector

Embed documents, store vectors in Postgres, and let an LLM answer questions about your own data — without hallucinating its sources.

DjangoZen Team May 09, 2026 19 min read 156 views

Large language models are powerful but they do not know your data — your documents, your product catalog, your knowledge base — and they confidently make things up when asked about what they have not seen. Retrieval-Augmented Generation fixes this by retrieving relevant information from your own data and giving it to the model as context, so answers are grounded in facts you control. This tutorial builds a production RAG system in Django using PostgreSQL and pgvector, covering embeddings, vector search, chunking, the retrieval pipeline, and the practices that make RAG actually work.

The problem RAG solves

An LLM knows only what it was trained on, with a knowledge cutoff and no access to your private or current data. Ask it about your company's internal documentation or a customer's recent orders and it will either decline or, worse, invent a plausible-sounding answer — a hallucination. RAG addresses this by retrieving relevant pieces of your data at query time and including them in the prompt, so the model answers from provided facts rather than its training memory. This grounds responses in your actual content, dramatically reduces hallucination, and lets you build assistants over private, current, and domain-specific knowledge that the base model could never have known. RAG is the standard pattern for putting an LLM to work on your own data.

How RAG works

The RAG pipeline has two phases. Indexing (done ahead of time): your documents are split into chunks, each chunk is converted into an embedding — a numeric vector capturing its meaning — and the vectors are stored. Retrieval and generation (at query time): the user's question is embedded the same way, the most similar chunks are found by vector search, and those chunks are inserted into the prompt sent to the LLM, which answers using them. The whole system rests on the idea that semantically similar text has similar embeddings, so finding relevant context becomes a matter of finding nearby vectors. Understanding these two phases is the map for everything that follows.

Embeddings: meaning as vectors

An embedding is a list of numbers — a vector — that represents the meaning of a piece of text, produced by an embedding model. The key property is that texts with similar meaning produce vectors that are close together in the vector space, while unrelated texts are far apart. This is what makes semantic search possible: instead of matching keywords, you match meaning, so a query about "shipping costs" can find a document about "delivery fees" even with no shared words. Embeddings turn the fuzzy notion of "relevant" into the precise, computable notion of "nearby in vector space," which is the foundation RAG retrieval is built on. Choosing a good embedding model and using it consistently for both documents and queries is essential.

pgvector: vectors in PostgreSQL

You do not need a separate vector database to do RAG. The pgvector extension adds a vector column type and similarity search to PostgreSQL, letting you store embeddings right alongside your relational data and query them with SQL:

CREATE EXTENSION vector;
-- a table with a 1536-dimension embedding column
ALTER TABLE document_chunk ADD COLUMN embedding vector(1536);

This is a major simplification: your chunks, their metadata, and their embeddings live in one database you already run, queried together, backed up together, with no separate system to operate. For the many applications whose scale fits comfortably in PostgreSQL, pgvector is the pragmatic choice, keeping your RAG data in the same place as the rest of your app.

Using pgvector from Django

The pgvector Python package integrates with Django, providing a vector field you add to a model so chunks and their embeddings are ordinary Django objects. You store each chunk's text, its embedding, and any metadata — which document it came from, its position — as model fields, and you query them through the ORM with vector similarity. This means your RAG storage layer is just Django models, with all the familiar tooling: migrations, the admin, querysets. Keeping the vector store inside your Django models rather than in an external service makes the whole system simpler to build, debug, and maintain, and lets retrieval join naturally against your other relational data when needed.

Chunking documents

Documents must be split into chunks before embedding, and how you chunk has an outsized effect on quality. Chunks that are too large dilute the relevant signal and waste context; chunks that are too small lose the surrounding meaning needed to be useful. The goal is chunks that each capture a coherent unit of meaning — a paragraph, a section — sized to fit usefully in the prompt. Overlapping chunks slightly so that context spanning a boundary is not lost is a common refinement. Chunking is not a trivial preprocessing step but a quality lever: poor chunking retrieves fragments that are individually relevant but collectively unhelpful, while good chunking retrieves coherent, self-contained context the model can actually use to answer.

The indexing pipeline

Indexing is the offline process that prepares your data for retrieval: take each document, split it into chunks, generate an embedding for each chunk via the embedding model, and store the chunk text and its vector. This runs ahead of any query — as a batch job over your corpus, and incrementally as documents are added or change. Because it involves calling an embedding model for potentially many chunks, it is naturally a background task, run through Celery, with care for cost and rate limits. Keeping the index current — re-embedding changed documents, removing deleted ones — is an ongoing concern, because a RAG system is only as good as the freshness and coverage of the data it has indexed.

Vector search and retrieval

At query time, you embed the user's question with the same model used for the documents, then find the chunks whose vectors are most similar — typically the top handful by cosine similarity or distance — using pgvector's similarity operators through the ORM. These nearest chunks are your retrieved context. The number you retrieve is a balance: too few risks missing the answer, too many dilutes the prompt and raises cost. This retrieval step is the heart of RAG, where the semantic-similarity property of embeddings turns a natural-language question into a set of the most relevant pieces of your own data, ready to hand to the model. Getting retrieval right — good embeddings, sensible chunk count — largely determines answer quality.

Indexing vectors for speed

Searching vectors by brute force compares the query against every stored vector, which is fine for thousands of chunks but slow for millions. pgvector supports approximate-nearest-neighbor indexes (such as HNSW) that make similarity search fast at scale by trading a tiny amount of accuracy for a large speed gain. As your corpus grows, adding such an index is what keeps retrieval fast enough for interactive use. This mirrors ordinary database indexing: the data is searchable without an index, but an index is what makes it fast at scale. For a small knowledge base you may not need one; for a large one, an approximate index is essential to keep query latency acceptable.

Assembling the prompt

With relevant chunks retrieved, you construct the prompt sent to the LLM: typically a system instruction telling the model to answer based on the provided context, the retrieved chunks themselves, and the user's question. The instruction matters — directing the model to rely on the context and to say when the answer is not present helps keep it grounded and honest. How you format and order the context, and how clearly you instruct the model to use it, affects answer quality. This assembly step is where retrieval meets generation: the retrieved facts become the grounding the model reasons over, and a well-constructed prompt is what turns relevant chunks into a accurate, well-sourced answer rather than a vague one.

Grounding and reducing hallucination

RAG's central benefit is grounding answers in retrieved facts, but it is not automatic — a model can still ignore the context or fill gaps with invention. Reduce this by instructing the model explicitly to answer only from the provided context and to admit when the information is not there, and by retrieving good context in the first place, since the model cannot ground an answer in chunks that were never retrieved. Citing which chunks an answer came from lets users verify it and builds trust. The honest framing is that RAG greatly reduces hallucination by giving the model facts to work from, but quality depends on retrieving the right context and instructing the model to stay within it — grounding is a discipline, not a guarantee.

Evaluating and improving

A RAG system needs evaluation to improve, because problems can hide in either retrieval or generation. When answers are wrong, diagnose where: did retrieval fail to find the relevant chunk (a chunking, embedding, or indexing issue), or did the model fail to use the context it was given (a prompting issue)? Building a set of representative questions with known good answers lets you measure quality and catch regressions as you tune chunking, retrieval count, and prompts. Treating RAG quality as something to measure and iterate on, rather than assume, is what turns a demo that works on a few questions into a system that reliably answers real ones. The pipeline has several knobs, and evaluation tells you which to turn.

Pure vector search captures semantic meaning but can miss exact matches that keyword search handles well — a specific product code, a precise name, an acronym. Hybrid search combines both: vector similarity for meaning and traditional keyword or full-text search for exact terms, merging the results for retrieval that is strong on both fronts. PostgreSQL is well-suited to this because it offers both pgvector for embeddings and built-in full-text search in the same database, so you can run both kinds of search and combine them without a second system. Hybrid retrieval often noticeably improves quality over either approach alone, catching both the semantically-related and the exactly-matching content a question needs.

Reranking retrieved results

The initial vector search retrieves candidates quickly but approximately, and a reranking step can sharpen quality. After fetching a larger set of candidate chunks, a reranker — often a more powerful model that scores each candidate's relevance to the query directly — reorders them so the most relevant rise to the top, and you pass only the best to the language model. This two-stage approach, fast broad retrieval followed by precise reranking, balances speed and quality: you cast a wide net cheaply, then spend more effort ranking the catch. Reranking is a common refinement when basic retrieval brings back relevant-ish chunks but not in the ideal order for grounding the answer.

Metadata filtering and access control

Storing embeddings in PostgreSQL alongside relational data unlocks a powerful capability: filtering retrieval by metadata. You can restrict vector search to chunks from a particular document set, a date range, a category, or — crucially — only the documents a given user is permitted to see. This last point matters enormously for multi-tenant or permissioned RAG: retrieval must never surface content the user is not authorized to access, and because your chunks are ordinary database rows with foreign keys, you enforce access control with normal query filters combined with the vector search. Keeping vectors in your relational database, rather than a separate store, is what makes this clean integration of semantic search and access control possible.

Managing cost and latency

A production RAG system has real cost and latency considerations to manage. Each query involves embedding the question and calling the language model, both of which cost money and add latency, and indexing a large corpus means many embedding calls. Control this by caching embeddings for repeated queries, batching embedding generation during indexing, choosing model sizes appropriate to the task, and retrieving only as many chunks as genuinely help — more context means higher cost and slower responses. Being deliberate about where the time and money go, and optimizing the expensive steps, is what makes a RAG system practical to run at scale rather than a demo that is too slow or costly for real traffic.

Summary

RAG grounds a language model in your own data by retrieving relevant content at query time and giving it to the model as context, turning a system that confidently hallucinates about what it has not seen into one that answers from facts you control. The pipeline has two phases: indexing, where documents are chunked, embedded into meaning-capturing vectors, and stored; and retrieval-and-generation, where the question is embedded, the most similar chunks are found by vector search, and those chunks ground the model's answer. PostgreSQL with pgvector lets you do all of this without a separate vector database, keeping chunks, embeddings, and metadata as ordinary Django models. Quality hinges on the details — sensible chunking, a good embedding model used consistently, retrieving the right number of chunks, an approximate index for speed at scale, and a prompt that instructs the model to stay grounded in the context and admit what it does not know. Evaluate retrieval and generation separately to know which to improve, and you build a RAG system that reliably answers questions over your private, current, domain-specific knowledge — the standard way to put an LLM to work on data it was never trained on.