AI & LLMs Advanced

Building a RAG System in Django with PostgreSQL and pgvector

Embed documents, store vectors in Postgres, and let an LLM answer questions about your own data — without hallucinating its sources.

DjangoZen Team May 09, 2026 15 min read 3 views

Why RAG

LLMs hallucinate when asked about anything they weren't trained on. Worse, they confidently invent answers about your specific documents, your customer database, your internal wiki — none of which were in the training data.

Retrieval-Augmented Generation (RAG) fixes this by:

  1. Searching your own documents for the most relevant snippets
  2. Sticking those snippets into the prompt as context
  3. Asking the LLM to answer based on the provided context

Done well, RAG lets you build a chatbot that answers questions about your product, your codebase, your support history — without lying.

The pipeline

User question
   ↓
Embed the question (vector)
   ↓
Search vector DB for nearest documents
   ↓
Top N chunks → stuff into prompt
   ↓
LLM answers using only the provided chunks
   ↓
Return answer + cite sources

Simple in concept. Several places to mess it up.

Stack choices

For a Django app, the simplest sensible stack:

  • Postgres + pgvector — your existing database becomes the vector store. No new infrastructure.
  • OpenAI text-embedding-3-small or Voyage AI voyage-3 for embeddings (Anthropic doesn't offer a public embedding model in 2026; you mix providers).
  • Claude Sonnet 4.6 for the answer generation step.

Why pgvector over Pinecone/Weaviate/Qdrant: at most app sizes, your Postgres is plenty fast for vector search, you avoid an extra service, and your data stays in one place. Tutorial 6 covers when to upgrade.

Step 1 — install pgvector

On the database server (Ubuntu):

sudo apt install postgresql-16-pgvector

Then in Postgres:

CREATE EXTENSION IF NOT EXISTS vector;

In your Django app:

pip install pgvector openai

Step 2 — the model

# myapp/models.py
from django.db import models
from pgvector.django import VectorField, HnswIndex


class DocumentChunk(models.Model):
    """A piece of a document, with its embedding for semantic search."""
    source = models.CharField(max_length=255)  # e.g. "docs/install.md"
    chunk_index = models.PositiveIntegerField()
    content = models.TextField()
    embedding = VectorField(dimensions=1536)  # text-embedding-3-small dims
    metadata = models.JSONField(default=dict, blank=True)
    created_at = models.DateTimeField(auto_now_add=True)

    class Meta:
        unique_together = ("source", "chunk_index")
        indexes = [
            HnswIndex(
                name="chunk_embedding_hnsw",
                fields=["embedding"],
                m=16,
                ef_construction=64,
                opclasses=["vector_cosine_ops"],
            )
        ]

The HNSW index makes nearest-neighbour search fast (sub-millisecond on hundreds of thousands of rows). Migrate as usual.

Step 3 — chunking documents

LLMs have context limits and embedding models work better on shorter text. Chunk your documents before embedding.

# myapp/rag/chunking.py
import re

def chunk_text(text: str, max_chars: int = 1500, overlap: int = 200) -> list[str]:
    """
    Split text into overlapping chunks around natural boundaries.
    Overlap helps preserve context across chunk boundaries.
    """
    text = text.strip()
    if len(text) <= max_chars:
        return [text]

    chunks = []
    pos = 0
    while pos < len(text):
        end = min(pos + max_chars, len(text))
        # Try to break at a paragraph or sentence boundary
        if end < len(text):
            for sep in ["\n\n", "\n", ". ", " "]:
                last = text.rfind(sep, pos + max_chars - 300, end)
                if last > pos:
                    end = last + len(sep)
                    break
        chunks.append(text[pos:end].strip())
        pos = end - overlap
    return [c for c in chunks if c]

Rule of thumb: chunks of 500–2000 characters, with 100–300 character overlap. Smaller chunks = more precise retrieval but more rows. Larger chunks = more context but less precise. Tune for your content.

Step 4 — embedding and indexing

# myapp/rag/embed.py
from openai import OpenAI
from django.conf import settings
from .models import DocumentChunk
from .chunking import chunk_text

oai = OpenAI(api_key=settings.OPENAI_API_KEY)


def embed_texts(texts: list[str]) -> list[list[float]]:
    """Get embeddings in a single batched call."""
    response = oai.embeddings.create(
        model="text-embedding-3-small",
        input=texts,
    )
    return [item.embedding for item in response.data]


def index_document(source: str, full_text: str):
    """Chunk, embed, and store a document."""
    # Wipe any existing chunks for this source (idempotent re-index)
    DocumentChunk.objects.filter(source=source).delete()

    chunks = chunk_text(full_text)
    if not chunks:
        return

    embeddings = embed_texts(chunks)
    DocumentChunk.objects.bulk_create([
        DocumentChunk(
            source=source,
            chunk_index=i,
            content=text,
            embedding=emb,
        )
        for i, (text, emb) in enumerate(zip(chunks, embeddings))
    ])

A management command runs the indexer:

# myapp/management/commands/index_docs.py
from django.core.management.base import BaseCommand
from pathlib import Path
from myapp.rag.embed import index_document


class Command(BaseCommand):
    def handle(self, *args, **opts):
        for path in Path("docs/").rglob("*.md"):
            text = path.read_text()
            index_document(source=str(path), full_text=text)
            self.stdout.write(f"Indexed {path}")

Step 5 — retrieval

# myapp/rag/retrieve.py
from pgvector.django import CosineDistance
from .embed import embed_texts
from .models import DocumentChunk


def retrieve(query: str, k: int = 5) -> list[DocumentChunk]:
    """Return top-k chunks most similar to the query."""
    [query_embedding] = embed_texts([query])
    return list(
        DocumentChunk.objects
        .annotate(distance=CosineDistance("embedding", query_embedding))
        .order_by("distance")[:k]
    )

Step 6 — answering

# myapp/rag/answer.py
import anthropic
from django.conf import settings
from .retrieve import retrieve

client = anthropic.Anthropic(api_key=settings.ANTHROPIC_API_KEY)

SYSTEM_PROMPT = """You are a helpful assistant answering questions based ONLY on the provided context.

If the context does not contain enough information to answer the question, say so explicitly. Do not make up facts. Always cite the source for each claim by referring to the [source] tag in the context."""


def answer(question: str) -> dict:
    chunks = retrieve(question, k=5)

    if not chunks:
        return {"answer": "I have no information on that topic.", "sources": []}

    context = "\n\n".join(
        f"[source: {c.source}]\n{c.content}" for c in chunks
    )

    response = client.messages.create(
        model=settings.ANTHROPIC_MODEL,
        max_tokens=1024,
        system=[{
            "type": "text",
            "text": SYSTEM_PROMPT,
            "cache_control": {"type": "ephemeral"},
        }],
        messages=[{
            "role": "user",
            "content": f"Context:\n{context}\n\nQuestion: {question}"
        }],
    )

    return {
        "answer": response.content[0].text,
        "sources": list({c.source for c in chunks}),
    }

The LLM is now grounded in your documents. It cites them. If the context doesn't contain the answer, the system prompt instructs it to say so.

Step 7 — the view

@login_required
def ask(request):
    if request.method != "POST":
        return render(request, "ask.html")
    question = request.POST["question"]
    result = answer(question)
    return JsonResponse(result)

What goes wrong, and how to fix it

The answers are vague or wrong. Your retrieval is missing the right chunks. Inspect what comes back from retrieve(). If the right document isn't in the top-5, your chunking strategy is off, or your embeddings aren't capturing the relevant similarity.

Latency is bad. The embedding call is usually the slowest step. Cache embeddings of common questions in Redis. Use HNSW indexes (above) for fast vector search.

Costs spiral. Embeddings are cheap ($0.02 per million tokens for text-embedding-3-small), but generation is not. Use prompt caching on the system prompt and on stable context.

The LLM still hallucinates. Add stricter wording to the system prompt: "If you cannot find the answer in the context, respond exactly: 'I don't have information on that.'" Pair with evals (tutorial 10).

Where to go from here

This is a working RAG system. Real-world enhancements:

  • Hybrid search — combine vector search with keyword search (BM25) for better recall
  • Reranking — use a cross-encoder to reorder retrieved chunks
  • Query rewriting — ask the LLM to rewrite the user's question for better retrieval
  • Multi-step retrieval — let the model issue multiple search queries before answering
  • Citation accuracy checks — programmatically verify that cited claims are in the retrieved chunks

Each adds complexity. Start with the basic pipeline and only add what your evals show you need.