AI & LLMs Beginner

How Large Language Models Actually Work (Without the Math)

Tokens, transformers, context windows, why LLMs hallucinate, and how to choose between Claude, GPT, and open-source models — explained in plain English.

DjangoZen Team May 09, 2026 10 min read 3 views

Why understanding the basics matters

You don't need to read research papers to use LLMs effectively. But you do need a working mental model of what they actually do — otherwise you'll fight the tool instead of using it.

This tutorial gives you that mental model in plain English. No math, no jargon you don't need.

LLMs are next-token predictors

At their core, large language models do one thing: given some text, predict the next chunk of text most likely to follow.

That chunk is called a token. A token is roughly 3–4 characters of English. The word "tokenization" is itself two tokens: "token" and "ization." The phrase "Hello world" is also two tokens.

Pseudocode for what an LLM does:

def llm_generate(prompt, max_tokens=500):
    output = ""
    for _ in range(max_tokens):
        next_token = predict_next(prompt + output)  # The "AI magic"
        if next_token == END_TOKEN:
            break
        output += next_token
    return output

That's it. The "intelligence" is entirely in the predict_next function — and that function was trained on a huge corpus of text (books, websites, code, etc.). The model has learned statistical patterns of what tokens tend to follow what other tokens.

The transformer architecture, in one paragraph

Modern LLMs use an architecture called a transformer. The key trick is the attention mechanism — when predicting the next token, the model can "attend to" any previous token in the input, weighting each by how relevant it is. So if you write "The cat sat on the mat. The animal was happy.", the model can connect "animal" back to "cat" because it pays attention to that earlier token. That's it. Everything else is engineering details.

Tokens, context windows, and what they cost you

Two numbers matter for any LLM call:

  1. Tokens in (your prompt) — what you send
  2. Tokens out (the response) — what you get back

You're billed for both, usually at different rates. Output tokens cost more because generating them takes more compute.

The context window is the max number of tokens (input + output) the model can handle in one call. Common sizes in 2026:

  • Claude Sonnet 4.6 / Opus 4.7: 200K tokens (about 150,000 words — most novels)
  • Claude Opus 4.7 (extended): 1M tokens (a whole codebase or several books)
  • GPT-5: 200K-400K tokens depending on tier

What this means in practice: a tutorial like this one is about 1,500 tokens. A typical web page is 2,000–5,000 tokens. So a 200K context window is enormous — you can stuff dozens of documents in.

Why LLMs hallucinate

Because they're next-token predictors, LLMs always produce plausible-sounding text. They don't have a built-in "I don't know" mechanism. If you ask a question they have no training data for, they'll generate something that looks like a correct answer based on similar-sounding training examples.

This is hallucination, and it's the #1 thing to design around in production:

  • Citations that don't exist
  • API endpoints invented from a vague memory of similar APIs
  • Confident wrong answers about recent events
  • Made-up names, dates, statistics

Mitigations (covered in later tutorials):

  • Retrieval-augmented generation (RAG) — give the model the actual source documents
  • Structured output — force JSON with required fields so missing data is visible
  • Self-critique prompts — ask the model to double-check its own answer
  • Always show sources — let the user verify

Training, fine-tuning, and what's available off the shelf

LLMs are trained in three stages:

  1. Pretraining — massive next-token prediction on the open internet. Months, millions of dollars, billions of parameters. This is where "general knowledge" comes from.
  2. Fine-tuning / instruction tuning — teaching the model to follow instructions (chat-style). This is what turns a raw next-token-predictor into Claude or ChatGPT.
  3. RLHF / RLAIF — reinforcement learning from human (or AI) feedback. Aligns the model to be helpful, harmless, and honest.

You almost never need to do any of this yourself. The right pattern in 2026 is: pick a hosted frontier model, prompt it well, augment with your data via RAG. Building or fine-tuning your own LLM is a research project, not a feature shipping next sprint.

Choosing a model in 2026

Practical decision guide:

  • Claude Sonnet 4.6 — your default. Strong, fast, cheap, excellent at code and writing. Pick this unless you have a reason not to.
  • Claude Opus 4.7 — when the problem is genuinely hard: complex reasoning, multi-step planning, careful writing. More expensive, slower, smarter.
  • Claude Haiku 4.5 — high-throughput, cost-sensitive tasks. Classification, simple extraction, bulk processing.
  • OpenAI GPT-5 — competitive with Claude. Choose based on your evals, not vibes.
  • Open-source (Llama, Mistral, Qwen) — you need data privacy guarantees beyond what hosted APIs offer, or you're at a scale where the per-token cost matters more than the engineering cost of self-hosting.

For a Django app starting out: Claude Sonnet 4.6 via the Anthropic API. Easy to integrate, reliable, plenty good for almost any use case.

The takeaway

LLMs are remarkable, statistical, fallible next-token predictors. They have no memory between calls, no real understanding (they're pattern matchers), no awareness of truth. Treat them as such and you'll build solid systems on top of them. Treat them as oracles and your users will catch the hallucinations before you do.

The next tutorial covers a more advanced topic: how reasoning models (Claude with extended thinking, OpenAI's o-series) try to improve on raw next-token prediction by "thinking out loud" before answering — and when that's worth the extra cost.