Reasoning models think before answering. Here's how chain-of-thought prompting works, what Anthropic's extended thinking does differently, and when the extra cost is worth it.
Standard LLMs predict tokens one at a time, left-to-right, with no chance to backtrack or double-check. For simple language tasks (rewriting an email, classifying a ticket) that's fine. For multi-step problems — a math word problem, a logic puzzle, debugging a piece of code — it often fails.
The reason: complex problems require planning, evaluation, and revision. A pure next-token predictor doesn't naturally do those. So researchers found ways to coax the model to do them.
The original 2022 finding: if you ask an LLM to "think step by step" before giving its answer, accuracy on reasoning tasks jumps dramatically.
# Standard prompt — often wrong on multi-step problems
"What's 23 * 47?"
# Chain-of-thought prompt — much more reliable
"What's 23 * 47? Let's think step by step before giving the final answer."
The model writes out its reasoning ("23 * 47 = 23 * 50 - 23 * 3 = 1150 - 69 = 1081") before giving the answer. Even though the model is still just predicting tokens, the intermediate tokens act as a scratchpad — each next prediction has more relevant context.
This is prompted CoT, and it works on any LLM. Costs you a few hundred extra output tokens.
Add a worked example to your prompt:
prompt = '''
Q: A shop sold 14 apples on Monday and 22 on Tuesday. How many in total?
A: 14 + 22 = 36 apples.
Q: A shop sold 9 books in the morning and 17 in the afternoon. How many in total?
A:
'''
The model sees the pattern of "show your work" and follows it. This is the workhorse of production prompting before reasoning models existed.
In 2024–2025 a new generation of models was trained specifically to reason internally before answering. Examples:
These models are trained on reasoning traces — long, structured "thinking out loud" examples — so chain-of-thought is no longer a prompt trick, it's a built-in behavior.
In practical terms:
With Claude, you can opt into extended thinking on a per-call basis:
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-opus-4-7",
max_tokens=4000,
thinking={"type": "enabled", "budget_tokens": 10000},
messages=[{
"role": "user",
"content": "Find the bug in this Django view: ..."
}]
)
# response.content includes both thinking blocks and the final answer
for block in response.content:
if block.type == "thinking":
print("Reasoning:", block.thinking)
elif block.type == "text":
print("Answer:", block.text)
The budget_tokens is the cap on thinking tokens. Higher budget → potentially better answer, but more cost and latency.
Worth it:
Not worth it:
Use a fast non-reasoning model for the hot path, a reasoning model for the slow lane:
def handle_user_question(question):
# Fast classifier — Haiku, no reasoning, < 1 second
category = classify_with_haiku(question)
if category == "simple_lookup":
# Fast path: Sonnet, no reasoning
return answer_with_sonnet(question)
elif category == "complex_analysis":
# Slow path: Opus with extended thinking
return answer_with_reasoning(question)
This gives you reasoning quality where it matters and snappy responses where it doesn't.
So a single complex reasoning call can cost $0.50–$5. Multiply by users. Plan accordingly.
Next tutorial: how to actually wire all this into a Django view — the Claude API, streaming, caching, and error handling.