Does "let's think step by step" still work?

On older non-reasoning models it sometimes helps measurably, especially on arithmetic and multi-step logic. On modern reasoning models (Claude's thinking mode, OpenAI's reasoning models, DeepSeek-R1, Qwen3-Thinking) the prompt is redundant; those models already generate hidden or visible reasoning chains. Adding the magic phrase does not improve accuracy and sometimes hurts it by leaking the inner reasoning into the final answer.

Should I ask the model to show its reasoning?

Only if you need it for debugging. If you do need it, ask explicitly: "before your answer, list the two or three steps you took." Do not rely on free-form "think step by step" to produce useful traces; the model will sometimes invent reasoning post-hoc that does not match what actually drove its answer.

Are reasoning models always better?

No. They are slower and more expensive per token, and on simple tasks (rewrite this sentence, extract this field) the extra reasoning is wasted. Use reasoning models for problems where the answer requires several real intermediate steps; use fast models for everything else.

Chain-of-Thought Prompting

Chain-of-thought prompting (CoT) is the practice of asking the model to produce intermediate reasoning before its final answer. The most famous version is the phrase "let's think step by step." That phrase appeared in a 2022 paper showing it raised math accuracy on a model called PaLM by a striking amount, and it became cargo-cult prompt advice for years afterwards.

The picture in 2026 is more nuanced. CoT still matters, but the way you invoke it has changed.

Three eras of CoT

Era 1 (pre-2022): instruct-tuned models with no internal reasoning. Models would happily blurt out a wrong answer to a multi-step problem because nothing in their decoder forced them to slow down. Telling them to think step by step gave them tokens to "compute on," and accuracy on math and logic benchmarks rose substantially. CoT was a genuine free lunch.

Era 2 (2022–2024): frontier models started to internalize CoT. GPT-4-class and Claude 3-class models did better when prompted to reason, but the gap shrank. Many tasks no longer benefited because the models had been post-trained on reasoning chains during instruction tuning. CoT was still useful for novel problems but rarely changed everyday outputs.

Era 3 (2024 onward): reasoning models. OpenAI's o-series, Anthropic's extended thinking, DeepSeek's R1, Qwen3-Thinking, and several open-weight reasoning models do CoT by default, usually behind a hidden scratchpad. The user-visible answer is short; the actual reasoning is long. Telling a reasoning model to "think step by step" is now redundant at best.

What this means for your prompts today

If you are using a modern reasoning model, skip the magic phrases. The model is already doing the reasoning. Your prompt should focus on the four levers from how-to-prompt-ai-effectively, not on invoking CoT.

If you are using a fast non-reasoning model on a multi-step task, CoT is still useful, but be surgical:

Use it when the task genuinely has multiple steps (arithmetic, code traces, multi-hop reasoning).
Skip it for everyday rewriting, summarization, classification, and extraction.
If the user-facing output should be short, ask for reasoning first and a final answer at the end, with a clear separator the parser can split on.

Example:

Solve the following word problem. First, in a "Working:" block, write your reasoning in 2–4 short bullets. Then write "Answer:" followed by the numeric answer only, no units, no prose.

Problem: ...

The structure gives you something to debug when the model gets it wrong, and a clean parse target for the final answer.

A common mistake: trusting the reasoning

Visible CoT is not a transparent window into the model's thinking. The text the model writes is itself just predicted tokens. It can look like reasoning and be wrong; it can look like reasoning and be unrelated to the actual answer; it can post-hoc rationalize an answer the model produced for unrelated reasons. There is published research showing chain-of-thought traces that diverge from the model's actual decision process.

In practice this means:

Do not use a model's reasoning trace as ground truth for why it answered a certain way.
Do use it as a debug signal: if the visible reasoning contains an error, the answer will usually contain that error too.
For audit-style use cases, prefer external verification (run the math, check the citation) over reading the trace.

CoT and cost

Reasoning costs tokens. With reasoning models, your bill includes the hidden reasoning tokens even though you never see them. A simple "what is 2+2" can balloon into hundreds of hidden reasoning tokens on a reasoning model in default mode.

Two practical knobs:

Pick the right model. Use fast non-reasoning models for simple tasks; reserve reasoning models for problems that earn the cost.
Cap thinking budget. Most reasoning APIs now expose a "thinking budget" or "reasoning effort" parameter. Use the lowest setting that still solves your task.

CoT was the cleverest prompting trick of the early LLM era, and it still has a place. Just use it where it earns its keep, and trust the modern models to handle it for you when they can.

Chain-of-Thought Prompting

Article summary

Three eras of CoT

What this means for your prompts today

A common mistake: trusting the reasoning

CoT and cost

Frequently asked questions

See also

Where to go next