Should I review every line the AI writes?

Every line of substance, yes. Every comment, indent change, and rename, no. Treat AI output like a PR from a competent but unfamiliar contributor: focus the review on the logic, the architecture, the edge cases, the security implications. Skim the boilerplate. The discipline that protects against bad AI-written code is the same discipline that protects against bad human-written code.

Are tests enough to validate AI-written code?

No. Tests catch the cases you wrote tests for; they miss the cases you did not think to test, which are the cases AI-written code most commonly fails on. Tests are a safety net, not a validation. Read the diff. Run the code on real inputs. Check the obvious edge cases manually before merging.

How big a task should I give an AI tool at once?

Smaller than feels efficient. A task that takes 10–60 minutes of focused human work is usually the right size. Bigger than that and review becomes too coarse — you can not realistically tell whether the model got the right answer because there is too much code to check. Smaller than that and the overhead of the AI loop dominates.

AI Coding Workflows That Actually Work

By 2026 the productivity research is starting to settle. The MIT-Princeton-Penn study of ~4,800 Copilot users at Microsoft, Accenture, and another Fortune 100 found roughly 26% more tasks completed on average, with code commits up about 13.5% and compilation frequency up about 38.4% after Copilot was introduced. A separate study of 150 developers found AI users reached first-solution about 30% faster on average, with "habitual" users about 55% faster. Same direction, consistent magnitudes.

Notably, the studies also found that AI assistance is an amplifier of existing good practice, not a replacement for it. The teams that get the most out of these tools are the ones already shipping good code; the teams that ship bad code get a slightly faster cycle of bad code. The workflow matters as much as the tool.

This article is about the workflow.

The pattern that actually works

Across tools — Copilot, Cursor, Claude Code, Aider — the dominant workflow in serious teams looks the same:

Human decomposes the problem into small steps. A change unit you can reason about in your head; a task you could ship by yourself in 10–60 minutes.
AI suggests code for the unit — inline completion, chat-driven edit, or agentic execution depending on the tool.
Human inspects the diff in the editor, against expectations and the existing code style.
Tests or quick-run. Confirm behavior on at least the obvious cases. Iterate if anything is wrong.
Commit, repeat. Each commit is a coherent unit you understand.

The whole loop is a few minutes per cycle. Most working pairs do dozens of these per session. The keys are: small enough that review is fast, tight enough that the AI's mistakes show up before they compound, and visible enough that every change passes through a human's eyes.

The four working patterns

Within that overall shape, four patterns show up repeatedly:

Inline completion + micro-edits. Use the tool's inline completion for boilerplate, looping code, API call patterns, test scaffolding. Use chat-driven edits for slightly larger asks ("add a function to handle X," "translate this logic to TypeScript"). Always inspect the diff; never commit a "feels right" suggestion without reading it. This is the bread-and-butter pattern for IDE-based tools like Cursor and Copilot.

Test-first, then implement. Write the tests (or have the AI write them) for the behavior you want. Then have the AI implement against the tests. The test set acts as a specification the AI has to satisfy. Works well for self-contained functions, less well for code where most of the difficulty is in the integration.

Agentic batch tasks. Pick a task that has a clear shape and minimal ambiguity ("migrate every callsite of the old API," "add the new field to all the models that need it"), give it to a terminal agent (Claude Code, Aider, Cursor agent mode), come back when it is done. Review the resulting set of commits. This is where the terminal-native tools earn their keep — they handle the kind of mechanical-but-tedious work that humans hate and AI is good at.

Code review as a sanity check. Use AI to review your own code before opening a PR. Have it look for bugs, missing edge cases, style inconsistencies. Surprisingly useful, especially for finding the bug you would have asked the AI to fix anyway if you had spotted it.

The anti-patterns

These are the patterns that quietly degrade code quality, in roughly increasing order of damage:

Accepting on green. The tests pass, so you merge. AI-written code often passes the tests because it pattern-matched to test-passing, not because it solved the actual problem. Read the diff. The bug is usually in the part that the test did not cover.

Big-bang autonomous tasks. Give the agent a vague large goal, walk away for an hour, find a sprawling 2,000-line diff that mostly works but is impossible to review honestly. You will either merge and discover the issues in production or revert and redo. Either way you have wasted the agent's time. Decompose.

Over-prompting. Writing a 600-word prompt to specify a 20-line change. The cognitive load of specifying everything dwarfs the time saved by not writing the code yourself. Smaller prompts on smaller units beat heroic specifications.

Letting the AI choose the architecture. AI tools are not great at picking the right abstraction for a long-lived codebase. They are great at filling in the details once the architecture exists. If you do not have an opinion on how the code should be structured, the model's defaults will quietly accumulate into something hard to maintain.

Treating the AI as authoritative. The model will confidently produce wrong code. It will confidently insist that wrong code is correct when challenged. The protection against this is engineering judgment, not the model. Verify; do not trust the output as more than a useful first draft.

Workflows for specific shapes of work

Greenfield code. AI is at its best here. The constraints are loose, the existing code is small, and the model can produce a reasonable starting structure quickly. Use a "agentic batch" pattern: spec the module, let the tool draft the skeleton, then switch to inline-edit mode for the details.

Bug fixing. Inline iteration wins. Give the tool the failing test and the relevant file; ask for the fix. Inspect the diff carefully — bug fixes are exactly where AI's confident-but-wrong failure mode is most expensive. Run the test, then run adjacent tests, then run the code on a hand-crafted edge case.

Refactoring. Mechanical refactors (renames, API migrations, signature changes) are perfect for terminal agents. Larger architectural refactors require a human to lead; AI is good at executing the steps once the plan exists, not at designing the plan.

PR review. AI as a code reviewer is genuinely useful — it catches style inconsistencies, obvious bugs, and missing edge cases. Pair it with human review; do not replace human review with it.

Reading unfamiliar code. Underrated. Use AI to explain code you do not know — paste the file, ask "what does this do, and what are the non-obvious parts?" Faster than reading line-by-line for first-pass comprehension. (Verify the explanation before relying on it.)

What is the single highest-leverage habit

If you adopt only one thing from this article: read every diff before you commit.

It sounds obvious. It is also the discipline that quietly disappears once teams get comfortable with AI tools. The slope from "I read every line" to "I skim the diff" to "I trust the tests" is gentle and easy to walk down. Walking back up takes a deliberate effort.

The teams that maintain code quality with AI tooling are the ones where every commit still passes a human's eyes. The teams that lose code quality are the ones where the loop got fast enough that human review slipped out.

Comments 2

u/sam_88 · 1 month ago

the part about writing the test first then letting the model fill it in is what finaly made AI coding click for me. before that it was just fancy autocomplete that i didnt trust
u/tdd_ai · 1 month ago

tests as the spec is the real unlock. the model fills a clear contract so much better than a vague 'just make it work' prompt

AI Coding Workflows That Actually Work

Article summary

The pattern that actually works

The four working patterns

The anti-patterns

Workflows for specific shapes of work

What is the single highest-leverage habit

Frequently asked questions

See also

Where to go next

Comments 2