Extraction is the use case where the output spec lever from the fundamentals does the heavy lifting. Most failures in production extraction come from a vague schema, not a bad model.
1. Always start with the schema
Define what you want before you write the prompt. A reasonable schema for extracting action items from a meeting:
{
"action_items": [
{
"owner": "string | null", // person responsible, or null if unstated
"task": "string", // short description
"deadline": "string | null", // ISO date, free-text, or null
"confidence": "high | medium | low"
}
]
}
A complete schema answers three questions: which fields exist, what types they hold, and what happens when a value is missing. If you cannot answer those, your downstream parser will fail.
2. Use the provider's structured-output mode
OpenAI, Anthropic, Google, and most local-model runtimes now support schema-constrained output. Use it:
# OpenAI-style
response = openai.chat.completions.create(
model="gpt-4o",
response_format={"type": "json_schema", "json_schema": {...}},
messages=[...]
)
The model is mechanically constrained to emit valid JSON matching the schema. No regex, no fallback parser, no "what if the model says 'Here is the JSON:'" anxiety.
3. One worked example beats five paragraphs of rules
Even with a schema, one example pins down the gray areas:
Example input:
"...so I'll talk to legal next week about the new clauses..."
Example output:
{
"action_items": [
{
"owner": "speaker",
"task": "talk to legal about new clauses",
"deadline": "next week",
"confidence": "high"
}
]
}
This example teaches the model what "owner = speaker" means and that "next week" is acceptable as a deadline value rather than being normalized to a date.
4. Be explicit about missing values
The most common production bug in extraction is the model saying "deadline": "none mentioned" instead of "deadline": null. Both are JSON-valid; only one is what your code expects.
Spell it out in the prompt:
For any field where the source does not provide a value:
- Set the field to null. Do NOT use "none", "unknown", or empty string.
- Do not invent or infer values from context.
5. Validate after, always
Even with structured output, validate. Use a library (Pydantic, Zod) to parse the response and fail loudly on schema violations. Treat extraction output as untrusted input to the rest of your system, because it is.
A robust extraction setup is 80% schema, 15% example, 5% prompt wording. The phrasing matters least. If schema-first thinking is unfamiliar territory, it pays to learn the concept first so the schema, not the wording, becomes your default starting point.
asking for json and getting markdown-wrapped json back is the bane of my existence. the structured output tips here actually help, especially the bit about giving it the schema up front
give it the schema up front is the one that cut my retry rate in half. stop hoping it guesses the exact shape you want and just tell it