How to Prompt an LLM for Data Extraction

Article summary

Extraction tasks ride on output spec, not phrasing. Give the model a JSON schema or one canonical example, define exactly how missing values appear (null vs. omit vs. "unknown"), and use the provider's JSON or structured-output mode so the response is mechanically parseable. Free-form prompts that "look like JSON" fail in production.

Extraction is the use case where the output spec lever from the fundamentals does the heavy lifting. Most failures in production extraction come from a vague schema, not a bad model.

1. Always start with the schema

Define what you want before you write the prompt. A reasonable schema for extracting action items from a meeting:

{
  "action_items": [
    {
      "owner": "string | null",   // person responsible, or null if unstated
      "task": "string",            // short description
      "deadline": "string | null", // ISO date, free-text, or null
      "confidence": "high | medium | low"
    }
  ]
}

A complete schema answers three questions: which fields exist, what types they hold, and what happens when a value is missing. If you cannot answer those, your downstream parser will fail.

2. Use the provider's structured-output mode

OpenAI, Anthropic, Google, and most local-model runtimes now support schema-constrained output. Use it:

# OpenAI-style
response = openai.chat.completions.create(
    model="gpt-4o",
    response_format={"type": "json_schema", "json_schema": {...}},
    messages=[...]
)

The model is mechanically constrained to emit valid JSON matching the schema. No regex, no fallback parser, no "what if the model says 'Here is the JSON:'" anxiety.

3. One worked example beats five paragraphs of rules

Even with a schema, one example pins down the gray areas:

Example input:
"...so I'll talk to legal next week about the new clauses..."

Example output:
{
  "action_items": [
    {
      "owner": "speaker",
      "task": "talk to legal about new clauses",
      "deadline": "next week",
      "confidence": "high"
    }
  ]
}

This example teaches the model what "owner = speaker" means and that "next week" is acceptable as a deadline value rather than being normalized to a date.

4. Be explicit about missing values

The most common production bug in extraction is the model saying "deadline": "none mentioned" instead of "deadline": null. Both are JSON-valid; only one is what your code expects.

Spell it out in the prompt:

For any field where the source does not provide a value:
- Set the field to null. Do NOT use "none", "unknown", or empty string.
- Do not invent or infer values from context.

5. Validate after, always

Even with structured output, validate. Use a library (Pydantic, Zod) to parse the response and fail loudly on schema violations. Treat extraction output as untrusted input to the rest of your system, because it is.

A robust extraction setup is 80% schema, 15% example, 5% prompt wording. The phrasing matters least.

Frequently asked questions

JSON mode or natural language?

JSON mode, always, for production extraction. Frontier models support strict schemas; let them. Natural-language "respond in JSON" prompts fail at the worst times — usually when the input has a quote or a special character. Use the mode.

How many few-shot examples?

Usually one good one is enough if the schema is clear. Use 2–3 when there is an edge case the schema cannot express ("an empty string field means we tried and there was no value; a null means we did not look"). Past 3 examples you have probably under-specified the schema.

How do I handle multilingual or noisy input?

Tell the model to extract regardless of source language and to return values in the source language unless told otherwise. Tell it explicitly what to do with garbage ("if the field is unreadable, return null and add a "_notes" field with the issue"). Plan for unreadable input; do not pretend it cannot happen.