all posts
Guide

JSON Schema or a plain-language prompt: which to hand the extractor

Plain-language prompt or JSON Schema for AI extraction? You always send a prompt; an output_schema optionally pins field names and types. When to add one.

Prompt only vs prompt + output_schema
AxisPrompt onlyPrompt + output_schema
Setup effortOne sentenceWrite a JSON Schema too
Output guaranteeShape is the model's choiceValidated, with one repair pass
Field names & typesModel infers from your wordingYou fix them
On a non-conforming resultYou check it yourselfCall fails instead of returning bad data
Best forPrototyping, one-offs, exploringPipelines, typed downstream code

A SmartScraper request always carries a user_prompt. That is the part people mean when they ask whether to "use a prompt or a schema," and the framing hides the real choice. You are never picking one or the other. You are deciding whether to add an output_schema on top of the prompt you were always going to send.

This post is about that decision: what the prompt does, what the schema adds, and when the extra effort of writing one pays for itself.

What is the difference between a prompt and an output_schema?

The user_prompt is a plain-language instruction: it tells the engine what to find and how to read ambiguous fields, like which price to take when a page shows two. The output_schema is an optional JSON Schema that validates the result's field names and types after extraction. The prompt does the finding. The schema does the enforcing.

When is a plain-language prompt enough?

Reach for a prompt alone when speed to first result matters more than a fixed shape. Prototyping, a one-off pull, or exploring a page you have not scraped before are all cases where you do not yet know the field names you want, so pinning them is premature. A single sentence gets you a JSON result you can read and iterate on:

bash
curl https://api.webscrape.ai/v1/smartscraper \
  -H "X-API-Key: wsg_live_your_key_here" \
  -H "Content-Type: application/json" \
  -d '{
    "website_url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
    "user_prompt": "Extract the book title, price, star rating, and stock status"
  }'

A prompt-only call runs no schema validation. The result is shaped by your wording, so the model chooses the keys and types. If the shape matters downstream, that is your job to check, which is exactly what the schema automates.

When should you add a JSON Schema?

Add an output_schema when the result feeds code that breaks if a field is renamed, missing, or the wrong type. The schema fixes the contract and turns on validation: the result is checked against it, with one repair pass if the first attempt does not conform.

bash
curl https://api.webscrape.ai/v1/smartscraper \
  -H "X-API-Key: wsg_live_your_key_here" \
  -H "Content-Type: application/json" \
  -d '{
    "website_url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
    "user_prompt": "Extract the book title, price, star rating, and stock status",
    "output_schema": {
      "type": "object",
      "properties": {
        "title":    { "type": "string" },
        "price":    { "type": "number" },
        "rating":   { "type": "number" },
        "in_stock": { "type": "boolean" }
      },
      "required": ["title", "price", "in_stock"]
    }
  }'

Now price is guaranteed to arrive as a number or the call fails. On a SmartScraper request, a result that still does not match after the repair pass comes back as a validation_failed error rather than malformed data, and a failed call costs nothing. (A Blueprint run handles the same mismatch differently: it reports validation_warnings and keeps its status completed, because page drift should not block a working blueprint.)

Reach for each when

  • A prompt alone when you are prototyping, running a one-off, or do not yet know the field names you want. Fastest path to a result you can eyeball.
  • A prompt plus an output_schema when the output flows into a pipeline, a database column, or typed code. The schema is the contract that keeps a model step from quietly changing the shape on you.
  • A prompt with a loose schema when you want validation on the fields you care about without over-constraining the rest: type the two or three that downstream code depends on, mark them required, and leave the others to the prompt.

Honest limitations

A schema guarantees the shape of the output, not the truth of the value. It will confirm that price is a number; it will not catch that the model read the crossed-out list price instead of the sale price. That is a prompt-quality problem or a job for a pinned Blueprint, not something validation can see.

An over-tight schema can also work against you. Mark a field required that a page legitimately omits, like a sale price on a full-price product, and you turn a normal absence into a failed call. Type the fields downstream code truly depends on, and let the prompt carry the rest.

A prompt-only call, for its part, gives you no safety net. It is the right default while you are exploring and the wrong one the moment the result is load-bearing. The progression most pipelines follow is a prompt to learn the page, then a schema to lock it, then a Blueprint once the page is stable, covered in why structured extraction beats CSS selectors.

Try it

Grab a free key and run the same page twice: once with a prompt alone, once with the schema above. Diff the two results and you will see exactly what the schema is buying you. A failed validation costs nothing, so the only price of an over-tight schema is the call you reword.

Frequently asked questions

Should I use a prompt or a JSON Schema for extraction?

Use both, in layers. Every SmartScraper call needs a user_prompt, so you always describe what you want in plain language. An output_schema is optional and sits on top: it validates the field names and types of the result. Start with a prompt, add a schema once the output feeds code that depends on its shape.

Is output_schema required on a SmartScraper request?

No. user_prompt is required and output_schema is optional. A prompt-only call returns JSON shaped by your wording, with no schema enforcement. Adding an output_schema turns on validation: the result is checked against the schema, with one repair pass, and the call fails rather than return data that does not match.

What happens if the extracted result does not match my output_schema?

The engine gets one automatic repair pass to bring the result into line. If it still does not conform, the SmartScraper call returns a validation_failed error rather than handing back malformed data, and a failed call costs nothing. Blueprint runs treat the same mismatch as a soft validation_warnings instead, keeping their status completed.

Does a JSON Schema make AI extraction deterministic?

It guarantees the shape, not the value. A schema validates that price is a number and that required fields are present; it does not prove the model picked the right number. For run-to-run repeatability on a stable page, save the extraction as a Blueprint and replay selectors with no AI step in the loop.

Can I use a prompt and an output_schema together?

Yes, and it is the common case. The prompt tells the engine what to look for and how to interpret ambiguous fields; the schema fixes the output's names and types and turns on validation. The prompt does the finding, the schema does the enforcing. Neither replaces the other on a SmartScraper call.

When is a plain-language prompt enough on its own?

When you are prototyping, running a one-off, or exploring an unfamiliar page and the exact field names do not matter yet. A prompt alone is the fastest way to see what a page gives up. Add a schema later, once the result flows into code that breaks if a field is renamed or comes back the wrong type.

How do I write the output_schema?

It is a standard JSON Schema. Use an object with typed properties for a single record, or an array of objects for a list of rows, and list the fields you consider mandatory under required. The schema is sanity-checked for unsupported keywords before extraction runs, so a malformed schema is caught early.