JSON Schema or a plain-language prompt: which to hand the extractor
Plain-language prompt or JSON Schema for AI extraction? You always send a prompt; an output_schema optionally pins field names and types. When to add one.
| Axis | Prompt only | Prompt + output_schema |
|---|---|---|
| Setup effort | One sentence | Write a JSON Schema too |
| Output guarantee | Shape is the model's choice | Validated, with one repair pass |
| Field names & types | Model infers from your wording | You fix them |
| On a non-conforming result | You check it yourself | Call fails instead of returning bad data |
| Best for | Prototyping, one-offs, exploring | Pipelines, typed downstream code |
A SmartScraper request always carries a user_prompt. That is the part people
mean when they ask whether to "use a prompt or a schema," and the framing hides
the real choice. You are never picking one or the other. You are deciding whether
to add an output_schema on top of the prompt you were always going to send.
This post is about that decision: what the prompt does, what the schema adds, and when the extra effort of writing one pays for itself.
What is the difference between a prompt and an output_schema?
The user_prompt is a plain-language instruction: it tells the engine what to
find and how to read ambiguous fields, like which price to take when a page shows
two. The output_schema is an optional JSON Schema that validates the result's
field names and types after extraction. The prompt does the finding. The schema
does the enforcing.
When is a plain-language prompt enough?
Reach for a prompt alone when speed to first result matters more than a fixed shape. Prototyping, a one-off pull, or exploring a page you have not scraped before are all cases where you do not yet know the field names you want, so pinning them is premature. A single sentence gets you a JSON result you can read and iterate on:
curl https://api.webscrape.ai/v1/smartscraper \
-H "X-API-Key: wsg_live_your_key_here" \
-H "Content-Type: application/json" \
-d '{
"website_url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
"user_prompt": "Extract the book title, price, star rating, and stock status"
}'
A prompt-only call runs no schema validation. The result is shaped by your wording, so the model chooses the keys and types. If the shape matters downstream, that is your job to check, which is exactly what the schema automates.
When should you add a JSON Schema?
Add an output_schema when the result feeds code that breaks if a field is
renamed, missing, or the wrong type. The schema fixes the contract and turns on
validation: the result is checked against it, with one repair pass if the first
attempt does not conform.
curl https://api.webscrape.ai/v1/smartscraper \
-H "X-API-Key: wsg_live_your_key_here" \
-H "Content-Type: application/json" \
-d '{
"website_url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
"user_prompt": "Extract the book title, price, star rating, and stock status",
"output_schema": {
"type": "object",
"properties": {
"title": { "type": "string" },
"price": { "type": "number" },
"rating": { "type": "number" },
"in_stock": { "type": "boolean" }
},
"required": ["title", "price", "in_stock"]
}
}'
Now price is guaranteed to arrive as a number or the call fails. On a
SmartScraper request, a result that still does not match after the repair pass
comes back as a validation_failed error rather than malformed data, and a failed
call costs nothing. (A Blueprint run handles the same mismatch differently: it
reports validation_warnings and keeps its status completed, because page drift
should not block a working blueprint.)
Reach for each when
- A prompt alone when you are prototyping, running a one-off, or do not yet know the field names you want. Fastest path to a result you can eyeball.
- A prompt plus an output_schema when the output flows into a pipeline, a database column, or typed code. The schema is the contract that keeps a model step from quietly changing the shape on you.
- A prompt with a loose schema when you want validation on the fields you care
about without over-constraining the rest: type the two or three that downstream
code depends on, mark them
required, and leave the others to the prompt.
Honest limitations
A schema guarantees the shape of the output, not the truth of the value. It will
confirm that price is a number; it will not catch that the model read the
crossed-out list price instead of the sale price. That is a prompt-quality problem
or a job for a pinned Blueprint, not something validation can see.
An over-tight schema can also work against you. Mark a field required that a
page legitimately omits, like a sale price on a full-price product, and you turn a
normal absence into a failed call. Type the fields downstream code truly depends
on, and let the prompt carry the rest.
A prompt-only call, for its part, gives you no safety net. It is the right default while you are exploring and the wrong one the moment the result is load-bearing. The progression most pipelines follow is a prompt to learn the page, then a schema to lock it, then a Blueprint once the page is stable, covered in why structured extraction beats CSS selectors.
Try it
Grab a free key and run the same page twice: once with a prompt alone, once with the schema above. Diff the two results and you will see exactly what the schema is buying you. A failed validation costs nothing, so the only price of an over-tight schema is the call you reword.
Frequently asked questions
Should I use a prompt or a JSON Schema for extraction?
Use both, in layers. Every SmartScraper call needs a user_prompt, so you always describe what you want in plain language. An output_schema is optional and sits on top: it validates the field names and types of the result. Start with a prompt, add a schema once the output feeds code that depends on its shape.
Is output_schema required on a SmartScraper request?
No. user_prompt is required and output_schema is optional. A prompt-only call returns JSON shaped by your wording, with no schema enforcement. Adding an output_schema turns on validation: the result is checked against the schema, with one repair pass, and the call fails rather than return data that does not match.
What happens if the extracted result does not match my output_schema?
The engine gets one automatic repair pass to bring the result into line. If it still does not conform, the SmartScraper call returns a validation_failed error rather than handing back malformed data, and a failed call costs nothing. Blueprint runs treat the same mismatch as a soft validation_warnings instead, keeping their status completed.
Does a JSON Schema make AI extraction deterministic?
It guarantees the shape, not the value. A schema validates that price is a number and that required fields are present; it does not prove the model picked the right number. For run-to-run repeatability on a stable page, save the extraction as a Blueprint and replay selectors with no AI step in the loop.
Can I use a prompt and an output_schema together?
Yes, and it is the common case. The prompt tells the engine what to look for and how to interpret ambiguous fields; the schema fixes the output's names and types and turns on validation. The prompt does the finding, the schema does the enforcing. Neither replaces the other on a SmartScraper call.
When is a plain-language prompt enough on its own?
When you are prototyping, running a one-off, or exploring an unfamiliar page and the exact field names do not matter yet. A prompt alone is the fastest way to see what a page gives up. Add a schema later, once the result flows into code that breaks if a field is renamed or comes back the wrong type.
How do I write the output_schema?
It is a standard JSON Schema. Use an object with typed properties for a single record, or an array of objects for a list of rows, and list the fields you consider mandatory under required. The schema is sanity-checked for unsupported keywords before extraction runs, so a malformed schema is caught early.