SmartScraper - webscrape.ai

POST /v1/smartscraper gives you structured JSON from any URL. You always send a user_prompt. Add an output_schema if you want the result validated into a fixed shape. The minimum call is in the API reference; this page is the deep dive — prompt and schema design, page complexity, and what to do when validation fails.

Prompt first, schema second

The user_prompt is required — that’s where the intent goes. output_schema is optional and pins down the shape. Use both together for the cleanest output. If you just want free-shape JSON, send the prompt alone.

Schema design

When you do supply an output_schema, the clearer it is, the cleaner the extraction.

Use the most specific types you can

Don’t say string if you mean integer. SmartScraper validates against your schema, so score: {"type": "integer"} produces cleaner results than score: {"type": "string"} and survives downstream typing.

{
  "type": "object",
  "properties": {
    "title":   {"type": "string"},
    "price":   {"type": "number"},
    "in_stock": {"type": "boolean"},
    "tags":    {"type": "array", "items": {"type": "string"}}
  }
}

Mark required fields

Adding "required": ["title", "price"] forces validation to fail when the field is missing. Much better than silently returning null and finding out three steps later. Group fields that belong together. A product page schema:

{
  "type": "object",
  "properties": {
    "name":  {"type": "string"},
    "price": {
      "type": "object",
      "properties": {
        "amount":   {"type": "number"},
        "currency": {"type": "string"}
      }
    }
  }
}

Arrays for repeating elements

For listings — articles, products, search results — wrap the repeating element in an array of object:

{
  "type": "object",
  "properties": {
    "products": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "name":  {"type": "string"},
          "price": {"type": "number"}
        }
      }
    }
  }
}

Sharpen the prompt

A specific user_prompt keeps the model focused. Tell it what you want and what to ignore.

requests.post(
    "https://api.webscrape.ai/v1/smartscraper",
    headers={"X-API-Key": "wsg_live_..."},
    json={
        "website_url": "https://example.com/products",
        "user_prompt": "Extract only physical products available for purchase. Skip ads, sponsored listings, and 'recently viewed' items.",
        "output_schema": {/* ... */},
    },
)

Page complexity

page_complexity controls how much effort goes into the page.

Value	When to use
`low` (default)	Most pages. Fast and cheap.
`high`	Visually busy pages, long articles, schemas with many nested fields.

Bump to high if extraction misses content you can clearly see in the browser. Stick with low everywhere else.

Caching with `max_age`

max_age tells us whether your fetch can come from cache:

Value	Behavior
Omitted	No cache. Always a fresh extraction.
`0`	Run the call, but cache the result so the next one’s a hit.
`> 0` (seconds)	If we have a cached result newer than this, return it; otherwise extract fresh.

Three cases are never cached, regardless of max_age:

Stealth requests
Requests with custom headers
URLs with query strings or fragments

Cached responses come back much faster than a fresh extraction, so for any URL you’ll hit more than once, max_age is worth setting.

Validation failures

Schema validation runs after extraction, and we automatically take one repair pass at a failure. If the output still doesn’t match, you get a 422 Unprocessable Entity with error.code: validation_failed. When that happens, start with error.details — it tells you which field failed and why (“expected integer, got string”). From there, a few common fixes:

Loosen the type if the source is genuinely ambiguous. A price like "$19.99" may need "type": "string" when you’re keeping the currency symbol — add a pattern regex to keep validation tight.
Trim overly aggressive required lists. If a field is sometimes missing, don’t require it.
Flatten deeply nested objects if the model seems to be getting lost. Two flat fields usually beat one three-level object.

Common patterns

E-commerce listings

Outer products array of {name, price, url, image}. Use user_prompt to skip ads and recommendations.

Article extraction

Schema of {title, author, published_at, body, tags}. The default page_complexity: low handles most news and blog sites.

Structured tables

Mirror the table: {rows: [{col1, col2, ...}]}. Bump to page_complexity: high for tables past a few hundred rows.

Interactive pages

Data behind a click, login, or paginated UI? SmartBrowse is the tool — recipes run in real Chrome.

Cost

5 credits per call (+5 with stealth: true). Failed requests cost 0. See Credits for the full table.

​Prompt first, schema second

​Schema design

​Use the most specific types you can

​Mark required fields

​Nest objects for related data

​Arrays for repeating elements

​Sharpen the prompt

​Page complexity

​Caching with max_age

​Validation failures

​Common patterns