Skip to main content
POST
/
smartscraper
Structured extraction with JSON-schema validation
curl --request POST \
  --url https://api.webscrape.ai/v1/smartscraper \
  --header 'Content-Type: application/json' \
  --header 'X-API-Key: <api-key>' \
  --data '
{
  "website_url": "https://news.ycombinator.com",
  "user_prompt": "Extract the front-page stories with title, url, and score.",
  "output_schema": {
    "type": "object",
    "properties": {
      "stories": {
        "type": "array",
        "items": {
          "type": "object",
          "properties": {
            "title": {
              "type": "string"
            },
            "url": {
              "type": "string"
            },
            "score": {
              "type": "integer"
            }
          }
        }
      }
    }
  },
  "page_complexity": "low",
  "detail_level": "medium",
  "parse_mode": "accurate",
  "plain_text": false,
  "include_tags": [
    "<string>"
  ],
  "exclude_tags": [
    "<string>"
  ],
  "reduce_content": true,
  "experimental": false,
  "headers": {},
  "max_age": 1,
  "stealth": false
}
'
{
  "status": "completed",
  "request_id": "req_aB3xY9Kp",
  "data": {
    "request_id": "<string>",
    "result": {},
    "latency_ms": 123
  },
  "credits_used": 5,
  "credits_remaining": 495
}
Hand over a URL and a JSON schema, get validated structured JSON back. Handles long content, noisy pages, and one automatic repair pass on validation failure. Cost: 5 credits per call (+5 with stealth: true). Failed requests cost 0.

Examples

curl https://api.webscrape.ai/v1/smartscraper \
  -H "X-API-Key: wsg_live_..." \
  -H "Content-Type: application/json" \
  -d '{
    "website_url": "https://news.ycombinator.com",
    "user_prompt": "Extract the front-page stories with title, url, and score.",
    "output_schema": {
      "type": "object",
      "properties": {
        "stories": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "title": {"type": "string"},
              "url":   {"type": "string"},
              "score": {"type": "integer"}
            }
          }
        }
      }
    }
  }'
{
  "status": "completed",
  "data": {
    "result": {
      "stories": [
        { "title": "Show HN: ...", "url": "https://...", "score": 312 },
        { "title": "Ask HN: ...", "url": "https://...", "score": 184 }
      ]
    },
    "latency_ms": 1842
  },
  "credits_used": 5,
  "credits_remaining": 495,
  "request_id": "req_aB3xY9Kp"
}
Your extracted data is at data.result. Always check status first.

Tips

  • user_prompt is required and does the heavy lifting. Add an output_schema when you want the result validated into a fixed shape.
  • Leave page_complexity at low (the default) for most pages. Bump to high for visually busy pages or schemas with many nested fields.
  • Failed schema validation comes back as error.code: validation_failed (HTTP 422). Adjust the schema or simplify the prompt and retry.
Page requires a login, click, or pagination before the data shows up? Use SmartBrowse. Schema-driven extraction, running in a real Chrome session.
For schema-design tips, page-complexity tuning, and validation troubleshooting, see the SmartScraper feature page.

Authorizations

X-API-Key
string
header
required

Generate from the dashboard. Format: wsg_live_<32 base62 chars>.

Body

application/json
website_url
string<uri>
required
Example:

"https://news.ycombinator.com"

user_prompt
string
required

Plain-English description of what to extract.

Example:

"Extract the front-page stories with title, url, and score."

output_schema
object

JSON Schema enforced on the extracted output. When supplied, the result is validated against the schema and one repair attempt is made before returning validation_failed. Providing both user_prompt and output_schema gives the best results.

Example:
{
"type": "object",
"properties": {
"stories": {
"type": "array",
"items": {
"type": "object",
"properties": {
"title": { "type": "string" },
"url": { "type": "string" },
"score": { "type": "integer" }
}
}
}
}
}
page_complexity
enum<string>
default:low

low (default) is faster and cheaper. Bump to high for visually busy pages or schemas with many nested fields.

Available options:
low,
high
detail_level
enum<string>
default:medium

How exhaustively to populate the result.

Available options:
low,
medium,
high
parse_mode
enum<string>
default:accurate

Cleaner mode. accurate is more forgiving on malformed pages; speed is faster.

Available options:
accurate,
speed
plain_text
boolean
default:false

Return the raw extracted text under result instead of a parsed JSON object. Bypasses output_schema validation.

include_tags
string[]

Whitelist of HTML tags to keep before extraction.

exclude_tags
string[]

Blacklist of HTML tags to drop before extraction.

reduce_content
boolean

Trim long content before extraction. Helps on pages with lots of repetitive boilerplate. Uses a sensible default when omitted.

experimental
boolean
default:false

Opt in to an alternate extraction path that can do better on hard-to-parse pages. Behavior may change without notice.

headers
object

Custom request headers forwarded to the fetcher. Providing any headers disables URL caching for this request.

max_age
integer

URL-cache opt-in — same three-state semantics as /scrape (omitted = no cache, 0 = bypass read but write on miss, >0 = return entries fresher than N seconds).

Stealth requests, custom-headered requests, and URLs with query strings or fragments are never cached.

Required range: x >= 0
stealth
boolean
default:false

Use stealth mode. +5 credits.

Response

Successfully extracted.

status
enum<string>
required

Discriminator for the envelope variant.

Available options:
completed
request_id
string
required

Per-request id. Also returned as the X-Request-ID response header. Include it when reporting issues.

Example:

"req_aB3xY9Kp"

data
object
required
credits_used
integer
required
Example:

5

credits_remaining
integer
required
Example:

495