SmartScraper - webscrape.ai

Structured extraction with JSON-schema validation

curl --request POST \
  --url https://api.webscrape.ai/v1/smartscraper \
  --header 'Content-Type: application/json' \
  --header 'X-API-Key: <api-key>' \
  --data '
{
  "website_url": "https://news.ycombinator.com",
  "user_prompt": "Extract the front-page stories with title, url, and score.",
  "output_schema": {
    "type": "object",
    "properties": {
      "stories": {
        "type": "array",
        "items": {
          "type": "object",
          "properties": {
            "title": {
              "type": "string"
            },
            "url": {
              "type": "string"
            },
            "score": {
              "type": "integer"
            }
          }
        }
      }
    }
  },
  "page_complexity": "low",
  "detail_level": "medium",
  "parse_mode": "accurate",
  "plain_text": false,
  "include_tags": [
    "<string>"
  ],
  "exclude_tags": [
    "<string>"
  ],
  "reduce_content": true,
  "experimental": false,
  "headers": {},
  "max_age": 1,
  "stealth": false
}
'

{
  "status": "completed",
  "request_id": "req_aB3xY9Kp",
  "data": {
    "request_id": "<string>",
    "result": {},
    "latency_ms": 123
  },
  "credits_used": 5,
  "credits_remaining": 495
}

POST

smartscraper

Structured extraction with JSON-schema validation

curl --request POST \
  --url https://api.webscrape.ai/v1/smartscraper \
  --header 'Content-Type: application/json' \
  --header 'X-API-Key: <api-key>' \
  --data '
{
  "website_url": "https://news.ycombinator.com",
  "user_prompt": "Extract the front-page stories with title, url, and score.",
  "output_schema": {
    "type": "object",
    "properties": {
      "stories": {
        "type": "array",
        "items": {
          "type": "object",
          "properties": {
            "title": {
              "type": "string"
            },
            "url": {
              "type": "string"
            },
            "score": {
              "type": "integer"
            }
          }
        }
      }
    }
  },
  "page_complexity": "low",
  "detail_level": "medium",
  "parse_mode": "accurate",
  "plain_text": false,
  "include_tags": [
    "<string>"
  ],
  "exclude_tags": [
    "<string>"
  ],
  "reduce_content": true,
  "experimental": false,
  "headers": {},
  "max_age": 1,
  "stealth": false
}
'

{
  "status": "completed",
  "request_id": "req_aB3xY9Kp",
  "data": {
    "request_id": "<string>",
    "result": {},
    "latency_ms": 123
  },
  "credits_used": 5,
  "credits_remaining": 495
}

Hand over a URL and a JSON schema, get validated structured JSON back. Handles long content, noisy pages, and one automatic repair pass on validation failure. Cost: 5 credits per call (+5 with stealth: true). Failed requests cost 0.

Examples

curl https://api.webscrape.ai/v1/smartscraper \
  -H "X-API-Key: wsg_live_..." \
  -H "Content-Type: application/json" \
  -d '{
    "website_url": "https://news.ycombinator.com",
    "user_prompt": "Extract the front-page stories with title, url, and score.",
    "output_schema": {
      "type": "object",
      "properties": {
        "stories": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "title": {"type": "string"},
              "url":   {"type": "string"},
              "score": {"type": "integer"}
            }
          }
        }
      }
    }
  }'

Example response

{
  "status": "completed",
  "data": {
    "result": {
      "stories": [
        { "title": "Show HN: ...", "url": "https://...", "score": 312 },
        { "title": "Ask HN: ...", "url": "https://...", "score": 184 }
      ]
    },
    "latency_ms": 1842
  },
  "credits_used": 5,
  "credits_remaining": 495,
  "request_id": "req_aB3xY9Kp"
}

Your extracted data is at data.result. Always check status first.

Tips

user_prompt is required and does the heavy lifting. Add an output_schema when you want the result validated into a fixed shape.
Leave page_complexity at low (the default) for most pages. Bump to high for visually busy pages or schemas with many nested fields.
Failed schema validation comes back as error.code: validation_failed (HTTP 422). Adjust the schema or simplify the prompt and retry.

Page requires a login, click, or pagination before the data shows up? Use SmartBrowse. Schema-driven extraction, running in a real Chrome session.

For schema-design tips, page-complexity tuning, and validation troubleshooting, see the SmartScraper feature page.

Authorizations

X-API-Key

string

header

required

Generate from the dashboard. Format: wsg_live_<32 base62 chars>.

Body

application/json

website_url

string<uri>

required

Example:

"https://news.ycombinator.com"

user_prompt

string

required

Plain-English description of what to extract.

Example:

"Extract the front-page stories with title, url, and score."

output_schema

object

JSON Schema enforced on the extracted output. When supplied, the result is validated against the schema and one repair attempt is made before returning validation_failed. Providing both user_prompt and output_schema gives the best results.

Example:

{
  "type": "object",
  "properties": {
    "stories": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "title": { "type": "string" },
          "url": { "type": "string" },
          "score": { "type": "integer" }
        }
      }
    }
  }
}

page_complexity

enum<string>

default:low

low (default) is faster and cheaper. Bump to high for visually busy pages or schemas with many nested fields.

Available options:

low,

high

detail_level

enum<string>

default:medium

How exhaustively to populate the result.

Available options:

low,

medium,

high

parse_mode

enum<string>

default:accurate

Cleaner mode. accurate is more forgiving on malformed pages; speed is faster.

Available options:

accurate,

speed

plain_text

boolean

default:false

Return the raw extracted text under result instead of a parsed JSON object. Bypasses output_schema validation.

include_tags

string[]

Whitelist of HTML tags to keep before extraction.

exclude_tags

string[]

Blacklist of HTML tags to drop before extraction.

reduce_content

boolean

Trim long content before extraction. Helps on pages with lots of repetitive boilerplate. Uses a sensible default when omitted.

experimental

boolean

default:false

Opt in to an alternate extraction path that can do better on hard-to-parse pages. Behavior may change without notice.

headers

object

Custom request headers forwarded to the fetcher. Providing any headers disables URL caching for this request.

Show child attributes

max_age

integer

URL-cache opt-in — same three-state semantics as /scrape (omitted = no cache, 0 = bypass read but write on miss, >0 = return entries fresher than N seconds).

Stealth requests, custom-headered requests, and URLs with query strings or fragments are never cached.

Required range: x >= 0

stealth

boolean

default:false

Use stealth mode. +5 credits.

Response

Successfully extracted.

status

enum<string>

required

Discriminator for the envelope variant.

Available options:

completed

request_id

string

required

Per-request id. Also returned as the X-Request-ID response header. Include it when reporting issues.

Example:

"req_aB3xY9Kp"

data

object

required

Show child attributes

credits_used

integer

required

Example:

5

credits_remaining

integer

required

Example:

495

Scrape

Dispatch run

​Examples

​Tips

Authorizations

Body

Response

Examples

Tips