Scrape - webscrape.ai

POST /v1/scrape returns the contents of a URL. The bare-minimum call is in the API reference; this page is the long version — output formats, tag filtering, stealth, and what to do with empty responses.

Pick the right format

You want…	Set	Returns
Full HTML, exactly as the server sent it	(default)	`data.html` is the raw HTML
Cleaned markdown for an LLM	`clean: true`	`data.html` is markdown
A list of outbound links	`extract_links: true`	`data.links` is a deduplicated array
Markdown from a PDF	Default, auto-detected	`data.html` is markdown, `content_type: "pdf"`

Flags compose. clean: true together with extract_links: true gives you cleaned content and the link list in a single call.

requests.post(
    "https://api.webscrape.ai/v1/scrape",
    headers={"X-API-Key": "wsg_live_..."},
    json={
        "website_url": "https://example.com",
        "clean": True,
        "extract_links": True,
    },
)

Content filtering

Use include_tags and exclude_tags (HTML tag names) to trim the cleaned output down to what you want.

{
  "website_url": "https://example.com/article",
  "clean": true,
  "include_tags": ["article", "h1", "h2", "p"],
  "exclude_tags": ["nav", "footer", "aside", "script"]
}

It’s faster and cheaper than running a second cleaning pass on the full page. Good for:

News articles where you only want the body, not the rail of related links.
Reference pages where you want headings and paragraphs but not the nav chrome.
Product pages where you want the description block but not the carousel of upsells.

When to use stealth

The default fetcher works on most public sites. Stealth adds 2 credits, so only flip it on when you have a clear signal:

Use stealth

403 / 429 / “Just a moment…” challenge pages
JS challenge interstitials or bot-wall responses
Sites that need JavaScript to render the data

Skip stealth

Most marketing, news, listing, and product pages
JSON APIs and RSS feeds
Anywhere the data is already in the HTML on first load

A common pattern: try without stealth, retry with it on failure. The first call is free if it fails, so you only ever pay the surcharge when stealth was needed. See Stealth mode for full code.

Caching with `max_age`

max_age controls whether we can serve your fetch from cache:

Value	Behavior
Omitted	No cache. Always a fresh fetch.
`0`	Same call hits the wire, but the result is cached for next time.
`> 0` (seconds)	If we have a cached response newer than this, serve it; otherwise fetch fresh.

Some requests are never cached, no matter what max_age says:

Stealth requests
Requests with custom headers
URLs with query strings or fragments

If your URL hits any of those, every call is a real fetch.

Troubleshooting

Empty or partial response

If data.html is much shorter than what you see in a browser, the page is probably client-rendered. Try stealth: true first — the stealth browser runs the page’s JavaScript before returning. If you need to interact with the page (click “Load more”, sign in, paginate), Scrape isn’t the right tool — switch to SmartBrowse.

Wrong content type

data.content_type is html or pdf. PDFs auto-convert to markdown, so you’ll see pdf here even though data.html contains markdown text. Branch on this if your downstream code cares about source format.

Bot wall responses

A 200 with a tiny body containing a CAPTCHA or “checking your browser…” page means the default fetcher reached the site but a JS challenge fired. Retry with stealth: true.

Links look weird

extract_links: true returns deduplicated, normalized links. Relative URLs are resolved against the page URL, fragments get stripped, and duplicate hrefs collapse to a single entry. If you want raw href attributes, parse them yourself from data.html.

When Scrape isn’t the right tool

Need structured JSON?

SmartScraper. Schema in, validated JSON out.

Behind a click or login?

SmartBrowse. Real Chrome, replayable recipes.

Cost

1 credit per call (+2 with stealth: true). Failed requests cost 0. See Credits for the full table.

​Pick the right format

​Content filtering

​When to use stealth

Use stealth

Skip stealth

​Caching with max_age

​Troubleshooting

​Empty or partial response

​Wrong content type

​Bot wall responses

​Links look weird

​When Scrape isn’t the right tool

Need structured JSON?

Behind a click or login?

​Cost

Pick the right format

Content filtering

When to use stealth

Caching with `max_age`

Troubleshooting

Empty or partial response

Wrong content type

Bot wall responses

Links look weird

When Scrape isn’t the right tool

Cost