Skip to main content
POST /v1/scrape returns the contents of a URL. The bare-minimum call is in the API reference; this page is the long version — output formats, tag filtering, stealth, and what to do with empty responses.

Pick the right format

You want…SetReturns
Full HTML, exactly as the server sent it(default)data.html is the raw HTML
Cleaned markdown for an LLMclean: truedata.html is markdown
A list of outbound linksextract_links: truedata.links is a deduplicated array
Markdown from a PDFDefault, auto-detecteddata.html is markdown, content_type: "pdf"
Flags compose. clean: true together with extract_links: true gives you cleaned content and the link list in a single call.
requests.post(
    "https://api.webscrape.ai/v1/scrape",
    headers={"X-API-Key": "wsg_live_..."},
    json={
        "website_url": "https://example.com",
        "clean": True,
        "extract_links": True,
    },
)

Content filtering

Use include_tags and exclude_tags (HTML tag names) to trim the cleaned output down to what you want.
{
  "website_url": "https://example.com/article",
  "clean": true,
  "include_tags": ["article", "h1", "h2", "p"],
  "exclude_tags": ["nav", "footer", "aside", "script"]
}
It’s faster and cheaper than running a second cleaning pass on the full page. Good for:
  • News articles where you only want the body, not the rail of related links.
  • Reference pages where you want headings and paragraphs but not the nav chrome.
  • Product pages where you want the description block but not the carousel of upsells.

When to use stealth

The default fetcher works on most public sites. Stealth adds 2 credits, so only flip it on when you have a clear signal:

Use stealth

  • 403 / 429 / “Just a moment…” challenge pages
  • JS challenge interstitials or bot-wall responses
  • Sites that need JavaScript to render the data

Skip stealth

  • Most marketing, news, listing, and product pages
  • JSON APIs and RSS feeds
  • Anywhere the data is already in the HTML on first load
A common pattern: try without stealth, retry with it on failure. The first call is free if it fails, so you only ever pay the surcharge when stealth was needed. See Stealth mode for full code.

Caching with max_age

max_age controls whether we can serve your fetch from cache:
ValueBehavior
OmittedNo cache. Always a fresh fetch.
0Same call hits the wire, but the result is cached for next time.
> 0 (seconds)If we have a cached response newer than this, serve it; otherwise fetch fresh.
Some requests are never cached, no matter what max_age says:
  • Stealth requests
  • Requests with custom headers
  • URLs with query strings or fragments
If your URL hits any of those, every call is a real fetch.

Troubleshooting

Empty or partial response

If data.html is much shorter than what you see in a browser, the page is probably client-rendered. Try stealth: true first — the stealth browser runs the page’s JavaScript before returning. If you need to interact with the page (click “Load more”, sign in, paginate), Scrape isn’t the right tool — switch to SmartBrowse.

Wrong content type

data.content_type is html or pdf. PDFs auto-convert to markdown, so you’ll see pdf here even though data.html contains markdown text. Branch on this if your downstream code cares about source format.

Bot wall responses

A 200 with a tiny body containing a CAPTCHA or “checking your browser…” page means the default fetcher reached the site but a JS challenge fired. Retry with stealth: true. extract_links: true returns deduplicated, normalized links. Relative URLs are resolved against the page URL, fragments get stripped, and duplicate hrefs collapse to a single entry. If you want raw href attributes, parse them yourself from data.html.

When Scrape isn’t the right tool

Need structured JSON?

SmartScraper. Schema in, validated JSON out.

Behind a click or login?

SmartBrowse. Real Chrome, replayable recipes.

Cost

1 credit per call (+2 with stealth: true). Failed requests cost 0. See Credits for the full table.