POST /v1/scrape returns the contents of a URL. The bare-minimum call is in the API reference; this page is the long version — output formats, tag filtering, stealth, and what to do with empty responses.
Pick the right format
| You want… | Set | Returns |
|---|---|---|
| Full HTML, exactly as the server sent it | (default) | data.html is the raw HTML |
| Cleaned markdown for an LLM | clean: true | data.html is markdown |
| A list of outbound links | extract_links: true | data.links is a deduplicated array |
| Markdown from a PDF | Default, auto-detected | data.html is markdown, content_type: "pdf" |
clean: true together with extract_links: true gives you cleaned content and the link list in a single call.
Content filtering
Useinclude_tags and exclude_tags (HTML tag names) to trim the cleaned output down to what you want.
- News articles where you only want the body, not the rail of related links.
- Reference pages where you want headings and paragraphs but not the nav chrome.
- Product pages where you want the description block but not the carousel of upsells.
When to use stealth
The default fetcher works on most public sites. Stealth adds 2 credits, so only flip it on when you have a clear signal:Use stealth
- 403 / 429 / “Just a moment…” challenge pages
- JS challenge interstitials or bot-wall responses
- Sites that need JavaScript to render the data
Skip stealth
- Most marketing, news, listing, and product pages
- JSON APIs and RSS feeds
- Anywhere the data is already in the HTML on first load
Caching with max_age
max_age controls whether we can serve your fetch from cache:
| Value | Behavior |
|---|---|
| Omitted | No cache. Always a fresh fetch. |
0 | Same call hits the wire, but the result is cached for next time. |
> 0 (seconds) | If we have a cached response newer than this, serve it; otherwise fetch fresh. |
max_age says:
- Stealth requests
- Requests with custom headers
- URLs with query strings or fragments
Troubleshooting
Empty or partial response
Ifdata.html is much shorter than what you see in a browser, the page is probably client-rendered. Try stealth: true first — the stealth browser runs the page’s JavaScript before returning. If you need to interact with the page (click “Load more”, sign in, paginate), Scrape isn’t the right tool — switch to SmartBrowse.
Wrong content type
data.content_type is html or pdf. PDFs auto-convert to markdown, so you’ll see pdf here even though data.html contains markdown text. Branch on this if your downstream code cares about source format.
Bot wall responses
A200 with a tiny body containing a CAPTCHA or “checking your browser…” page means the default fetcher reached the site but a JS challenge fired. Retry with stealth: true.
Links look weird
extract_links: true returns deduplicated, normalized links. Relative URLs are resolved against the page URL, fragments get stripped, and duplicate hrefs collapse to a single entry. If you want raw href attributes, parse them yourself from data.html.
When Scrape isn’t the right tool
Need structured JSON?
SmartScraper. Schema in, validated JSON out.
Behind a click or login?
SmartBrowse. Real Chrome, replayable recipes.
Cost
1 credit per call (+2 withstealth: true). Failed requests cost 0. See Credits for the full table.