How to get LLM-ready data (markdown or JSON) from any URL
How to get LLM-ready data from any URL: pull specific fields as JSON, or strip a page to its clean content, with PDFs converted to markdown automatically.
| You want | Call to make |
|---|---|
| Specific fields as JSON | /v1/smartscraper with website_url + user_prompt |
| The readable text, unstructured | /v1/smartscraper with plain_text: true |
| The whole page, boilerplate stripped | /v1/scrape with clean: true |
| A PDF as markdown | /v1/scrape on the PDF URL — converted automatically |
| Cost | smartscraper 5 credits; scrape 1 credit; success only |
"LLM-ready" means two different things depending on what you are feeding the model. Sometimes you know the exact fields you want and need them as structured JSON. Sometimes you want the whole readable page, stripped of the navigation and ads, to drop into a prompt or a RAG index. The API has a call for each, and a PDF takes care of itself. By the end of this you will know which one to reach for.
You need an API key, an HTTP client, and a URL. The examples target a Wikipedia article, whose content is openly licensed; respect each target's terms and robots.txt.
Step 1: Decide the shape you need
Two questions settle it. Do you want a fixed set of fields, or the whole page? And
do you want JSON, or readable text? Specific fields point you at
/v1/smartscraper; the whole page points you at /v1/scrape with clean: true.
Step 2: Specific fields as JSON
When you know what you want, name the fields and let the engine return them as
structured JSON under data.result:
curl https://api.webscrape.ai/v1/smartscraper \
-H "X-API-Key: wsg_live_your_key_here" \
-H "Content-Type: application/json" \
-d '{
"website_url": "https://en.wikipedia.org/wiki/Web_scraping",
"user_prompt": "Extract the article title, the lead summary, and the list of section headings"
}'
The result is shaped to your description and ready to use without any HTML
parsing. Pass an output_schema to pin the field names and types when the JSON
feeds code that depends on them, covered in
prompt versus schema. For unstructured text
rather than fields, add plain_text: true and the raw extracted text comes back
under data.result instead of a parsed object.
Step 3: The whole page, boilerplate stripped
When you want the page itself, not a handful of fields, call /v1/scrape with
clean: true. The cleaner removes scripts, styles, navigation, and the rest of
the chrome, and returns the cleaned HTML a model should actually read:
curl https://api.webscrape.ai/v1/scrape \
-H "X-API-Key: wsg_live_your_key_here" \
-H "Content-Type: application/json" \
-d '{
"website_url": "https://en.wikipedia.org/wiki/Web_scraping",
"clean": true
}'
The cleaned HTML comes back under data.html, with content_type telling you
whether the URL served HTML or a PDF, and metadata carrying the title and
language:
{
"status": "completed",
"data": {
"content_type": "html",
"cleaned": true,
"html": "...cleaned HTML...",
"metadata": { "title": "Web scraping", "language": "en" }
},
"credits_used": 1,
"credits_remaining": 499,
"request_id": "req_..."
}
You can narrow the result further with include_tags to keep only certain
elements, like ["article", "main"], or exclude_tags to drop others, like
["nav", "footer", "aside"], before the content is returned.
Step 4: PDFs become markdown on their own
Point the same /v1/scrape call at a PDF URL and you do not need a different
flag. The fetcher detects the PDF, converts it, and returns the text as markdown
under data.html with content_type set to pdf:
curl https://api.webscrape.ai/v1/scrape \
-H "X-API-Key: wsg_live_your_key_here" \
-H "Content-Type: application/json" \
-d '{ "website_url": "https://example.com/whitepaper.pdf" }'
That makes a mixed list of HTML and PDF URLs safe to send through one code path:
each comes back as readable content, and content_type tells you which is which.
When this breaks
You expected markdown from an HTML page. clean: true returns cleaned HTML,
not markdown. The markdown path today is PDF conversion: a PDF URL comes back as
markdown, while an HTML page comes back as cleaned HTML. For HTML, that cleaned
HTML is the model-ready shape; if you need a uniform markdown corpus, run the
cleaned HTML through your own HTML-to-markdown step.
A PDF returns almost nothing. Its pages are probably scanned images with no text layer, so there is no text to extract. That is a job for OCR upstream, not a fetch problem.
The clean content is empty or thin. The page likely renders with JavaScript, so the first byte carried no content. The fetch stack escalates to a browser tier when it detects this, walked through in the tiered fetch writeup; if a page you can read in your browser comes back empty, that is the case to suspect.
Next step
Grab a free key and run both calls on the same URL: /v1/smartscraper for the
fields you want, /v1/scrape with clean: true for the whole page. Seeing the two
side by side is the fastest way to learn which shape your model actually wants. A
failed fetch costs nothing, so it is free to try. For pulling fields out of a
specific page type, start with
how to turn a product page into JSON.
Frequently asked questions
How do I get LLM-ready data from a URL?
Decide whether you want specific fields or the whole page. For fields, send the URL to /v1/smartscraper and describe them; you get structured JSON under data.result. For the whole readable page, call /v1/scrape with clean: true to strip the boilerplate. PDFs come back as markdown automatically.
How do I convert a web page to clean text for an LLM?
Call /v1/scrape with clean: true. It strips the scripts, styles, navigation, and ads, and returns the cleaned HTML under data.html, plus page metadata like the title and language. That is the part a model should read, without the surrounding chrome that wastes tokens and confuses extraction.
Can I get a web page as JSON instead of HTML?
Yes. /v1/smartscraper returns structured JSON: send website_url and a user_prompt naming the fields, and the result arrives under data.result, optionally validated against an output_schema. Use /v1/scrape instead when you want the whole page's content rather than a fixed set of fields pulled out of it.
Does it convert pages to markdown?
It depends on the source. For HTML pages, clean: true returns cleaned HTML with the boilerplate stripped, not markdown. PDFs are the markdown path: point /v1/scrape at a PDF URL and the text comes back as markdown under data.html with content_type set to pdf. For structured fields rather than prose, use /v1/smartscraper for JSON.
How do I strip nav, ads, and boilerplate before sending a page to a model?
Set clean: true on a /v1/scrape call. The cleaner removes scripts, styles, navigation, and similar chrome before returning the content, so you send the model the substance of the page rather than its menus and footers. You can also narrow further with include_tags or exclude_tags to keep or drop specific elements.
Can I extract text from a PDF URL?
Yes. Point /v1/scrape at a PDF URL and it is detected and converted, returning the text as markdown under data.html with content_type set to pdf. A PDF whose pages are scanned images has no text layer to extract, so expect little or nothing back from those until they have been run through OCR elsewhere.
How much does it cost to fetch clean content from a URL?
A /v1/scrape call costs 1 credit, billed on success only, so a failed fetch costs nothing. A structured /v1/smartscraper extraction costs 5 credits. If a hostile page forces the stealth browser tier, that adds 2 credits on scrape or 5 on smartscraper, charged only when the stealth tier actually runs.