How to turn a product page into JSON in one call
How to scrape a product page into JSON in one API call: send the URL plus the fields you want, get schema-validated data back, and pay only on success.
| Step | What happens |
|---|---|
| 1. Get a free API key | From the dashboard; it looks like wsg_live_… |
| 2. POST the URL + a prompt | Send the product URL and the fields you want to /v1/smartscraper |
| 3. Read the JSON | Schema-validated fields come back in the data object |
| 4. Pin the types (optional) | Send output_schema to lock field names and types |
| Cost | 5 credits per successful call; a failed fetch costs 0 |
By the end of this you will have a single API call that turns a product page into clean JSON, like name, price, and availability, and a second version that pins the field types so the output never drifts in shape. You need three things: an API key, an HTTP client (curl is fine), and a product URL.
The examples target books.toscrape.com, a public sandbox built for scraping
practice, so you can run them end to end before pointing the call at a production
site. Whatever target you use, respect its terms and robots.txt.
Step 1: Point one call at the product page
Send the URL and a plain-language description of the fields you want. You do not write a selector for any of them.
curl https://api.webscrape.ai/v1/smartscraper \
-H "X-API-Key: wsg_live_your_key_here" \
-H "Content-Type: application/json" \
-d '{
"website_url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
"user_prompt": "Extract the book title, price as a number, star rating, and whether it is in stock"
}'
What just happened: the engine fetched the page, stripped the navigation and
boilerplate, read the cleaned content, and returned the fields you named. The
extracted output sits under data.result, next to what it cost and a request id
you can log.
{
"status": "completed",
"data": {
"result": {
"title": "A Light in the Attic",
"price": 51.77,
"rating": 3,
"in_stock": true
}
},
"credits_used": 5,
"credits_remaining": 495,
"request_id": "req_..."
}
The status comes back as completed, and credits_used is 5. The data object
also carries the extraction's own request_id and a latency_ms, left out
above. Credits are charged on success only, so if the fetch had failed you would
have been billed nothing.
Step 2: Lock the shape with a schema
A prompt is enough to get going, but in a pipeline you want the field names and
types fixed so downstream code can rely on them. Keep the user_prompt and add
an output_schema, a real JSON Schema the result is checked against:
curl https://api.webscrape.ai/v1/smartscraper \
-H "X-API-Key: wsg_live_your_key_here" \
-H "Content-Type: application/json" \
-d '{
"website_url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
"user_prompt": "Extract the book title, price, star rating, and stock status",
"output_schema": {
"type": "object",
"properties": {
"title": { "type": "string" },
"price": { "type": "number" },
"rating": { "type": "number" },
"in_stock": { "type": "boolean" }
},
"required": ["title", "price", "in_stock"]
}
}'
Now every result is validated against that schema. If the first extraction comes
back with price as a string, the engine gets one automatic repair pass to fix
it before the call returns. You get a number where you asked for a number, or the
call fails cleanly instead of handing you malformed data.
Step 3: Use the fields you got back
The extracted fields live under data.result, so your code reads
data.result.price and data.result.in_stock directly. credits_remaining lets
you watch your balance without a second request, and the envelope's request_id
is the handle to quote if you ever need to ask us what a specific call did.
When this breaks
Three failure modes show up often enough to plan for. None of them is a silent surprise if you know where to look.
A field comes back null. The most common cause is that the field is not on
the page, or your prompt asked for something the page does not show. A prompt
asking for the manufacturer's list price on a page that only displays the sale
price will return null for that field, because the value genuinely is not
there. Fix the description to match what the page shows:
- "user_prompt": "Extract the title, the MSRP, and the current price"
+ "user_prompt": "Extract the title and the current price"
The page comes back as an empty shell. Some product pages render the price with JavaScript after the initial HTML loads. The engine escalates from a fast HTTP request to a headless browser when it detects this, so most of these resolve on their own (the tiered fetch logic covers how it decides to climb). A page can still hand the fast tier a 200 with an empty body that looks like success, and no escalation ladder catches every one. If a value you can see in your own browser is missing, that is the case to suspect.
The site blocks the request. Aggressive anti-bot targets may need the stealth browser tier. You can let the engine escalate into it automatically, or force it when you already know a target is hostile. Stealth adds 5 credits on top of the SmartScraper base, and it is billed only when the stealth tier actually runs, so a page that clears on fast HTTP stays at the base rate.
For a target you will scrape on a schedule, the failure mode that bites later is
the redesign you did not see coming. Pin the working extraction as a Blueprint and
drift surfaces as validation_warnings with the status still completed, rather
than a silent null flowing downstream. The mechanics are in
why structured extraction beats CSS selectors
and the field guide to why scrapers go quiet after a redesign.
Next step
Grab a free key and run the call above against a product page you actually need. The free credits cover real extractions, and a failed fetch costs nothing, so the only thing a wrong prompt costs you is the time to fix the wording.
Frequently asked questions
How do I scrape a product page into JSON?
Send a POST to /v1/smartscraper with website_url set to the product page and user_prompt naming the fields you want, like title, price, and availability. The extracted JSON comes back under data.result. There are no proxies to rent or selectors to write; the fetch and extraction run in one call.
Can I scrape a product page without writing CSS selectors?
Yes. You describe the fields in plain language or pass a JSON Schema, and the engine reads the cleaned page to find them. Because you name the data rather than its position in the markup, the call keeps working when a redesign moves an element or renames a class, which is where selectors usually break.
How do I get the price as a number instead of a string?
Pass an output_schema that types price as a number. Every result is validated against your schema, and if the first pass returns the wrong type, the engine gets one repair attempt before the call succeeds. That guarantees the shape of the output, so a numeric field comes back as a number, not as text.
Why does my product scrape return null for some fields?
Usually the field is not on the page you fetched, or your prompt named something the page does not show, like an MSRP when only a sale price is listed. Check the rendered page and loosen the field description. A genuinely blocked page is a different problem; the engine escalates fetch tiers automatically for that.
How much does it cost to scrape one product page?
A successful SmartScraper call costs 5 credits, billed on success only, so a failed fetch costs nothing. If a hostile page forces the stealth browser tier, that adds 5 credits on top. Proxies, browser rendering, and extraction all come out of the same credit balance, with no separate proxy bill.
Can I scrape product pages that load the price with JavaScript?
Yes. The fetch stack starts with a fast HTTP request and steps up to a headless browser when a page renders its content with JavaScript, then to a stealth browser for the hardest targets. You do not choose the tier; the engine escalates only as far as the page in front of it forces.
How do I scrape many product pages on a schedule?
Once an extraction works, save it as a Blueprint so repeat runs replay selectors with no AI step, which keeps them fast and cheap. On drift the run returns validation_warnings instead of a silent null. See why structured extraction beats CSS selectors for how the deterministic replay path works.