all posts
Guide

How to turn a product page into JSON in one call

How to scrape a product page into JSON in one API call: send the URL plus the fields you want, get schema-validated data back, and pay only on success.

TL;DR: scrape a product page into JSON
StepWhat happens
1. Get a free API keyFrom the dashboard; it looks like wsg_live_…
2. POST the URL + a promptSend the product URL and the fields you want to /v1/smartscraper
3. Read the JSONSchema-validated fields come back in the data object
4. Pin the types (optional)Send output_schema to lock field names and types
Cost5 credits per successful call; a failed fetch costs 0

By the end of this you will have a single API call that turns a product page into clean JSON, like name, price, and availability, and a second version that pins the field types so the output never drifts in shape. You need three things: an API key, an HTTP client (curl is fine), and a product URL.

The examples target books.toscrape.com, a public sandbox built for scraping practice, so you can run them end to end before pointing the call at a production site. Whatever target you use, respect its terms and robots.txt.

Step 1: Point one call at the product page

Send the URL and a plain-language description of the fields you want. You do not write a selector for any of them.

bash
curl https://api.webscrape.ai/v1/smartscraper \
  -H "X-API-Key: wsg_live_your_key_here" \
  -H "Content-Type: application/json" \
  -d '{
    "website_url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
    "user_prompt": "Extract the book title, price as a number, star rating, and whether it is in stock"
  }'

What just happened: the engine fetched the page, stripped the navigation and boilerplate, read the cleaned content, and returned the fields you named. The extracted output sits under data.result, next to what it cost and a request id you can log.

json
{
  "status": "completed",
  "data": {
    "result": {
      "title": "A Light in the Attic",
      "price": 51.77,
      "rating": 3,
      "in_stock": true
    }
  },
  "credits_used": 5,
  "credits_remaining": 495,
  "request_id": "req_..."
}

The status comes back as completed, and credits_used is 5. The data object also carries the extraction's own request_id and a latency_ms, left out above. Credits are charged on success only, so if the fetch had failed you would have been billed nothing.

Step 2: Lock the shape with a schema

A prompt is enough to get going, but in a pipeline you want the field names and types fixed so downstream code can rely on them. Keep the user_prompt and add an output_schema, a real JSON Schema the result is checked against:

bash
curl https://api.webscrape.ai/v1/smartscraper \
  -H "X-API-Key: wsg_live_your_key_here" \
  -H "Content-Type: application/json" \
  -d '{
    "website_url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
    "user_prompt": "Extract the book title, price, star rating, and stock status",
    "output_schema": {
      "type": "object",
      "properties": {
        "title":    { "type": "string" },
        "price":    { "type": "number" },
        "rating":   { "type": "number" },
        "in_stock": { "type": "boolean" }
      },
      "required": ["title", "price", "in_stock"]
    }
  }'

Now every result is validated against that schema. If the first extraction comes back with price as a string, the engine gets one automatic repair pass to fix it before the call returns. You get a number where you asked for a number, or the call fails cleanly instead of handing you malformed data.

Step 3: Use the fields you got back

The extracted fields live under data.result, so your code reads data.result.price and data.result.in_stock directly. credits_remaining lets you watch your balance without a second request, and the envelope's request_id is the handle to quote if you ever need to ask us what a specific call did.

When this breaks

Three failure modes show up often enough to plan for. None of them is a silent surprise if you know where to look.

A field comes back null. The most common cause is that the field is not on the page, or your prompt asked for something the page does not show. A prompt asking for the manufacturer's list price on a page that only displays the sale price will return null for that field, because the value genuinely is not there. Fix the description to match what the page shows:

diff
- "user_prompt": "Extract the title, the MSRP, and the current price"
+ "user_prompt": "Extract the title and the current price"

The page comes back as an empty shell. Some product pages render the price with JavaScript after the initial HTML loads. The engine escalates from a fast HTTP request to a headless browser when it detects this, so most of these resolve on their own (the tiered fetch logic covers how it decides to climb). A page can still hand the fast tier a 200 with an empty body that looks like success, and no escalation ladder catches every one. If a value you can see in your own browser is missing, that is the case to suspect.

The site blocks the request. Aggressive anti-bot targets may need the stealth browser tier. You can let the engine escalate into it automatically, or force it when you already know a target is hostile. Stealth adds 5 credits on top of the SmartScraper base, and it is billed only when the stealth tier actually runs, so a page that clears on fast HTTP stays at the base rate.

For a target you will scrape on a schedule, the failure mode that bites later is the redesign you did not see coming. Pin the working extraction as a Blueprint and drift surfaces as validation_warnings with the status still completed, rather than a silent null flowing downstream. The mechanics are in why structured extraction beats CSS selectors and the field guide to why scrapers go quiet after a redesign.

Next step

Grab a free key and run the call above against a product page you actually need. The free credits cover real extractions, and a failed fetch costs nothing, so the only thing a wrong prompt costs you is the time to fix the wording.

Frequently asked questions

How do I scrape a product page into JSON?

Send a POST to /v1/smartscraper with website_url set to the product page and user_prompt naming the fields you want, like title, price, and availability. The extracted JSON comes back under data.result. There are no proxies to rent or selectors to write; the fetch and extraction run in one call.

Can I scrape a product page without writing CSS selectors?

Yes. You describe the fields in plain language or pass a JSON Schema, and the engine reads the cleaned page to find them. Because you name the data rather than its position in the markup, the call keeps working when a redesign moves an element or renames a class, which is where selectors usually break.

How do I get the price as a number instead of a string?

Pass an output_schema that types price as a number. Every result is validated against your schema, and if the first pass returns the wrong type, the engine gets one repair attempt before the call succeeds. That guarantees the shape of the output, so a numeric field comes back as a number, not as text.

Why does my product scrape return null for some fields?

Usually the field is not on the page you fetched, or your prompt named something the page does not show, like an MSRP when only a sale price is listed. Check the rendered page and loosen the field description. A genuinely blocked page is a different problem; the engine escalates fetch tiers automatically for that.

How much does it cost to scrape one product page?

A successful SmartScraper call costs 5 credits, billed on success only, so a failed fetch costs nothing. If a hostile page forces the stealth browser tier, that adds 5 credits on top. Proxies, browser rendering, and extraction all come out of the same credit balance, with no separate proxy bill.

Can I scrape product pages that load the price with JavaScript?

Yes. The fetch stack starts with a fast HTTP request and steps up to a headless browser when a page renders its content with JavaScript, then to a stealth browser for the hardest targets. You do not choose the tier; the engine escalates only as far as the page in front of it forces.

How do I scrape many product pages on a schedule?

Once an extraction works, save it as a Blueprint so repeat runs replay selectors with no AI step, which keeps them fast and cheap. On drift the run returns validation_warnings instead of a silent null. See why structured extraction beats CSS selectors for how the deterministic replay path works.