all posts
Guide

What is structured data extraction?

Structured data extraction turns a web page into schema-validated JSON by naming the fields you want, not the CSS selectors that break when a site redesigns.

Structured data extraction at a glance
QuestionIn one line
What is it?A hosted service that reads a web page and returns the fields you asked for as JSON.
What do you send?A URL and a plain-language prompt; optionally a JSON Schema to pin the field names and types.
What do you get back?Schema-validated JSON shaped to your fields, in one API call.
How is it different from CSS selectors?You describe the data, not its DOM location, so a redesign doesn't break it.
When is it the wrong tool?A high-volume feed off one stable page: pin selectors there instead.
What does it cost?5 credits per successful call; a failed fetch costs 0.

What is structured data extraction?

Structured data extraction reads a web page and returns the specific fields you asked for as JSON. You send a URL and a description of what you want, such as the product name, price, and whether it is in stock, and you get back data shaped to match. The work of fetching the page, stripping the boilerplate, and turning the content into typed fields happens for you, in one call.

The contrast that defines it is with the older way: hand-written CSS or XPath selectors that encode where data sat in one snapshot of the DOM. A selector is a bet that the page structure will not change, and that is a bet you keep losing as teams refactor components and A/B-test layouts.

Key takeaways

  • You describe the fields you want, not their position in the markup, so the request survives a redesign that would break a selector.
  • The output is validated against a schema with one automatic repair pass, which guarantees the shape of the data even though the extraction itself uses a model.
  • Fetching, cleaning, and extraction run as one pipeline behind a single API call, billed on success only.
  • For a stable page you hit at high volume, selectors are still cheaper at run time. Most pipelines use both, and we say so below.

How does structured data extraction work?

Every request runs through the same pipeline you would otherwise build and maintain yourself.

  1. Fetch the page. A tiered stack starts with a fast HTTP request and escalates to a headless or stealth browser only when a site needs JavaScript to render or actively blocks the cheaper option. You can read how that escalation decides when to climb in how the fetch tiers decide when to escalate.
  2. Clean the content. Navigation, ads, and other chrome are stripped before extraction runs, so the model works on the content and not the furniture around it. Less noise means faster, cheaper extraction.
  3. Extract the fields. You describe the fields in plain language or pass a JSON Schema. The engine reads the cleaned page and returns data shaped to match what you asked for.
  4. Validate the shape. The result is checked against your schema. If it does not conform, the engine gets one structured repair attempt before the call fails. This is what lets an AI step sit in a pipeline without handing you a string where you asked for a number.

None of those steps asks you to name a CSS class or count nth-child positions. You name the data, and the pipeline finds it.

How is it different from writing CSS selectors?

A selector encodes a location: "the text in the third span." Structured extraction encodes a meaning: "the current price as a number." When a redesign moves the price or renames the class, the location breaks and the meaning still points at the same thing. That difference is the whole reason this approach exists, and it is the subject of our deeper writeup on why structured extraction beats CSS selectors.

Selectors are not useless, though. On one stable page you hit at high volume, a pinned selector is cheaper and faster at run time than a live model call, because there is no model in the loop. That is a real axis selectors win, and it is why we ship Blueprints: save a working extraction, and the engine reverse-engineers selectors from that known-good result and replays them with no AI step. When the page later drifts, a Blueprint run does not go silent. It comes back with validation_warnings naming the fields that stopped matching, and its status stays completed.

When should you use it, and when not?

Reach for structured extraction when you are hitting many sites whose structure keeps moving, when you do not want to own selector upkeep, or when you are prototyping and the schema is still changing. Describing the fields is faster to write and survives the churn. The walkthrough for a single page is in how to turn a product page into JSON in one call.

Reach for pinned selectors, by way of a Blueprint, when you are hammering one stable site at high volume and want the cheapest, most repeatable run. The honest answer for most pipelines is both: describe the data while a page is new or volatile, pin a Blueprint once it settles, and let the warnings tell you when to revisit.

What does a request look like?

Point one call at a page and ask for the fields you care about:

bash
curl https://api.webscrape.ai/v1/smartscraper \
  -H "X-API-Key: wsg_live_your_key_here" \
  -H "Content-Type: application/json" \
  -d '{
    "website_url": "https://example.com/product/42",
    "user_prompt": "Extract the product name, price, and whether it is in stock"
  }'

The extracted fields come back under data.result, alongside the credits spent and a request id:

json
{
  "status": "completed",
  "data": {
    "result": {
      "name": "Aeron Chair",
      "price": 1395.00,
      "in_stock": true
    }
  },
  "credits_used": 5,
  "credits_remaining": 495,
  "request_id": "req_..."
}

The data object also carries the extraction's own request_id and a latency_ms, trimmed here for clarity. New accounts get free credits, so you can run a real extraction before picking a plan. Point a key at a page you keep re-scraping by hand and see what comes back.

Frequently asked questions

What is structured data extraction?

Structured data extraction reads a web page and returns the specific fields you asked for as JSON. You describe what you want, like the product name and price, instead of pointing at where it sits in the page markup. A hosted service handles fetching, cleaning, and shaping the output.

What is the difference between structured data extraction and web scraping?

Web scraping is the broad act of pulling content off a page. Structured data extraction is the step that turns that content into typed fields you can use downstream. Scraping gets you the HTML; extraction gets you the name, price, and availability as JSON that matches a schema.

Do I need to write CSS selectors for structured data extraction?

No. You describe the fields in plain language or pass a JSON Schema, and the engine reads the cleaned page to find them. Because the request names the data rather than its position in the DOM, a moved element or renamed class does not break the call the way a hand-written selector would.

Can structured data extraction return data in a fixed schema?

Yes. Pass an output_schema with your field names and types, and every result is validated against it. If the first result does not conform, the engine gets one automatic repair pass before the call can succeed. That guarantees the shape of the output, so you will not get a string where you asked for a number.

Is AI-based extraction reliable enough for a production pipeline?

AI extraction is not deterministic by nature, so each result is validated against your schema with one repair pass before the call succeeds. For run-to-run repeatability on a stable page, save a working extraction as a Blueprint and replay it with no AI step, which makes the run fast and identical each time.

What happens when a page changes after I set up extraction?

If you run a live extraction, describing the fields means a moved price or renamed class still resolves. If you pinned a Blueprint, drift surfaces as validation_warnings naming the fields that stopped matching, and the status stays completed. It will not hand you a silent null; you re-pin or fall back to a live call.

How much does a structured data extraction request cost?

A successful SmartScraper call costs 5 credits, billed on success only, so a fetch that fails costs nothing. Proxies, browser rendering, and the extraction all come out of one credit balance. There is no separate proxy invoice or browser-minutes line item to reconcile at the end of the month.

Can I extract data from JavaScript-rendered pages?

Yes. The fetch stack starts with a fast HTTP request and escalates to a headless or stealth browser when a page needs JavaScript to render or fights back. You do not pick the tier; the engine escalates only as far as the page forces, so most pages stay on the cheaper path.