Why structured extraction beats CSS selectors
Hand-written selectors break every time a site redesigns. Describing the data you want survives the redesign. Here's the tradeoff, and how we keep the AI version repeatable.
If you've ever maintained a scraper, you know how it ends. It runs fine for a few
weeks. Then a redesign moves the price from .product__price to .pdp-price-now,
and your pipeline starts emitting null without a single error in the logs. A
selector encodes where data sat in one snapshot of the DOM, and the DOM is the
least stable thing about a web page.
The problem with positional selectors
A CSS or XPath selector is a bet that the page's structure won't change, and it's
a bet you lose constantly. Markup is an implementation detail: teams refactor
components, swap frameworks, and A/B-test layouts without changing what the page
actually means. The selectors are brittle by construction on top of that.
Something like div > div:nth-child(3) > span is a fossil of one render, and a
single inserted element shifts every index after it. Worst of all, the failures
are silent: a missing selector returns nothing rather than an error, so the bad
data flows downstream until someone notices the dashboard is wrong.
Describe the data, not its location
Structured extraction works the other way around. Instead of asking for "the text in the third span", you ask for "the current price as a number". The engine reads the cleaned page and returns data that matches what you asked for:
{
"url": "https://example.com/product/42",
"output_schema": {
"name": "string",
"price": "number",
"currency": "string",
"in_stock": "boolean"
}
}
That block is simplified to make the point. The live API takes a real JSON Schema
in output_schema (or a plain-language prompt); what matters is that you name the
fields and their types, not where they sit in the DOM.
Because the request describes what you want rather than where it sits, it survives the redesign. Move the price or rename the class, and "the current price as a number" still points at the same thing.
What about determinism?
The obvious objection: AI extraction isn't deterministic by nature, so how do you trust it in a pipeline? There are two answers, and it's worth being precise about what each one actually buys you.
First, schema validation with one repair pass. Every result is checked against your schema. If it doesn't conform, the engine gets one structured attempt to fix it before the request fails. This guarantees the shape of the output, so you won't get a string where you asked for a number, or a silently missing field. It does not, by itself, promise the extraction picked the right value. That's what the second part is for.
Second, Blueprints. Once an extraction is working, save it as a Blueprint.
The engine reverse-engineers selectors from that known-good result, and afterward
replays them with no AI step at run time, which makes runs fast and identical
every time. This is where real run-to-run determinism comes from, because there's
no AI step left to vary. These are still selectors, so they can still drift when a
page changes enough. The difference is the failure mode. A Blueprint run is
selectors-only and never runs an AI step, so when a field stops matching it doesn't
hand you a silent null. The run comes back with validation_warnings naming
exactly which fields drifted, and its status stays completed. That warning is
your signal to re-pin the Blueprint or drop back to a live SmartScraper call: the
cheap, deterministic path while the page holds, and an explicit heads-up the
moment it doesn't.
When to reach for which
None of this makes selectors useless. It makes them the wrong default. If you're hammering one stable site at high volume, pin a Blueprint and let it run. When you're hitting many sites where the structure keeps moving, or you just don't want to own the upkeep, describe the data and let the engine find it. Most teams run both, which is the honest answer for most pipelines.