all posts
Engineering

Why structured extraction beats CSS selectors

Hand-written selectors break every time a site redesigns. Describing the data you want survives the redesign. Here's the tradeoff, and how we keep the AI version repeatable.

If you've ever maintained a scraper, you know how it ends. It runs fine for a few weeks. Then a redesign moves the price from .product__price to .pdp-price-now, and your pipeline starts emitting null without a single error in the logs. A selector encodes where data sat in one snapshot of the DOM, and the DOM is the least stable thing about a web page.

The problem with positional selectors

A CSS or XPath selector is a bet that the page's structure won't change, and it's a bet you lose constantly. Markup is an implementation detail: teams refactor components, swap frameworks, and A/B-test layouts without changing what the page actually means. The selectors are brittle by construction on top of that. Something like div > div:nth-child(3) > span is a fossil of one render, and a single inserted element shifts every index after it. Worst of all, the failures are silent: a missing selector returns nothing rather than an error, so the bad data flows downstream until someone notices the dashboard is wrong.

Describe the data, not its location

Structured extraction works the other way around. Instead of asking for "the text in the third span", you ask for "the current price as a number". The engine reads the cleaned page and returns data that matches what you asked for:

json
{
  "url": "https://example.com/product/42",
  "output_schema": {
    "name": "string",
    "price": "number",
    "currency": "string",
    "in_stock": "boolean"
  }
}

That block is simplified to make the point. The live API takes a real JSON Schema in output_schema (or a plain-language prompt); what matters is that you name the fields and their types, not where they sit in the DOM.

Because the request describes what you want rather than where it sits, it survives the redesign. Move the price or rename the class, and "the current price as a number" still points at the same thing.

What about determinism?

The obvious objection: AI extraction isn't deterministic by nature, so how do you trust it in a pipeline? There are two answers, and it's worth being precise about what each one actually buys you.

First, schema validation with one repair pass. Every result is checked against your schema. If it doesn't conform, the engine gets one structured attempt to fix it before the request fails. This guarantees the shape of the output, so you won't get a string where you asked for a number, or a silently missing field. It does not, by itself, promise the extraction picked the right value. That's what the second part is for.

Second, Blueprints. Once an extraction is working, save it as a Blueprint. The engine reverse-engineers selectors from that known-good result, and afterward replays them with no AI step at run time, which makes runs fast and identical every time. This is where real run-to-run determinism comes from, because there's no AI step left to vary. These are still selectors, so they can still drift when a page changes enough. The difference is the failure mode. A Blueprint run is selectors-only and never runs an AI step, so when a field stops matching it doesn't hand you a silent null. The run comes back with validation_warnings naming exactly which fields drifted, and its status stays completed. That warning is your signal to re-pin the Blueprint or drop back to a live SmartScraper call: the cheap, deterministic path while the page holds, and an explicit heads-up the moment it doesn't.

When to reach for which

None of this makes selectors useless. It makes them the wrong default. If you're hammering one stable site at high volume, pin a Blueprint and let it run. When you're hitting many sites where the structure keeps moving, or you just don't want to own the upkeep, describe the data and let the engine find it. Most teams run both, which is the honest answer for most pipelines.