all posts
Engineering

Why your scraper returns null after a redesign

Why your scraper returns null after a redesign: a catalog of the silent failure modes behind an empty result, and how to turn page drift into a warning instead.

Your dashboard has been showing zero new prices for six days, and not one line in the logs is red. The cron job ran. The requests returned 200. The pipeline wrote rows. Every row just happens to have null where the price should be, and nobody noticed until finance asked why the report looked flat.

A scraper that returns null is not broken in the way you would hope. It is running exactly as written, against a page that changed underneath it. That is the whole problem with the silent failure: the code is fine, the data is wrong, and nothing connects the two until a human goes looking.

Here is the catalog of how a scraper goes quiet, and what actually stops each one.

1. A class got renamed

The most common cause is the cheapest to explain. A redesign moves the price from .product__price to .pdp-price-now, and a selector keyed on the old class matches nothing. A missing match is not an error in any scraping library. It is an empty result, returned successfully, so the request looks healthy all the way down the stack.

The fix is to stop encoding the class. When you describe "the current price as a number" instead of pointing at .product__price, a renamed class still resolves, because the description names what the value means, not where it sat.

2. An inserted element shifted every index

Positional selectors fail the same way for a different reason. Someone adds a promo banner above the product grid, and div:nth-child(3) > span now points one element to the left of where it used to. A single inserted node renumbers every index after it. Sometimes you get the wrong value, which is bad; sometimes you get nothing, which at least shows up as null.

Describing the field sidesteps the count entirely. There is no third span to drift when you ask for the thing rather than its position.

3. A 200 with an empty shell

Not every silent failure is the selector's fault. A page can move its content behind JavaScript, so a plain HTTP fetch returns a 200 with a near-empty body. The selector runs against a shell that never had the data in it, and a 200 reads as success everywhere downstream.

This is what tiered fetching is for: when a page needs scripts to render, the stack escalates from a fast HTTP request to a browser that runs them, which we walk through in the tiered fetch writeup. The honest caveat is that an empty shell which still returns a 200 is the one case no escalation ladder catches every time. If a value you can see in your own browser comes back empty, suspect this first.

4. A soft block you read as content

Anti-bot systems rarely announce themselves with a 403 anymore. Many serve a 200 carrying a "checking your browser" interstitial, and your selectors dutifully run against the challenge page instead of the content. You get structured nothing from a request that looked like it worked.

The answer is the same escalation ladder, climbing to a stealth browser tier on the targets that warrant it. The signal to watch for is a sudden run of empty results from a site that used to return data, which is drift of a different kind.

5. The field is genuinely gone

Sometimes null is correct. An out-of-stock product hides its price; a logged-out view drops a section; a regional page omits a field that exists elsewhere. The data is not there to extract, and any approach returns nothing, because nothing is the right answer.

This one is not a bug to fix but a case to handle. Describing the field makes the absence unambiguous, so your code can treat a real null as the signal it is rather than confusing it with a broken selector.

6. The worst case: a plausible wrong value

The failures above at least give you a null to notice. The dangerous one returns something. After a redesign, a selector can still match a value, just the wrong one: the struck-through list price instead of the sale price, a placeholder, a neighboring cell. The type is right, the shape is right, and the number is wrong. Nothing flags it because nothing looks broken.

This is the failure that runs for a week. It is also the one a silent pipeline is least equipped to catch, because every automated check it would trip is satisfied.

The pattern under all six

Five of these six are not really about which value you got. They are about the failure mode: a scraper that fails by going quiet lets bad data flow until a person notices. The fix is not a cleverer selector. It is a pipeline that turns drift into a signal you can act on, and there are two mechanisms we lean on.

Describe the data, not its location. Renames and index shifts stop mattering when the request names the meaning. That is the argument in full in why structured extraction beats CSS selectors.

Make drift loud. Once an extraction works and you want the cheap, repeatable path, pin it as a Blueprint. A Blueprint replays selectors with no AI step, so it is fast and identical run to run. When a field stops matching, it does not hand you a silent null. The run comes back with validation_warnings naming exactly which fields drifted, and its status stays completed. That warning is your cue to re-pin or fall back to a live call. Pairing a Blueprint with an output_schema adds the same loudness to type drift, the subject of prompt versus schema: a result that no longer matches the shape is caught instead of shipped.

Where this still bites

None of this is a clean win on every axis, and pretending otherwise would be the marketing version of this post.

The empty-shell-that-returns-200 still slips through occasionally, because it is indistinguishable from a thin but real page until something downstream expects content. The plausible-wrong-value case is only partly covered: validation sees a well-typed number, not a wrong one, so a same-type substitution can still pass. And describing the field instead of pinning a selector costs more per call than a static selector does, which is the trade you make for surviving the redesign.

The goal was never a scraper that never breaks. Pages change; that is not negotiable. The goal is a scraper that tells you the moment it does, instead of letting a column of null reach a report nobody questions until the numbers look wrong.

Pin a Blueprint on a page you actually depend on and let its warnings, not your dashboard, be the thing that tells you it drifted. A failed fetch costs nothing, so watching for drift is free to set up.