Why your scraper returns null after a redesign
Why your scraper returns null after a redesign: a catalog of the silent failure modes behind an empty result, and how to turn page drift into a warning instead.
Your dashboard has been showing zero new prices for six days, and not one line in
the logs is red. The cron job ran. The requests returned 200. The pipeline wrote
rows. Every row just happens to have null where the price should be, and nobody
noticed until finance asked why the report looked flat.
A scraper that returns null is not broken in the way you would hope. It is
running exactly as written, against a page that changed underneath it. That is the
whole problem with the silent failure: the code is fine, the data is wrong, and
nothing connects the two until a human goes looking.
Here is the catalog of how a scraper goes quiet, and what actually stops each one.
1. A class got renamed
The most common cause is the cheapest to explain. A redesign moves the price from
.product__price to .pdp-price-now, and a selector keyed on the old class
matches nothing. A missing match is not an error in any scraping library. It is an
empty result, returned successfully, so the request looks healthy all the way
down the stack.
The fix is to stop encoding the class. When you describe "the current price as a
number" instead of pointing at .product__price, a renamed class still resolves,
because the description names what the value means, not where it sat.
2. An inserted element shifted every index
Positional selectors fail the same way for a different reason. Someone adds a
promo banner above the product grid, and div:nth-child(3) > span now points one
element to the left of where it used to. A single inserted node renumbers every
index after it. Sometimes you get the wrong value, which is bad; sometimes you get
nothing, which at least shows up as null.
Describing the field sidesteps the count entirely. There is no third span to drift when you ask for the thing rather than its position.
3. A 200 with an empty shell
Not every silent failure is the selector's fault. A page can move its content behind JavaScript, so a plain HTTP fetch returns a 200 with a near-empty body. The selector runs against a shell that never had the data in it, and a 200 reads as success everywhere downstream.
This is what tiered fetching is for: when a page needs scripts to render, the stack escalates from a fast HTTP request to a browser that runs them, which we walk through in the tiered fetch writeup. The honest caveat is that an empty shell which still returns a 200 is the one case no escalation ladder catches every time. If a value you can see in your own browser comes back empty, suspect this first.
4. A soft block you read as content
Anti-bot systems rarely announce themselves with a 403 anymore. Many serve a 200 carrying a "checking your browser" interstitial, and your selectors dutifully run against the challenge page instead of the content. You get structured nothing from a request that looked like it worked.
The answer is the same escalation ladder, climbing to a stealth browser tier on the targets that warrant it. The signal to watch for is a sudden run of empty results from a site that used to return data, which is drift of a different kind.
5. The field is genuinely gone
Sometimes null is correct. An out-of-stock product hides its price; a
logged-out view drops a section; a regional page omits a field that exists
elsewhere. The data is not there to extract, and any approach returns nothing,
because nothing is the right answer.
This one is not a bug to fix but a case to handle. Describing the field makes the
absence unambiguous, so your code can treat a real null as the signal it is
rather than confusing it with a broken selector.
6. The worst case: a plausible wrong value
The failures above at least give you a null to notice. The dangerous one returns
something. After a redesign, a selector can still match a value, just the wrong one:
the struck-through list price instead of the sale price, a placeholder, a
neighboring cell. The type is right, the shape is right, and the number is wrong.
Nothing flags it because nothing looks broken.
This is the failure that runs for a week. It is also the one a silent pipeline is least equipped to catch, because every automated check it would trip is satisfied.
The pattern under all six
Five of these six are not really about which value you got. They are about the failure mode: a scraper that fails by going quiet lets bad data flow until a person notices. The fix is not a cleverer selector. It is a pipeline that turns drift into a signal you can act on, and there are two mechanisms we lean on.
Describe the data, not its location. Renames and index shifts stop mattering when the request names the meaning. That is the argument in full in why structured extraction beats CSS selectors.
Make drift loud. Once an extraction works and you want the cheap, repeatable
path, pin it as a Blueprint. A Blueprint replays selectors with no AI step, so it
is fast and identical run to run. When a field stops matching, it does not hand you
a silent null. The run comes back with validation_warnings naming exactly which
fields drifted, and its status stays completed. That warning is your cue to
re-pin or fall back to a live call. Pairing a Blueprint with an output_schema
adds the same loudness to type drift, the subject of
prompt versus schema: a result that no longer
matches the shape is caught instead of shipped.
Where this still bites
None of this is a clean win on every axis, and pretending otherwise would be the marketing version of this post.
The empty-shell-that-returns-200 still slips through occasionally, because it is indistinguishable from a thin but real page until something downstream expects content. The plausible-wrong-value case is only partly covered: validation sees a well-typed number, not a wrong one, so a same-type substitution can still pass. And describing the field instead of pinning a selector costs more per call than a static selector does, which is the trade you make for surviving the redesign.
The goal was never a scraper that never breaks. Pages change; that is not
negotiable. The goal is a scraper that tells you the moment it does, instead of
letting a column of null reach a report nobody questions until the numbers look
wrong.
Pin a Blueprint on a page you actually depend on and let its warnings, not your dashboard, be the thing that tells you it drifted. A failed fetch costs nothing, so watching for drift is free to set up.