How to pull structured data out of an HTML table
How to pull structured data out of an HTML table in one API call: describe the columns you want and get each row back as schema-validated JSON, no selectors.
| Step | What happens |
|---|---|
| 1. POST the URL + describe the rows | Ask /v1/smartscraper for the list of rows and the columns you want |
| 2. Read the array | Each row comes back as an object in data.result |
| 3. Pin the row shape (optional) | Send an output_schema typed as an array of objects |
| 4. Disambiguate | Name the table in your prompt when a page has several |
| Cost | 5 credits per successful call; a failed fetch costs 0 |
By the end of this you will have one API call that turns an HTML table into an
array of row objects, with no <tr> and <td> walking and no selectors to keep
in sync with the page. You need an API key, an HTTP client, and a URL with a table
on it.
The examples target a Wikipedia article, whose content is openly licensed and friendly to read programmatically. Whatever target you use, respect its terms and robots.txt.
Step 1: Describe the rows you want
A table is a list of rows that share columns, so ask for exactly that. You name the columns; you do not point at cells.
curl https://api.webscrape.ai/v1/smartscraper \
-H "X-API-Key: wsg_live_your_key_here" \
-H "Content-Type: application/json" \
-d '{
"website_url": "https://en.wikipedia.org/wiki/List_of_Apollo_missions",
"user_prompt": "Extract the list of crewed Apollo missions, each with its mission name, launch date, and crew members"
}'
What just happened: the engine fetched the page, cleaned it, found the table, and
returned one object per row. The rows arrive under data.result as an array.
{
"status": "completed",
"data": {
"result": [
{
"mission": "Apollo 11",
"launch_date": "July 16, 1969",
"crew": ["Neil Armstrong", "Michael Collins", "Buzz Aldrin"]
},
{
"mission": "Apollo 12",
"launch_date": "November 14, 1969",
"crew": ["Pete Conrad", "Richard F. Gordon Jr.", "Alan Bean"]
}
]
},
"credits_used": 5,
"credits_remaining": 495,
"request_id": "req_..."
}
The array is abbreviated above. Each row carries the three columns you asked for, typed the way you described them, including the crew as a list of strings.
Step 2: Pin the row shape with a schema
In a pipeline you want every row to have the same keys and types. Pass an
output_schema that types the array and its row objects, and keep the
user_prompt alongside it:
curl https://api.webscrape.ai/v1/smartscraper \
-H "X-API-Key: wsg_live_your_key_here" \
-H "Content-Type: application/json" \
-d '{
"website_url": "https://en.wikipedia.org/wiki/List_of_Apollo_missions",
"user_prompt": "Extract the list of crewed Apollo missions with mission, launch date, and crew",
"output_schema": {
"type": "array",
"items": {
"type": "object",
"properties": {
"mission": { "type": "string" },
"launch_date": { "type": "string" },
"crew": { "type": "array", "items": { "type": "string" } }
},
"required": ["mission", "launch_date"]
}
}
}'
Each row is now validated against items. If a row comes back missing a required
field or with the wrong type, the engine gets one repair pass before the call
returns, so downstream code can trust the shape of every row.
Step 3: Pick the right table
A real page often carries several tables: a navigation box, a summary sidebar, a schedule, the one you actually want. Name the target in your prompt so the engine knows which one to read.
- "user_prompt": "Extract the table as rows"
+ "user_prompt": "Extract the crewed-missions table, each row with mission, launch date, and crew"
Description does the disambiguating that a positional selector cannot. If two tables share a structure, narrow by a column that only the right one has.
When this breaks
It returns the wrong table. The fix is in Step 3: name the table by what it contains. A prompt that says "the specifications table" beats one that says "the second table," because a table inserted above it would renumber your target.
Merged or spanning cells come back lopsided. A rowspan header that labels
several rows, or a cell merged across columns, is the classic case that breaks an
nth-child selector. Describing the logical field handles most of these, because
the engine reads the cell's meaning rather than its index. It is still a hard
case at the edges, so check a sample row before trusting a whole table.
A huge table returns partial rows. Large pages are split into chunks, extracted in parallel, and merged, so size alone is not a blocker. The tail of a very long table is where a dropped or duplicated row is most likely, so validate a sample against your schema rather than assuming the count is right.
The table is empty in the response. The page probably draws the table with JavaScript after the first byte. The engine escalates to a browser tier when it detects this, covered in the tiered fetch writeup, but an empty shell that returns a 200 can still slip past. If you can see the rows in your own browser and the response is empty, that is the case to suspect.
For a table you will re-scrape on a schedule, pin the working extraction as a
Blueprint so drift surfaces as validation_warnings with the status still
completed, rather than a silently shorter array. The reasoning is in
why structured extraction beats CSS selectors,
and the failure catalog is in why scrapers go quiet after a redesign.
Next step
Grab a free key and point the call above at a table you keep copying by hand. The free credits cover real runs, and a failed fetch costs nothing, so a prompt that grabs the wrong table only costs the time to reword it.
Frequently asked questions
How do I extract data from an HTML table?
Send the page URL to /v1/smartscraper and describe the rows you want in user_prompt, like the list of entries with their name, date, and value columns. The extracted rows come back under data.result as an array of objects. You never write a selector to walk the table's rows and cells.
How do I scrape an HTML table without writing CSS selectors?
Describe the columns you want in plain language instead of pointing at table cells. The engine reads the cleaned page and maps each row to the fields you named. Because you describe meaning rather than position, an inserted column or a renamed header does not shift the mapping the way a positional selector would.
How do I get each table row as a separate JSON object?
Ask for the list of rows in user_prompt, and pass an output_schema typed as an array of objects to fix the shape. Each row comes back as one object in the data.result array, with the columns you named as typed fields. The schema is validated, with one repair pass if the first result does not conform.
How do I scrape one specific table when the page has several?
Name the table in your prompt: ask for the medals table rather than the schedule, or the specifications table rather than the related-products grid. Description disambiguates where a positional selector cannot. If several tables share a structure, narrow by a column name that only the one you want contains.
Can it handle tables with merged or spanning cells?
Usually, yes. Because you describe the logical field rather than a cell index, a rowspan header or a merged label that would throw off an nth-child selector is read for its meaning. Spanning cells are still a hard case in general; check the result on a sample row before trusting a whole large table.
How do I scrape a very large HTML table?
Send it the same way. Large pages are split into chunks, extracted in parallel, and merged with deduplication, so a long table does not need a different call. Validate a sample of rows against your schema before relying on the full set, since the tail of a very large table is where extraction errors tend to hide.
Does table extraction work on tables rendered by JavaScript?
Yes. If the table is drawn by JavaScript after the initial HTML loads, the fetch stack escalates from a fast HTTP request to a headless browser that runs the scripts before extraction reads the page. You do not pick the tier; the engine escalates only as far as the page in front of it forces.