AvidiaExtract — Product Data Extraction
Fetch any manufacturer or supplier product page and extract clean, structured data automatically.
What AvidiaExtract Does
AvidiaExtract fetches a product URL — from a manufacturer, distributor, or supplier — and uses AI to parse the page into structured JSON fields. It handles HTML product pages, PDF datasheets, and JavaScript-rendered sites (with the browser extension).
The output is a normalized product record you can use as the input for all downstream modules: AvidiaDescribe for copy generation, AvidiaSEO for optimization, and your eCommerce platform for direct import.
Supported Input Types
- ✓Single URL — paste any product URL into the Extract UI and hit Extract. Ideal for testing and one-off products.
- ✓Batch URL list — paste up to 50 URLs at once using the bulk input field (newline-separated).
- ✓CSV import — upload a CSV with a
urlcolumn to start a bulk extraction job. See the Import guide for column requirements. - ✓API — submit extraction requests programmatically via
POST /api/v1/ingest. - ✓PDF datasheets — upload PDF files directly for extraction. Works best with machine-readable (non-scanned) PDFs.
Extracted Fields
AvidiaExtract attempts to populate the following fields from every product page. Fields that cannot be found are returned as null.
| Field | Example Value | Notes |
|---|---|---|
| product_name | 3M N95 Particulate Respirator 8210, 20/Box | Full product title as listed by manufacturer |
| brand | 3M | Manufacturer or brand name |
| sku | 8210 | Manufacturer part number or SKU |
| description | NIOSH-approved N95 respirator... | Raw manufacturer description text |
| specifications | {"filtration_efficiency": "≥95%", "style": "flat fold"} | Key-value pairs from spec tables |
| images | ["https://cdn.3m.com/..."] | All product image URLs found on the page |
| dimensions | {"length": "5.5in", "width": "3.5in"} | Physical size, extracted from specs or product data |
| weight | 4.5 oz | Shipping or product weight |
| price | 24.99 | Listed price if publicly available |
| availability | in_stock | in_stock, out_of_stock, or discontinued |
| categories | ["Safety", "Respiratory Protection", "N95"] | Breadcrumb or taxonomy categories from page |
| upc | 00051131070141 | UPC/EAN barcode if present |
How Extraction Works
- 1.Fetch: AvidiaTech's crawler fetches the HTML source of the URL, following redirects and handling standard HTTP headers. Pages that require JavaScript rendering fall back to a headless browser.
- 2.AI Parsing: The raw HTML is sent to the extraction model, which identifies product data signals — spec tables, structured data (JSON-LD, microdata), image tags, pricing elements, and description blocks.
- 3.Field Normalization: Raw values are normalized — units are standardized (e.g., "5 lbs" →
{"value": 5, "unit": "lb"}), prices are stripped of currency symbols, and image URLs are resolved to absolute paths. - 4.JSON Output: The normalized record is stored as an ingestion and made available in the UI, via the API, and as input to downstream modules.
Handling Difficult Pages
JavaScript-heavy sites
Sites like Grainger, MSC Direct, or custom B2B portals often load product data dynamically via JavaScript. Standard HTTP fetching captures an empty or skeleton page. Use the AvidiaExtract browser extension to extract from a fully rendered page in your Chrome session.
PDF datasheets
Upload PDFs via the Extract UI or via POST /api/v1/ingest with type: "pdf". Machine-readable PDFs (text-based) extract cleanly. Scanned image-only PDFs require OCR preprocessing.
Private / login-required pages
Distributor portals and private supplier catalogs often require authentication. The browser extension can extract from authenticated sessions. Alternatively, export the product data to CSV from the portal and use AvidiaTech's CSV import to ingest it.
Understanding Extraction Quality Scores
Every extraction is assigned a quality score from 0 to 100. This score reflects how completely the system was able to populate the standard field set.
80–100
Excellent
50–79
Good
0–49
Review Needed
Scores below 50 typically mean critical fields like product name, description, or specifications could not be found. Filter your product list by score using the Score < 50 filter to identify products that need manual review or reprocessing.
Reprocessing Failed Extractions
If an extraction fails or produces a low-quality result, you can reprocess it without consuming an additional extraction credit (first reprocess is free). From the product detail view, click Reprocess → choose whether to use the cached page or re-fetch the URL.
Via the API, set "reprocess": true in your ingest request body to force a fresh extraction even if a cached result exists.
Rate Limits and Quotas
| Plan | Monthly Extractions | Concurrent Jobs | API Rate Limit |
|---|---|---|---|
| Starter | 100 | 2 | 60 req/min |
| Growth | 5,000 | 10 | 300 req/min |
| Scale | Unlimited | 50 | 1,000 req/min |
API: Submit an Extraction
POST https://app.avidiatech.com/api/v1/ingest
Authorization: Bearer <your-api-key>
Content-Type: application/json
{
"url": "https://www.boschtools.com/us/en/boschtools-ocs/cordless-drill-drivers-gsb-18v-755-06019H3110.html",
"options": {
"reprocess": false,
"pipeline": ["extract", "describe", "seo"]
}
}
// Response (202 Accepted)
{
"id": "ing_01HX4K2QZRP7W9VMBGT3DENF8",
"status": "queued",
"created_at": "2024-03-15T14:22:00Z",
"url": "https://www.boschtools.com/..."
}Poll the result with GET /api/v1/ingest/:id or set up a webhook to be notified on completion. See the full API Reference.
Common Errors and Fixes
FETCH_BLOCKED
Cause: The page returned a 403 or CAPTCHA block.
Fix: Try the manufacturer's direct product URL rather than a category or search page. Some pages require the browser extension for JavaScript-heavy rendering.
LOW_CONTENT
Cause: Extracted fewer than 3 fields — page may be JS-rendered or behind auth.
Fix: Use the AvidiaExtract browser extension, which runs extraction in your browser session and can access pages that require login or JavaScript.
PDF_PARSE_ERROR
Cause: Uploaded PDF was image-only (scanned, not machine-readable).
Fix: Run OCR on the PDF before uploading, or request a text-based datasheet directly from your supplier.
TIMEOUT
Cause: Page took more than 30 seconds to respond.
Fix: Check if the URL is publicly accessible. Try reprocessing — some slow supplier sites are intermittently slow.
DUPLICATE_URL
Cause: This URL was already extracted and has an existing ingestion record.
Fix: Find the existing record in your product list, or use the reprocess option to force a fresh extraction.
Tips for Best Results
- →Always use the manufacturer's own product page, not a retailer listing. Manufacturer pages have specification tables, official images, and authoritative data that score 10–20 points higher on average.
- →Ensure the URL is publicly accessible — test it in an incognito browser window before submitting.
- →For product families (e.g., Bosch 18V drill with multiple SKUs), extract the specific variant URL, not the product family overview page.
- →Include the
skufield in your CSV imports — it helps deduplicate products and match records to your existing catalog.