AI Modules

AvidiaExtract — Product Data Extraction

Fetch any manufacturer or supplier product page and extract clean, structured data automatically.

What AvidiaExtract Does

AvidiaExtract fetches a product URL — from a manufacturer, distributor, or supplier — and uses AI to parse the page into structured JSON fields. It handles HTML product pages, PDF datasheets, and JavaScript-rendered sites (with the browser extension).

The output is a normalized product record you can use as the input for all downstream modules: AvidiaDescribe for copy generation, AvidiaSEO for optimization, and your eCommerce platform for direct import.

Supported Input Types

  • Single URL — paste any product URL into the Extract UI and hit Extract. Ideal for testing and one-off products.
  • Batch URL list — paste up to 50 URLs at once using the bulk input field (newline-separated).
  • CSV import — upload a CSV with a url column to start a bulk extraction job. See the Import guide for column requirements.
  • API — submit extraction requests programmatically via POST /api/v1/ingest.
  • PDF datasheets — upload PDF files directly for extraction. Works best with machine-readable (non-scanned) PDFs.

Extracted Fields

AvidiaExtract attempts to populate the following fields from every product page. Fields that cannot be found are returned as null.

FieldExample ValueNotes
product_name3M N95 Particulate Respirator 8210, 20/BoxFull product title as listed by manufacturer
brand3MManufacturer or brand name
sku8210Manufacturer part number or SKU
descriptionNIOSH-approved N95 respirator...Raw manufacturer description text
specifications{"filtration_efficiency": "≥95%", "style": "flat fold"}Key-value pairs from spec tables
images["https://cdn.3m.com/..."]All product image URLs found on the page
dimensions{"length": "5.5in", "width": "3.5in"}Physical size, extracted from specs or product data
weight4.5 ozShipping or product weight
price24.99Listed price if publicly available
availabilityin_stockin_stock, out_of_stock, or discontinued
categories["Safety", "Respiratory Protection", "N95"]Breadcrumb or taxonomy categories from page
upc00051131070141UPC/EAN barcode if present

How Extraction Works

  1. 1.Fetch: AvidiaTech's crawler fetches the HTML source of the URL, following redirects and handling standard HTTP headers. Pages that require JavaScript rendering fall back to a headless browser.
  2. 2.AI Parsing: The raw HTML is sent to the extraction model, which identifies product data signals — spec tables, structured data (JSON-LD, microdata), image tags, pricing elements, and description blocks.
  3. 3.Field Normalization: Raw values are normalized — units are standardized (e.g., "5 lbs" → {"value": 5, "unit": "lb"}), prices are stripped of currency symbols, and image URLs are resolved to absolute paths.
  4. 4.JSON Output: The normalized record is stored as an ingestion and made available in the UI, via the API, and as input to downstream modules.

Handling Difficult Pages

JavaScript-heavy sites

Sites like Grainger, MSC Direct, or custom B2B portals often load product data dynamically via JavaScript. Standard HTTP fetching captures an empty or skeleton page. Use the AvidiaExtract browser extension to extract from a fully rendered page in your Chrome session.

PDF datasheets

Upload PDFs via the Extract UI or via POST /api/v1/ingest with type: "pdf". Machine-readable PDFs (text-based) extract cleanly. Scanned image-only PDFs require OCR preprocessing.

Private / login-required pages

Distributor portals and private supplier catalogs often require authentication. The browser extension can extract from authenticated sessions. Alternatively, export the product data to CSV from the portal and use AvidiaTech's CSV import to ingest it.

Understanding Extraction Quality Scores

Every extraction is assigned a quality score from 0 to 100. This score reflects how completely the system was able to populate the standard field set.

80–100

Excellent

50–79

Good

0–49

Review Needed

Scores below 50 typically mean critical fields like product name, description, or specifications could not be found. Filter your product list by score using the Score < 50 filter to identify products that need manual review or reprocessing.

Reprocessing Failed Extractions

If an extraction fails or produces a low-quality result, you can reprocess it without consuming an additional extraction credit (first reprocess is free). From the product detail view, click Reprocess → choose whether to use the cached page or re-fetch the URL.

Via the API, set "reprocess": true in your ingest request body to force a fresh extraction even if a cached result exists.

Rate Limits and Quotas

PlanMonthly ExtractionsConcurrent JobsAPI Rate Limit
Starter100260 req/min
Growth5,00010300 req/min
ScaleUnlimited501,000 req/min

API: Submit an Extraction

POST https://app.avidiatech.com/api/v1/ingest
Authorization: Bearer <your-api-key>
Content-Type: application/json

{
  "url": "https://www.boschtools.com/us/en/boschtools-ocs/cordless-drill-drivers-gsb-18v-755-06019H3110.html",
  "options": {
    "reprocess": false,
    "pipeline": ["extract", "describe", "seo"]
  }
}

// Response (202 Accepted)
{
  "id": "ing_01HX4K2QZRP7W9VMBGT3DENF8",
  "status": "queued",
  "created_at": "2024-03-15T14:22:00Z",
  "url": "https://www.boschtools.com/..."
}

Poll the result with GET /api/v1/ingest/:id or set up a webhook to be notified on completion. See the full API Reference.

Common Errors and Fixes

FETCH_BLOCKED

Cause: The page returned a 403 or CAPTCHA block.

Fix: Try the manufacturer's direct product URL rather than a category or search page. Some pages require the browser extension for JavaScript-heavy rendering.

LOW_CONTENT

Cause: Extracted fewer than 3 fields — page may be JS-rendered or behind auth.

Fix: Use the AvidiaExtract browser extension, which runs extraction in your browser session and can access pages that require login or JavaScript.

PDF_PARSE_ERROR

Cause: Uploaded PDF was image-only (scanned, not machine-readable).

Fix: Run OCR on the PDF before uploading, or request a text-based datasheet directly from your supplier.

TIMEOUT

Cause: Page took more than 30 seconds to respond.

Fix: Check if the URL is publicly accessible. Try reprocessing — some slow supplier sites are intermittently slow.

DUPLICATE_URL

Cause: This URL was already extracted and has an existing ingestion record.

Fix: Find the existing record in your product list, or use the reprocess option to force a fresh extraction.

Tips for Best Results

  • Always use the manufacturer's own product page, not a retailer listing. Manufacturer pages have specification tables, official images, and authoritative data that score 10–20 points higher on average.
  • Ensure the URL is publicly accessible — test it in an incognito browser window before submitting.
  • For product families (e.g., Bosch 18V drill with multiple SKUs), extract the specific variant URL, not the product family overview page.
  • Include the sku field in your CSV imports — it helps deduplicate products and match records to your existing catalog.
AvidiaTech | Product Data Automation