Index URLs

Index documents or web pages from public URLs into a collection. No cloud storage credentials required. You can provide either: - `url` — a single URL string - `urls` — an array of URL strings ## Smart Content Detection The endpoint automatically detects whether a URL points to a hosted file or a web page: - **Hosted files** (PDF, DOCX, XLSX, CSV, TXT, images, etc.) are downloaded and processed directly through the indexing pipeline. - **Web pages** (HTML) are automatically scraped — text content is extracted as markdown and page images are downloaded and indexed. Bot-protected pages are handled via web unlocker technology. ## Supported Content - **Documents**: PDF, DOCX, DOC, XLSX, XLS, CSV, TSV, TXT, MD, JSON, YAML, YML - **Images**: PNG, JPG, JPEG, GIF, BMP, TIFF - **Web pages**: Any public URL serving HTML — text and images are extracted automatically ## Processing Modes for Web Pages - **advanced**: Extracts text content as markdown AND downloads and indexes all page images - **basic**: Extracts text content only — faster and lower cost Returns a job_id for tracking progress via GET /v2/jobs/{job_id}.

Authentication

AuthorizationBearer
Bearer token authentication using API key

Path parameters

collection_namestringRequired

Headers

X-Organization-IDstringRequired

Request

This endpoint expects an object.
processing_typeenumRequired

Processing mode. For hosted documents: ‘advanced’ enables AI-enhanced extraction for complex layouts, tables, figures, and charts; ‘basic’ provides standard document processing. For web pages: ‘advanced’ extracts both text content and page images; ‘basic’ extracts text content only (faster, lower cost).

Allowed values:
urlstringOptional

A single public URL to a document or web page. Hosted files (PDF, DOCX, etc.) are indexed directly. Web pages (HTML) are automatically scraped — text and images are extracted. Provide either ‘url’ or ‘urls’, not both.

urlslist of stringsOptional

An array of public URLs to documents or web pages. Each URL is auto-detected — hosted files are indexed directly, web pages are scraped. Provide either ‘url’ or ‘urls’, not both.

custom_metadatamap from strings to anyOptional

Custom metadata to attach to all indexed chunks. Keys must be strings. Values: str, int, float, bool, or array of strings.

parsing_scriptstringOptional

Relative path to a JavaScript parsing script for JSON files (e.g. ‘research/paper-parser’). When provided, .json files are processed through a sandboxed V8 isolate that executes the script to extract text and metadata. Without this parameter, .json files are indexed as raw text. Scripts are org-scoped and managed in the Parser Studio.

Response

Indexing job started
job_idstring
statusenum
Allowed values: