Index URLs | Captain Docs

Index documents from public URLs into a collection. No cloud storage credentials required.

You can provide either:

url — a single URL string for one document
urls — an array of URL strings for multiple documents

Supported file types: PDF, DOCX, DOC, XLSX, XLS, CSV, TSV, TXT, MD, JSON, YAML, YML, PNG, JPG, JPEG, GIF, BMP, TIFF. Documents are downloaded and processed through the same pipeline as cloud storage indexing.

Returns a job_id for tracking progress via GET /v2/jobs/{job_id}.

Index documents from public URLs into a collection. No cloud storage credentials required. You can provide either: - `url` — a single URL string for one document - `urls` — an array of URL strings for multiple documents Supported file types: PDF, DOCX, DOC, XLSX, XLS, CSV, TSV, TXT, MD, JSON, YAML, YML, PNG, JPG, JPEG, GIF, BMP, TIFF. Documents are downloaded and processed through the same pipeline as cloud storage indexing. Returns a job_id for tracking progress via GET /v2/jobs/{job_id}.

Authentication

AuthorizationBearer

Bearer authentication of the form Bearer <token>, where token is your auth token.

X-Organization-IDstring

API Key authentication via header

Path parameters

collection_namestringRequired

Request

This endpoint expects an object.

processing_typeenumRequired

Document processing type. ‘advanced’ uses agentic OCR with AI-enhanced extraction for complex layouts, tables, figures, charts, and documents containing images. ‘basic’ provides reliable OCR optimized for general document indexing and high-volume processing.

Allowed values:

urlstringOptional

A single public URL to a hosted document. Supported types: PDF, DOCX, DOC, XLSX, XLS, CSV, TSV, TXT, MD, JSON, YAML, YML, PNG, JPG, JPEG, GIF, BMP, TIFF. Provide either ‘url’ or ‘urls’, not both.

urlslist of stringsOptional

An array of public URLs to hosted documents. Provide either 'url' or 'urls', not both.

custom_metadatamap from strings to anyOptional

Custom metadata to attach to all indexed chunks. Keys must be strings. Values: str, int, float, bool, or array of strings.

Response

Indexing job started

job_idstring

statusenum

Allowed values:

1	import requests
2
3	BASE_URL = "https://api.runcaptain.com"
4	API_KEY = "your_api_key"
5	ORG_ID = "your_organization_id"
6
7	headers = {
8	"Authorization": f"Bearer {API_KEY}",
9	"X-Organization-ID": ORG_ID,
10	"Content-Type": "application/json"
11	}
12
13	response = requests.post(
14	f"{BASE_URL}/v2/collections/my_documents/index/url",
15	headers=headers,
16	json={
17	"url": "https://example.com/documents/report.pdf",
18	"processing_type": "advanced"
19	},
20	timeout=60.0
21	)
22
23	if response.status_code in [200, 201]:
24	data = response.json()
25	print(f"Job started! ID: {data['job_id']}")
26	else:
27	print(f"Error: {response.status_code}")

1	{
2	"job_id": "job_url_abc123",
3	"status": "pending"
4	}