Index R2 Bucket | Captain Docs

import json
import requests
import uuid
BASE_URL = "https://api.runcaptain.com"
API_KEY = "your_api_key"
COLLECTION_NAME = "my_documents"
headers = {
    "Authorization": f"Bearer {API_KEY}",
    "Content-Type": "application/json",
    "Idempotency-Key": str(uuid.uuid4()),
}
response = requests.post(
    f"{BASE_URL}/v2/collections/{COLLECTION_NAME}/index/r2",
    headers=headers,
    json={
    "access_key_id": "example_access_key_id",
    "account_id": "example_account_id",
    "bucket_name": "example_bucket_name",
    "custom_metadata": {},
    "jurisdiction": "default",
    "max_files": 10,
    "processing_type": "advanced",
    "secret_access_key": "example_secret_access_key"
},
    timeout=120.0
)
print(json.dumps(response.json(), indent=2))

Index all files from a Cloudflare R2 bucket into a collection.

R2 is S3-compatible. Provide your R2 API token’s Access Key ID and Secret Access Key.

Headers:

Authorization: Bearer {api_key} - Captain API key for authentication
X-Organization-ID: Organization UUID
Idempotency-Key: UUID for request deduplication (optional)

Args: collection_name: Name of the collection (path parameter) body: R2 bucket configuration

Returns: { job_id, status: “pending” }

Index all files from a Cloudflare R2 bucket into a collection. R2 is S3-compatible. Provide your R2 API token's Access Key ID and Secret Access Key. Headers: - Authorization: Bearer {api_key} - Captain API key for authentication - X-Organization-ID: Organization UUID - Idempotency-Key: UUID for request deduplication (optional) Args: collection_name: Name of the collection (path parameter) body: R2 bucket configuration Returns: { job_id, status: "pending" }

Path parameters

collection_namestringRequired

Request

This endpoint expects an object.

access_key_idstringRequired

R2 S3 API token access key ID

account_idstringRequired

Cloudflare account ID (found in R2 dashboard URL)

bucket_namestringRequired

processing_typeenumRequired

Document processing type. ‘advanced’ uses agentic OCR with AI-enhanced extraction for complex layouts, tables, figures, charts, and documents containing images. ‘basic’ provides reliable OCR optimized for general document indexing and high-volume processing.

secret_access_keystringRequired

R2 S3 API token secret access key

custom_metadatamap from strings to strings or integers or doubles or booleans or lists of strings or nullOptional

Custom metadata to attach to all indexed chunks. Keys must be strings. Values: str, int, float, bool, or List[str].

jurisdictionenum or nullOptionalDefaults to default

R2 jurisdiction. ‘default’ for global, ‘eu’ for EU-only, ‘fedramp’ for FedRAMP. Determines the S3-compatible endpoint URL.

max_filesinteger or nullOptional

overwrite_existingbooleanOptionalDefaults to false

When true, files that already exist in the collection will be deleted and re-indexed with the latest changes. Requires skip_existing=false. Setting both to true returns a 400 error.

parsing_scriptstring or nullOptional

Relative path to a JS parsing script for JSON files (e.g. ‘research/paper-parser’). When provided, .json files are processed through a sandboxed V8 isolate. Without this, .json files are indexed as raw text.

skip_existingbooleanOptionalDefaults to true

When true, files already indexed in the collection are skipped and will not be re-indexed with incoming changes. When false, all incoming files are indexed regardless of whether they already exist.

Response

Successful Response

job_idstring

statusstringDefaults to pending

Errors

400

Bad Request Error