Index S3 Bucket

Index all files from an S3 bucket into a collection. Returns a job_id for tracking progress via GET /v2/jobs/{job_id}.

Authentication

AuthorizationBearer
Bearer token authentication using API key

Path parameters

collection_namestringRequired

Headers

X-Organization-IDstringOptional

Request

This endpoint expects an object.
bucket_namestringRequired
Name of the S3 bucket
processing_typeenumRequired

Document processing type. ‘advanced’ uses agentic OCR with AI-enhanced extraction for complex layouts, tables, figures, charts, and documents containing images. ‘basic’ provides reliable OCR optimized for general document indexing and high-volume processing.

Allowed values:
bucket_regionstringOptionalDefaults to us-east-1
AWS region where the bucket is located
aws_access_key_idstringOptional

AWS access key ID with read access to the bucket. Use this for long-lived IAM-user credentials. Omit when using the role-based ‘auth’ block.

aws_secret_access_keystringOptional

AWS secret access key. Use this for long-lived IAM-user credentials. Omit when using the role-based ‘auth’ block.

authobjectOptional

Cross-account role-assumption auth. When provided, Captain calls sts:AssumeRole on the supplied role_arn instead of using static IAM-user keys. Mutually exclusive with aws_access_key_id/aws_secret_access_key.

max_filesintegerOptional

Maximum number of files to index (optional)

skip_existingbooleanOptionalDefaults to true

When true, files already indexed in the collection are skipped and will not be re-indexed with incoming changes. When false, all incoming files are indexed regardless of whether they already exist.

custom_metadatamap from strings to anyOptional

Custom metadata to attach to all indexed chunks. Keys must be strings. Values: str, int, float, bool, or array of strings.

parsing_scriptstringOptional

Relative path to a JavaScript parsing script for JSON files (e.g. ‘research/paper-parser’). When provided, .json files are processed through a sandboxed V8 isolate that executes the script to extract text and metadata. Without this parameter, .json files are indexed as raw text. Scripts are org-scoped and managed in the Parser Studio.

overwrite_existingbooleanOptionalDefaults to false

When true, files that already exist in the collection will be deleted and re-indexed with the latest changes. Requires skip_existing=false. setting both to true returns a 400 error.

Response

Indexing Job Started
job_idstring
statusenum
Allowed values: