Index S3 Bucket | Captain Docs

import requests
import time
BASE_URL = "https://api.runcaptain.com"
API_KEY = "your_api_key"
ORG_ID = "your_organization_id"
headers = {
    "Authorization": f"Bearer {API_KEY}",
    "X-Organization-ID": ORG_ID,
    "Content-Type": "application/json"
}
# Start indexing job
response = requests.post(
    f"{BASE_URL}/v2/collections/my_documents/index/s3",
    headers=headers,
    json={
        "bucket_name": "my-s3-bucket",
        "aws_access_key_id": "AKIAIOSFODNN7EXAMPLE",
        "aws_secret_access_key": "your_secret_key",
        "bucket_region": "us-east-1",
        "processing_type": "advanced"
    },
    timeout=60.0
)
if response.status_code in [200, 201]:
    result = response.json()
    job_id = result.get("job_id")
    print(f"Indexing started! Job ID: {job_id}")
    
    # Poll for status using v2 jobs endpoint
    while True:
        status_resp = requests.get(
            f"{BASE_URL}/v2/jobs/{job_id}",
            headers={"Authorization": f"Bearer {API_KEY}"},
            timeout=30.0
        )
        status = status_resp.json()
        print(f"Status: {status.get('progress_message')}")
        
        if status.get('status') in ['completed', 'failed', 'cancelled']:
            print(f"Job {status.get('status')}!")
            break
        time.sleep(5)
else:
    print(f"Error: {response.status_code}")
    print(response.json())

{
  "job_id": "abc123xyz-1234567890",
  "status": "pending"
}

Index all files from an S3 bucket into a collection. Returns a job_id for tracking progress via GET /v2/jobs/{job_id}.

Path parameters

collection_namestringRequired

Name of the collection to index into

Request

This endpoint expects an object.

bucket_namestringRequired

Name of the S3 bucket

aws_access_key_idstringRequired

AWS access key ID with read access to the bucket

aws_secret_access_keystringRequired

AWS secret access key

processing_typeenumRequired

Document processing type. ‘advanced’ uses agentic OCR with AI-enhanced extraction for complex layouts, tables, figures, charts, and documents containing images. ‘basic’ provides reliable OCR optimized for general document indexing and high-volume processing.

Allowed values:

bucket_regionstringOptionalDefaults to us-east-1

AWS region where the bucket is located

max_filesintegerOptional

Maximum number of files to index (optional)

skip_existingbooleanOptionalDefaults to true

Skip files that are already indexed in the collection. When true, only new files will be indexed. Set to false to re-index all files.

custom_metadatamap from strings to anyOptional

Custom metadata to attach to all indexed chunks. Keys must be strings. Values: str, int, float, bool, or array of strings.

Response

Indexing Job Started

job_idstring

statusenum

Allowed values:

Path parameters

Headers

Request

Response