Integrate Captain with Your Data Lake | Captain API Documentation

Connect Captain to your cloud storage (AWS S3 or Google Cloud Storage) to index and query your entire data lake. This integration enables persistent, searchable databases from your existing file storage infrastructure.

Overview

Captain’s Data Lake Integration consists of three key components working together:

Tagger - Scans your cloud storage, processes files, and extracts structured data
DeepQuery - Executes natural language queries against indexed data
Cloud Storage - Your AWS S3 or GCS buckets containing documents

┌─────────────────┐
│   Your Cloud    │
│  Storage (S3/   │
│      GCS)       │
└────────┬────────┘
         │
         │ Tagger scans & indexes
         ▼
┌─────────────────┐
│    Captain      │
│   Database      │
└────────┬────────┘
         │
         │ DeepQuery executes
         ▼
┌─────────────────┐
│  Natural Lang.  │
│     Queries     │
└─────────────────┘

Getting Started

Prerequisites

Captain API key (get one at runcaptain.com/studio)
AWS S3 bucket or Google Cloud Storage bucket
Cloud storage credentials (see Cloud Credentials Guide)

Authentication Setup

All Captain API endpoints use header-based authentication. Set up your credentials:

1 # Step 1: Set your credentials
2 API_KEY = "cap_prod_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
3 ORG_ID = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
4 
5 # Step 2: Create headers dictionary (reuse for all requests)
6 headers = {
7     "Authorization": f"Bearer {API_KEY}",
8     "X-Organization-ID": ORG_ID
9 }
10 
11 # Step 3: Use headers in every request
12 response = requests.post(
13     "https://api.runcaptain.com/v1/...",
14     headers=headers,
15     data={...}
16 )

Required headers for all requests:

Authorization: Bearer {your_api_key}
X-Organization-ID: {your_org_id}

Quick Start (4 Steps)

1 import requests
2 
3 # Set your credentials
4 API_KEY = "cap_prod_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
5 ORG_ID = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
6 BASE_URL = "https://api.runcaptain.com"
7 
8 headers = {
9     "Authorization": f"Bearer {API_KEY}",
10     "X-Organization-ID": ORG_ID
11 }
12 
13 # 1. Create a database
14 requests.post(
15     f"{BASE_URL}/v1/create-database",
16     headers=headers,
17     data={
18         'database_name': 'my_documents'
19     }
20 )
21 
22 # 2. Index your S3 bucket (Tagger will scan all files)
23 index_response = requests.post(
24     f"{BASE_URL}/v1/index-s3",
25     headers=headers,
26     data={
27         'database_name': 'my_documents',
28         'bucket_name': 'my-s3-bucket',
29         'aws_access_key_id': 'AKIA...',
30         'aws_secret_access_key': '...',
31         'bucket_region': 'us-east-1'
32     }
33 )
34 
35 job_id = index_response.json()['job_id']
36 
37 # 3. Monitor indexing progress
38 status = requests.get(
39     f"{BASE_URL}/v1/indexing-status/{job_id}",
40     headers=headers
41 )
42 
43 # 4. Query your data (DeepQuery executes the search)
44 response = requests.post(
45     f"{BASE_URL}/v1/query",
46     headers=headers,
47     data={
48         'query': 'What are the main themes in Q4 reports?',
49         'database_name': 'my_documents'
50     }
51 )

Tagger: Intelligent File Indexing

The Tagger service automatically scans your cloud storage, extracts content from files, and creates searchable indexes.

Supported File Types

Documents:

PDF (.pdf) - Up to 512MB with automatic page chunking
Microsoft Word (.docx)
Text files (.txt)
Markdown (.md)
Rich Text Format (.rtf)
OpenDocument Text (.odt)

Spreadsheets & Data:

Microsoft Excel (.xlsx, .xls) - Row-based chunking
CSV (.csv) - Header preservation
JSON (.json)

Presentations:

Microsoft PowerPoint (.pptx, .ppt)

Images (with OCR & Computer Vision):

JPEG (.jpg, .jpeg)
PNG (.png)
BMP (.bmp) - Experimental
GIF (.gif) - Experimental
TIFF (.tiff) - Experimental

Code Files:

Python (.py)
TypeScript (.ts)
JavaScript (.js)
HTML (.html)
CSS (.css)
PHP (.php)
Java (.java)

Web Content:

XML (.xml)

How Tagger Works

Discovery: Scans your bucket and identifies all supported files
Processing: Extracts text, images, and metadata from each file
Chunking: Intelligently splits large files into searchable chunks
Tagging: Generates AI-powered tags and summaries for each chunk
Indexing: Stores processed data in your Captain database

Processing Time:

Average: 2-5 seconds per file
Large PDFs (100+ pages): 10-30 seconds
Entire bucket (1000 files): 45-120 seconds

AWS S3 Integration

Index Entire S3 Bucket

Index all files in an S3 bucket:

1 import requests
2 
3 API_KEY = "cap_prod_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
4 ORG_ID = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
5 
6 headers = {
7     "Authorization": f"Bearer {API_KEY}",
8     "X-Organization-ID": ORG_ID
9 }
10 
11 response = requests.post(
12     "https://api.runcaptain.com/v1/index-s3",
13     headers=headers,
14     data={
15         'database_name': 'my_documents',
16         'bucket_name': 'my-company-docs',
17         'aws_access_key_id': 'AKIAIOSFODNN7EXAMPLE',
18         'aws_secret_access_key': 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY',
19         'bucket_region': 'us-east-1'
20     }
21 )
22 
23 print(response.json())

Response:

1 {
2   "status": "success",
3   "message": "Indexing job started",
4   "job_id": "job_1729876543_a1b2c3d4",
5   "estimated_duration_seconds": 90,
6   "files_found": 1247
7 }

Index Single S3 File

Index a specific file from S3:

1 import requests
2 
3 API_KEY = "cap_prod_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
4 ORG_ID = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
5 
6 headers = {
7     "Authorization": f"Bearer {API_KEY}",
8     "X-Organization-ID": ORG_ID
9 }
10 
11 response = requests.post(
12     "https://api.runcaptain.com/v1/index-s3-file",
13     headers=headers,
14     data={
15         'file_uri': 's3://my-bucket/documents/report_q4_2024.pdf',
16         'database_name': 'my_documents',
17         'aws_access_key_id': 'AKIAIOSFODNN7EXAMPLE',
18         'aws_secret_access_key': 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY',
19         'bucket_region': 'us-east-1'
20     }
21 )

AWS IAM Permissions

Your AWS credentials need these permissions:

1 {
2   "Version": "2012-10-17",
3   "Statement": [
4     {
5       "Effect": "Allow",
6       "Action": [
7         "s3:ListBucket",
8         "s3:GetObject"
9       ],
10       "Resource": [
11         "arn:aws:s3:::your-bucket-name",
12         "arn:aws:s3:::your-bucket-name/*"
13       ]
14     }
15   ]
16 }

See: Cloud Credentials Guide for detailed AWS setup instructions.

Google Cloud Storage Integration

Index Entire GCS Bucket

Index all files in a GCS bucket:

1 import requests
2 
3 API_KEY = "cap_prod_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
4 ORG_ID = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
5 
6 headers = {
7     "Authorization": f"Bearer {API_KEY}",
8     "X-Organization-ID": ORG_ID
9 }
10 
11 # Load service account JSON
12 with open('path/to/service-account-key.json', 'r') as f:
13     service_account_json = f.read()
14 
15 response = requests.post(
16     "https://api.runcaptain.com/v1/index-gcs",
17     headers=headers,
18     data={
19         'database_name': 'my_documents',
20         'bucket_name': 'my-gcs-bucket',
21         'service_account_json': service_account_json
22     }
23 )
24 
25 print(response.json())

Response:

1 {
2   "status": "success",
3   "message": "Indexing job started",
4   "job_id": "job_1729876543_x9y8z7w6",
5   "estimated_duration_seconds": 85,
6   "files_found": 892
7 }

Index Single GCS File

Index a specific file from GCS:

1 import requests
2 
3 API_KEY = "cap_prod_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
4 ORG_ID = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
5 
6 headers = {
7     "Authorization": f"Bearer {API_KEY}",
8     "X-Organization-ID": ORG_ID
9 }
10 
11 with open('service-account-key.json', 'r') as f:
12     service_account_json = f.read()
13 
14 response = requests.post(
15     "https://api.runcaptain.com/v1/index-gcs-file",
16     headers=headers,
17     data={
18         'file_uri': 'gs://my-bucket/documents/report_q4_2024.pdf',
19         'database_name': 'my_documents',
20         'service_account_json': service_account_json
21     }
22 )

GCS IAM Permissions

Your service account needs this role:

roles/storage.objectViewer

Or custom permissions:

storage.buckets.get
storage.objects.get
storage.objects.list

See: Cloud Credentials Guide for detailed GCS setup instructions.

Monitoring Indexing Jobs

Get Indexing Status

Check the progress of your indexing job:

1 import requests
2 
3 response = requests.get(
4     f"https://api.runcaptain.com/v1/indexing-status/{job_id}",
5     headers={
6         'Authorization': 'Bearer cap_prod_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
7     }
8 )
9 
10 status = response.json()
11 print(f"Completed: {status['completed']}")
12 print(f"Status: {status['status']}")
13 print(f"Active workers: {status['active_file_processing_workers']}")

Response:

1 {
2   "completed": false,
3   "status": "RUNNING",
4   "active_file_processing_workers": 12,
5   "files_processed": 458,
6   "files_total": 1247,
7   "progress_percentage": 36.7
8 }

Get Detailed Step Function Status

Get detailed execution status with ETA:

1 import requests
2 
3 response = requests.get(
4     f"https://api.runcaptain.com/v1/index-status/{job_id}",
5     headers={
6         'Authorization': 'Bearer cap_prod_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
7     }
8 )
9 
10 status = response.json()
11 print(f"Status: {status['status']}")
12 print(f"Progress: {status['progress']['percentage']}%")
13 print(f"ETA: {status['estimated_completion_time']}")

Response:

1 {
2   "status": "RUNNING",
3   "execution_arn": "arn:aws:states:us-east-1:123456789012:execution:captain-tagger:job_1729876543",
4   "start_date": "2024-10-25T10:15:30Z",
5   "progress": {
6     "files_processed": 458,
7     "files_total": 1247,
8     "percentage": 36.7
9   },
10   "estimated_completion_time": "2024-10-25T10:17:45Z"
11 }

Cancel Indexing Job

Stop a running indexing job:

1 import requests
2 
3 response = requests.post(
4     f"https://api.runcaptain.com/v1/index-stop/{job_id}",
5     headers={
6         'Authorization': 'Bearer cap_prod_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
7     }
8 )
9 
10 print(response.json())

Response:

1 {
2   "status": "success",
3   "message": "Indexing job cancelled",
4   "tasks_revoked": 789
5 }

DeepQuery: Natural Language Queries

Once your data is indexed, use DeepQuery to ask questions in natural language.

Query Parameters

Parameter	Type	Required	Description
`query`	string	Yes	Natural language query
`database_name`	string	Yes	Database to query
`include_files`	boolean	No	Include file metadata in response (default: false)
`inference`	boolean	No	Enable LLM inference (default: true). When false, returns raw vector search results for custom RAG pipelines
`topK`	integer	No	Number of results from vector search (default: 80). Works with or without inference
`stream`	boolean	No	Enable streaming (default: false). Note: Automatically disabled when `inference=false`

Execute Query with AI Inference

1 import requests
2 
3 API_KEY = "cap_prod_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
4 ORG_ID = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
5 
6 headers = {
7     "Authorization": f"Bearer {API_KEY}",
8     "X-Organization-ID": ORG_ID
9 }
10 
11 response = requests.post(
12     "https://api.runcaptain.com/v1/query",
13     headers=headers,
14     data={
15         'query': 'What are the revenue projections for Q4 2024?',
16         'database_name': 'my_documents',
17         'include_files': 'true',
18         'inference': 'true',  # Use Captain's LLM (default)
19         'topK': '20'  # Limit to top 20 results (optional)
20     }
21 )
22 
23 result = response.json()
24 print(result['response'])

Response:

1 {
2   "status": "success",
3   "response": "Based on the indexed documents, Q4 2024 revenue projections are $15.2M, representing a 23% increase over Q3...",
4   "query": "What are the revenue projections for Q4 2024?",
5   "database_name": "my_documents",
6   "processing_metrics": {
7     "total_files_processed": 42,
8     "total_tokens": 85000,
9     "execution_time_ms": 2850
10   },
11   "relevant_files": [
12     {
13       "file_name": "Q4_Financial_Forecast.xlsx",
14       "relevancy_score": 0.95,
15       "file_type": "xlsx",
16       "file_id": "file_abc123"
17     },
18     {
19       "file_name": "2024_Revenue_Analysis.pdf",
20       "relevancy_score": 0.87,
21       "file_type": "pdf",
22       "file_id": "file_def456"
23     }
24   ]
25 }

Get Raw Search Results (Custom RAG Pipelines)

Skip LLM inference and get raw vector search results for custom RAG implementations:

1 import requests
2 
3 API_KEY = "cap_prod_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
4 ORG_ID = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
5 
6 headers = {
7     "Authorization": f"Bearer {API_KEY}",
8     "X-Organization-ID": ORG_ID
9 }
10 
11 response = requests.post(
12     "https://api.runcaptain.com/v1/query",
13     headers=headers,
14     data={
15         'query': 'What are the revenue projections for Q4 2024?',
16         'database_name': 'my_documents',
17         'inference': 'false',  # Return raw chunks instead of AI answer
18         'topK': '10'  # Get top 10 chunks
19     }
20 )
21 
22 result = response.json()
23 for chunk in result['search_results']:
24     print(f"{chunk['score']}: {chunk['text'][:100]}...")

Response:

1 {
2   "status": "success",
3   "query": "What are the revenue projections for Q4 2024?",
4   "inference": false,
5   "search_results": [
6     {
7       "text": "Q4 2024 Revenue Projections: Based on current market trends...",
8       "score": 0.89,
9       "file_name": "Q4_Financial_Forecast.xlsx",
10       "file_id": "file_abc123",
11       "chunk_id": "chunk_001",
12       "tokens": 420
13     },
14     {
15       "text": "Revenue analysis for Q4 shows projected growth of 23%...",
16       "score": 0.85,
17       "file_name": "2024_Revenue_Analysis.pdf",
18       "file_id": "file_def456",
19       "chunk_id": "chunk_042",
20       "tokens": 380
21     }
22   ],
23   "total_results": 2,
24   "topK": 10,
25   "search_metadata": {
26     "denseResults": 50,
27     "sparseResults": 30,
28     "mergedResults": 60,
29     "finalResults": 10,
30     "searchMethod": "hybrid_vector"
31   }
32 }

Query with Streaming

Enable real-time streaming for immediate results:

1 import requests
2 
3 API_KEY = "cap_prod_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
4 ORG_ID = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
5 
6 headers = {
7     "Authorization": f"Bearer {API_KEY}",
8     "X-Organization-ID": ORG_ID
9 }
10 
11 response = requests.post(
12     "https://api.runcaptain.com/v1/query",
13     headers=headers,
14     data={
15         'query': 'Summarize all security vulnerabilities found in code reviews',
16         'database_name': 'my_documents',
17         'stream': 'true'
18     },
19     stream=True
20 )
21 
22 # Process streamed chunks
23 for line in response.iter_lines():
24     if line:
25         line_text = line.decode('utf-8')
26         if line_text.startswith('data: '):
27             print(line_text[6:], end='', flush=True)

Idempotency

Prevent duplicate processing with idempotency keys:

1 import requests
2 import uuid
3 
4 API_KEY = "cap_prod_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
5 ORG_ID = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
6 
7 idempotency_key = str(uuid.uuid4())
8 
9 response = requests.post(
10     "https://api.runcaptain.com/v1/query",
11     headers={
12         'Authorization': f'Bearer {API_KEY}',
13         'X-Organization-ID': ORG_ID,
14         'Idempotency-Key': idempotency_key
15     },
16     data={
17         'query': 'What are the main findings?',
18         'database_name': 'my_documents'
19     }
20 )
21 
22 # Subsequent requests with same key return cached response
23 # No additional processing or token usage

Database Management

Create Database

1 import requests
2 
3 API_KEY = "cap_prod_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
4 ORG_ID = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
5 
6 headers = {
7     "Authorization": f"Bearer {API_KEY}",
8     "X-Organization-ID": ORG_ID
9 }
10 
11 response = requests.post(
12     "https://api.runcaptain.com/v1/create-database",
13     headers=headers,
14     data={
15         'database_name': 'contracts_db'
16     }
17 )

List Databases

1 import requests
2 
3 API_KEY = "cap_prod_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
4 ORG_ID = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
5 
6 headers = {
7     "Authorization": f"Bearer {API_KEY}",
8     "X-Organization-ID": ORG_ID
9 }
10 
11 response = requests.post(
12     "https://api.runcaptain.com/v1/list-databases",
13     headers=headers
14 )
15 
16 databases = response.json()['databases']
17 for db in databases:
18     print(f"{db['database_name']} - {db['database_id']}")

List Files in Database

1 import requests
2 
3 API_KEY = "cap_prod_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
4 ORG_ID = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
5 
6 headers = {
7     "Authorization": f"Bearer {API_KEY}",
8     "X-Organization-ID": ORG_ID
9 }
10 
11 response = requests.post(
12     "https://api.runcaptain.com/v1/list-files",
13     headers=headers,
14     data={
15         'database_name': 'my_documents',
16         'limit': 50,
17         'offset': 0
18     }
19 )
20 
21 files = response.json()['files']
22 for file in files:
23     print(f"{file['file_name']} - {file['file_type']}")

Delete Single File

1 import requests
2 
3 API_KEY = "cap_prod_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
4 ORG_ID = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
5 
6 headers = {
7     "Authorization": f"Bearer {API_KEY}",
8     "X-Organization-ID": ORG_ID
9 }
10 
11 response = requests.post(
12     "https://api.runcaptain.com/v1/delete-file",
13     headers=headers,
14     data={
15         'database_name': 'my_documents',
16         'file_id': 'file_abc123'
17     }
18 )

Wipe Database

Clear all files while keeping the database structure:

1 import requests
2 
3 API_KEY = "cap_prod_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
4 ORG_ID = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
5 
6 headers = {
7     "Authorization": f"Bearer {API_KEY}",
8     "X-Organization-ID": ORG_ID
9 }
10 
11 response = requests.post(
12     "https://api.runcaptain.com/v1/wipe-database",
13     headers=headers,
14     data={
15         'database_name': 'my_documents'
16     }
17 )
18 
19 print(f"Wiped {response.json()['files_deleted']} files")

Delete Database

Permanently delete a database and all its files:

1 import requests
2 
3 API_KEY = "cap_prod_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
4 ORG_ID = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
5 
6 headers = {
7     "Authorization": f"Bearer {API_KEY}",
8     "X-Organization-ID": ORG_ID
9 }
10 
11 response = requests.post(
12     "https://api.runcaptain.com/v1/delete-database",
13     headers=headers,
14     data={
15         'database_name': 'my_documents'
16     }
17 )

Architecture

How It All Works Together

┌────────────────────────────────────────────────┐
│          Your Cloud Storage (S3/GCS)           │
│  ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐          │
│  │ PDF  │ │ DOCX │ │ XLSX │ │ IMG  │          │
│  └──────┘ └──────┘ └──────┘ └──────┘          │
└────────────────┬───────────────────────────────┘
                 │
                 │ API: /index-s3 or /index-gcs
                 ▼
┌────────────────────────────────────────────────┐
│                Tagger Service                  │
│  ┌──────────────────────────────────────┐     │
│  │  1. Scan bucket (list all files)     │     │
│  │  2. Download & extract content       │     │
│  │  3. Intelligent chunking             │     │
│  │  4. AI tagging & summarization       │     │
│  │  5. Store in database                │     │
│  └──────────────────────────────────────┘     │
└────────────────┬───────────────────────────────┘
                 │
                 │ Indexed data stored
                 ▼
┌────────────────────────────────────────────────┐
│            Captain Database (RDS)              │
│  ┌──────────────────────────────────────┐     │
│  │ • File metadata                      │     │
│  │ • Extracted content chunks           │     │
│  │ • AI-generated tags & summaries      │     │
│  │ • Vector embeddings                  │     │
│  └──────────────────────────────────────┘     │
└────────────────┬───────────────────────────────┘
                 │
                 │ API: /query
                 ▼
┌────────────────────────────────────────────────┐
│              DeepQuery Service                 │
│  ┌──────────────────────────────────────┐     │
│  │  1. Parse natural language query     │     │
│  │  2. Search indexed data              │     │
│  │  3. Rank relevant chunks             │     │
│  │  4. Generate natural language answer │     │
│  │  5. Return results + metadata        │     │
│  └──────────────────────────────────────┘     │
└────────────────┬───────────────────────────────┘
                 │
                 │ Results returned
                 ▼
┌────────────────────────────────────────────────┐
│              Your Application                  │
│  • Natural language answers                    │
│  • Relevant file references                    │
│  • Processing metrics                          │
└────────────────────────────────────────────────┘

Component Details

Tagger:

AWS Lambda-based serverless architecture
Parallel processing for high throughput
Automatic retry logic for failed files
Redis job tracking
Average: 200-500 files/minute

DeepQuery:

AWS Step Functions orchestration
Multi-stage pipeline:
1. Query parsing
2. Semantic search
3. Relevance ranking
4. Context building
5. LLM generation
Average query time: 2-5 seconds
Supports streaming responses

Storage:

RDS Aurora PostgreSQL for structured data
S3 for large file storage
Vector embeddings for semantic search
Automatic backups and replication

Best Practices

Organizing Databases

By Project:

1 # Good: Separate databases per project
2 databases = [
3     'project_alpha_docs',
4     'project_beta_contracts',
5     'project_gamma_reports'
6 ]

By Department:

1 # Good: Separate databases per department
2 databases = [
3     'engineering_docs',
4     'legal_contracts',
5     'finance_reports'
6 ]

Avoid:

1 # Bad: Single database for everything
2 database = 'all_company_documents'  # Hard to manage, slower queries

Re-indexing Strategy

Files are automatically re-indexed when you run index again:

1 # First index
2 requests.post('.../index-s3', data={'database_name': 'docs', ...})
3 
4 # Update some files in S3...
5 
6 # Re-index (existing files will be updated)
7 requests.post('.../index-s3', data={'database_name': 'docs', ...})

Behavior:

Existing files with same name are soft-deleted
New versions are indexed
Maintains query history

Query Optimization

Specific > General:

1 # Good
2 "What are the security vulnerabilities in the authentication module?"
3 
4 # Less effective
5 "Tell me about security"

Include Context:

1 # Good
2 "What were the revenue projections mentioned in Q4 2024 reports?"
3 
4 # Less effective
5 "What are the numbers?"

Use File Filtering:

1 # Include file metadata to understand sources
2 response = requests.post('.../query', data={
3     'query': '...',
4     'include_files': 'true',  # Returns which files were used
5     ...
6 })

Environment Isolation

Use different API keys for dev/staging/prod:

1 # Development
2 API_KEY_DEV = 'cap_dev_...'
3 DATABASE_NAME_DEV = 'docs_dev'
4 
5 # Production
6 API_KEY_PROD = 'cap_prod_...'
7 DATABASE_NAME_PROD = 'docs_prod'

This ensures:

Development testing doesn’t affect production
Complete data isolation
Separate token usage tracking

Use Cases

Enterprise Knowledge Base

Index company wikis, documentation, and internal resources for instant employee access.

Legal Document Management

Index contracts, case files, and legal research for rapid clause extraction and risk analysis.

Customer Support Automation

Index support tickets, documentation, and FAQs to power AI-driven support responses.

Compliance & Audit

Index financial reports, audit logs, and compliance documents for regulatory analysis.

Research & Development

Index scientific papers, patents, and research notes for literature review and prior art searches.

Code Repository Search

Index entire codebases for semantic code search, documentation generation, and technical debt analysis.

Error Handling

Common Errors

Invalid Credentials:

1 {
2   "status": "error",
3   "message": "Invalid AWS credentials",
4   "error_code": "INVALID_CREDENTIALS"
5 }

Bucket Not Found:

1 {
2   "status": "error",
3   "message": "S3 bucket 'my-bucket' does not exist or is not accessible",
4   "error_code": "BUCKET_NOT_FOUND"
5 }

Database Not Found:

1 {
2   "status": "error",
3   "message": "Database 'my_documents' not found",
4   "error_code": "DATABASE_NOT_FOUND"
5 }

File Type Not Supported:

1 {
2   "status": "warning",
3   "message": "Skipped 5 unsupported files (.exe, .dll)",
4   "files_indexed": 1242
5 }

Rate Limits & Quotas

Tier	Indexing Jobs/Hour	Queries/Minute	Max Files/Database
Standard	10	10	50,000
Premium	Unlimited	60	Unlimited

Contact support@runcaptain.com to upgrade to Premium tier.

Infinite-Responses API - Process massive text inputs without indexing
Cloud Credentials Guide - Detailed AWS & GCS setup
API Reference - Complete endpoint documentation
Getting Started - Quick start guide

Overview

Getting Started

Prerequisites

Authentication Setup

Quick Start (4 Steps)

Tagger: Intelligent File Indexing

Supported File Types

How Tagger Works

AWS S3 Integration

Index Entire S3 Bucket

Index Single S3 File

AWS IAM Permissions

Google Cloud Storage Integration

Index Entire GCS Bucket

Index Single GCS File

GCS IAM Permissions

Monitoring Indexing Jobs

Get Indexing Status

Get Detailed Step Function Status

Cancel Indexing Job

DeepQuery: Natural Language Queries

Query Parameters

Execute Query with AI Inference

Get Raw Search Results (Custom RAG Pipelines)

Query with Streaming

Idempotency

Database Management

Create Database

List Databases

List Files in Database

Delete Single File

Wipe Database

Delete Database

Architecture

How It All Works Together

Component Details

Best Practices

Organizing Databases

Re-indexing Strategy

Query Optimization

Environment Isolation

Use Cases

Enterprise Knowledge Base

Legal Document Management

Customer Support Automation

Compliance & Audit

Research & Development

Code Repository Search

Error Handling

Common Errors

Rate Limits & Quotas

Related Documentation