Integrate Captain with Your Data Lake

Connect Captain to your cloud storage (AWS S3 or Google Cloud Storage) to index and query your entire data lake. This integration enables persistent, searchable databases from your existing file storage infrastructure.


Overview

Captain’s Data Lake Integration consists of three key components working together:

  1. Tagger - Scans your cloud storage, processes files, and extracts structured data
  2. DeepQuery - Executes natural language queries against indexed data
  3. Cloud Storage - Your AWS S3 or GCS buckets containing documents
┌─────────────────┐
│ Your Cloud │
│ Storage (S3/ │
│ GCS) │
└────────┬────────┘
│ Tagger scans & indexes
┌─────────────────┐
│ Captain │
│ Database │
└────────┬────────┘
│ DeepQuery executes
┌─────────────────┐
│ Natural Lang. │
│ Queries │
└─────────────────┘

Getting Started

Prerequisites

Authentication Setup

All Captain API endpoints use header-based authentication. Set up your credentials:

1# Step 1: Set your credentials
2API_KEY = "cap_prod_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
3ORG_ID = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
4
5# Step 2: Create headers dictionary (reuse for all requests)
6headers = {
7 "Authorization": f"Bearer {API_KEY}",
8 "X-Organization-ID": ORG_ID
9}
10
11# Step 3: Use headers in every request
12response = requests.post(
13 "https://api.runcaptain.com/v1/...",
14 headers=headers,
15 data={...}
16)

Required headers for all requests:

  • Authorization: Bearer {your_api_key}
  • X-Organization-ID: {your_org_id}

Quick Start (4 Steps)

1import requests
2
3# Set your credentials
4API_KEY = "cap_prod_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
5ORG_ID = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
6BASE_URL = "https://api.runcaptain.com"
7
8headers = {
9 "Authorization": f"Bearer {API_KEY}",
10 "X-Organization-ID": ORG_ID
11}
12
13# 1. Create a database
14requests.post(
15 f"{BASE_URL}/v1/create-database",
16 headers=headers,
17 data={
18 'database_name': 'my_documents'
19 }
20)
21
22# 2. Index your S3 bucket (Tagger will scan all files)
23index_response = requests.post(
24 f"{BASE_URL}/v1/index-s3",
25 headers=headers,
26 data={
27 'database_name': 'my_documents',
28 'bucket_name': 'my-s3-bucket',
29 'aws_access_key_id': 'AKIA...',
30 'aws_secret_access_key': '...',
31 'bucket_region': 'us-east-1'
32 }
33)
34
35job_id = index_response.json()['job_id']
36
37# 3. Monitor indexing progress
38status = requests.get(
39 f"{BASE_URL}/v1/indexing-status/{job_id}",
40 headers=headers
41)
42
43# 4. Query your data (DeepQuery executes the search)
44response = requests.post(
45 f"{BASE_URL}/v1/query",
46 headers=headers,
47 data={
48 'query': 'What are the main themes in Q4 reports?',
49 'database_name': 'my_documents'
50 }
51)

Tagger: Intelligent File Indexing

The Tagger service automatically scans your cloud storage, extracts content from files, and creates searchable indexes.

Supported File Types

Documents:

  • PDF (.pdf) - Up to 512MB with automatic page chunking
  • Microsoft Word (.docx)
  • Text files (.txt)
  • Markdown (.md)
  • Rich Text Format (.rtf)
  • OpenDocument Text (.odt)

Spreadsheets & Data:

  • Microsoft Excel (.xlsx, .xls) - Row-based chunking
  • CSV (.csv) - Header preservation
  • JSON (.json)

Presentations:

  • Microsoft PowerPoint (.pptx, .ppt)

Images (with OCR & Computer Vision):

  • JPEG (.jpg, .jpeg)
  • PNG (.png)
  • BMP (.bmp) - Experimental
  • GIF (.gif) - Experimental
  • TIFF (.tiff) - Experimental

Code Files:

  • Python (.py)
  • TypeScript (.ts)
  • JavaScript (.js)
  • HTML (.html)
  • CSS (.css)
  • PHP (.php)
  • Java (.java)

Web Content:

  • XML (.xml)

How Tagger Works

  1. Discovery: Scans your bucket and identifies all supported files
  2. Processing: Extracts text, images, and metadata from each file
  3. Chunking: Intelligently splits large files into searchable chunks
  4. Tagging: Generates AI-powered tags and summaries for each chunk
  5. Indexing: Stores processed data in your Captain database

Processing Time:

  • Average: 2-5 seconds per file
  • Large PDFs (100+ pages): 10-30 seconds
  • Entire bucket (1000 files): 45-120 seconds

AWS S3 Integration

Index Entire S3 Bucket

Index all files in an S3 bucket:

1import requests
2
3API_KEY = "cap_prod_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
4ORG_ID = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
5
6headers = {
7 "Authorization": f"Bearer {API_KEY}",
8 "X-Organization-ID": ORG_ID
9}
10
11response = requests.post(
12 "https://api.runcaptain.com/v1/index-s3",
13 headers=headers,
14 data={
15 'database_name': 'my_documents',
16 'bucket_name': 'my-company-docs',
17 'aws_access_key_id': 'AKIAIOSFODNN7EXAMPLE',
18 'aws_secret_access_key': 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY',
19 'bucket_region': 'us-east-1'
20 }
21)
22
23print(response.json())

Response:

1{
2 "status": "success",
3 "message": "Indexing job started",
4 "job_id": "job_1729876543_a1b2c3d4",
5 "estimated_duration_seconds": 90,
6 "files_found": 1247
7}

Index Single S3 File

Index a specific file from S3:

1import requests
2
3API_KEY = "cap_prod_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
4ORG_ID = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
5
6headers = {
7 "Authorization": f"Bearer {API_KEY}",
8 "X-Organization-ID": ORG_ID
9}
10
11response = requests.post(
12 "https://api.runcaptain.com/v1/index-s3-file",
13 headers=headers,
14 data={
15 'file_uri': 's3://my-bucket/documents/report_q4_2024.pdf',
16 'database_name': 'my_documents',
17 'aws_access_key_id': 'AKIAIOSFODNN7EXAMPLE',
18 'aws_secret_access_key': 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY',
19 'bucket_region': 'us-east-1'
20 }
21)

AWS IAM Permissions

Your AWS credentials need these permissions:

1{
2 "Version": "2012-10-17",
3 "Statement": [
4 {
5 "Effect": "Allow",
6 "Action": [
7 "s3:ListBucket",
8 "s3:GetObject"
9 ],
10 "Resource": [
11 "arn:aws:s3:::your-bucket-name",
12 "arn:aws:s3:::your-bucket-name/*"
13 ]
14 }
15 ]
16}

See: Cloud Credentials Guide for detailed AWS setup instructions.


Google Cloud Storage Integration

Index Entire GCS Bucket

Index all files in a GCS bucket:

1import requests
2
3API_KEY = "cap_prod_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
4ORG_ID = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
5
6headers = {
7 "Authorization": f"Bearer {API_KEY}",
8 "X-Organization-ID": ORG_ID
9}
10
11# Load service account JSON
12with open('path/to/service-account-key.json', 'r') as f:
13 service_account_json = f.read()
14
15response = requests.post(
16 "https://api.runcaptain.com/v1/index-gcs",
17 headers=headers,
18 data={
19 'database_name': 'my_documents',
20 'bucket_name': 'my-gcs-bucket',
21 'service_account_json': service_account_json
22 }
23)
24
25print(response.json())

Response:

1{
2 "status": "success",
3 "message": "Indexing job started",
4 "job_id": "job_1729876543_x9y8z7w6",
5 "estimated_duration_seconds": 85,
6 "files_found": 892
7}

Index Single GCS File

Index a specific file from GCS:

1import requests
2
3API_KEY = "cap_prod_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
4ORG_ID = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
5
6headers = {
7 "Authorization": f"Bearer {API_KEY}",
8 "X-Organization-ID": ORG_ID
9}
10
11with open('service-account-key.json', 'r') as f:
12 service_account_json = f.read()
13
14response = requests.post(
15 "https://api.runcaptain.com/v1/index-gcs-file",
16 headers=headers,
17 data={
18 'file_uri': 'gs://my-bucket/documents/report_q4_2024.pdf',
19 'database_name': 'my_documents',
20 'service_account_json': service_account_json
21 }
22)

GCS IAM Permissions

Your service account needs this role:

roles/storage.objectViewer

Or custom permissions:

storage.buckets.get
storage.objects.get
storage.objects.list

See: Cloud Credentials Guide for detailed GCS setup instructions.


Monitoring Indexing Jobs

Get Indexing Status

Check the progress of your indexing job:

1import requests
2
3response = requests.get(
4 f"https://api.runcaptain.com/v1/indexing-status/{job_id}",
5 headers={
6 'Authorization': 'Bearer cap_prod_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
7 }
8)
9
10status = response.json()
11print(f"Completed: {status['completed']}")
12print(f"Status: {status['status']}")
13print(f"Active workers: {status['active_file_processing_workers']}")

Response:

1{
2 "completed": false,
3 "status": "RUNNING",
4 "active_file_processing_workers": 12,
5 "files_processed": 458,
6 "files_total": 1247,
7 "progress_percentage": 36.7
8}

Get Detailed Step Function Status

Get detailed execution status with ETA:

1import requests
2
3response = requests.get(
4 f"https://api.runcaptain.com/v1/index-status/{job_id}",
5 headers={
6 'Authorization': 'Bearer cap_prod_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
7 }
8)
9
10status = response.json()
11print(f"Status: {status['status']}")
12print(f"Progress: {status['progress']['percentage']}%")
13print(f"ETA: {status['estimated_completion_time']}")

Response:

1{
2 "status": "RUNNING",
3 "execution_arn": "arn:aws:states:us-east-1:123456789012:execution:captain-tagger:job_1729876543",
4 "start_date": "2024-10-25T10:15:30Z",
5 "progress": {
6 "files_processed": 458,
7 "files_total": 1247,
8 "percentage": 36.7
9 },
10 "estimated_completion_time": "2024-10-25T10:17:45Z"
11}

Cancel Indexing Job

Stop a running indexing job:

1import requests
2
3response = requests.post(
4 f"https://api.runcaptain.com/v1/index-stop/{job_id}",
5 headers={
6 'Authorization': 'Bearer cap_prod_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
7 }
8)
9
10print(response.json())

Response:

1{
2 "status": "success",
3 "message": "Indexing job cancelled",
4 "tasks_revoked": 789
5}

DeepQuery: Natural Language Queries

Once your data is indexed, use DeepQuery to ask questions in natural language.

Query Parameters

ParameterTypeRequiredDescription
querystringYesNatural language query
database_namestringYesDatabase to query
include_filesbooleanNoInclude file metadata in response (default: false)
inferencebooleanNoEnable LLM inference (default: true). When false, returns raw vector search results for custom RAG pipelines
topKintegerNoNumber of results from vector search (default: 80). Works with or without inference
streambooleanNoEnable streaming (default: false). Note: Automatically disabled when inference=false

Execute Query with AI Inference

1import requests
2
3API_KEY = "cap_prod_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
4ORG_ID = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
5
6headers = {
7 "Authorization": f"Bearer {API_KEY}",
8 "X-Organization-ID": ORG_ID
9}
10
11response = requests.post(
12 "https://api.runcaptain.com/v1/query",
13 headers=headers,
14 data={
15 'query': 'What are the revenue projections for Q4 2024?',
16 'database_name': 'my_documents',
17 'include_files': 'true',
18 'inference': 'true', # Use Captain's LLM (default)
19 'topK': '20' # Limit to top 20 results (optional)
20 }
21)
22
23result = response.json()
24print(result['response'])

Response:

1{
2 "status": "success",
3 "response": "Based on the indexed documents, Q4 2024 revenue projections are $15.2M, representing a 23% increase over Q3...",
4 "query": "What are the revenue projections for Q4 2024?",
5 "database_name": "my_documents",
6 "processing_metrics": {
7 "total_files_processed": 42,
8 "total_tokens": 85000,
9 "execution_time_ms": 2850
10 },
11 "relevant_files": [
12 {
13 "file_name": "Q4_Financial_Forecast.xlsx",
14 "relevancy_score": 0.95,
15 "file_type": "xlsx",
16 "file_id": "file_abc123"
17 },
18 {
19 "file_name": "2024_Revenue_Analysis.pdf",
20 "relevancy_score": 0.87,
21 "file_type": "pdf",
22 "file_id": "file_def456"
23 }
24 ]
25}

Get Raw Search Results (Custom RAG Pipelines)

Skip LLM inference and get raw vector search results for custom RAG implementations:

1import requests
2
3API_KEY = "cap_prod_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
4ORG_ID = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
5
6headers = {
7 "Authorization": f"Bearer {API_KEY}",
8 "X-Organization-ID": ORG_ID
9}
10
11response = requests.post(
12 "https://api.runcaptain.com/v1/query",
13 headers=headers,
14 data={
15 'query': 'What are the revenue projections for Q4 2024?',
16 'database_name': 'my_documents',
17 'inference': 'false', # Return raw chunks instead of AI answer
18 'topK': '10' # Get top 10 chunks
19 }
20)
21
22result = response.json()
23for chunk in result['search_results']:
24 print(f"{chunk['score']}: {chunk['text'][:100]}...")

Response:

1{
2 "status": "success",
3 "query": "What are the revenue projections for Q4 2024?",
4 "inference": false,
5 "search_results": [
6 {
7 "text": "Q4 2024 Revenue Projections: Based on current market trends...",
8 "score": 0.89,
9 "file_name": "Q4_Financial_Forecast.xlsx",
10 "file_id": "file_abc123",
11 "chunk_id": "chunk_001",
12 "tokens": 420
13 },
14 {
15 "text": "Revenue analysis for Q4 shows projected growth of 23%...",
16 "score": 0.85,
17 "file_name": "2024_Revenue_Analysis.pdf",
18 "file_id": "file_def456",
19 "chunk_id": "chunk_042",
20 "tokens": 380
21 }
22 ],
23 "total_results": 2,
24 "topK": 10,
25 "search_metadata": {
26 "denseResults": 50,
27 "sparseResults": 30,
28 "mergedResults": 60,
29 "finalResults": 10,
30 "searchMethod": "hybrid_vector"
31 }
32}

Query with Streaming

Enable real-time streaming for immediate results:

1import requests
2
3API_KEY = "cap_prod_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
4ORG_ID = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
5
6headers = {
7 "Authorization": f"Bearer {API_KEY}",
8 "X-Organization-ID": ORG_ID
9}
10
11response = requests.post(
12 "https://api.runcaptain.com/v1/query",
13 headers=headers,
14 data={
15 'query': 'Summarize all security vulnerabilities found in code reviews',
16 'database_name': 'my_documents',
17 'stream': 'true'
18 },
19 stream=True
20)
21
22# Process streamed chunks
23for line in response.iter_lines():
24 if line:
25 line_text = line.decode('utf-8')
26 if line_text.startswith('data: '):
27 print(line_text[6:], end='', flush=True)

Idempotency

Prevent duplicate processing with idempotency keys:

1import requests
2import uuid
3
4API_KEY = "cap_prod_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
5ORG_ID = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
6
7idempotency_key = str(uuid.uuid4())
8
9response = requests.post(
10 "https://api.runcaptain.com/v1/query",
11 headers={
12 'Authorization': f'Bearer {API_KEY}',
13 'X-Organization-ID': ORG_ID,
14 'Idempotency-Key': idempotency_key
15 },
16 data={
17 'query': 'What are the main findings?',
18 'database_name': 'my_documents'
19 }
20)
21
22# Subsequent requests with same key return cached response
23# No additional processing or token usage

Database Management

Create Database

1import requests
2
3API_KEY = "cap_prod_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
4ORG_ID = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
5
6headers = {
7 "Authorization": f"Bearer {API_KEY}",
8 "X-Organization-ID": ORG_ID
9}
10
11response = requests.post(
12 "https://api.runcaptain.com/v1/create-database",
13 headers=headers,
14 data={
15 'database_name': 'contracts_db'
16 }
17)

List Databases

1import requests
2
3API_KEY = "cap_prod_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
4ORG_ID = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
5
6headers = {
7 "Authorization": f"Bearer {API_KEY}",
8 "X-Organization-ID": ORG_ID
9}
10
11response = requests.post(
12 "https://api.runcaptain.com/v1/list-databases",
13 headers=headers
14)
15
16databases = response.json()['databases']
17for db in databases:
18 print(f"{db['database_name']} - {db['database_id']}")

List Files in Database

1import requests
2
3API_KEY = "cap_prod_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
4ORG_ID = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
5
6headers = {
7 "Authorization": f"Bearer {API_KEY}",
8 "X-Organization-ID": ORG_ID
9}
10
11response = requests.post(
12 "https://api.runcaptain.com/v1/list-files",
13 headers=headers,
14 data={
15 'database_name': 'my_documents',
16 'limit': 50,
17 'offset': 0
18 }
19)
20
21files = response.json()['files']
22for file in files:
23 print(f"{file['file_name']} - {file['file_type']}")

Delete Single File

1import requests
2
3API_KEY = "cap_prod_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
4ORG_ID = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
5
6headers = {
7 "Authorization": f"Bearer {API_KEY}",
8 "X-Organization-ID": ORG_ID
9}
10
11response = requests.post(
12 "https://api.runcaptain.com/v1/delete-file",
13 headers=headers,
14 data={
15 'database_name': 'my_documents',
16 'file_id': 'file_abc123'
17 }
18)

Wipe Database

Clear all files while keeping the database structure:

1import requests
2
3API_KEY = "cap_prod_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
4ORG_ID = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
5
6headers = {
7 "Authorization": f"Bearer {API_KEY}",
8 "X-Organization-ID": ORG_ID
9}
10
11response = requests.post(
12 "https://api.runcaptain.com/v1/wipe-database",
13 headers=headers,
14 data={
15 'database_name': 'my_documents'
16 }
17)
18
19print(f"Wiped {response.json()['files_deleted']} files")

Delete Database

Permanently delete a database and all its files:

1import requests
2
3API_KEY = "cap_prod_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
4ORG_ID = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
5
6headers = {
7 "Authorization": f"Bearer {API_KEY}",
8 "X-Organization-ID": ORG_ID
9}
10
11response = requests.post(
12 "https://api.runcaptain.com/v1/delete-database",
13 headers=headers,
14 data={
15 'database_name': 'my_documents'
16 }
17)

Architecture

How It All Works Together

┌────────────────────────────────────────────────┐
│ Your Cloud Storage (S3/GCS) │
│ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ │
│ │ PDF │ │ DOCX │ │ XLSX │ │ IMG │ │
│ └──────┘ └──────┘ └──────┘ └──────┘ │
└────────────────┬───────────────────────────────┘
│ API: /index-s3 or /index-gcs
┌────────────────────────────────────────────────┐
│ Tagger Service │
│ ┌──────────────────────────────────────┐ │
│ │ 1. Scan bucket (list all files) │ │
│ │ 2. Download & extract content │ │
│ │ 3. Intelligent chunking │ │
│ │ 4. AI tagging & summarization │ │
│ │ 5. Store in database │ │
│ └──────────────────────────────────────┘ │
└────────────────┬───────────────────────────────┘
│ Indexed data stored
┌────────────────────────────────────────────────┐
│ Captain Database (RDS) │
│ ┌──────────────────────────────────────┐ │
│ │ • File metadata │ │
│ │ • Extracted content chunks │ │
│ │ • AI-generated tags & summaries │ │
│ │ • Vector embeddings │ │
│ └──────────────────────────────────────┘ │
└────────────────┬───────────────────────────────┘
│ API: /query
┌────────────────────────────────────────────────┐
│ DeepQuery Service │
│ ┌──────────────────────────────────────┐ │
│ │ 1. Parse natural language query │ │
│ │ 2. Search indexed data │ │
│ │ 3. Rank relevant chunks │ │
│ │ 4. Generate natural language answer │ │
│ │ 5. Return results + metadata │ │
│ └──────────────────────────────────────┘ │
└────────────────┬───────────────────────────────┘
│ Results returned
┌────────────────────────────────────────────────┐
│ Your Application │
│ • Natural language answers │
│ • Relevant file references │
│ • Processing metrics │
└────────────────────────────────────────────────┘

Component Details

Tagger:

  • AWS Lambda-based serverless architecture
  • Parallel processing for high throughput
  • Automatic retry logic for failed files
  • Redis job tracking
  • Average: 200-500 files/minute

DeepQuery:

  • AWS Step Functions orchestration
  • Multi-stage pipeline:
    1. Query parsing
    2. Semantic search
    3. Relevance ranking
    4. Context building
    5. LLM generation
  • Average query time: 2-5 seconds
  • Supports streaming responses

Storage:

  • RDS Aurora PostgreSQL for structured data
  • S3 for large file storage
  • Vector embeddings for semantic search
  • Automatic backups and replication

Best Practices

Organizing Databases

By Project:

1# Good: Separate databases per project
2databases = [
3 'project_alpha_docs',
4 'project_beta_contracts',
5 'project_gamma_reports'
6]

By Department:

1# Good: Separate databases per department
2databases = [
3 'engineering_docs',
4 'legal_contracts',
5 'finance_reports'
6]

Avoid:

1# Bad: Single database for everything
2database = 'all_company_documents' # Hard to manage, slower queries

Re-indexing Strategy

Files are automatically re-indexed when you run index again:

1# First index
2requests.post('.../index-s3', data={'database_name': 'docs', ...})
3
4# Update some files in S3...
5
6# Re-index (existing files will be updated)
7requests.post('.../index-s3', data={'database_name': 'docs', ...})

Behavior:

  • Existing files with same name are soft-deleted
  • New versions are indexed
  • Maintains query history

Query Optimization

Specific > General:

1# Good
2"What are the security vulnerabilities in the authentication module?"
3
4# Less effective
5"Tell me about security"

Include Context:

1# Good
2"What were the revenue projections mentioned in Q4 2024 reports?"
3
4# Less effective
5"What are the numbers?"

Use File Filtering:

1# Include file metadata to understand sources
2response = requests.post('.../query', data={
3 'query': '...',
4 'include_files': 'true', # Returns which files were used
5 ...
6})

Environment Isolation

Use different API keys for dev/staging/prod:

1# Development
2API_KEY_DEV = 'cap_dev_...'
3DATABASE_NAME_DEV = 'docs_dev'
4
5# Production
6API_KEY_PROD = 'cap_prod_...'
7DATABASE_NAME_PROD = 'docs_prod'

This ensures:

  • Development testing doesn’t affect production
  • Complete data isolation
  • Separate token usage tracking

Use Cases

Enterprise Knowledge Base

Index company wikis, documentation, and internal resources for instant employee access.

Index contracts, case files, and legal research for rapid clause extraction and risk analysis.

Customer Support Automation

Index support tickets, documentation, and FAQs to power AI-driven support responses.

Compliance & Audit

Index financial reports, audit logs, and compliance documents for regulatory analysis.

Research & Development

Index scientific papers, patents, and research notes for literature review and prior art searches.

Index entire codebases for semantic code search, documentation generation, and technical debt analysis.


Error Handling

Common Errors

Invalid Credentials:

1{
2 "status": "error",
3 "message": "Invalid AWS credentials",
4 "error_code": "INVALID_CREDENTIALS"
5}

Bucket Not Found:

1{
2 "status": "error",
3 "message": "S3 bucket 'my-bucket' does not exist or is not accessible",
4 "error_code": "BUCKET_NOT_FOUND"
5}

Database Not Found:

1{
2 "status": "error",
3 "message": "Database 'my_documents' not found",
4 "error_code": "DATABASE_NOT_FOUND"
5}

File Type Not Supported:

1{
2 "status": "warning",
3 "message": "Skipped 5 unsupported files (.exe, .dll)",
4 "files_indexed": 1242
5}

Rate Limits & Quotas

TierIndexing Jobs/HourQueries/MinuteMax Files/Database
Standard101050,000
PremiumUnlimited60Unlimited

Contact support@runcaptain.com to upgrade to Premium tier.