Integrate Captain with Your Data Lake
Connect Captain to your cloud storage (AWS S3 or Google Cloud Storage) to index and query your entire data lake. This integration enables persistent, searchable databases from your existing file storage infrastructure.
Overview
Captain’s Data Lake Integration consists of three key components working together:
- Tagger - Scans your cloud storage, processes files, and extracts structured data
- DeepQuery - Executes natural language queries against indexed data
- Cloud Storage - Your AWS S3 or GCS buckets containing documents
Getting Started
Prerequisites
- Captain API key (get one at runcaptain.com/studio)
- AWS S3 bucket or Google Cloud Storage bucket
- Cloud storage credentials (see Cloud Credentials Guide)
Authentication Setup
All Captain API endpoints use header-based authentication. Set up your credentials:
Required headers for all requests:
Authorization: Bearer {your_api_key}X-Organization-ID: {your_org_id}
Quick Start (4 Steps)
Tagger: Intelligent File Indexing
The Tagger service automatically scans your cloud storage, extracts content from files, and creates searchable indexes.
Supported File Types
Documents:
- PDF (.pdf) - Up to 512MB with automatic page chunking
- Microsoft Word (.docx)
- Text files (.txt)
- Markdown (.md)
- Rich Text Format (.rtf)
- OpenDocument Text (.odt)
Spreadsheets & Data:
- Microsoft Excel (.xlsx, .xls) - Row-based chunking
- CSV (.csv) - Header preservation
- JSON (.json)
Presentations:
- Microsoft PowerPoint (.pptx, .ppt)
Images (with OCR & Computer Vision):
- JPEG (.jpg, .jpeg)
- PNG (.png)
- BMP (.bmp) - Experimental
- GIF (.gif) - Experimental
- TIFF (.tiff) - Experimental
Code Files:
- Python (.py)
- TypeScript (.ts)
- JavaScript (.js)
- HTML (.html)
- CSS (.css)
- PHP (.php)
- Java (.java)
Web Content:
- XML (.xml)
How Tagger Works
- Discovery: Scans your bucket and identifies all supported files
- Processing: Extracts text, images, and metadata from each file
- Chunking: Intelligently splits large files into searchable chunks
- Tagging: Generates AI-powered tags and summaries for each chunk
- Indexing: Stores processed data in your Captain database
Processing Time:
- Average: 2-5 seconds per file
- Large PDFs (100+ pages): 10-30 seconds
- Entire bucket (1000 files): 45-120 seconds
AWS S3 Integration
Index Entire S3 Bucket
Index all files in an S3 bucket:
Response:
Index Single S3 File
Index a specific file from S3:
AWS IAM Permissions
Your AWS credentials need these permissions:
See: Cloud Credentials Guide for detailed AWS setup instructions.
Google Cloud Storage Integration
Index Entire GCS Bucket
Index all files in a GCS bucket:
Response:
Index Single GCS File
Index a specific file from GCS:
GCS IAM Permissions
Your service account needs this role:
Or custom permissions:
See: Cloud Credentials Guide for detailed GCS setup instructions.
Monitoring Indexing Jobs
Get Indexing Status
Check the progress of your indexing job:
Response:
Get Detailed Step Function Status
Get detailed execution status with ETA:
Response:
Cancel Indexing Job
Stop a running indexing job:
Response:
DeepQuery: Natural Language Queries
Once your data is indexed, use DeepQuery to ask questions in natural language.
Query Parameters
Execute Query with AI Inference
Response:
Get Raw Search Results (Custom RAG Pipelines)
Skip LLM inference and get raw vector search results for custom RAG implementations:
Response:
Query with Streaming
Enable real-time streaming for immediate results:
Idempotency
Prevent duplicate processing with idempotency keys:
Database Management
Create Database
List Databases
List Files in Database
Delete Single File
Wipe Database
Clear all files while keeping the database structure:
Delete Database
Permanently delete a database and all its files:
Architecture
How It All Works Together
Component Details
Tagger:
- AWS Lambda-based serverless architecture
- Parallel processing for high throughput
- Automatic retry logic for failed files
- Redis job tracking
- Average: 200-500 files/minute
DeepQuery:
- AWS Step Functions orchestration
- Multi-stage pipeline:
- Query parsing
- Semantic search
- Relevance ranking
- Context building
- LLM generation
- Average query time: 2-5 seconds
- Supports streaming responses
Storage:
- RDS Aurora PostgreSQL for structured data
- S3 for large file storage
- Vector embeddings for semantic search
- Automatic backups and replication
Best Practices
Organizing Databases
By Project:
By Department:
Avoid:
Re-indexing Strategy
Files are automatically re-indexed when you run index again:
Behavior:
- Existing files with same name are soft-deleted
- New versions are indexed
- Maintains query history
Query Optimization
Specific > General:
Include Context:
Use File Filtering:
Environment Isolation
Use different API keys for dev/staging/prod:
This ensures:
- Development testing doesn’t affect production
- Complete data isolation
- Separate token usage tracking
Use Cases
Enterprise Knowledge Base
Index company wikis, documentation, and internal resources for instant employee access.
Legal Document Management
Index contracts, case files, and legal research for rapid clause extraction and risk analysis.
Customer Support Automation
Index support tickets, documentation, and FAQs to power AI-driven support responses.
Compliance & Audit
Index financial reports, audit logs, and compliance documents for regulatory analysis.
Research & Development
Index scientific papers, patents, and research notes for literature review and prior art searches.
Code Repository Search
Index entire codebases for semantic code search, documentation generation, and technical debt analysis.
Error Handling
Common Errors
Invalid Credentials:
Bucket Not Found:
Database Not Found:
File Type Not Supported:
Rate Limits & Quotas
Contact support@runcaptain.com to upgrade to Premium tier.
Related Documentation
- Infinite-Responses API - Process massive text inputs without indexing
- Cloud Credentials Guide - Detailed AWS & GCS setup
- API Reference - Complete endpoint documentation
- Getting Started - Quick start guide