Captain API
The Captain API enables you to process large context (millions of tokens) and ask questions using the familiar OpenAI SDK. Unlike traditional LLMs with limited context windows, Captain can handle unlimited context through intelligent chunking, parallel processing, and generative merging.
Key Features
- Unlimited Context: Process millions of tokens in a single request
- Multiple Input Methods: Inline text or file upload
- Real-time Streaming: Get responses as they’re generated via Server-Sent Events
- OpenAI SDK Compatible: Drop-in replacement for OpenAI API
- Tool Calling Support: Enable LLM to call functions and tools during processing
- Automatic Optimization: Large inputs handled automatically behind the scenes
- Intelligent Processing: 30-40% chunk overlap for accuracy with parallel LLM processing
Authentication
All requests require authentication:
Streaming Responses
Enable real-time streaming to receive responses as they’re generated using Server-Sent Events (SSE).
Enable Streaming
Set stream=true in your request:
Stream Response Format
Streaming responses use Server-Sent Events (SSE) format:
Stream Events
POST /v1/responses
Direct HTTP endpoint for infinite context processing. Use this endpoint when making direct HTTP requests without the OpenAI SDK.
Authentication
All requests require authentication via headers:
Parameters
Request Example (Python)
Request Example (JavaScript)
Request Example (cURL)
Streaming Example
Response Format
Non-Streaming Response:
Streaming Response (SSE):
SDK Integration
Captain supports multiple SDK approaches for different use cases:
Supported SDKs
- Python SDK → - Official OpenAI Python SDK with
extra_body - JavaScript SDK → - Official OpenAI JavaScript SDK with
extra_body(⭐ Recommended for JS/TS) - Vercel AI SDK → - Vercel’s AI SDK with custom header approach
- Direct HTTP → - Maximum control with fetch/requests
Python SDK
Use Captain with the official OpenAI Python SDK for a familiar developer experience.
Installation
Using Captain with OpenAI SDK ⭐ Recommended
Captain separates your instructions from your context using the OpenAI SDK’s standard message format combined with the extra_body parameter.
Key Concepts:
- System messages: Provide instructions to the AI (e.g., “You are a helpful assistant…”)
- User messages: Contain your query or question
extra_body.captain.context: Provides your large context/documents (can be millions of tokens)
Why use this approach:
- ✅ Clean separation: instructions vs. context
- ✅ Your system prompts pass directly to the AI (not replaced)
- ✅ Context properly chunked and processed for unlimited size
- ✅ Compatible with OpenAI SDK patterns
Alternative: File Upload Endpoint (for very large files)
For files so large that loading them into memory is impractical, use the dedicated multipart upload endpoint:
When to use this endpoint:
- ✅ Very large files where loading into memory first is impractical
- ✅ When you want to upload files directly via multipart form data
- ✅ For workflows where you’re already using
requestslibrary instead of OpenAI SDK
Note: Captain handles files of any size automatically - you only need this endpoint if you want to use multipart uploads instead of the standard OpenAI SDK approach.
Message Roles Explained
System Role ({"role": "system", ...}):
- Provides instructions and guidance to the AI (OPTIONAL)
- Sets the AI’s behavior and personality
- Examples: “You are a legal expert”, “Be concise”, “Focus on security aspects”
- Passes directly to the AI model (not replaced by Captain)
- Completely optional - omit to use Captain’s default helpful persona
- When provided, your instructions take priority over Captain’s defaults
User Role ({"role": "user", ...}):
- Contains your query or question
- What you want to know about the context
- Examples: “What are the main themes?”, “List all vulnerabilities”
Context (extra_body.captain.context):
- Your large documents, text, or data to analyze
- Can be millions of tokens (unlimited size)
- Automatically chunked and processed by Captain
- Examples: Full contracts, codebases, research papers
Example 1: Custom System Prompt (Define Your Own Role)
Example 2: Captain’s Default Persona (No System Prompt)
Important: Context must be provided via extra_body. Do not place large documents in system or user messages.
JavaScript SDK
Use the official OpenAI JavaScript SDK with Captain - recommended for most TypeScript/JavaScript projects.
Installation
Basic Example
Streaming
Vercel AI SDK
Use Vercel’s AI SDK with Captain via custom header approach or upload endpoint.
Note:
- For small contexts (<4KB): Use base64-encoded
X-Captain-Contextheader - For large contexts (>4KB): Use
/v1/chat/completions/uploadendpoint with FormData
Installation
Small Context Example
Large Context Example (Upload Endpoint)
For contexts larger than ~4KB, use the upload endpoint:
For complete Vercel AI SDK documentation, see Vercel AI SDK Guide.
Direct HTTP Fetch
For maximum control, use direct HTTP requests with the captain parameter.
Basic Example
SDK Parameters
Common parameters across all SDKs:
POST /v1/chat/completions
Standard OpenAI-compatible chat completions endpoint. Use this for contexts that fit in memory (< 1 MB) or when using the OpenAI Python SDK.
Parameters
Captain-specific extensions (in extra_body):
Request Example (Using OpenAI SDK)
See the “OpenAI SDK Integration” section below for complete examples.
Response Format
Streaming response (SSE format):
Non-streaming response:
POST /v1/chat/completions/upload
Note: As of the latest update, the standard /v1/chat/completions endpoint now automatically handles large contexts (>250KB). This upload endpoint is required if you have larger txt files.
Upload extremely large files (1MB+) directly with your chat completion request using multipart form data.
Use this endpoint for:
- Explicit multipart file uploads
- Legacy integrations requiring direct file upload
- When you want manual control over file upload process
Parameters (Multipart Form Data)
Note: Form data values must be strings. Boolean values like stream should be sent as "true" or "false".
Request Example (HTTP Multipart Upload)
Note: This uses direct HTTP requests, not the OpenAI SDK. For OpenAI SDK usage, use the standard /v1/chat/completions endpoint which now handles large contexts automatically.
How It Works
- File Upload: Your large file is processed automatically
- Parallel Processing: The file is split into chunks (80k tokens each)
- Worker Execution: 15+ parallel workers process chunks simultaneously
- Compression: Each worker compresses its chunk to ~8-10k tokens
- Response Streaming: The reducer streams the final response in real-time
For a 4.6MB Bible text:
- Creates 15 chunks of ~80k tokens each
- Runs 15 parallel workers
- Each worker processes ~20k tokens
- Total processing time: ~10-15 seconds
- Streams response as it’s generated
Response Format
Returns streaming response in OpenAI format with [DONE] marker:
Important Notes
- Streaming Recommended: Always use
stream=truefor best experience - Wait Time: First chunk may take 10-30 seconds for very large files (worker startup time)
- File Size: Supports files up to 100MB+
- Encoding: Files must be UTF-8 encoded text
- Timeout: The stream will wait up to 60 seconds for processing to start
Error Responses
400 Bad Request
401 Unauthorized
413 Payload Too Large
500 Internal Server Error
Tool Calling
Enable the LLM to call functions and tools while processing your documents. Tool calling allows the model to request external operations (calculations, API calls, data lookups) during response generation.
Overview
Tool calling works with both streaming and non-streaming modes. When tools are provided, the LLM autonomously decides whether to use them based on your query and the available context.
Key Requirements:
- All tools must have
"strict": truein the function definition (Cerebras requirement) - Tools work with any context size (small or infinite)
- Compatible with both streaming (
stream=true) and non-streaming modes
Basic Example
Tool Definition Format
Each tool must follow OpenAI’s function calling schema:
Complete Example: Multiple Tools
Tool Choice Parameter
Control when the LLM should use tools:
Streaming vs Non-Streaming with Tools
Streaming Mode:
Non-Streaming Mode:
Tool Calling with Large Context
Tool calling works seamlessly with Captain’s infinite context processing. Simply provide your large document and tools - Captain handles everything automatically:
Key Points:
- Captain automatically handles large contexts (100M+ tokens)
- No special upload needed - just use system messages
- Tools + infinite context work together seamlessly
- The LLM can use tools while processing massive documents
How Tool Calling Works
- LLM Analysis: Model analyzes your query and available tools
- Tool Invocation: Model requests tool execution with specific parameters
- Result Processing: Tool results are fed back to the model
- Final Response: Model generates final answer using tool results
Important Notes:
- Tools are logged but not executed by default (you see what the LLM would call)
- The LLM autonomously decides which tools to use and when
- Multiple tool calls may occur for complex queries
- Works seamlessly with infinite context processing
Supported Models
Tool calling is supported on:
captain-voyager-latest(default)
Best Practices for Tool Calling
Do:
- ✅ Provide clear, descriptive tool names and descriptions
- ✅ Use specific parameter descriptions
- ✅ Set
"strict": Truein all function definitions - ✅ Test with streaming mode for better UX
- ✅ Keep tool parameters simple and well-defined
Don’t:
- ❌ Forget
"strict": True - ❌ Use vague tool descriptions
- ❌ Define too many tools (keep it focused)
- ❌ Expect tool auto-execution (currently logged only)
Troubleshooting
Tool not being called:
- Ensure
"strict": Trueis set - Make description more specific to your use case
- Try being more explicit in your query
Invalid tool schema error:
- Verify JSON schema format in
parameters - Check all required fields are present
- Ensure
typevalues are valid JSON Schema types
Best Practices
Input Size Optimization
Captain handles files of any size automatically. You have two options:
- Standard approach: Use OpenAI SDK with system messages or
captain.context- works for any size - Multipart upload: Use
/v1/chat/completions/uploadfor very large files if you prefer multipart form data
Streaming vs Non-Streaming
Use Streaming When:
- You want real-time responses
- Processing very large documents (reduces perceived latency)
- Building chat interfaces
- Need to show progress to users
Use Non-Streaming When:
- You need the complete response at once
- Processing in batch jobs
- Storing responses in databases
- Simpler implementation needed
Query Optimization
Good Queries:
- “What are the main conclusions in section 3?”
- “Summarize the methodology described in the paper”
- “List all security vulnerabilities mentioned”
Avoid:
- Vague queries: “Tell me about this”
- Multiple questions: “What are the themes, findings, and recommendations?”
- Yes/no questions without context: “Is this good?”
Better Approach:
- Break complex queries into separate requests
- Be specific about what sections or topics to focus on
- Ask for structured output: “List the top 5…”
Use Cases
Legal Document Analysis
Process entire legal contracts, case files, or regulatory documents to extract key clauses, risks, and obligations.
Scientific Paper Review
Analyze research papers, extract methodologies, findings, and compare multiple papers simultaneously.
Code Repository Analysis
Upload entire codebases to understand architecture, identify patterns, or find specific implementations.
Financial Report Processing
Process annual reports, 10-Ks, earnings transcripts to extract financial metrics and strategic insights.
Customer Support Ticket Analysis
Analyze thousands of support tickets to identify common issues, trends, and resolution patterns.
Rate Limits
Contact support@runcaptain.com to upgrade to Premium tier.
Related APIs
- Integrate Captain with Datalake - Index cloud storage buckets for persistent querying
- API Reference - Complete API documentation
- Getting Started - Quick start guide
Choosing the Right Endpoint
Use this guide to select the best endpoint for your use case:
/v1/chat/completions (OpenAI SDK)
Best for:
- Standard contexts (< 1 MB)
- Contexts that fit in memory
- Using the official OpenAI Python SDK
- Familiar OpenAI-compatible interface
Limits:
- No size limits - Captain handles any file size automatically
Example use case:
/v1/chat/completions/upload (Multipart Upload)
Best for:
- Very large files (1 MB - 100 MB+)
- Files already on disk
- Maximum performance for huge contexts
- Parallel processing of massive texts
Advantages:
- Direct file upload (no encoding overhead)
- Optimized for 10 MB+ files
- Automatic parallel processing
- Streaming starts immediately after upload
Example use case: