Multimodal Search | Captain Docs

Captain supports native multimodal search across text documents, images, video, and audio files. Upload any combination of file types to a single collection, and search across all of them with one query. Captain handles format detection, media segmentation, embedding, and cross-modal ranking automatically.

Multimodal search across video, images, and documents

On MRAG-Bench (ICLR 2025), a standardized academic benchmark for vision-centric retrieval with 16,130 images and 1,251 questions, Captain achieves 81.3% retrieval accuracy—outperforming every end-to-end RAG system tested in the paper, including GPT-4o with CLIP retrieval (68.96%).

Supported File Types

Modality	Formats	Processing
Documents	PDF, DOCX, DOC, TXT, MD, JSON, YAML, CSV, XLSX	Text extraction, chunking, and semantic embedding
Images	PNG, JPEG, GIF, BMP, TIFF, WEBP	Native multimodal embedding + visual description
Video	MP4, MOV, AVI, MKV, WEBM, FLV, WMV	Segmented into ≤120s clips, natively embedded
Audio	MP3, WAV, AAC, FLAC, M4A, OGG, WMA	Segmented into ≤80s clips, natively embedded

Unsupported formats are automatically converted at ingestion time (e.g., FLAC to MP3, WebP to PNG, AVI to MP4).

How It Works

Captain uses a dual embedding strategy for media files. Each image, video segment, or audio segment is embedded in two ways:

Native multimodal embedding (3072 dimensions): The raw bytes—image pixels, audio waveform, video frames—are embedded directly. This captures the actual visual, auditory, or temporal content.
Text embedding (1024 dimensions): A vision-language model generates a structured description of the content, which is embedded alongside your text documents. This enables keyword search on filenames and descriptions, bridging the gap between text queries and media content.

This dual approach means a query for “Bruno Mars” finds audio files both by matching the sound of the music (native embedding) and by matching the artist name in the filename (text embedding).

Why reranking is required

Relevance scores from text search and media search are produced by different models on different scales. A text reranker score of 0.7 and a media cosine similarity of 0.7 do not mean the same thing—they cannot be sorted into a single list.

Captain solves this with reranker-informed pipeline weighting: the text reranker’s scores on media descriptions are used to determine how much weight each modality should receive in the final ranking. This is why rerank=true is required for multimodal collections—without the reranker, there’s no way to produce a meaningful cross-modal ranking.

Text-only collections

If your collection contains only text documents, multimodal search adds zero overhead. Captain automatically detects whether a collection has media content and skips the multimodal pipeline entirely for text-only collections.

Querying Multimodal Collections

Reranking is required when a collection contains multimodal content (images, video, or audio). If you set rerank=false on a multimodal collection, the API returns a 400 error. Text-only collections work with rerank=false as before.

Example: Multimodal Search

1 import requests
2 import uuid
3 
4 BASE_URL = "https://api.runcaptain.com"
5 API_KEY = "your_api_key"
6 ORG_ID = "your_organization_id"
7 
8 headers = {
9     "Authorization": f"Bearer {API_KEY}",
10     "X-Organization-ID": ORG_ID,
11     "Content-Type": "application/json",
12     "Idempotency-Key": str(uuid.uuid4())
13 }
14 
15 response = requests.post(
16     f"{BASE_URL}/v2/collections/my_media_library/query",
17     headers=headers,
18     json={
19         "query": "product demo with live dashboard",
20         "inference": False,
21         "rerank": True,
22         "top_k": 10
23     },
24     timeout=120.0
25 )
26 
27 data = response.json()
28 for result in data["search_results"]:
29     modality = result["metadata"].get("modality", "text")
30     print(f"[{modality}] {result['filename']} — score: {result['score']:.3f}")
31     if result["metadata"].get("mediaSegmentStartSec") is not None:
32         start = result["metadata"]["mediaSegmentStartSec"]
33         end = result["metadata"]["mediaSegmentEndSec"]
34         print(f"  Timestamp: {start}s - {end}s")
35     print(f"  {result['content'][:100]}")

Response:

1 {
2   "success": true,
3   "inference": false,
4   "search_results": [
5     {
6       "score": 0.95,
7       "content": "[Video: product_demo.mp4, 120s-240s]\nIn the video, a presenter walks through the analytics dashboard showing real-time revenue metrics and user engagement charts...",
8       "document_id": "doc_video_123",
9       "filename": "product_demo.mp4",
10       "uri": "s3://my-media-bucket/videos/product_demo.mp4",
11       "chunk_index": 1,
12       "metadata": {
13         "modality": "video",
14         "source": "multimodal"
15       }
16     },
17     {
18       "score": 0.88,
19       "content": "[Image: dashboard_screenshot.png]\n### Visual Description\nA screenshot of a web application dashboard with a dark theme. The main area displays a line chart of monthly revenue trending upward...",
20       "document_id": "doc_img_456",
21       "filename": "dashboard_screenshot.png",
22       "uri": "s3://my-media-bucket/images/dashboard_screenshot.png",
23       "chunk_index": 0,
24       "metadata": {
25         "modality": "image",
26         "source": "multimodal"
27       }
28     },
29     {
30       "score": 0.82,
31       "content": "The dashboard provides real-time analytics including revenue metrics, user engagement, and conversion rates...",
32       "document_id": "doc_pdf_789",
33       "filename": "product_docs.pdf",
34       "uri": "s3://my-company-docs/product_docs.pdf",
35       "chunk_index": 15,
36       "page_start": 8,
37       "page_end": 8,
38       "metadata": {
39         "modality": "text",
40         "source": "hybrid"
41       }
42     }
43   ],
44   "total_results": 3,
45   "top_k": 10,
46   "query": "product demo with live dashboard",
47   "execution_time_ms": 1240
48 }

Request Fields

Field	Type	Default	Description
`query`	string	required	The natural language search query
`inference`	boolean	`false`	Enable AI-powered answers with retrieved context
`stream`	boolean	`false`	Enable real-time streaming (only when `inference=true`)
`rerank`	boolean	`true`	Enable reranking. Required when multimodal content is present
`top_k`	integer	`10`	Number of results to return (only when `inference=false`)
`include_bbox`	boolean	`false`	Include bounding box layout data (only when `inference=false`)
`include_documents`	boolean	`false`	Include full document text in results
`search_results`	boolean	`false`	Include raw search chunks when `inference=true`
`metadata_filter`	object	`null`	Filter expression (`$eq`, `$ne`, `$gt`, `$gte`, `$lt`, `$lte`, `$in`, `$nin`, `$and`, `$or`)

Response Fields (per search result)

Field	Type	Description
`score`	number	Final relevance score
`content`	string	Text content, or media label + VLM description (e.g., `[Video: file.mp4, 0s-120s]\nDescription...`)
`document_id`	string	Unique identifier of the source file
`filename`	string	Name of the source file
`uri`	string \| null	Original source URI of the file
`chunk_index`	integer	Index of this chunk (0 for images, segment number for video/audio)
`page_start`	integer \| null	Starting page number (text/PDF only)
`page_end`	integer \| null	Ending page number (text/PDF only)
`metadata.modality`	string	Content type: `text`, `pdf`, `image`, `video`, or `audio`
`metadata.source`	string	How this result was found: `hybrid` (text), `vector` (text), `bm25` (text), or `multimodal` (media)

Error: Reranking Required

If you query a multimodal collection with rerank=false:

1 {
2   "detail": "Reranking is required when multimodal content is present in the collection. Set rerank=true."
3 }

Status code: 400 Bad Request

This error only occurs for collections that contain media files. Text-only collections work with rerank=false as before.

Indexing Media Files

Media files are indexed through the same endpoints as text documents. No special configuration is needed—Captain automatically detects the file type and routes it through the appropriate processing pipeline.

Index from cloud storage

1 import requests
2 
3 BASE_URL = "https://api.runcaptain.com"
4 API_KEY = "your_api_key"
5 ORG_ID = "your_organization_id"
6 
7 headers = {
8     "Authorization": f"Bearer {API_KEY}",
9     "X-Organization-ID": ORG_ID,
10     "Content-Type": "application/json"
11 }
12 
13 # Index an S3 bucket containing mixed content (PDFs, images, videos, audio)
14 response = requests.post(
15     f"{BASE_URL}/v2/collections/my_media_library/index/s3",
16     headers=headers,
17     json={
18         "bucket_name": "my-media-bucket",
19         "aws_access_key_id": "AKIAIOSFODNN7EXAMPLE",
20         "aws_secret_access_key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
21         "bucket_region": "us-east-1",
22         "processing_type": "advanced"
23     }
24 )
25 
26 print(response.json())

Index from URL

1 # Index media files from direct URLs
2 response = requests.post(
3     f"{BASE_URL}/v2/collections/my_media_library/index/url",
4     headers=headers,
5     json={
6         "urls": [
7             "https://example.com/product-demo.mp4",
8             "https://example.com/screenshot.png",
9             "https://example.com/podcast-episode.mp3"
10         ],
11         "processing_type": "basic"
12     }
13 )

Upload files directly

1 # Upload files via multipart form
2 with open("meeting_recording.mp4", "rb") as video, open("notes.pdf", "rb") as pdf:
3     response = requests.post(
4         f"{BASE_URL}/v2/collections/my_media_library/index/file",
5         headers={
6             "Authorization": f"Bearer {API_KEY}",
7             "X-Organization-ID": ORG_ID,
8         },
9         files=[
10             ("files", ("meeting_recording.mp4", video, "video/mp4")),
11             ("files", ("notes.pdf", pdf, "application/pdf")),
12         ],
13         data={"processing_type": "basic"}
14     )

Using with Inference (AI-Powered Answers)

When inference=true, the AI agent automatically searches across all content types. Media results are presented to the agent as text descriptions with timestamps, and the agent can cite specific media files and segments in its response.

1 response = requests.post(
2     f"{BASE_URL}/v2/collections/my_media_library/query",
3     headers=headers,
4     json={
5         "query": "What was discussed about the product roadmap?",
6         "inference": True,
7         "rerank": True,
8         "search_results": True
9     },
10     timeout=120.0
11 )
12 
13 data = response.json()
14 print(data["response"])  # AI-generated answer with citations

The agent sees media results as text labels like [Video: meeting_recording.mp4, 32s-120s] followed by the VLM-generated description, allowing it to cite specific timestamps in its response.

Limitations

Video segments: Maximum 120 seconds per segment. Longer videos are automatically split.
Audio segments: Maximum 80 seconds per segment. Longer audio files are automatically split.
Format conversion: Some formats are converted at ingestion (FLAC→MP3, WebP→PNG, AVI→MP4). The original file is preserved.
Reranking required: Multimodal collections require rerank=true. Text-only collections are unaffected.
Latency: Multimodal queries take ~1-2 seconds (text embedding + media search + reranking). Text-only queries are unchanged.

Benchmark Results

Captain’s multimodal retrieval was evaluated on MRAG-Bench (Hu et al., ICLR 2025), achieving 81.3% ContentHit@5 across 1,251 questions with a corpus of 16,130 images. This outperforms GPT-4o + CLIP retrieval (68.96%), Gemini Pro + CLIP retrieval (65.93%), and human performance with retrieved images (61.38%).

Full results, methodology, and evaluation code are available in our open-source evaluation repository.