Multimodal Search

Captain supports native multimodal search across text documents, images, video, and audio files. Upload any combination of file types to a single collection, and search across all of them with one query. Captain handles format detection, media segmentation, embedding, and cross-modal ranking automatically.

Multimodal search across video, images, and documents

On MRAG-Bench (ICLR 2025), a standardized academic benchmark for vision-centric retrieval with 16,130 images and 1,251 questions, Captain achieves 81.3% retrieval accuracy—outperforming every end-to-end RAG system tested in the paper, including GPT-4o with CLIP retrieval (68.96%).

Supported File Types

ModalityFormatsProcessing
DocumentsPDF, DOCX, DOC, TXT, MD, JSON, YAML, CSV, XLSXText extraction, chunking, and semantic embedding
ImagesPNG, JPEG, GIF, BMP, TIFF, WEBPNative multimodal embedding + visual description
VideoMP4, MOV, AVI, MKV, WEBM, FLV, WMVSegmented into ≤120s clips, natively embedded
AudioMP3, WAV, AAC, FLAC, M4A, OGG, WMASegmented into ≤80s clips, natively embedded

Unsupported formats are automatically converted at ingestion time (e.g., FLAC to MP3, WebP to PNG, AVI to MP4).

How It Works

Captain uses a dual embedding strategy for media files. Each image, video segment, or audio segment is embedded in two ways:

  1. Native multimodal embedding (3072 dimensions): The raw bytes—image pixels, audio waveform, video frames—are embedded directly. This captures the actual visual, auditory, or temporal content.

  2. Text embedding (1024 dimensions): A vision-language model generates a structured description of the content, which is embedded alongside your text documents. This enables keyword search on filenames and descriptions, bridging the gap between text queries and media content.

This dual approach means a query for “Bruno Mars” finds audio files both by matching the sound of the music (native embedding) and by matching the artist name in the filename (text embedding).

Why reranking is required

Relevance scores from text search and media search are produced by different models on different scales. A text reranker score of 0.7 and a media cosine similarity of 0.7 do not mean the same thing—they cannot be sorted into a single list.

Captain solves this with reranker-informed pipeline weighting: the text reranker’s scores on media descriptions are used to determine how much weight each modality should receive in the final ranking. This is why rerank=true is required for multimodal collections—without the reranker, there’s no way to produce a meaningful cross-modal ranking.

Text-only collections

If your collection contains only text documents, multimodal search adds zero overhead. Captain automatically detects whether a collection has media content and skips the multimodal pipeline entirely for text-only collections.

Querying Multimodal Collections

Reranking is required when a collection contains multimodal content (images, video, or audio). If you set rerank=false on a multimodal collection, the API returns a 400 error. Text-only collections work with rerank=false as before.

1import requests
2import uuid
3
4BASE_URL = "https://api.runcaptain.com"
5API_KEY = "your_api_key"
6ORG_ID = "your_organization_id"
7
8headers = {
9 "Authorization": f"Bearer {API_KEY}",
10 "X-Organization-ID": ORG_ID,
11 "Content-Type": "application/json",
12 "Idempotency-Key": str(uuid.uuid4())
13}
14
15response = requests.post(
16 f"{BASE_URL}/v2/collections/my_media_library/query",
17 headers=headers,
18 json={
19 "query": "product demo with live dashboard",
20 "inference": False,
21 "rerank": True,
22 "top_k": 10
23 },
24 timeout=120.0
25)
26
27data = response.json()
28for result in data["search_results"]:
29 modality = result["metadata"].get("modality", "text")
30 print(f"[{modality}] {result['filename']} — score: {result['score']:.3f}")
31 if result["metadata"].get("mediaSegmentStartSec") is not None:
32 start = result["metadata"]["mediaSegmentStartSec"]
33 end = result["metadata"]["mediaSegmentEndSec"]
34 print(f" Timestamp: {start}s - {end}s")
35 print(f" {result['content'][:100]}")

Response:

1{
2 "success": true,
3 "inference": false,
4 "search_results": [
5 {
6 "score": 0.95,
7 "content": "[Video: product_demo.mp4, 120s-240s]\nIn the video, a presenter walks through the analytics dashboard showing real-time revenue metrics and user engagement charts...",
8 "document_id": "doc_video_123",
9 "filename": "product_demo.mp4",
10 "uri": "s3://my-media-bucket/videos/product_demo.mp4",
11 "chunk_index": 1,
12 "metadata": {
13 "modality": "video",
14 "source": "multimodal"
15 }
16 },
17 {
18 "score": 0.88,
19 "content": "[Image: dashboard_screenshot.png]\n### Visual Description\nA screenshot of a web application dashboard with a dark theme. The main area displays a line chart of monthly revenue trending upward...",
20 "document_id": "doc_img_456",
21 "filename": "dashboard_screenshot.png",
22 "uri": "s3://my-media-bucket/images/dashboard_screenshot.png",
23 "chunk_index": 0,
24 "metadata": {
25 "modality": "image",
26 "source": "multimodal"
27 }
28 },
29 {
30 "score": 0.82,
31 "content": "The dashboard provides real-time analytics including revenue metrics, user engagement, and conversion rates...",
32 "document_id": "doc_pdf_789",
33 "filename": "product_docs.pdf",
34 "uri": "s3://my-company-docs/product_docs.pdf",
35 "chunk_index": 15,
36 "page_start": 8,
37 "page_end": 8,
38 "metadata": {
39 "modality": "text",
40 "source": "hybrid"
41 }
42 }
43 ],
44 "total_results": 3,
45 "top_k": 10,
46 "query": "product demo with live dashboard",
47 "execution_time_ms": 1240
48}

Request Fields

FieldTypeDefaultDescription
querystringrequiredThe natural language search query
inferencebooleanfalseEnable AI-powered answers with retrieved context
streambooleanfalseEnable real-time streaming (only when inference=true)
rerankbooleantrueEnable reranking. Required when multimodal content is present
top_kinteger10Number of results to return (only when inference=false)
include_bboxbooleanfalseInclude bounding box layout data (only when inference=false)
include_documentsbooleanfalseInclude full document text in results
search_resultsbooleanfalseInclude raw search chunks when inference=true
metadata_filterobjectnullFilter expression ($eq, $ne, $gt, $gte, $lt, $lte, $in, $nin, $and, $or)

Response Fields (per search result)

FieldTypeDescription
scorenumberFinal relevance score
contentstringText content, or media label + VLM description (e.g., [Video: file.mp4, 0s-120s]\nDescription...)
document_idstringUnique identifier of the source file
filenamestringName of the source file
uristring | nullOriginal source URI of the file
chunk_indexintegerIndex of this chunk (0 for images, segment number for video/audio)
page_startinteger | nullStarting page number (text/PDF only)
page_endinteger | nullEnding page number (text/PDF only)
metadata.modalitystringContent type: text, pdf, image, video, or audio
metadata.sourcestringHow this result was found: hybrid (text), vector (text), bm25 (text), or multimodal (media)

Error: Reranking Required

If you query a multimodal collection with rerank=false:

1{
2 "detail": "Reranking is required when multimodal content is present in the collection. Set rerank=true."
3}

Status code: 400 Bad Request

This error only occurs for collections that contain media files. Text-only collections work with rerank=false as before.

Indexing Media Files

Media files are indexed through the same endpoints as text documents. No special configuration is needed—Captain automatically detects the file type and routes it through the appropriate processing pipeline.

Index from cloud storage

1import requests
2
3BASE_URL = "https://api.runcaptain.com"
4API_KEY = "your_api_key"
5ORG_ID = "your_organization_id"
6
7headers = {
8 "Authorization": f"Bearer {API_KEY}",
9 "X-Organization-ID": ORG_ID,
10 "Content-Type": "application/json"
11}
12
13# Index an S3 bucket containing mixed content (PDFs, images, videos, audio)
14response = requests.post(
15 f"{BASE_URL}/v2/collections/my_media_library/index/s3",
16 headers=headers,
17 json={
18 "bucket_name": "my-media-bucket",
19 "aws_access_key_id": "AKIAIOSFODNN7EXAMPLE",
20 "aws_secret_access_key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
21 "bucket_region": "us-east-1",
22 "processing_type": "advanced"
23 }
24)
25
26print(response.json())

Index from URL

1# Index media files from direct URLs
2response = requests.post(
3 f"{BASE_URL}/v2/collections/my_media_library/index/url",
4 headers=headers,
5 json={
6 "urls": [
7 "https://example.com/product-demo.mp4",
8 "https://example.com/screenshot.png",
9 "https://example.com/podcast-episode.mp3"
10 ],
11 "processing_type": "basic"
12 }
13)

Upload files directly

1# Upload files via multipart form
2with open("meeting_recording.mp4", "rb") as video, open("notes.pdf", "rb") as pdf:
3 response = requests.post(
4 f"{BASE_URL}/v2/collections/my_media_library/index/file",
5 headers={
6 "Authorization": f"Bearer {API_KEY}",
7 "X-Organization-ID": ORG_ID,
8 },
9 files=[
10 ("files", ("meeting_recording.mp4", video, "video/mp4")),
11 ("files", ("notes.pdf", pdf, "application/pdf")),
12 ],
13 data={"processing_type": "basic"}
14 )

Using with Inference (AI-Powered Answers)

When inference=true, the AI agent automatically searches across all content types. Media results are presented to the agent as text descriptions with timestamps, and the agent can cite specific media files and segments in its response.

1response = requests.post(
2 f"{BASE_URL}/v2/collections/my_media_library/query",
3 headers=headers,
4 json={
5 "query": "What was discussed about the product roadmap?",
6 "inference": True,
7 "rerank": True,
8 "search_results": True
9 },
10 timeout=120.0
11)
12
13data = response.json()
14print(data["response"]) # AI-generated answer with citations

The agent sees media results as text labels like [Video: meeting_recording.mp4, 32s-120s] followed by the VLM-generated description, allowing it to cite specific timestamps in its response.

Limitations

  • Video segments: Maximum 120 seconds per segment. Longer videos are automatically split.
  • Audio segments: Maximum 80 seconds per segment. Longer audio files are automatically split.
  • Format conversion: Some formats are converted at ingestion (FLAC→MP3, WebP→PNG, AVI→MP4). The original file is preserved.
  • Reranking required: Multimodal collections require rerank=true. Text-only collections are unaffected.
  • Latency: Multimodal queries take ~1-2 seconds (text embedding + media search + reranking). Text-only queries are unchanged.

Benchmark Results

Captain’s multimodal retrieval was evaluated on MRAG-Bench (Hu et al., ICLR 2025), achieving 81.3% ContentHit@5 across 1,251 questions with a corpus of 16,130 images. This outperforms GPT-4o + CLIP retrieval (68.96%), Gemini Pro + CLIP retrieval (65.93%), and human performance with retrieved images (61.38%).

Full results, methodology, and evaluation code are available in our open-source evaluation repository.