> For a complete page index of the Captain API documentation, fetch https://docs.runcaptain.com/llms.txt?excludeSpec=true

# Multimodal Search

> Search across text, images, video, and audio in a single query. Captain uses native multimodal embeddings with cross-modal reranking for state-of-the-art retrieval quality.

## Agent Quick Reference - Multimodal Search

* **Endpoint**: `POST /v2/collections/{name}/query`
* **Key constraint**: `rerank=true` is REQUIRED when collection contains multimodal content (images, video, audio). Querying a multimodal collection with `rerank=false` fails the request.
* **Media processing**: Video segmented into ≤120s clips, audio into ≤80s clips, images natively embedded.
* **Response fields per result**: `score`, `content`, `document_id`, `filename`, `uri`, `chunk_index`, `rerank_score`, `metadata`. The modality lives at `metadata.modality` ("text"|"pdf"|"image"|"video"|"audio"). Media segment time ranges are the structured fields `metadata.startSec` / `metadata.endSec` (seconds); the `content` label is just `[Video: file.mp4]` with no timestamps.
* **Indexing**: Same endpoints as text. Captain auto-detects file type. No special configuration needed.

Captain supports native multimodal search across text documents, images, video, and audio files. Upload any combination of file types to a single collection, and search across all of them with one query. Captain handles format detection, media segmentation, embedding, and cross-modal ranking automatically.

## Benchmark Results

On [MRAG-Bench](https://github.com/runcaptain/captain-mrag-bench) (ICLR 2025), a standardized academic benchmark for vision-centric retrieval with 16,130 images and 1,251 questions, Captain achieves **81.3% retrieval accuracy**-outperforming every end-to-end RAG system tested in the paper, including GPT-4o with CLIP retrieval (68.96%). Full results, methodology, and evaluation code are available in our [open-source evaluation repository](https://github.com/runcaptain/captain-mrag-bench).

## Supported File Types

| Modality      | Formats                                        | Processing                                        |
| ------------- | ---------------------------------------------- | ------------------------------------------------- |
| **Documents** | PDF, DOCX, DOC, TXT, MD, JSON, YAML, CSV, XLSX | Text extraction, chunking, and semantic embedding |
| **Images**    | PNG, JPEG, GIF, BMP, TIFF, WEBP                | Native multimodal embedding + visual description  |
| **Video**     | MP4, MOV, AVI, MKV, WEBM, FLV, WMV             | Segmented into ≤120s clips, natively embedded     |
| **Audio**     | MP3, WAV, AAC, FLAC, M4A, OGG, WMA             | Segmented into ≤80s clips, natively embedded      |

## How It Works

Captain uses a **dual embedding** strategy for media files. Each image, video segment, or audio segment is embedded in two ways:

1. **Native multimodal embedding** (3072 dimensions): The raw bytes-image pixels, audio waveform, video frames-are embedded directly. This captures the actual visual, auditory, or temporal content.

2. **Text embedding** (1024 dimensions): A model generates a structured description of what the media actually contains (a transcript for audio, a visual description for images and video), and that text is embedded alongside your text documents. Your query matches the content of the media, not just its filename.

This dual approach means a query for a song finds audio files both by matching the *sound* of the music (native embedding) and by matching the transcribed lyrics (text embedding). A search for the line "uptown funk you up" surfaces the right track even when the file is named `track03.mp3`.

### Why reranking is required

Relevance scores from text search and media search are produced by different models on different scales. A text reranker score of 0.7 and a media cosine similarity of 0.7 do not mean the same thing and cannot be sorted into a single list.

Captain solves this with **reranker-informed pipeline weighting**: the text reranker's scores on media descriptions are used to determine how much weight each modality should receive in the final ranking. This is why `rerank=true` is required for multimodal collections-without the reranker, there's no way to produce a meaningful cross-modal ranking.

### Text-only collections

If your collection contains only text documents, multimodal search adds zero overhead. Captain automatically detects whether a collection has media content and skips the multimodal pipeline entirely for text-only collections.

## Querying Multimodal Collections

**Reranking is required** when a collection contains multimodal content (images, video, or audio). If you set `rerank=false` on a multimodal collection, the query fails. Text-only collections work with `rerank=false` as before.

### Example: Multimodal Search

```python
import requests
import uuid

BASE_URL = "https://api.runcaptain.com"
API_KEY = "your_api_key"
headers = {
    "Authorization": f"Bearer {API_KEY}",
    "Content-Type": "application/json",
    "Idempotency-Key": str(uuid.uuid4())
}

response = requests.post(
    f"{BASE_URL}/v2/collections/my_media_library/query",
    headers=headers,
    json={
        "query": "product demo with live dashboard",
        "inference": False,
        "rerank": True,
        "top_k": 10
    },
    timeout=120.0
)

data = response.json()
for result in data["search_results"]:
    modality = result["metadata"].get("modality", "text")
    print(f"[{modality}] {result['filename']} | score: {result['score']:.3f}")
    # For video/audio results, the segment time range is in metadata.startSec /
    # metadata.endSec (seconds).
    meta = result["metadata"]
    if meta.get("startSec") is not None:
        print(f"  segment {meta['startSec']:.0f}s-{meta['endSec']:.0f}s")
    print(f"  {result['content'][:100]}")
```

**Response:**

```json
{
  "success": true,
  "inference": false,
  "search_results": [
    {
      "score": 0.95,
      "content": "[Video: product_demo.mp4]\nIn the video, a presenter walks through the analytics dashboard showing real-time revenue metrics and user engagement charts...",
      "document_id": "doc_video_123",
      "filename": "product_demo.mp4",
      "uri": "s3://my-media-bucket/videos/product_demo.mp4",
      "chunk_index": 1,
      "metadata": {
        "modality": "video",
        "source": "multimodal",
        "startSec": 120.0,
        "endSec": 240.0
      }
    },
    {
      "score": 0.88,
      "content": "[Image: dashboard_screenshot.png]\n### Visual Description\nA screenshot of a web application dashboard with a dark theme. The main area displays a line chart of monthly revenue trending upward...",
      "document_id": "doc_img_456",
      "filename": "dashboard_screenshot.png",
      "uri": "s3://my-media-bucket/images/dashboard_screenshot.png",
      "chunk_index": 0,
      "metadata": {
        "modality": "image",
        "source": "multimodal"
      }
    },
    {
      "score": 0.82,
      "content": "The dashboard provides real-time analytics including revenue metrics, user engagement, and conversion rates...",
      "document_id": "doc_pdf_789",
      "filename": "product_docs.pdf",
      "uri": "s3://my-company-docs/product_docs.pdf",
      "chunk_index": 15,
      "metadata": {
        "modality": "text",
        "source": "hybrid",
        "pageStart": 8,
        "pageEnd": 8
      }
    }
  ],
  "total_results": 3,
  "top_k": 10,
  "query": "product demo with live dashboard",
  "execution_time_ms": 1240
}
```

### Request Fields

| Field               | Type    | Default    | Description                                                                                  |
| ------------------- | ------- | ---------- | -------------------------------------------------------------------------------------------- |
| `query`             | string  | *required* | The natural language search query                                                            |
| `inference`         | boolean | `false`    | Enable AI-powered answers with retrieved context                                             |
| `stream`            | boolean | `false`    | Enable real-time streaming (only when `inference=true`)                                      |
| `rerank`            | boolean | `true`     | Enable reranking. **Required** when multimodal content is present                            |
| `top_k`             | integer | `10`       | Number of results to return (only when `inference=false`)                                    |
| `include_bbox`      | boolean | `false`    | Include bounding box layout data (only when `inference=false`)                               |
| `include_documents` | boolean | `false`    | Include full document text in results                                                        |
| `search_results`    | boolean | `false`    | Include raw search chunks when `inference=true`                                              |
| `metadata_filter`   | object  | `null`     | Filter expression (`$eq`, `$ne`, `$gt`, `$gte`, `$lt`, `$lte`, `$in`, `$nin`, `$and`, `$or`) |

### Response Fields (per search result)

| Field                | Type              | Description                                                                                                                       |
| -------------------- | ----------------- | --------------------------------------------------------------------------------------------------------------------------------- |
| `score`              | number            | Final relevance score                                                                                                             |
| `content`            | string            | Text content, or media label + VLM description (e.g., `[Video: file.mp4]\nDescription...`)                                        |
| `document_id`        | string            | Unique identifier of the source file                                                                                              |
| `filename`           | string            | Name of the source file                                                                                                           |
| `uri`                | string \| null    | Original source URI of the file                                                                                                   |
| `chunk_index`        | integer           | Index of this chunk (0 for images, segment number for video/audio)                                                                |
| `rerank_score`       | number \| null    | Reranker score when `rerank=true`, otherwise null                                                                                 |
| `metadata.modality`  | string            | Content type: `text`, `pdf`, `image`, `video`, or `audio`                                                                         |
| `metadata.source`    | string            | How this result was found: `hybrid` (text), `vector` (text), `bm25` (text), or `multimodal` (media)                               |
| `metadata.pageStart` | integer \| null   | Starting page number (text/PDF only)                                                                                              |
| `metadata.pageEnd`   | integer \| null   | Ending page number (text/PDF only)                                                                                                |
| `metadata.startSec`  | number \| null    | Start of the segment in seconds (video/audio only)                                                                                |
| `metadata.endSec`    | number \| null    | End of the segment in seconds (video/audio only)                                                                                  |
| `metadata.sheetName` | string \| null    | Worksheet name the chunk came from (spreadsheets only)                                                                            |
| `metadata.section`   | string \| null    | Logical table section within the sheet (e.g. a repeated-header or stacked-table block); may carry a label like `section 1 (2024)` |
| `metadata.rowStart`  | integer \| null   | First worksheet row of the chunk, 1-based (spreadsheets only)                                                                     |
| `metadata.rowEnd`    | integer \| null   | Last worksheet row of the chunk, 1-based (spreadsheets only)                                                                      |
| `metadata.colStart`  | integer \| null   | First column of the chunk, 1-based (spreadsheets only)                                                                            |
| `metadata.colEnd`    | integer \| null   | Last column of the chunk, 1-based (spreadsheets only)                                                                             |
| `metadata.columns`   | string\[] \| null | Header labels for the chunk's columns, in order (spreadsheets only)                                                               |
| `metadata.rowRole`   | string \| null    | `aggregate` when the chunk's rows are totals/subtotals/averages, otherwise null (spreadsheets only)                               |

### Spreadsheet row/column provenance

For spreadsheet sources (XLSX, XLS, CSV, TSV), each chunk is a structure-aware block of rows
rendered as a markdown table, and carries first-class **row/column provenance** — the tabular
analogue of `pageStart`/`pageEnd` for PDFs. This lets you ground an answer in concrete
coordinates, e.g. *"Sheet 'YTD Sales', rows 2–13, columns Year–Ytd Total Sales"*.

```json
{
  "score": 0.81,
  "content": "# Sheet: YTD Sales — section 1 (2024)\n| Year | Month | Weekly Net Sales | Ytd Total Sales |\n| --- | --- | --- | --- |\n| 2024 | Jan | 8025 | 339987 |",
  "document_id": "doc_xyz789",
  "filename": "sales.xlsx",
  "chunk_index": 0,
  "metadata": {
    "sheetName": "YTD Sales",
    "section": "section 1 (2024)",
    "rowStart": 2,
    "rowEnd": 13,
    "colStart": 1,
    "colEnd": 4,
    "columns": ["Year", "Month", "Weekly Net Sales", "Ytd Total Sales"],
    "rowRole": null
  }
}
```

These fields are `null` for non-spreadsheet sources, exactly as `pageStart`/`pageEnd` are `null`
for spreadsheets. The chunk's `content` repeats the header row, so the table is self-describing
even when read in isolation.

### Media segment time range

For video and audio, each result is one **segment** of the source file. Its time range is
returned as the structured fields **`metadata.startSec`** and **`metadata.endSec`** (seconds) —
the time analogue of `pageStart`/`pageEnd` for documents. Read these directly; do not parse the
`content` string.

```json
{
  "score": 0.88,
  "content": "[Video: product_demo.mp4]\nThe presenter demonstrates the checkout flow...",
  "metadata": { "modality": "video", "startSec": 120.0, "endSec": 158.4 }
}
```

The `content` field's first line is a citation **label** (`[Video: <filename>]`) followed by the
VLM description and (for video) the spoken transcript. The label no longer carries the time range.
`startSec`/`endSec` are `null` for non-media (text/PDF) results, just as `pageStart`/`pageEnd` are
`null` for media.

| Modality | `content` label       | Example                      |
| -------- | --------------------- | ---------------------------- |
| Video    | `[Video: <filename>]` | `[Video: product_demo.mp4]`  |
| Audio    | `[Audio: <filename>]` | `[Audio: earnings_call.mp3]` |
| Image    | `[Image: <filename>]` | `[Image: architecture.png]`  |

If you query a multimodal collection with `rerank=false`, the request fails:

```json
{
  "error": "INTERNAL_ERROR",
  "message": "An internal error occurred",
  "path": "/v2/collections/{name}/query"
}
```

**Status code:** `500 Internal Server Error`

This error only occurs for collections that contain media files. Always send `rerank=true` for multimodal collections. Text-only collections work with `rerank=false` as before.

## Indexing Media Files

Media files are indexed through the same endpoints as text documents. No special configuration is needed-Captain automatically detects the file type and routes it through the appropriate processing pipeline.

### Index from cloud storage

```python
import requests

BASE_URL = "https://api.runcaptain.com"
API_KEY = "your_api_key"
headers = {
    "Authorization": f"Bearer {API_KEY}",
    "Content-Type": "application/json"
}

# Index an S3 bucket containing mixed content (PDFs, images, videos, audio)
response = requests.post(
    f"{BASE_URL}/v2/collections/my_media_library/index/s3",
    headers=headers,
    json={
        "bucket_name": "my-media-bucket",
        "aws_access_key_id": "AKIAIOSFODNN7EXAMPLE",
        "aws_secret_access_key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
        "bucket_region": "us-east-1",
        "processing_type": "advanced"
    }
)

print(response.json())
```

### Index from URL

```python
# Index media files from direct URLs
response = requests.post(
    f"{BASE_URL}/v2/collections/my_media_library/index/url",
    headers=headers,
    json={
        "urls": [
            "https://example.com/product-demo.mp4",
            "https://example.com/screenshot.png",
            "https://example.com/podcast-episode.mp3"
        ],
        "processing_type": "basic"
    }
)
```

### Upload files directly

```python
# Upload files via multipart form
with open("meeting_recording.mp4", "rb") as video, open("notes.pdf", "rb") as pdf:
    response = requests.post(
        f"{BASE_URL}/v2/collections/my_media_library/index/file",
        headers={
            "Authorization": f"Bearer {API_KEY}",
        },
        files=[
            ("files", ("meeting_recording.mp4", video, "video/mp4")),
            ("files", ("notes.pdf", pdf, "application/pdf")),
        ],
        data={"processing_type": "basic"}
    )
```

## Using with Inference (AI-Powered Answers)

When `inference=true`, the AI agent automatically searches across all content types. Media results are presented to the agent as text descriptions with timestamps, and the agent can cite specific media files and segments in its response.

```python
response = requests.post(
    f"{BASE_URL}/v2/collections/my_media_library/query",
    headers=headers,
    json={
        "query": "What was discussed about the product roadmap?",
        "inference": True,
        "rerank": True,
        "search_results": True
    },
    timeout=120.0
)

data = response.json()
print(data["response"])  # AI-generated answer with citations
```

The agent sees media results as a `[Video: meeting_recording.mp4]` label followed by the VLM-generated description, with the segment time range in `metadata.startSec`/`metadata.endSec`, allowing it to cite specific timestamps in its response.

## Limitations

* **Video segments**: Maximum 120 seconds per segment. Longer videos are automatically split.
* **Audio segments**: Maximum 80 seconds per segment. Longer audio files are automatically split.
* **Reranking required**: Multimodal collections require `rerank=true`. Text-only collections are unaffected.
* **Latency**: Multimodal queries take \~1-2 seconds (text embedding + media search + reranking). Text-only queries are unchanged. We are continuously optimizing query latency to make search as fast as possible without compromising accuracy.