> For a complete page index of the Captain API documentation, fetch https://docs.runcaptain.com/llms.txt?excludeSpec=true

# Parsers (BETA)

> Write JavaScript parsing scripts to extract searchable text from JSON files. Captain executes your script in a sandboxed V8 isolate during indexing, transforming structured JSON into markdown that gets chunked, embedded, and searched.

## Agent Quick Reference - JSON Parsers

* **Feature**: Custom JavaScript parsing scripts for JSON file indexing
* **Status**: BETA
* **How it works**: Add `parsing_script` parameter to any indexing endpoint. JSON files route through a V8 isolate that runs your script. Other file types are unaffected.
* **Script contract**: `export default function extract(doc) \{ return "markdown string" \}`. return a string that Captain indexes.
* **Storage**: Scripts managed via Parser Studio UI. Referenced by relative path.
* **Limits**: 10s execution timeout, 64MB memory, 500MB max JSON file size, 50MB max for V8 processing.
* **Without `parsing_script`**: JSON files are indexed as raw text (existing behavior, unchanged).

Example request:

```
POST /v2/collections/{name}/index/s3
{
  "bucket_name": "my-data",
  "aws_access_key_id": "...",
  "aws_secret_access_key": "...",
  "processing_type": "basic",
  "parsing_script": "research/paper-parser"
}
```

## What are Parsers?

Parsers let you tell Captain how to read your JSON files. Instead of indexing raw JSON as text, you write a short JavaScript function that extracts the content you actually want to search over.

**Without a parser**, Captain indexes your JSON as-is:

```
{"title":"Attention Is All You Need","authors":["Vaswani","Shazeer"],"content":"We propose a new..."}
```

**With a parser**, Captain indexes clean, searchable text:

```markdown
# Attention Is All You Need

**Authors:** Vaswani, Shazeer

We propose a new simple network architecture, the Transformer...
```

Same data. Much better search results.

## How It Works

```
You write a parsing script (JavaScript)
    |
    v
Upload it via Parser Studio (/studio/parsers)
    |
    v
Start an indexing job with parsing_script parameter
    |
    v
For each .json file:
  1. Captain downloads the file
  2. Parses it as JSON
  3. Runs your script in a sandboxed V8 engine
  4. Your script returns a string
  5. That text gets chunked, embedded, and indexed
    |
    v
Search works over the extracted text
```

Other file types (PDF, DOCX, images, etc.) are completely unaffected. The parser only applies to `.json` files.

## Writing Your First Parser

A parsing script is a JavaScript function that receives your JSON object and returns a string:

```javascript
export default function extract(doc) {
  return doc.content;
}
```

That's it. `doc` is your parsed JSON. Return the string you want Captain to index.

### The returned string

The string you return becomes the searchable content. You can format it however you want. Markdown works great because it preserves structure:

```javascript
export default function extract(doc) {
  var md = "";

  if (doc.title) {
    md += "# " + doc.title + "\n\n";
  }

  if (doc.authors) {
    md += "**Authors:** " + doc.authors.join(", ") + "\n\n";
  }

  if (doc.abstract) {
    md += doc.abstract;
  }

  return md;
}
```

### What to return

Your function must return a **string**. That string becomes the searchable content.

```javascript
// GOOD: return a string
return doc.content;
return "# " + doc.title + "\n\n" + doc.content;
return parts.join("\n");
```

What does NOT work:

```javascript
// BAD: returning an object
return { text: doc.content };

// BAD: returning a number
return 42;
```

## Examples

### Research Papers

**Input JSON:**

```json
{
  "title": "Attention Is All You Need",
  "authors": ["Vaswani", "Shazeer", "Parmar", "Uszkoreit", "Jones"],
  "content": "We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.",
  "metadata": {
    "doi": "10.48550/arXiv.1706.03762",
    "year": 2017,
    "journal": "NeurIPS"
  }
}
```

**Parsing script:**

```javascript
export default function extract(doc) {
  var md = "";

  if (doc.title) {
    md += "# " + doc.title + "\n\n";
  }

  if (doc.authors) {
    var authorList = Array.isArray(doc.authors)
      ? doc.authors.join(", ")
      : doc.authors;
    md += "**Authors:** " + authorList + "\n\n";
  }

  if (doc.metadata && doc.metadata.doi) {
    md += "**DOI:** " + doc.metadata.doi + "\n\n";
  }

  if (doc.content) {
    md += doc.content;
  }

  return md;
}
```

**Indexed text:**

```markdown
# Attention Is All You Need

**Authors:** Vaswani, Shazeer, Parmar, Uszkoreit, Jones

**DOI:** 10.48550/arXiv.1706.03762

We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.
```

### Product Catalog

**Input JSON:**

```json
{
  "sku": "WH-1000XM5",
  "name": "Sony WH-1000XM5 Wireless Headphones",
  "category": "Electronics > Audio > Headphones",
  "price": 349.99,
  "description": "Industry-leading noise cancellation with Auto NC Optimizer.",
  "specs": {
    "driver_size": "30mm",
    "battery_life": "30 hours",
    "weight": "250g",
    "connectivity": ["Bluetooth 5.2", "3.5mm", "USB-C"]
  },
  "reviews_summary": "Excellent ANC, comfortable fit, premium sound. Some users note the folding mechanism changed from XM4."
}
```

**Parsing script:**

```javascript
export default function extract(doc) {
  var parts = [];

  parts.push("# " + doc.name);
  parts.push("**SKU:** " + doc.sku);
  parts.push("**Category:** " + doc.category);
  parts.push("**Price:** $" + doc.price);

  if (doc.description) {
    parts.push("\n" + doc.description);
  }

  if (doc.specs) {
    parts.push("\n## Specifications");
    Object.keys(doc.specs).forEach(function(key) {
      var val = doc.specs[key];
      var label = key.replace(/_/g, " ");
      if (Array.isArray(val)) {
        parts.push("- **" + label + ":** " + val.join(", "));
      } else {
        parts.push("- **" + label + ":** " + val);
      }
    });
  }

  if (doc.reviews_summary) {
    parts.push("\n## Customer Reviews");
    parts.push(doc.reviews_summary);
  }

  return parts.join("\n");
}
```

**Indexed text:**

```markdown
# Sony WH-1000XM5 Wireless Headphones
**SKU:** WH-1000XM5
**Category:** Electronics > Audio > Headphones
**Price:** $349.99

Industry-leading noise cancellation with Auto NC Optimizer.

## Specifications
- **driver size:** 30mm
- **battery life:** 30 hours
- **weight:** 250g
- **connectivity:** Bluetooth 5.2, 3.5mm, USB-C

## Customer Reviews
Excellent ANC, comfortable fit, premium sound. Some users note the folding mechanism changed from XM4.
```

### Healthcare Records (FHIR)

**Input JSON:**

```json
{
  "resourceType": "Patient",
  "id": "example-001",
  "name": [{ "family": "Smith", "given": ["John", "Michael"] }],
  "birthDate": "1990-05-15",
  "gender": "male",
  "address": [{ "city": "Portland", "state": "OR" }],
  "condition": [
    { "code": { "text": "Type 2 Diabetes Mellitus" }, "onsetDateTime": "2018-03-01" },
    { "code": { "text": "Essential Hypertension" }, "onsetDateTime": "2020-07-15" }
  ],
  "medication": [
    { "code": { "text": "Metformin 500mg" }, "status": "active" },
    { "code": { "text": "Lisinopril 10mg" }, "status": "active" }
  ]
}
```

**Parsing script:**

```javascript
export default function extract(doc) {
  var parts = [];

  // Patient name
  if (doc.name && doc.name[0]) {
    var n = doc.name[0];
    var fullName = (n.given || []).join(" ") + " " + (n.family || "");
    parts.push("# Patient: " + fullName.trim());
  }

  parts.push("**DOB:** " + (doc.birthDate || "Unknown"));
  parts.push("**Gender:** " + (doc.gender || "Unknown"));

  if (doc.address && doc.address[0]) {
    var addr = doc.address[0];
    parts.push("**Location:** " + [addr.city, addr.state].filter(Boolean).join(", "));
  }

  // Conditions
  if (doc.condition && doc.condition.length > 0) {
    parts.push("\n## Active Conditions");
    doc.condition.forEach(function(c) {
      var text = c.code && c.code.text ? c.code.text : "Unknown condition";
      var onset = c.onsetDateTime ? " (onset: " + c.onsetDateTime + ")" : "";
      parts.push("- " + text + onset);
    });
  }

  // Medications
  if (doc.medication && doc.medication.length > 0) {
    parts.push("\n## Current Medications");
    doc.medication.forEach(function(m) {
      var text = m.code && m.code.text ? m.code.text : "Unknown medication";
      var status = m.status ? " [" + m.status + "]" : "";
      parts.push("- " + text + status);
    });
  }

  return parts.join("\n");
}
```

**Indexed text:**

```markdown
# Patient: John Michael Smith
**DOB:** 1990-05-15
**Gender:** male
**Location:** Portland, OR

## Active Conditions
- Type 2 Diabetes Mellitus (onset: 2018-03-01)
- Essential Hypertension (onset: 2020-07-15)

## Current Medications
- Metformin 500mg [active]
- Lisinopril 10mg [active]
```

### Generic Key-Value Flattener

Don't know your schema yet? This script handles any JSON by flattening all fields into readable text:

```javascript
export default function extract(doc) {
  var parts = [];

  function flatten(obj, prefix) {
    Object.keys(obj).forEach(function(key) {
      var val = obj[key];
      var label = prefix ? prefix + "." + key : key;

      if (val === null || val === undefined) return;

      if (Array.isArray(val)) {
        parts.push(label + ": " + val.join(", "));
      } else if (typeof val === "object") {
        flatten(val, label);
      } else {
        parts.push(label + ": " + String(val));
      }
    });
  }

  flatten(doc, "");
  return parts.join("\n");
}
```

This works for any JSON structure. Good for prototyping before you write a specialized parser.

## Validating Scripts Programmatically

Before uploading a script, you can validate it via the API to catch syntax errors and structural problems. Captain runs your script in the same sandboxed V8 engine that `json_handler` uses at indexing time, so a script that validates will work when you index files.

**Endpoint:** `POST /v2/parsing-scripts/validate`

**Content type:** `multipart/form-data`. upload your `.js` file under the `file` field.

The validation does NOT run your script against real data. it just confirms:

1. The code is syntactically valid JavaScript
2. It exports a default function

The actual execution against your JSON files happens at indexing time in `json_handler`, which enforces the return-type contract (must be a string) on real data.

```bash title="curl"
curl -X POST https://api.runcaptain.com/v2/parsing-scripts/validate \
  -H "Authorization: Bearer $CAPTAIN_API_KEY" \
  -F "file=@./my-parser.js"
```

```python title="Python"
import requests

with open("my-parser.js", "rb") as f:
    response = requests.post(
        "https://api.runcaptain.com/v2/parsing-scripts/validate",
        headers={
            "Authorization": f"Bearer {api_key}",
        },
        files={"file": ("my-parser.js", f, "application/javascript")},
    )

result = response.json()
if result["valid"]:
    print("Script is valid")
else:
    print(f"{result['error_type']}: {result['error']}")
```

```typescript title="TypeScript"
import { readFileSync } from "fs";

const scriptContent = readFileSync("./my-parser.js");
const formData = new FormData();
formData.append(
  "file",
  new Blob([scriptContent], { type: "application/javascript" }),
  "my-parser.js"
);

const response = await fetch(
  "https://api.runcaptain.com/v2/parsing-scripts/validate",
  {
    method: "POST",
    headers: {
      Authorization: `Bearer ${apiKey}`,
    },
    body: formData,
  }
);
const result = await response.json();
if (result.valid) {
  console.log("Script is valid");
} else {
  console.log(`${result.error_type}: ${result.error}`);
}
```

**Valid response:**

```json
{
  "valid": true,
  "error": null,
  "error_type": null
}
```

**Invalid response:**

```json
{
  "valid": false,
  "error": "Unexpected token '}' at line 3",
  "error_type": "syntax_error"
}
```

**Error types:**

| `error_type`   | Meaning                                                          |
| -------------- | ---------------------------------------------------------------- |
| `syntax_error` | JavaScript fails to parse (missing braces, invalid tokens, etc.) |
| `no_export`    | Script does not export a default function                        |
| `timeout`      | Script exceeded the validation execution time limit              |

The endpoint returns HTTP 200 for both valid and invalid scripts. `valid: false` is a normal result, not an error. HTTP 4xx/5xx codes are reserved for auth failures and malformed requests.

Use this endpoint in CI/CD pipelines to catch bad scripts before they get uploaded, or in your own tools before calling the S3 upload proxy.

## Using Parsers with the API

Add the `parsing_script` parameter to any indexing endpoint:

```bash title="curl"
curl -X POST https://api.runcaptain.com/v2/collections/research-papers/index/s3 \
  -H "Authorization: Bearer $CAPTAIN_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "bucket_name": "my-research-data",
    "aws_access_key_id": "...",
    "aws_secret_access_key": "...",
    "processing_type": "basic",
    "parsing_script": "research/paper-parser"
  }'
```

```python title="Python"
import requests

response = requests.post(
    "https://api.runcaptain.com/v2/collections/research-papers/index/s3",
    headers={
        "Authorization": f"Bearer {api_key}",
    },
    json={
        "bucket_name": "my-research-data",
        "aws_access_key_id": "...",
        "aws_secret_access_key": "...",
        "processing_type": "basic",
        "parsing_script": "research/paper-parser",
    },
)
```

```typescript title="TypeScript"
const response = await fetch(
  "https://api.runcaptain.com/v2/collections/research-papers/index/s3",
  {
    method: "POST",
    headers: {
      "Authorization": `Bearer ${apiKey}`,
      "Content-Type": "application/json",
    },
    body: JSON.stringify({
      bucket_name: "my-research-data",
      aws_access_key_id: "...",
      aws_secret_access_key: "...",
      processing_type: "basic",
      parsing_script: "research/paper-parser",
    }),
  }
);
```

The `parsing_script` parameter is the relative path to your script. Captain resolves it to your org's script storage automatically:

```
"research/paper-parser"  →  research/paper-parser.js
"legacy-parser"          →  legacy-parser.js
"healthcare/fhir-bundle" →  healthcare/fhir-bundle.js
```

You can include or omit the `.js` extension. Both work.

### Without `parsing_script`

If you don't include `parsing_script`, JSON files are indexed as raw text. This is the existing behavior and it's unchanged. Parsers are opt-in.

### Supported Endpoints

`parsing_script` works on all indexing endpoints:

| Endpoint                                              | Description          |
| ----------------------------------------------------- | -------------------- |
| `POST /v2/collections/\{name\}/index/s3`              | S3 bucket            |
| `POST /v2/collections/\{name\}/index/s3/file`         | Single S3 file       |
| `POST /v2/collections/\{name\}/index/s3/directory`    | S3 directory         |
| `POST /v2/collections/\{name\}/index/gcs`             | Google Cloud Storage |
| `POST /v2/collections/\{name\}/index/gcs/file`        | Single GCS file      |
| `POST /v2/collections/\{name\}/index/gcs/directory`   | GCS directory        |
| `POST /v2/collections/\{name\}/index/azure`           | Azure Blob Storage   |
| `POST /v2/collections/\{name\}/index/azure/file`      | Single Azure blob    |
| `POST /v2/collections/\{name\}/index/azure/directory` | Azure directory      |
| `POST /v2/collections/\{name\}/index/r2`              | Cloudflare R2        |
| `POST /v2/collections/\{name\}/index/r2/file`         | Single R2 file       |
| `POST /v2/collections/\{name\}/index/r2/directory`    | R2 directory         |
| `POST /v2/collections/\{name\}/index/url`             | Public URL           |

## Tips for Writing Good Parsers

**1. Return markdown, not plain text.** Headers, bold, and lists help Captain understand document structure and improve search relevance.

**2. Put the most important content first.** Title, key identifiers, and summary should come before detailed content. Search results show previews from the beginning of each chunk.

**3. Skip noise.** Internal IDs, timestamps, and system fields don't help search. Only extract what a human would search for.

```javascript
// BAD: includes everything
return JSON.stringify(doc, null, 2);

// GOOD: extracts what matters
return "# " + doc.title + "\n\n" + doc.content;
```

**4. Handle missing fields.** Not every document has every field. Use guards:

```javascript
if (doc.authors && doc.authors.length > 0) {
  md += "**Authors:** " + doc.authors.join(", ") + "\n\n";
}
```

**5. Use `var`, not `let`/`const`.** The V8 sandbox uses a JavaScript version that works best with `var` declarations. Avoid optional chaining (`?.`) and use explicit null checks instead.

```javascript
// Avoid:
var doi = doc.metadata?.doi;

// Use:
var doi = doc.metadata && doc.metadata.doi;
```

**6. Test in the Parser Studio.** The browser-based test runner gives instant feedback. Write your script, paste sample JSON, click Run, and see the output immediately.

## Limits

| Limit                          | Value                                                                       |
| ------------------------------ | --------------------------------------------------------------------------- |
| Max JSON file size             | 500 MB                                                                      |
| Max JSON for script processing | 50 MB                                                                       |
| Script execution timeout       | 10 seconds per file                                                         |
| Script memory limit            | 64 MB heap                                                                  |
| Allowed in sandbox             | Pure JS: objects, arrays, strings, Math, Date, RegExp, JSON                 |
| NOT allowed                    | `require`, `import`, `fetch`, `fs`, `process`, `setTimeout`, network access |

## Error Handling

If your parsing script fails, that specific file is marked as failed in the indexing job. Other files in the same batch continue processing normally.

Common errors:

| Error                                   | Cause                                            | Fix                                                                 |
| --------------------------------------- | ------------------------------------------------ | ------------------------------------------------------------------- |
| "Script must export a default function" | Missing `export default function`                | Add `export default function extract(doc) \{ ... \}`                |
| "Must return a string"                  | Returned an object or number instead of a string | Make sure your function returns a string, e.g. `return doc.content` |
| "Parsing script returned empty text"    | Script returned `""`                             | Check your field paths match the actual JSON                        |
| "Script exceeded 10s time limit"        | Infinite loop or very complex processing         | Simplify your script logic                                          |
| "Script exceeded 64MB memory limit"     | Creating very large intermediate objects         | Avoid building huge arrays or strings in memory                     |
| "Invalid JSON"                          | The file isn't valid JSON                        | Check the file encoding and format                                  |

Check the job status for per-file error details:

```bash
curl https://api.runcaptain.com/v2/jobs/{job_id} \
  -H "Authorization: Bearer $CAPTAIN_API_KEY" \
```

## FAQ

**Can I use the same parser across multiple collections?**
Yes. Parsers are org-scoped, not collection-scoped. Any indexing job in your org can reference any parser.

**What happens to non-JSON files when I set `parsing_script`?**
Nothing. PDFs, DOCX, images, and other files are processed normally. The parser only affects `.json` files.

**Can I use TypeScript?**
Not yet. Write plain JavaScript. TypeScript support would require shipping a compiler to the sandbox.

**Can my script make API calls or read files?**
No. The sandbox is completely isolated. No network access, no filesystem, no `require`/`import`. Your script receives the JSON object and returns text. That's it.

**Can I use `async`/`await`?**
No. The sandbox runs synchronous JavaScript only. No Promises, no `setTimeout`, no async patterns.

**How do I update a parser?**
Upload a new version with the same path. The next indexing job that references it will use the updated script. There's no versioning. The latest upload is what runs.