Parsers (BETA) | Captain Docs

What are Parsers?

Parsers let you tell Captain how to read your JSON files. Instead of indexing raw JSON as text, you write a short JavaScript function that extracts the content you actually want to search over.

Without a parser, Captain indexes your JSON as-is:

{"title":"Attention Is All You Need","authors":["Vaswani","Shazeer"],"content":"We propose a new..."}

With a parser, Captain indexes clean, searchable text:

1 # Attention Is All You Need
2 
3 **Authors:** Vaswani, Shazeer
4 
5 We propose a new simple network architecture, the Transformer...

Same data. Much better search results.

How It Works

You write a parsing script (JavaScript)
    |
    v
Upload it via Parser Studio (/studio/parsers)
    |
    v
Start an indexing job with parsing_script parameter
    |
    v
For each .json file:
  1. Captain downloads the file
  2. Parses it as JSON
  3. Runs your script in a sandboxed V8 engine
  4. Your script returns { text: "..." }
  5. That text gets chunked, embedded, and indexed
    |
    v
Search works over the extracted text

Other file types (PDF, DOCX, images, etc.) are completely unaffected. The parser only applies to .json files.

Writing Your First Parser

A parsing script is a JavaScript function that receives your JSON object and returns a string:

1 export default function extract(doc) {
2   return doc.content;
3 }

That’s it. doc is your parsed JSON. Return an object with a text field containing the string you want Captain to index.

The text field

The text you return becomes the searchable content. You can format it however you want. Markdown works great because it preserves structure:

1 export default function extract(doc) {
2   var md = "";
3 
4   if (doc.title) {
5     md += "# " + doc.title + "\n\n";
6   }
7 
8   if (doc.authors) {
9     md += "**Authors:** " + doc.authors.join(", ") + "\n\n";
10   }
11 
12   if (doc.abstract) {
13     md += doc.abstract;
14   }
15 
16   return md;
17 }

What to return

Your function must return a string. That string becomes the searchable content.

1 // GOOD: return a string
2 return doc.content;
3 return "# " + doc.title + "\n\n" + doc.content;
4 return parts.join("\n");

What does NOT work:

1 // BAD: returning an object
2 return { text: doc.content };
3 
4 // BAD: returning a number
5 return 42;

Examples

Research Papers

Input JSON:

1 {
2   "title": "Attention Is All You Need",
3   "authors": ["Vaswani", "Shazeer", "Parmar", "Uszkoreit", "Jones"],
4   "content": "We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.",
5   "metadata": {
6     "doi": "10.48550/arXiv.1706.03762",
7     "year": 2017,
8     "journal": "NeurIPS"
9   }
10 }

Parsing script:

1 export default function extract(doc) {
2   var md = "";
3 
4   if (doc.title) {
5     md += "# " + doc.title + "\n\n";
6   }
7 
8   if (doc.authors) {
9     var authorList = Array.isArray(doc.authors)
10       ? doc.authors.join(", ")
11       : doc.authors;
12     md += "**Authors:** " + authorList + "\n\n";
13   }
14 
15   if (doc.metadata && doc.metadata.doi) {
16     md += "**DOI:** " + doc.metadata.doi + "\n\n";
17   }
18 
19   if (doc.content) {
20     md += doc.content;
21   }
22 
23   return md;
24 }

Indexed text:

1 # Attention Is All You Need
2 
3 **Authors:** Vaswani, Shazeer, Parmar, Uszkoreit, Jones
4 
5 **DOI:** 10.48550/arXiv.1706.03762
6 
7 We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.

Product Catalog

Input JSON:

1 {
2   "sku": "WH-1000XM5",
3   "name": "Sony WH-1000XM5 Wireless Headphones",
4   "category": "Electronics > Audio > Headphones",
5   "price": 349.99,
6   "description": "Industry-leading noise cancellation with Auto NC Optimizer.",
7   "specs": {
8     "driver_size": "30mm",
9     "battery_life": "30 hours",
10     "weight": "250g",
11     "connectivity": ["Bluetooth 5.2", "3.5mm", "USB-C"]
12   },
13   "reviews_summary": "Excellent ANC, comfortable fit, premium sound. Some users note the folding mechanism changed from XM4."
14 }

Parsing script:

1 export default function extract(doc) {
2   var parts = [];
3 
4   parts.push("# " + doc.name);
5   parts.push("**SKU:** " + doc.sku);
6   parts.push("**Category:** " + doc.category);
7   parts.push("**Price:** $" + doc.price);
8 
9   if (doc.description) {
10     parts.push("\n" + doc.description);
11   }
12 
13   if (doc.specs) {
14     parts.push("\n## Specifications");
15     Object.keys(doc.specs).forEach(function(key) {
16       var val = doc.specs[key];
17       var label = key.replace(/_/g, " ");
18       if (Array.isArray(val)) {
19         parts.push("- **" + label + ":** " + val.join(", "));
20       } else {
21         parts.push("- **" + label + ":** " + val);
22       }
23     });
24   }
25 
26   if (doc.reviews_summary) {
27     parts.push("\n## Customer Reviews");
28     parts.push(doc.reviews_summary);
29   }
30 
31   return parts.join("\n");
32 }

Indexed text:

1 # Sony WH-1000XM5 Wireless Headphones
2 **SKU:** WH-1000XM5
3 **Category:** Electronics > Audio > Headphones
4 **Price:** $349.99
5 
6 Industry-leading noise cancellation with Auto NC Optimizer.
7 
8 ## Specifications
9 - **driver size:** 30mm
10 - **battery life:** 30 hours
11 - **weight:** 250g
12 - **connectivity:** Bluetooth 5.2, 3.5mm, USB-C
13 
14 ## Customer Reviews
15 Excellent ANC, comfortable fit, premium sound. Some users note the folding mechanism changed from XM4.

Healthcare Records (FHIR)

Input JSON:

1 {
2   "resourceType": "Patient",
3   "id": "example-001",
4   "name": [{ "family": "Smith", "given": ["John", "Michael"] }],
5   "birthDate": "1990-05-15",
6   "gender": "male",
7   "address": [{ "city": "Portland", "state": "OR" }],
8   "condition": [
9     { "code": { "text": "Type 2 Diabetes Mellitus" }, "onsetDateTime": "2018-03-01" },
10     { "code": { "text": "Essential Hypertension" }, "onsetDateTime": "2020-07-15" }
11   ],
12   "medication": [
13     { "code": { "text": "Metformin 500mg" }, "status": "active" },
14     { "code": { "text": "Lisinopril 10mg" }, "status": "active" }
15   ]
16 }

Parsing script:

1 export default function extract(doc) {
2   var parts = [];
3 
4   // Patient name
5   if (doc.name && doc.name[0]) {
6     var n = doc.name[0];
7     var fullName = (n.given || []).join(" ") + " " + (n.family || "");
8     parts.push("# Patient: " + fullName.trim());
9   }
10 
11   parts.push("**DOB:** " + (doc.birthDate || "Unknown"));
12   parts.push("**Gender:** " + (doc.gender || "Unknown"));
13 
14   if (doc.address && doc.address[0]) {
15     var addr = doc.address[0];
16     parts.push("**Location:** " + [addr.city, addr.state].filter(Boolean).join(", "));
17   }
18 
19   // Conditions
20   if (doc.condition && doc.condition.length > 0) {
21     parts.push("\n## Active Conditions");
22     doc.condition.forEach(function(c) {
23       var text = c.code && c.code.text ? c.code.text : "Unknown condition";
24       var onset = c.onsetDateTime ? " (onset: " + c.onsetDateTime + ")" : "";
25       parts.push("- " + text + onset);
26     });
27   }
28 
29   // Medications
30   if (doc.medication && doc.medication.length > 0) {
31     parts.push("\n## Current Medications");
32     doc.medication.forEach(function(m) {
33       var text = m.code && m.code.text ? m.code.text : "Unknown medication";
34       var status = m.status ? " [" + m.status + "]" : "";
35       parts.push("- " + text + status);
36     });
37   }
38 
39   return parts.join("\n");
40 }

Indexed text:

1 # Patient: John Michael Smith
2 **DOB:** 1990-05-15
3 **Gender:** male
4 **Location:** Portland, OR
5 
6 ## Active Conditions
7 - Type 2 Diabetes Mellitus (onset: 2018-03-01)
8 - Essential Hypertension (onset: 2020-07-15)
9 
10 ## Current Medications
11 - Metformin 500mg [active]
12 - Lisinopril 10mg [active]

Generic Key-Value Flattener

Don’t know your schema yet? This script handles any JSON by flattening all fields into readable text:

1 export default function extract(doc) {
2   var parts = [];
3 
4   function flatten(obj, prefix) {
5     Object.keys(obj).forEach(function(key) {
6       var val = obj[key];
7       var label = prefix ? prefix + "." + key : key;
8 
9       if (val === null || val === undefined) return;
10 
11       if (Array.isArray(val)) {
12         parts.push(label + ": " + val.join(", "));
13       } else if (typeof val === "object") {
14         flatten(val, label);
15       } else {
16         parts.push(label + ": " + String(val));
17       }
18     });
19   }
20 
21   flatten(doc, "");
22   return parts.join("\n");
23 }

This works for any JSON structure. Good for prototyping before you write a specialized parser.

Using Parsers with the API

Add the parsing_script parameter to any indexing endpoint:

$ curl -X POST https://api.runcaptain.com/v2/collections/research-papers/index/s3 \
>   -H "Authorization: Bearer $CAPTAIN_API_KEY" \
>   -H "X-Organization-ID: $ORG_ID" \
>   -H "Content-Type: application/json" \
>   -d '{
>     "bucket_name": "my-research-data",
>     "aws_access_key_id": "...",
>     "aws_secret_access_key": "...",
>     "processing_type": "basic",
>     "parsing_script": "research/paper-parser"
>   }'

The parsing_script parameter is the relative path to your script. Captain resolves it to your org’s script storage automatically:

"research/paper-parser"  →  research/paper-parser.js
"legacy-parser"          →  legacy-parser.js
"healthcare/fhir-bundle" →  healthcare/fhir-bundle.js

You can include or omit the .js extension. Both work.

Without `parsing_script`

If you don’t include parsing_script, JSON files are indexed as raw text. This is the existing behavior and it’s unchanged. Parsers are opt-in.

Supported Endpoints

parsing_script works on all indexing endpoints:

Endpoint	Description
`POST /v2/collections/\{name\}/index/s3`	S3 bucket
`POST /v2/collections/\{name\}/index/s3/file`	Single S3 file
`POST /v2/collections/\{name\}/index/s3/directory`	S3 directory
`POST /v2/collections/\{name\}/index/gcs`	Google Cloud Storage
`POST /v2/collections/\{name\}/index/gcs/file`	Single GCS file
`POST /v2/collections/\{name\}/index/gcs/directory`	GCS directory
`POST /v2/collections/\{name\}/index/azure`	Azure Blob Storage
`POST /v2/collections/\{name\}/index/azure/file`	Single Azure blob
`POST /v2/collections/\{name\}/index/azure/directory`	Azure directory
`POST /v2/collections/\{name\}/index/r2`	Cloudflare R2
`POST /v2/collections/\{name\}/index/r2/file`	Single R2 file
`POST /v2/collections/\{name\}/index/r2/directory`	R2 directory
`POST /v2/collections/\{name\}/index/url`	Public URL

Tips for Writing Good Parsers

1. Return markdown, not plain text. Headers, bold, and lists help Captain understand document structure and improve search relevance.

2. Put the most important content first. Title, key identifiers, and summary should come before detailed content. Search results show previews from the beginning of each chunk.

3. Skip noise. Internal IDs, timestamps, and system fields don’t help search. Only extract what a human would search for.

1 // BAD: includes everything
2 return { text: JSON.stringify(doc, null, 2) };
3 
4 // GOOD: extracts what matters
5 return { text: "# " + doc.title + "\n\n" + doc.content };

4. Handle missing fields. Not every document has every field. Use guards:

1 if (doc.authors && doc.authors.length > 0) {
2   md += "**Authors:** " + doc.authors.join(", ") + "\n\n";
3 }

5. Use var, not let/const. The V8 sandbox uses a JavaScript version that works best with var declarations. Avoid optional chaining (?.) and use explicit null checks instead.

1 // Avoid:
2 var doi = doc.metadata?.doi;
3 
4 // Use:
5 var doi = doc.metadata && doc.metadata.doi;

6. Test in the Parser Studio. The browser-based test runner gives instant feedback. Write your script, paste sample JSON, click Run, and see the output immediately.

Limits

Limit	Value
Max JSON file size	500 MB
Max JSON for script processing	50 MB
Script execution timeout	10 seconds per file
Script memory limit	64 MB heap
Allowed in sandbox	Pure JS: objects, arrays, strings, Math, Date, RegExp, JSON
NOT allowed	`require`, `import`, `fetch`, `fs`, `process`, `setTimeout`, network access

Error Handling

If your parsing script fails, that specific file is marked as failed in the indexing job. Other files in the same batch continue processing normally.

Common errors:

Error	Cause	Fix
”Script must export a default function”	Missing `export default function`	Add `export default function extract(doc) \{ ... \}`
”Must return a string”	Returned an object or number instead of a string	Make sure your function returns a string, e.g. `return doc.content`
”Parsing script returned empty text”	Script returned `""`	Check your field paths match the actual JSON
”Script exceeded 10s time limit”	Infinite loop or very complex processing	Simplify your script logic
”Script exceeded 64MB memory limit”	Creating very large intermediate objects	Avoid building huge arrays or strings in memory
”Invalid JSON”	The file isn’t valid JSON	Check the file encoding and format

Check the job status for per-file error details:

$ curl https://api.runcaptain.com/v2/jobs/{job_id} \
>   -H "Authorization: Bearer $CAPTAIN_API_KEY" \
>   -H "X-Organization-ID: $ORG_ID"

FAQ

Can I use the same parser across multiple collections? Yes. Parsers are org-scoped, not collection-scoped. Any indexing job in your org can reference any parser.

What happens to non-JSON files when I set parsing_script? Nothing. PDFs, DOCX, images, and other files are processed normally. The parser only affects .json files.

Can I use TypeScript? Not yet. Write plain JavaScript. TypeScript support would require shipping a compiler to the sandbox.

Can my script make API calls or read files? No. The sandbox is completely isolated. No network access, no filesystem, no require/import. Your script receives the JSON object and returns text. That’s it.

Can I use async/await? No. The sandbox runs synchronous JavaScript only. No Promises, no setTimeout, no async patterns.

How do I update a parser? Upload a new version with the same path. The next indexing job that references it will use the updated script. There’s no versioning. The latest upload is what runs.

What are Parsers?

How It Works

Writing Your First Parser

The text field

What to return

Examples

Research Papers

Product Catalog

Healthcare Records (FHIR)

Generic Key-Value Flattener

Using Parsers with the API

Without parsing_script

Supported Endpoints

Tips for Writing Good Parsers

Limits

Error Handling

FAQ

Without `parsing_script`