Parsers (BETA)

What are Parsers?

Parsers let you tell Captain how to read your JSON files. Instead of indexing raw JSON as text, you write a short JavaScript function that extracts the content you actually want to search over.

Without a parser, Captain indexes your JSON as-is:

{"title":"Attention Is All You Need","authors":["Vaswani","Shazeer"],"content":"We propose a new..."}

With a parser, Captain indexes clean, searchable text:

1# Attention Is All You Need
2
3**Authors:** Vaswani, Shazeer
4
5We propose a new simple network architecture, the Transformer...

Same data. Much better search results.

How It Works

You write a parsing script (JavaScript)
|
v
Upload it via Parser Studio (/studio/parsers)
|
v
Start an indexing job with parsing_script parameter
|
v
For each .json file:
1. Captain downloads the file
2. Parses it as JSON
3. Runs your script in a sandboxed V8 engine
4. Your script returns { text: "..." }
5. That text gets chunked, embedded, and indexed
|
v
Search works over the extracted text

Other file types (PDF, DOCX, images, etc.) are completely unaffected. The parser only applies to .json files.

Writing Your First Parser

A parsing script is a JavaScript function that receives your JSON object and returns a string:

1export default function extract(doc) {
2 return doc.content;
3}

That’s it. doc is your parsed JSON. Return an object with a text field containing the string you want Captain to index.

The text field

The text you return becomes the searchable content. You can format it however you want. Markdown works great because it preserves structure:

1export default function extract(doc) {
2 var md = "";
3
4 if (doc.title) {
5 md += "# " + doc.title + "\n\n";
6 }
7
8 if (doc.authors) {
9 md += "**Authors:** " + doc.authors.join(", ") + "\n\n";
10 }
11
12 if (doc.abstract) {
13 md += doc.abstract;
14 }
15
16 return md;
17}

What to return

Your function must return a string. That string becomes the searchable content.

1// GOOD: return a string
2return doc.content;
3return "# " + doc.title + "\n\n" + doc.content;
4return parts.join("\n");

What does NOT work:

1// BAD: returning an object
2return { text: doc.content };
3
4// BAD: returning a number
5return 42;

Examples

Research Papers

Input JSON:

1{
2 "title": "Attention Is All You Need",
3 "authors": ["Vaswani", "Shazeer", "Parmar", "Uszkoreit", "Jones"],
4 "content": "We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.",
5 "metadata": {
6 "doi": "10.48550/arXiv.1706.03762",
7 "year": 2017,
8 "journal": "NeurIPS"
9 }
10}

Parsing script:

1export default function extract(doc) {
2 var md = "";
3
4 if (doc.title) {
5 md += "# " + doc.title + "\n\n";
6 }
7
8 if (doc.authors) {
9 var authorList = Array.isArray(doc.authors)
10 ? doc.authors.join(", ")
11 : doc.authors;
12 md += "**Authors:** " + authorList + "\n\n";
13 }
14
15 if (doc.metadata && doc.metadata.doi) {
16 md += "**DOI:** " + doc.metadata.doi + "\n\n";
17 }
18
19 if (doc.content) {
20 md += doc.content;
21 }
22
23 return md;
24}

Indexed text:

1# Attention Is All You Need
2
3**Authors:** Vaswani, Shazeer, Parmar, Uszkoreit, Jones
4
5**DOI:** 10.48550/arXiv.1706.03762
6
7We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.

Product Catalog

Input JSON:

1{
2 "sku": "WH-1000XM5",
3 "name": "Sony WH-1000XM5 Wireless Headphones",
4 "category": "Electronics > Audio > Headphones",
5 "price": 349.99,
6 "description": "Industry-leading noise cancellation with Auto NC Optimizer.",
7 "specs": {
8 "driver_size": "30mm",
9 "battery_life": "30 hours",
10 "weight": "250g",
11 "connectivity": ["Bluetooth 5.2", "3.5mm", "USB-C"]
12 },
13 "reviews_summary": "Excellent ANC, comfortable fit, premium sound. Some users note the folding mechanism changed from XM4."
14}

Parsing script:

1export default function extract(doc) {
2 var parts = [];
3
4 parts.push("# " + doc.name);
5 parts.push("**SKU:** " + doc.sku);
6 parts.push("**Category:** " + doc.category);
7 parts.push("**Price:** $" + doc.price);
8
9 if (doc.description) {
10 parts.push("\n" + doc.description);
11 }
12
13 if (doc.specs) {
14 parts.push("\n## Specifications");
15 Object.keys(doc.specs).forEach(function(key) {
16 var val = doc.specs[key];
17 var label = key.replace(/_/g, " ");
18 if (Array.isArray(val)) {
19 parts.push("- **" + label + ":** " + val.join(", "));
20 } else {
21 parts.push("- **" + label + ":** " + val);
22 }
23 });
24 }
25
26 if (doc.reviews_summary) {
27 parts.push("\n## Customer Reviews");
28 parts.push(doc.reviews_summary);
29 }
30
31 return parts.join("\n");
32}

Indexed text:

1# Sony WH-1000XM5 Wireless Headphones
2**SKU:** WH-1000XM5
3**Category:** Electronics > Audio > Headphones
4**Price:** $349.99
5
6Industry-leading noise cancellation with Auto NC Optimizer.
7
8## Specifications
9- **driver size:** 30mm
10- **battery life:** 30 hours
11- **weight:** 250g
12- **connectivity:** Bluetooth 5.2, 3.5mm, USB-C
13
14## Customer Reviews
15Excellent ANC, comfortable fit, premium sound. Some users note the folding mechanism changed from XM4.

Healthcare Records (FHIR)

Input JSON:

1{
2 "resourceType": "Patient",
3 "id": "example-001",
4 "name": [{ "family": "Smith", "given": ["John", "Michael"] }],
5 "birthDate": "1990-05-15",
6 "gender": "male",
7 "address": [{ "city": "Portland", "state": "OR" }],
8 "condition": [
9 { "code": { "text": "Type 2 Diabetes Mellitus" }, "onsetDateTime": "2018-03-01" },
10 { "code": { "text": "Essential Hypertension" }, "onsetDateTime": "2020-07-15" }
11 ],
12 "medication": [
13 { "code": { "text": "Metformin 500mg" }, "status": "active" },
14 { "code": { "text": "Lisinopril 10mg" }, "status": "active" }
15 ]
16}

Parsing script:

1export default function extract(doc) {
2 var parts = [];
3
4 // Patient name
5 if (doc.name && doc.name[0]) {
6 var n = doc.name[0];
7 var fullName = (n.given || []).join(" ") + " " + (n.family || "");
8 parts.push("# Patient: " + fullName.trim());
9 }
10
11 parts.push("**DOB:** " + (doc.birthDate || "Unknown"));
12 parts.push("**Gender:** " + (doc.gender || "Unknown"));
13
14 if (doc.address && doc.address[0]) {
15 var addr = doc.address[0];
16 parts.push("**Location:** " + [addr.city, addr.state].filter(Boolean).join(", "));
17 }
18
19 // Conditions
20 if (doc.condition && doc.condition.length > 0) {
21 parts.push("\n## Active Conditions");
22 doc.condition.forEach(function(c) {
23 var text = c.code && c.code.text ? c.code.text : "Unknown condition";
24 var onset = c.onsetDateTime ? " (onset: " + c.onsetDateTime + ")" : "";
25 parts.push("- " + text + onset);
26 });
27 }
28
29 // Medications
30 if (doc.medication && doc.medication.length > 0) {
31 parts.push("\n## Current Medications");
32 doc.medication.forEach(function(m) {
33 var text = m.code && m.code.text ? m.code.text : "Unknown medication";
34 var status = m.status ? " [" + m.status + "]" : "";
35 parts.push("- " + text + status);
36 });
37 }
38
39 return parts.join("\n");
40}

Indexed text:

1# Patient: John Michael Smith
2**DOB:** 1990-05-15
3**Gender:** male
4**Location:** Portland, OR
5
6## Active Conditions
7- Type 2 Diabetes Mellitus (onset: 2018-03-01)
8- Essential Hypertension (onset: 2020-07-15)
9
10## Current Medications
11- Metformin 500mg [active]
12- Lisinopril 10mg [active]

Generic Key-Value Flattener

Don’t know your schema yet? This script handles any JSON by flattening all fields into readable text:

1export default function extract(doc) {
2 var parts = [];
3
4 function flatten(obj, prefix) {
5 Object.keys(obj).forEach(function(key) {
6 var val = obj[key];
7 var label = prefix ? prefix + "." + key : key;
8
9 if (val === null || val === undefined) return;
10
11 if (Array.isArray(val)) {
12 parts.push(label + ": " + val.join(", "));
13 } else if (typeof val === "object") {
14 flatten(val, label);
15 } else {
16 parts.push(label + ": " + String(val));
17 }
18 });
19 }
20
21 flatten(doc, "");
22 return parts.join("\n");
23}

This works for any JSON structure. Good for prototyping before you write a specialized parser.

Using Parsers with the API

Add the parsing_script parameter to any indexing endpoint:

$curl -X POST https://api.runcaptain.com/v2/collections/research-papers/index/s3 \
> -H "Authorization: Bearer $CAPTAIN_API_KEY" \
> -H "X-Organization-ID: $ORG_ID" \
> -H "Content-Type: application/json" \
> -d '{
> "bucket_name": "my-research-data",
> "aws_access_key_id": "...",
> "aws_secret_access_key": "...",
> "processing_type": "basic",
> "parsing_script": "research/paper-parser"
> }'

The parsing_script parameter is the relative path to your script. Captain resolves it to your org’s script storage automatically:

"research/paper-parser" → research/paper-parser.js
"legacy-parser" → legacy-parser.js
"healthcare/fhir-bundle" → healthcare/fhir-bundle.js

You can include or omit the .js extension. Both work.

Without parsing_script

If you don’t include parsing_script, JSON files are indexed as raw text. This is the existing behavior and it’s unchanged. Parsers are opt-in.

Supported Endpoints

parsing_script works on all indexing endpoints:

EndpointDescription
POST /v2/collections/\{name\}/index/s3S3 bucket
POST /v2/collections/\{name\}/index/s3/fileSingle S3 file
POST /v2/collections/\{name\}/index/s3/directoryS3 directory
POST /v2/collections/\{name\}/index/gcsGoogle Cloud Storage
POST /v2/collections/\{name\}/index/gcs/fileSingle GCS file
POST /v2/collections/\{name\}/index/gcs/directoryGCS directory
POST /v2/collections/\{name\}/index/azureAzure Blob Storage
POST /v2/collections/\{name\}/index/azure/fileSingle Azure blob
POST /v2/collections/\{name\}/index/azure/directoryAzure directory
POST /v2/collections/\{name\}/index/r2Cloudflare R2
POST /v2/collections/\{name\}/index/r2/fileSingle R2 file
POST /v2/collections/\{name\}/index/r2/directoryR2 directory
POST /v2/collections/\{name\}/index/urlPublic URL

Tips for Writing Good Parsers

1. Return markdown, not plain text. Headers, bold, and lists help Captain understand document structure and improve search relevance.

2. Put the most important content first. Title, key identifiers, and summary should come before detailed content. Search results show previews from the beginning of each chunk.

3. Skip noise. Internal IDs, timestamps, and system fields don’t help search. Only extract what a human would search for.

1// BAD: includes everything
2return { text: JSON.stringify(doc, null, 2) };
3
4// GOOD: extracts what matters
5return { text: "# " + doc.title + "\n\n" + doc.content };

4. Handle missing fields. Not every document has every field. Use guards:

1if (doc.authors && doc.authors.length > 0) {
2 md += "**Authors:** " + doc.authors.join(", ") + "\n\n";
3}

5. Use var, not let/const. The V8 sandbox uses a JavaScript version that works best with var declarations. Avoid optional chaining (?.) and use explicit null checks instead.

1// Avoid:
2var doi = doc.metadata?.doi;
3
4// Use:
5var doi = doc.metadata && doc.metadata.doi;

6. Test in the Parser Studio. The browser-based test runner gives instant feedback. Write your script, paste sample JSON, click Run, and see the output immediately.

Limits

LimitValue
Max JSON file size500 MB
Max JSON for script processing50 MB
Script execution timeout10 seconds per file
Script memory limit64 MB heap
Allowed in sandboxPure JS: objects, arrays, strings, Math, Date, RegExp, JSON
NOT allowedrequire, import, fetch, fs, process, setTimeout, network access

Error Handling

If your parsing script fails, that specific file is marked as failed in the indexing job. Other files in the same batch continue processing normally.

Common errors:

ErrorCauseFix
”Script must export a default function”Missing export default functionAdd export default function extract(doc) \{ ... \}
”Must return a string”Returned an object or number instead of a stringMake sure your function returns a string, e.g. return doc.content
”Parsing script returned empty text”Script returned ""Check your field paths match the actual JSON
”Script exceeded 10s time limit”Infinite loop or very complex processingSimplify your script logic
”Script exceeded 64MB memory limit”Creating very large intermediate objectsAvoid building huge arrays or strings in memory
”Invalid JSON”The file isn’t valid JSONCheck the file encoding and format

Check the job status for per-file error details:

$curl https://api.runcaptain.com/v2/jobs/{job_id} \
> -H "Authorization: Bearer $CAPTAIN_API_KEY" \
> -H "X-Organization-ID: $ORG_ID"

FAQ

Can I use the same parser across multiple collections? Yes. Parsers are org-scoped, not collection-scoped. Any indexing job in your org can reference any parser.

What happens to non-JSON files when I set parsing_script? Nothing. PDFs, DOCX, images, and other files are processed normally. The parser only affects .json files.

Can I use TypeScript? Not yet. Write plain JavaScript. TypeScript support would require shipping a compiler to the sandbox.

Can my script make API calls or read files? No. The sandbox is completely isolated. No network access, no filesystem, no require/import. Your script receives the JSON object and returns text. That’s it.

Can I use async/await? No. The sandbox runs synchronous JavaScript only. No Promises, no setTimeout, no async patterns.

How do I update a parser? Upload a new version with the same path. The next indexing job that references it will use the updated script. There’s no versioning. The latest upload is what runs.