Scientific Search | Captain Docs

Scientific Search gives you a deterministic way to find scientific material and retrieve the complete document text as markdown. It does not synthesize an answer. It returns source documents that your application can inspect, store, or index into File Search.

What It Covers

Source	Search result type	Full document fetch
PubMed Central	`type: "pmc"` with `pmcid`	Open-access article body rendered from PMC JATS XML into markdown, including sections, tables, and figure captions
ClinicalTrials.gov	`type: "clinical_trial"` with `nct_id`	Complete study record rendered into markdown

PubMed abstracts and Semantic Scholar papers are not part of this deterministic workflow because they do not guarantee a complete, freely retrievable document for every result. Scientific Search is intentionally narrower: every returned result includes full_text_available: true and a download_url for the full markdown document.

How It Works

Search scientific documents with GET /v2/datasets/scientific/search.
Read results[].download_url from the response.
Fetch the full markdown document from that URL.
Index the markdown into a Captain collection if you want retrieval, filtering, relations, or RAG over those scientific documents.

All requests require Authorization: Bearer {api_key}. Include X-Organization-ID only when your key is not already scoped to an organization.

Search Documents

Use q for the search query. Results are gathered from the fixed source set in a fixed order, so the same query parameters produce the same result shape.

Python

1 import json
2 import requests
3 
4 BASE_URL = "https://api.runcaptain.com"
5 API_KEY = "your_api_key"
6 
7 headers = {"Authorization": f"Bearer {API_KEY}"}
8 
9 response = requests.get(
10     f"{BASE_URL}/v2/datasets/scientific/search",
11     headers=headers,
12     params={
13         "q": "PARP inhibitor BRCA1 breast cancer",
14         "limit": 5,
15         "recency_years": 10,
16     },
17     timeout=30.0,
18 )
19 
20 print(json.dumps(response.json(), indent=2))

TypeScript

1 const BASE_URL = "https://api.runcaptain.com";
2 const API_KEY = "your_api_key";
3 
4 const params = new URLSearchParams({
5   q: "PARP inhibitor BRCA1 breast cancer",
6   limit: "5",
7   recency_years: "10",
8 });
9 
10 const response = await fetch(`${BASE_URL}/v2/datasets/scientific/search?${params}`, {
11   headers: { Authorization: `Bearer ${API_KEY}` },
12 });
13 
14 const data = await response.json();
15 console.log(JSON.stringify(data, null, 2));

cURL

$ curl -G "https://api.runcaptain.com/v2/datasets/scientific/search" \
>   -H "Authorization: Bearer $CAPTAIN_API_KEY" \
>   --data-urlencode "q=PARP inhibitor BRCA1 breast cancer" \
>   --data-urlencode "limit=5" \
>   --data-urlencode "recency_years=10"

Search Response

1 {
2   "query": "PARP inhibitor BRCA1 breast cancer",
3   "results": [
4     {
5       "type": "pmc",
6       "pmcid": "PMC6503629",
7       "title": "PARP inhibitors and homologous recombination deficiency in breast cancer",
8       "journal": "Cancer Research",
9       "year": 2024,
10       "full_text_available": true,
11       "download_url": "https://api.runcaptain.com/v2/datasets/scientific/documents/pmc/PMC6503629"
12     },
13     {
14       "type": "clinical_trial",
15       "nct_id": "NCT02000622",
16       "title": "Olaparib as Adjuvant Treatment in Patients With Germline BRCA Mutated High Risk HER2 Negative Primary Breast Cancer",
17       "phase": "PHASE3",
18       "status": "ACTIVE_NOT_RECRUITING",
19       "full_text_available": true,
20       "download_url": "https://api.runcaptain.com/v2/datasets/scientific/documents/clinicaltrials/NCT02000622"
21     }
22   ],
23   "results_by_source": {
24     "pmc": [
25       {
26         "type": "pmc",
27         "pmcid": "PMC6503629",
28         "title": "PARP inhibitors and homologous recombination deficiency in breast cancer",
29         "journal": "Cancer Research",
30         "year": 2024,
31         "full_text_available": true,
32         "download_url": "https://api.runcaptain.com/v2/datasets/scientific/documents/pmc/PMC6503629"
33       }
34     ],
35     "clinicaltrials": [
36       {
37         "type": "clinical_trial",
38         "nct_id": "NCT02000622",
39         "title": "Olaparib as Adjuvant Treatment in Patients With Germline BRCA Mutated High Risk HER2 Negative Primary Breast Cancer",
40         "phase": "PHASE3",
41         "status": "ACTIVE_NOT_RECRUITING",
42         "full_text_available": true,
43         "download_url": "https://api.runcaptain.com/v2/datasets/scientific/documents/clinicaltrials/NCT02000622"
44       }
45     ]
46   },
47   "sources_searched": ["pmc", "clinicaltrials"],
48   "total_results": 2,
49   "errors": {},
50   "execution_time_ms": 842
51 }

Fetch Full Markdown

Call the download_url from a search result, or construct the URL from the source type and document ID.

Python

1 import json
2 import requests
3 
4 BASE_URL = "https://api.runcaptain.com"
5 API_KEY = "your_api_key"
6 PMCID = "PMC6503629"
7 
8 headers = {"Authorization": f"Bearer {API_KEY}"}
9 
10 response = requests.get(
11     f"{BASE_URL}/v2/datasets/scientific/documents/pmc/{PMCID}",
12     headers=headers,
13     timeout=60.0,
14 )
15 
16 document = response.json()
17 print(document["markdown"][:1000])
18 print(json.dumps({k: v for k, v in document.items() if k != "markdown"}, indent=2))

TypeScript

1 const BASE_URL = "https://api.runcaptain.com";
2 const API_KEY = "your_api_key";
3 const pmcid = "PMC6503629";
4 
5 const response = await fetch(`${BASE_URL}/v2/datasets/scientific/documents/pmc/${pmcid}`, {
6   headers: { Authorization: `Bearer ${API_KEY}` },
7 });
8 
9 const document = await response.json();
10 console.log(document.markdown.slice(0, 1000));

cURL

$ curl "https://api.runcaptain.com/v2/datasets/scientific/documents/pmc/PMC6503629" \
>   -H "Authorization: Bearer $CAPTAIN_API_KEY"

Document Response

PMC document:

1 {
2   "type": "pmc",
3   "found": true,
4   "pmcid": "PMC6503629",
5   "markdown": "# PARP inhibitors and homologous recombination deficiency in breast cancer\n\n**Abstract.** ...",
6   "markdown_source": "pmc_jats_full_text",
7   "pdf_url": "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6503629/pdf/",
8   "section_count": 12,
9   "table_caption_count": 3,
10   "figure_caption_count": 5
11 }

ClinicalTrials.gov document:

1 {
2   "type": "clinicaltrials",
3   "found": true,
4   "nct_id": "NCT02000622",
5   "markdown": "# Olaparib as Adjuvant Treatment in Patients With Germline BRCA Mutated High Risk HER2 Negative Primary Breast Cancer\n\n## Status\nACTIVE_NOT_RECRUITING\n\n## Brief Summary\n...",
6   "markdown_source": "ctgov_record"
7 }

If a document cannot be retrieved, found may be false with a reason, or the endpoint may return an error for invalid IDs, unknown document types, or upstream failures.

Index Into File Search

Scientific Search returns source markdown. File Search indexes and retrieves that markdown.

Python

1 import requests
2 import uuid
3 
4 BASE_URL = "https://api.runcaptain.com"
5 API_KEY = "your_api_key"
6 COLLECTION_NAME = "scientific_research"
7 
8 headers = {
9     "Authorization": f"Bearer {API_KEY}",
10     "Content-Type": "application/json",
11     "Idempotency-Key": str(uuid.uuid4()),
12 }
13 
14 search_response = requests.get(
15     f"{BASE_URL}/v2/datasets/scientific/search",
16     headers={"Authorization": f"Bearer {API_KEY}"},
17     params={"q": "PARP inhibitor BRCA1 breast cancer", "limit": 3},
18     timeout=30.0,
19 )
20 
21 for result in search_response.json()["results"]:
22     document_response = requests.get(
23         result["download_url"],
24         headers={"Authorization": f"Bearer {API_KEY}"},
25         timeout=60.0,
26     )
27     document = document_response.json()
28     if not document.get("markdown"):
29         continue
30 
31     requests.post(
32         f"{BASE_URL}/v2/collections/{COLLECTION_NAME}/index/text",
33         headers=headers,
34         json={
35             "text": document["markdown"],
36             "document_id": result.get("pmcid") or result.get("nct_id"),
37             "custom_metadata": {
38                 "source": result["type"],
39                 "title": result.get("title"),
40                 "pmcid": result.get("pmcid"),
41                 "nct_id": result.get("nct_id"),
42             },
43         },
44         timeout=60.0,
45     )

Then query the collection with File Search or inspect the exact request shape in Query - v3.

Fields

Search Request

Field	Type	Default	Description
`q`	string	required	Scientific search query.
`limit`	integer	`10`	Maximum results per source. Valid range: 1-25.
`recency_years`	integer	`10`	Prefer documents from this many years back where the upstream source supports date filtering. Valid range: 1-50.

Search Result

Field	Description
`type`	Source type. Values are `pmc` or `clinical_trial`.
`pmcid`	PubMed Central ID. Present on `pmc` results.
`nct_id`	ClinicalTrials.gov ID. Present on `clinical_trial` results.
`title`	Article or study title.
`journal`	Journal name for PMC results.
`year`	Publication year for PMC results.
`phase`	Trial phase for ClinicalTrials.gov results.
`status`	Trial status for ClinicalTrials.gov results.
`full_text_available`	`true` for returned results. The endpoint is scoped to sources with full document retrieval.
`download_url`	API URL that fetches the full markdown document.

Document Response

Field	Description
`type`	Document source type: `pmc` or `clinicaltrials`.
`found`	Whether a full document was found for the provided ID.
`pmcid`	PubMed Central ID for PMC documents.
`nct_id`	ClinicalTrials.gov ID for trial documents.
`markdown`	Full document rendered as markdown.
`markdown_source`	Renderer used for the markdown, such as `pmc_jats_full_text` or `ctgov_record`.
`pdf_url`	Open-access PMC PDF URL when available.
`section_count`	Number of sections rendered from a PMC article.
`table_caption_count`	Number of table captions rendered from a PMC article.
`figure_caption_count`	Number of figure captions rendered from a PMC article.
`reason`	Why a full document was unavailable when `found` is `false`.

Source Search and Scrape

Scientific also has scraper-backed source search endpoints for supported web publications:

GET /v2/datasets/scientific/sources/search
GET /v2/datasets/scientific/sources/scrape

Use those only when you specifically need publication URL discovery and scraping. For deterministic scientific document retrieval, prefer GET /v2/datasets/scientific/search and GET /v2/datasets/scientific/documents/{doc_type}/{doc_id}.