Scientific Search

Scientific Search gives you a deterministic way to find scientific material and retrieve the complete document text as markdown. It does not synthesize an answer. It returns source documents that your application can inspect, store, or index into File Search.

What It Covers

SourceSearch result typeFull document fetch
PubMed Centraltype: "pmc" with pmcidOpen-access article body rendered from PMC JATS XML into markdown, including sections, tables, and figure captions
ClinicalTrials.govtype: "clinical_trial" with nct_idComplete study record rendered into markdown

PubMed abstracts and Semantic Scholar papers are not part of this deterministic workflow because they do not guarantee a complete, freely retrievable document for every result. Scientific Search is intentionally narrower: every returned result includes full_text_available: true and a download_url for the full markdown document.

How It Works

  1. Search scientific documents with GET /v2/datasets/scientific/search.
  2. Read results[].download_url from the response.
  3. Fetch the full markdown document from that URL.
  4. Index the markdown into a Captain collection if you want retrieval, filtering, relations, or RAG over those scientific documents.

All requests require Authorization: Bearer {api_key}. Include X-Organization-ID only when your key is not already scoped to an organization.

Search Documents

Use q for the search query. Results are gathered from the fixed source set in a fixed order, so the same query parameters produce the same result shape.

Python
1import json
2import requests
3
4BASE_URL = "https://api.runcaptain.com"
5API_KEY = "your_api_key"
6
7headers = {"Authorization": f"Bearer {API_KEY}"}
8
9response = requests.get(
10 f"{BASE_URL}/v2/datasets/scientific/search",
11 headers=headers,
12 params={
13 "q": "PARP inhibitor BRCA1 breast cancer",
14 "limit": 5,
15 "recency_years": 10,
16 },
17 timeout=30.0,
18)
19
20print(json.dumps(response.json(), indent=2))
TypeScript
1const BASE_URL = "https://api.runcaptain.com";
2const API_KEY = "your_api_key";
3
4const params = new URLSearchParams({
5 q: "PARP inhibitor BRCA1 breast cancer",
6 limit: "5",
7 recency_years: "10",
8});
9
10const response = await fetch(`${BASE_URL}/v2/datasets/scientific/search?${params}`, {
11 headers: { Authorization: `Bearer ${API_KEY}` },
12});
13
14const data = await response.json();
15console.log(JSON.stringify(data, null, 2));
cURL
$curl -G "https://api.runcaptain.com/v2/datasets/scientific/search" \
> -H "Authorization: Bearer $CAPTAIN_API_KEY" \
> --data-urlencode "q=PARP inhibitor BRCA1 breast cancer" \
> --data-urlencode "limit=5" \
> --data-urlencode "recency_years=10"

Search Response

1{
2 "query": "PARP inhibitor BRCA1 breast cancer",
3 "results": [
4 {
5 "type": "pmc",
6 "pmcid": "PMC6503629",
7 "title": "PARP inhibitors and homologous recombination deficiency in breast cancer",
8 "journal": "Cancer Research",
9 "year": 2024,
10 "full_text_available": true,
11 "download_url": "https://api.runcaptain.com/v2/datasets/scientific/documents/pmc/PMC6503629"
12 },
13 {
14 "type": "clinical_trial",
15 "nct_id": "NCT02000622",
16 "title": "Olaparib as Adjuvant Treatment in Patients With Germline BRCA Mutated High Risk HER2 Negative Primary Breast Cancer",
17 "phase": "PHASE3",
18 "status": "ACTIVE_NOT_RECRUITING",
19 "full_text_available": true,
20 "download_url": "https://api.runcaptain.com/v2/datasets/scientific/documents/clinicaltrials/NCT02000622"
21 }
22 ],
23 "results_by_source": {
24 "pmc": [
25 {
26 "type": "pmc",
27 "pmcid": "PMC6503629",
28 "title": "PARP inhibitors and homologous recombination deficiency in breast cancer",
29 "journal": "Cancer Research",
30 "year": 2024,
31 "full_text_available": true,
32 "download_url": "https://api.runcaptain.com/v2/datasets/scientific/documents/pmc/PMC6503629"
33 }
34 ],
35 "clinicaltrials": [
36 {
37 "type": "clinical_trial",
38 "nct_id": "NCT02000622",
39 "title": "Olaparib as Adjuvant Treatment in Patients With Germline BRCA Mutated High Risk HER2 Negative Primary Breast Cancer",
40 "phase": "PHASE3",
41 "status": "ACTIVE_NOT_RECRUITING",
42 "full_text_available": true,
43 "download_url": "https://api.runcaptain.com/v2/datasets/scientific/documents/clinicaltrials/NCT02000622"
44 }
45 ]
46 },
47 "sources_searched": ["pmc", "clinicaltrials"],
48 "total_results": 2,
49 "errors": {},
50 "execution_time_ms": 842
51}

Fetch Full Markdown

Call the download_url from a search result, or construct the URL from the source type and document ID.

Python
1import json
2import requests
3
4BASE_URL = "https://api.runcaptain.com"
5API_KEY = "your_api_key"
6PMCID = "PMC6503629"
7
8headers = {"Authorization": f"Bearer {API_KEY}"}
9
10response = requests.get(
11 f"{BASE_URL}/v2/datasets/scientific/documents/pmc/{PMCID}",
12 headers=headers,
13 timeout=60.0,
14)
15
16document = response.json()
17print(document["markdown"][:1000])
18print(json.dumps({k: v for k, v in document.items() if k != "markdown"}, indent=2))
TypeScript
1const BASE_URL = "https://api.runcaptain.com";
2const API_KEY = "your_api_key";
3const pmcid = "PMC6503629";
4
5const response = await fetch(`${BASE_URL}/v2/datasets/scientific/documents/pmc/${pmcid}`, {
6 headers: { Authorization: `Bearer ${API_KEY}` },
7});
8
9const document = await response.json();
10console.log(document.markdown.slice(0, 1000));
cURL
$curl "https://api.runcaptain.com/v2/datasets/scientific/documents/pmc/PMC6503629" \
> -H "Authorization: Bearer $CAPTAIN_API_KEY"

Document Response

PMC document:

1{
2 "type": "pmc",
3 "found": true,
4 "pmcid": "PMC6503629",
5 "markdown": "# PARP inhibitors and homologous recombination deficiency in breast cancer\n\n**Abstract.** ...",
6 "markdown_source": "pmc_jats_full_text",
7 "pdf_url": "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6503629/pdf/",
8 "section_count": 12,
9 "table_caption_count": 3,
10 "figure_caption_count": 5
11}

ClinicalTrials.gov document:

1{
2 "type": "clinicaltrials",
3 "found": true,
4 "nct_id": "NCT02000622",
5 "markdown": "# Olaparib as Adjuvant Treatment in Patients With Germline BRCA Mutated High Risk HER2 Negative Primary Breast Cancer\n\n## Status\nACTIVE_NOT_RECRUITING\n\n## Brief Summary\n...",
6 "markdown_source": "ctgov_record"
7}

If a document cannot be retrieved, found may be false with a reason, or the endpoint may return an error for invalid IDs, unknown document types, or upstream failures.

Scientific Search returns source markdown. File Search indexes and retrieves that markdown.

Python
1import requests
2import uuid
3
4BASE_URL = "https://api.runcaptain.com"
5API_KEY = "your_api_key"
6COLLECTION_NAME = "scientific_research"
7
8headers = {
9 "Authorization": f"Bearer {API_KEY}",
10 "Content-Type": "application/json",
11 "Idempotency-Key": str(uuid.uuid4()),
12}
13
14search_response = requests.get(
15 f"{BASE_URL}/v2/datasets/scientific/search",
16 headers={"Authorization": f"Bearer {API_KEY}"},
17 params={"q": "PARP inhibitor BRCA1 breast cancer", "limit": 3},
18 timeout=30.0,
19)
20
21for result in search_response.json()["results"]:
22 document_response = requests.get(
23 result["download_url"],
24 headers={"Authorization": f"Bearer {API_KEY}"},
25 timeout=60.0,
26 )
27 document = document_response.json()
28 if not document.get("markdown"):
29 continue
30
31 requests.post(
32 f"{BASE_URL}/v2/collections/{COLLECTION_NAME}/index/text",
33 headers=headers,
34 json={
35 "text": document["markdown"],
36 "document_id": result.get("pmcid") or result.get("nct_id"),
37 "custom_metadata": {
38 "source": result["type"],
39 "title": result.get("title"),
40 "pmcid": result.get("pmcid"),
41 "nct_id": result.get("nct_id"),
42 },
43 },
44 timeout=60.0,
45 )

Then query the collection with File Search or inspect the exact request shape in Query - v3.

Fields

Search Request

FieldTypeDefaultDescription
qstringrequiredScientific search query.
limitinteger10Maximum results per source. Valid range: 1-25.
recency_yearsinteger10Prefer documents from this many years back where the upstream source supports date filtering. Valid range: 1-50.

Search Result

FieldDescription
typeSource type. Values are pmc or clinical_trial.
pmcidPubMed Central ID. Present on pmc results.
nct_idClinicalTrials.gov ID. Present on clinical_trial results.
titleArticle or study title.
journalJournal name for PMC results.
yearPublication year for PMC results.
phaseTrial phase for ClinicalTrials.gov results.
statusTrial status for ClinicalTrials.gov results.
full_text_availabletrue for returned results. The endpoint is scoped to sources with full document retrieval.
download_urlAPI URL that fetches the full markdown document.

Document Response

FieldDescription
typeDocument source type: pmc or clinicaltrials.
foundWhether a full document was found for the provided ID.
pmcidPubMed Central ID for PMC documents.
nct_idClinicalTrials.gov ID for trial documents.
markdownFull document rendered as markdown.
markdown_sourceRenderer used for the markdown, such as pmc_jats_full_text or ctgov_record.
pdf_urlOpen-access PMC PDF URL when available.
section_countNumber of sections rendered from a PMC article.
table_caption_countNumber of table captions rendered from a PMC article.
figure_caption_countNumber of figure captions rendered from a PMC article.
reasonWhy a full document was unavailable when found is false.

Source Search and Scrape

Scientific also has scraper-backed source search endpoints for supported web publications:

GET /v2/datasets/scientific/sources/search
GET /v2/datasets/scientific/sources/scrape

Use those only when you specifically need publication URL discovery and scraping. For deterministic scientific document retrieval, prefer GET /v2/datasets/scientific/search and GET /v2/datasets/scientific/documents/{doc_type}/{doc_id}.