VibecoderMcSwaggins Claude commited on
Commit
89f1173
·
unverified ·
1 Parent(s): 0049ad7

fix(SPEC_11): address CodeRabbit review feedback (#92)

Browse files

* fix(SPEC_11): address CodeRabbit review feedback

- Update run_full.py docstring to reference sexual health pipeline (not drug repurposing)
- Update run_full.py help text to use sexual health query example
- Fix app.py domain display to show "Sexual Health" (not "Sexual_Health")
- Update test_nodes.py mock data to use sexual health terms (Testosterone, Androgen Receptor)
- Add

@pytest
.mark.unit markers to test_judges.py test classes
- Add specific assertion for next_search_queries in test_judges.py
- Add missing

@pytest
.mark.asyncio marker to test_mcp_tools.py
- Update test_mcp_tools.py and test_clinicaltrials.py to use sexual health queries
- Fix SPEC_12 markdown bare code fence (MD040)

* fix(SPEC_11): comprehensive domain alignment audit

Second pass of CodeRabbit review fixes - found additional domain mismatches:

- examples/search_demo/run_search.py: Update docstring from "drug repurposing" to "sexual health"
- examples/orchestrator_demo/run_agent.py: Update help text from "metformin cancer" to "testosterone libido"
- src/agents/tools.py: Update search_preprints example from "long covid" to "flibanserin HSDD preprint"
- tests/unit/tools/test_europepmc.py: Replace "Long COVID" mock data and queries with testosterone/HSDD
- tests/unit/tools/test_query_utils.py: Replace "diabetes" and "aging" examples with sexual health terms

All examples, demos, and tests now consistently use sexual health domain examples.

* feat(SPEC_12): implement narrative report synthesis using LLM

Transform report generation from string templating to LLM-based narrative
synthesis, following Microsoft Agent Framework aggregator pattern.

New files:
- src/prompts/synthesis.py: Narrative synthesis prompts with few-shot example
- get_synthesis_system_prompt(): Domain-aware narrative writing instructions
- format_synthesis_prompt(): Formats evidence/assessment for LLM
- FEW_SHOT_EXAMPLE: Alprostadil ED example demonstrating prose style

- tests/unit/prompts/test_synthesis.py: 20 tests for synthesis prompts
- Verify emphasis on prose, not bullets
- Verify required sections (Executive Summary, Background, etc.)
- Verify anti-hallucination warnings
- Verify few-shot example quality

- tests/unit/orchestrators/test_simple_synthesis.py: 6 tests for orchestrator
- Test LLM agent is called for synthesis
- Test graceful fallback on LLM failure
- Test citation footer inclusion

Modified files:
- src/orchestrators/simple.py:
- Add async _generate_synthesis() that calls LLM for narrative prose
- Rename old method to _generate_template_synthesis() as fallback
- Update call site at line 394 to await the async synthesis

- tests/unit/orchestrators/test_simple_orchestrator_domain.py:
- Update test to use _generate_template_synthesis() (sync fallback)

This implements SPEC_12 acceptance criteria:
- Report contains paragraph-form prose, not just bullet points
- Report has Executive Summary, Background, Evidence Synthesis sections
- Report has actionable Recommendations and Limitations
- Citations properly formatted with author/year/title/URL
- Graceful fallback if LLM unavailable
- All 256 tests pass

* refactor: enhance exception handling and type safety

- Add new exception types: LLMError, QuotaExceededError, ModalError, SynthesisError
- Update orchestrator to catch specific exception types (SearchError, JudgeError, ModalError)
- Add exc_type logging for better debugging and observability
- Fix app.py type safety with OrchestratorMode literal type
- Add mode validation for Gradio string inputs
- Remove unnecessary type: ignore comment in app.py

* docs: add embeddings and meta-agent architecture brainstorm

Research and first-principles analysis covering:
- Embedding service comparison (FAISS, ChromaDB, Voyage AI, MixedBread)
- Selective vs full codebase embedding (selective wins)
- Meta question: would self-knowledge help agents?
- Implementation patterns for codebase RAG
- Recommended roadmap for developer tooling

* docs: add reality check section to embeddings brainstorm

Distinguish real vs vaporware based on web research:
- Cursor's

@codebase
: REAL, production (embeddings + Turbopuffer)
- Claude Code: grep-only, no semantic search natively
- MCP servers (claude-context, code-index-mcp): REAL but with bugs
- "Self-aware agents" claims: mostly vaporware

Key insight: For AI-native devs, the real opportunity is MCP servers
that give Claude Code semantic search, not embedding the codebase
for agent self-understanding.

* docs: deep dive on internal organ vs external tool

First-principles analysis with empirical research:
- SICA (ICLR 2025): 17-53% gains from self-improvement
- Gödel Agent (ACL 2025): recursive self-modification works
- Introspection paradox: self-knowledge can HURT strong models
- Anthropic research: ~20% accuracy on genuine introspection

Key conclusion: For DeepBoner's core task (research), internal
self-knowledge organ = overhead with negative ROI. The agent
doesn't need to understand its code to search PubMed.

External tools help DEVELOPMENT. Internal organs help EXPERIMENTATION.
Neither helps the research task itself.

* docs: critical tool analysis and embeddings conclusion

New: TOOL_ANALYSIS_CRITICAL.md
- Deep analysis of all 4 search tools (PubMed, ClinicalTrials, Europe PMC, OpenAlex)
- API limitations and what's actually possible
- Identified critical gaps: deduplication, outcome measures, citation traversal
- Priority improvements without horizontal sprawl
- Neo4j recommendation: not yet, use OpenAlex API first

Updated: BRAINSTORM_EMBEDDINGS_META.md
- Condensed to conclusions only
- Closed: internal embeddings/mGREP not needed for this use case
- Focus on research evidence retrieval, not codebase self-knowledge

* test: update e2e/integration tests for SPEC_12 LLM synthesis format

Tests were asserting OLD template format ("## Sexual Health Analysis",
"### Drug Candidates") but SPEC_12 implementation uses LLM-generated
narrative prose with different section headers ("Executive Summary",
"Background", "Evidence Synthesis", etc.)

Updated assertions to accept both formats for backwards compatibility.

* docs: add language identifier to code fence (MD040)

---------

Co-authored-by: Claude <noreply@anthropic.com>

BRAINSTORM_EMBEDDINGS_META.md ADDED
@@ -0,0 +1,74 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Embeddings Brainstorm - Conclusions
2
+
3
+ **Date**: November 2025
4
+ **Status**: CLOSED - Conclusions reached, no action needed
5
+
6
+ ---
7
+
8
+ ## The Question
9
+
10
+ Should DeepBoner implement:
11
+ 1. Internal codebase embeddings/ingestion pipeline?
12
+ 2. mGREP for internal tool selection?
13
+ 3. Self-knowledge components for agents?
14
+
15
+ ## The Answer: NO
16
+
17
+ After research and first-principles analysis, the conclusion is clear:
18
+
19
+ ### Why Not Internal Embeddings/Ingestion
20
+
21
+ ```text
22
+ DeepBoner's Core Task:
23
+ ┌─────────────────────────────────────────────────────────┐
24
+ │ User Query: "Evidence for testosterone in HSDD?" │
25
+ │ ↓ │
26
+ │ 1. Search PubMed, ClinicalTrials, Europe PMC │
27
+ │ 2. Judge: Is evidence sufficient? │
28
+ │ 3. Synthesize: Generate report │
29
+ │ ↓ │
30
+ │ Output: Research report with citations │
31
+ └─────────────────────────────────────────────────────────┘
32
+
33
+ Does ANY step require self-knowledge of codebase? NO.
34
+ ```
35
+
36
+ ### Why Not mGREP for Tool Selection
37
+
38
+ | Approach | Complexity | Accuracy |
39
+ |----------|------------|----------|
40
+ | Embeddings + mGREP for tool selection | High | Medium (semantic similarity ≠ correct tool) |
41
+ | Direct prompting with tool descriptions | Low | High (LLM reasons about applicability) |
42
+
43
+ **No real agent system uses embeddings for tool selection.** All major frameworks (LangChain, OpenAI, Anthropic, Magentic) use prompt-based tool selection because:
44
+ 1. LLMs are already doing semantic matching internally
45
+ 2. Tool count is small (5-20) - fits easily in context
46
+ 3. Prompts allow reasoning, not just similarity
47
+
48
+ ### What We Already Have
49
+
50
+ DeepBoner already uses embeddings for the **right thing**: research evidence retrieval.
51
+ - `src/services/embeddings.py` - ChromaDB + sentence-transformers
52
+ - `src/services/llamaindex_rag.py` - OpenAI embeddings for premium tier
53
+
54
+ ### The Real Priority
55
+
56
+ Instead of internal embeddings/mGREP, focus on:
57
+ 1. **Deduplication** across PubMed/Europe PMC/OpenAlex
58
+ 2. **Outcome measures** from ClinicalTrials.gov
59
+ 3. **Citation graph traversal** via OpenAlex
60
+
61
+ See: `TOOL_ANALYSIS_CRITICAL.md` for detailed improvement roadmap.
62
+
63
+ ---
64
+
65
+ ## Research Sources
66
+
67
+ - [SICA Paper (ICLR 2025)](https://arxiv.org/abs/2504.15228) - Self-improving agents
68
+ - [Gödel Agent (ACL 2025)](https://arxiv.org/abs/2410.04444) - Recursive self-modification
69
+ - [Introspection Paradox (EMNLP 2025)](https://aclanthology.org/2025.emnlp-main.352/) - Self-knowledge can hurt performance
70
+ - [Anthropic Introspection Research](https://www.anthropic.com/research/introspection) - ~20% accuracy on genuine introspection
71
+
72
+ ---
73
+
74
+ *This document is closed. The conclusion is: don't implement internal embeddings/mGREP for this use case.*
SPEC_12_NARRATIVE_SYNTHESIS.md CHANGED
@@ -176,7 +176,7 @@ async def summarize_results(results: list[Any]) -> str:
176
 
177
  ### Architecture Change
178
 
179
- ```
180
  Current (Simple Mode):
181
  Evidence → Judge → {structured data} → String Template → Bullet Points
182
 
 
176
 
177
  ### Architecture Change
178
 
179
+ ```text
180
  Current (Simple Mode):
181
  Evidence → Judge → {structured data} → String Template → Bullet Points
182
 
TOOL_ANALYSIS_CRITICAL.md ADDED
@@ -0,0 +1,348 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Critical Analysis: Search Tools - Limitations, Gaps, and Improvements
2
+
3
+ **Date**: November 2025
4
+ **Purpose**: Honest assessment of all search tools to identify what's working, what's broken, and what needs improvement WITHOUT horizontal sprawl.
5
+
6
+ ---
7
+
8
+ ## Executive Summary
9
+
10
+ DeepBoner currently has **4 search tools**:
11
+ 1. PubMed (NCBI E-utilities)
12
+ 2. ClinicalTrials.gov (API v2)
13
+ 3. Europe PMC (includes preprints)
14
+ 4. OpenAlex (citation-aware)
15
+
16
+ **Overall Assessment**: Tools are functional but have significant gaps in:
17
+ - Deduplication (PubMed ∩ Europe PMC ∩ OpenAlex = massive overlap)
18
+ - Full-text retrieval (only abstracts currently)
19
+ - Citation graph traversal (OpenAlex has data but we don't use it)
20
+ - Query optimization (basic synonym expansion, no MeSH term mapping)
21
+
22
+ ---
23
+
24
+ ## Tool 1: PubMed (NCBI E-utilities)
25
+
26
+ **File**: `src/tools/pubmed.py`
27
+
28
+ ### What It Does Well
29
+ | Feature | Status | Notes |
30
+ |---------|--------|-------|
31
+ | Rate limiting | ✅ | Shared limiter, respects 3/sec (no key) or 10/sec (with key) |
32
+ | Retry logic | ✅ | tenacity with exponential backoff |
33
+ | Query preprocessing | ✅ | Strips question words, expands synonyms |
34
+ | Abstract parsing | ✅ | Handles XML edge cases (dict vs list) |
35
+
36
+ ### Limitations (API-Level)
37
+ | Limitation | Severity | Workaround Possible? |
38
+ |------------|----------|---------------------|
39
+ | **10,000 result cap per query** | Medium | Yes - use date ranges to paginate |
40
+ | **Abstracts only** (no full text) | High | No - full text requires PMC or publisher |
41
+ | **No citation counts** | Medium | Yes - cross-reference with OpenAlex |
42
+ | **Rate limit (10/sec max)** | Low | Already handled |
43
+
44
+ ### Current Implementation Gaps
45
+ ```python
46
+ # GAP 1: No MeSH term expansion
47
+ # Current: expand_synonyms() uses hardcoded dict
48
+ # Better: Use NCBI's E-utilities to get MeSH terms for query
49
+
50
+ # GAP 2: No date filtering
51
+ # Current: Gets whatever PubMed returns (biased toward recent)
52
+ # Better: Add date range parameter for historical research
53
+
54
+ # GAP 3: No publication type filtering
55
+ # Current: Returns all types (reviews, case reports, RCTs)
56
+ # Better: Filter for RCTs and systematic reviews when appropriate
57
+ ```
58
+
59
+ ### Priority Improvements
60
+ 1. **HIGH**: Add publication type filter (Reviews, RCTs, Meta-analyses)
61
+ 2. **MEDIUM**: Add date range parameter
62
+ 3. **LOW**: MeSH term expansion via E-utilities
63
+
64
+ ---
65
+
66
+ ## Tool 2: ClinicalTrials.gov
67
+
68
+ **File**: `src/tools/clinicaltrials.py`
69
+
70
+ ### What It Does Well
71
+ | Feature | Status | Notes |
72
+ |---------|--------|-------|
73
+ | API v2 usage | ✅ | Modern API, not deprecated v1 |
74
+ | Interventional filter | ✅ | Only gets drug/treatment studies |
75
+ | Status filter | ✅ | COMPLETED, ACTIVE, RECRUITING |
76
+ | httpx → requests workaround | ✅ | Bypasses WAF TLS fingerprint block |
77
+
78
+ ### Limitations (API-Level)
79
+ | Limitation | Severity | Workaround Possible? |
80
+ |------------|----------|---------------------|
81
+ | **No results data** | High | Yes - available via different endpoint |
82
+ | **No outcome measures** | High | Yes - add to FIELDS list |
83
+ | **No adverse events** | Medium | Yes - separate API call |
84
+ | **Sparse drug mechanism data** | Medium | No - not in API |
85
+
86
+ ### Current Implementation Gaps
87
+ ```python
88
+ # GAP 1: Missing critical fields
89
+ FIELDS: ClassVar[list[str]] = [
90
+ "NCTId",
91
+ "BriefTitle",
92
+ "Phase",
93
+ "OverallStatus",
94
+ "Condition",
95
+ "InterventionName",
96
+ "StartDate",
97
+ "BriefSummary",
98
+ # MISSING:
99
+ # "PrimaryOutcome",
100
+ # "SecondaryOutcome",
101
+ # "ResultsFirstSubmitDate",
102
+ # "StudyResults", # Whether results are posted
103
+ ]
104
+
105
+ # GAP 2: No results retrieval
106
+ # Many completed trials have posted results
107
+ # We could get actual efficacy data, not just trial existence
108
+
109
+ # GAP 3: No linked publications
110
+ # Trials often link to PubMed articles with results
111
+ # We could follow these links for richer evidence
112
+ ```
113
+
114
+ ### Priority Improvements
115
+ 1. **HIGH**: Add outcome measures to FIELDS
116
+ 2. **HIGH**: Check for and retrieve posted results
117
+ 3. **MEDIUM**: Follow linked publications (NCT → PMID)
118
+
119
+ ---
120
+
121
+ ## Tool 3: Europe PMC
122
+
123
+ **File**: `src/tools/europepmc.py`
124
+
125
+ ### What It Does Well
126
+ | Feature | Status | Notes |
127
+ |---------|--------|-------|
128
+ | Preprint coverage | ✅ | bioRxiv, medRxiv, ChemRxiv indexed |
129
+ | Preprint labeling | ✅ | `[PREPRINT - Not peer-reviewed]` marker |
130
+ | DOI/PMID fallback URLs | ✅ | Smart URL construction |
131
+ | Relevance scoring | ✅ | Preprints weighted lower (0.75 vs 0.9) |
132
+
133
+ ### Limitations (API-Level)
134
+ | Limitation | Severity | Workaround Possible? |
135
+ |------------|----------|---------------------|
136
+ | **No full text for most articles** | High | Partial - CC-licensed available after 14 days |
137
+ | **Citation data limited** | Medium | Only journal articles, not preprints |
138
+ | **Preprint-publication linking gaps** | Medium | ~50% of links missing per Crossref |
139
+ | **License info sometimes missing** | Low | Manual review required |
140
+
141
+ ### Current Implementation Gaps
142
+ ```python
143
+ # GAP 1: No full-text retrieval
144
+ # Europe PMC has full text for many CC-licensed articles
145
+ # Could retrieve full text XML via separate endpoint
146
+
147
+ # GAP 2: Massive overlap with PubMed
148
+ # Europe PMC indexes all of PubMed/MEDLINE
149
+ # We're getting duplicates with no deduplication
150
+
151
+ # GAP 3: No citation network
152
+ # Europe PMC has "citedByCount" but we don't use it
153
+ # Could prioritize highly-cited preprints
154
+ ```
155
+
156
+ ### Priority Improvements
157
+ 1. **HIGH**: Add deduplication with PubMed (by PMID)
158
+ 2. **MEDIUM**: Retrieve citation counts for ranking
159
+ 3. **LOW**: Full-text retrieval for CC-licensed articles
160
+
161
+ ---
162
+
163
+ ## Tool 4: OpenAlex
164
+
165
+ **File**: `src/tools/openalex.py`
166
+
167
+ ### What It Does Well
168
+ | Feature | Status | Notes |
169
+ |---------|--------|-------|
170
+ | Citation counts | ✅ | Sorted by `cited_by_count:desc` |
171
+ | Abstract reconstruction | ✅ | Handles inverted index format |
172
+ | Concept extraction | ✅ | Hierarchical classification |
173
+ | Open access detection | ✅ | `is_oa` and `pdf_url` |
174
+ | Polite pool | ✅ | mailto for 100k/day limit |
175
+ | Rich metadata | ✅ | Best metadata of all tools |
176
+
177
+ ### Limitations (API-Level)
178
+ | Limitation | Severity | Workaround Possible? |
179
+ |------------|----------|---------------------|
180
+ | **Author truncation at 100** | Low | Only affects mega-author papers |
181
+ | **No full text** | High | No - OpenAlex is metadata only |
182
+ | **Stale data (1-2 day lag)** | Low | Acceptable for research |
183
+
184
+ ### Current Implementation Gaps
185
+ ```python
186
+ # GAP 1: No citation graph traversal
187
+ # OpenAlex has `cited_by` and `references` endpoints
188
+ # We could find seminal papers by following citation chains
189
+
190
+ # GAP 2: No related works
191
+ # OpenAlex has ML-powered "related_works" field
192
+ # Could expand search to similar papers
193
+
194
+ # GAP 3: No concept filtering
195
+ # OpenAlex has hierarchical concepts
196
+ # Could filter for specific domains (e.g., "Sexual health" concept)
197
+
198
+ # GAP 4: Overlap with PubMed
199
+ # OpenAlex indexes most of PubMed
200
+ # More duplicates without deduplication
201
+ ```
202
+
203
+ ### Priority Improvements
204
+ 1. **HIGH**: Add citation graph traversal (find seminal papers)
205
+ 2. **HIGH**: Add deduplication with PubMed/Europe PMC
206
+ 3. **MEDIUM**: Use `related_works` for query expansion
207
+ 4. **LOW**: Concept-based filtering
208
+
209
+ ---
210
+
211
+ ## Cross-Tool Issues
212
+
213
+ ### Issue 1: MASSIVE DUPLICATION
214
+
215
+ ```
216
+ PubMed: 36M+ articles
217
+ Europe PMC: Indexes ALL of PubMed + preprints
218
+ OpenAlex: 250M+ works (includes PubMed)
219
+
220
+ Current behavior: All 3 return the same papers
221
+ Result: Duplicate evidence, wasted tokens, inflated counts
222
+ ```
223
+
224
+ **Solution**: Deduplication by PMID/DOI
225
+ ```python
226
+ # Proposed: Add to SearchHandler
227
+ def deduplicate_evidence(evidence_list: list[Evidence]) -> list[Evidence]:
228
+ seen_ids: set[str] = set()
229
+ unique: list[Evidence] = []
230
+ for e in evidence_list:
231
+ # Extract PMID or DOI from URL
232
+ paper_id = extract_paper_id(e.citation.url)
233
+ if paper_id not in seen_ids:
234
+ seen_ids.add(paper_id)
235
+ unique.append(e)
236
+ return unique
237
+ ```
238
+
239
+ ### Issue 2: NO FULL-TEXT RETRIEVAL
240
+
241
+ All tools return **abstracts only**. For deep research, this is limiting.
242
+
243
+ **What's Actually Possible**:
244
+ | Source | Full Text Access | How |
245
+ |--------|------------------|-----|
246
+ | PubMed Central (PMC) | Yes, for OA articles | Separate API: `efetch` with `db=pmc` |
247
+ | Europe PMC | Yes, CC-licensed after 14 days | `/fullTextXML/{id}` endpoint |
248
+ | OpenAlex | No | Metadata only |
249
+ | Unpaywall | Yes, OA link discovery | Separate API |
250
+
251
+ **Recommendation**: Add PMC full-text retrieval for open access articles.
252
+
253
+ ### Issue 3: NO CITATION GRAPH
254
+
255
+ OpenAlex has rich citation data but we only use `cited_by_count` for sorting.
256
+
257
+ **Untapped Capabilities**:
258
+ - `cited_by`: Find papers that cite a key paper
259
+ - `references`: Find sources a paper cites
260
+ - `related_works`: ML-powered similar papers
261
+
262
+ **Use Case**: User asks about "testosterone therapy for HSDD". We find a seminal 2019 RCT. We could automatically find:
263
+ - Papers that cite it (newer evidence)
264
+ - Papers it cites (foundational research)
265
+ - Related papers (similar topics)
266
+
267
+ ---
268
+
269
+ ## What's NOT Possible (API Constraints)
270
+
271
+ | Feature | Why Not Possible |
272
+ |---------|------------------|
273
+ | **bioRxiv direct search** | No keyword search API, only RSS feed of latest |
274
+ | **arXiv search** | API exists but irrelevant for sexual health |
275
+ | **PubMed full text** | Requires publisher access or PMC |
276
+ | **Real-time trial results** | ClinicalTrials.gov results are static snapshots |
277
+ | **Drug mechanism data** | Not in any API - would need ChEMBL or DrugBank |
278
+
279
+ ---
280
+
281
+ ## Recommended Improvements (Priority Order)
282
+
283
+ ### Phase 1: Fix Fundamentals (High ROI)
284
+ 1. **Deduplication** - Stop returning the same paper 3 times
285
+ 2. **Outcome measures in ClinicalTrials** - Get actual efficacy data
286
+ 3. **Citation counts from all sources** - Rank by influence, not recency
287
+
288
+ ### Phase 2: Depth Improvements (Medium ROI)
289
+ 4. **PMC full-text retrieval** - Get full papers for OA articles
290
+ 5. **Citation graph traversal** - Find seminal papers automatically
291
+ 6. **Publication type filtering** - Prioritize RCTs and meta-analyses
292
+
293
+ ### Phase 3: Quality Improvements (Lower ROI, Nice-to-Have)
294
+ 7. **MeSH term expansion** - Better PubMed queries
295
+ 8. **Related works expansion** - Use OpenAlex ML similarity
296
+ 9. **Date range filtering** - Historical vs recent research
297
+
298
+ ---
299
+
300
+ ## Neo4j Integration (Future Consideration)
301
+
302
+ **Question**: Should we add Neo4j for citation graph storage?
303
+
304
+ **Answer**: Not yet. Here's why:
305
+
306
+ | Approach | Complexity | Value |
307
+ |----------|------------|-------|
308
+ | OpenAlex API for citation traversal | Low | High |
309
+ | Neo4j for local citation graph | High | Medium (unless doing graph analytics) |
310
+ | Cron job to sync OpenAlex → Neo4j | Medium | Only if we need offline access |
311
+
312
+ **Recommendation**: Use OpenAlex API for citation traversal first. Only add Neo4j if:
313
+ 1. We need to do complex graph queries (PageRank on citations, community detection)
314
+ 2. We need offline access to citation data
315
+ 3. We're hitting OpenAlex rate limits
316
+
317
+ ---
318
+
319
+ ## Summary: What's Broken vs What's Working
320
+
321
+ ### Working Well
322
+ - Basic search across all 4 sources
323
+ - Rate limiting and retry logic
324
+ - Query preprocessing
325
+ - Evidence model with citations
326
+
327
+ ### Needs Fixing (Current Scope)
328
+ - Deduplication (critical)
329
+ - Outcome measures in ClinicalTrials (critical)
330
+ - Citation-based ranking (important)
331
+
332
+ ### Future Enhancements (Out of Current Scope)
333
+ - Full-text retrieval
334
+ - Citation graph traversal
335
+ - Neo4j integration
336
+ - Drug mechanism data (would need new data sources)
337
+
338
+ ---
339
+
340
+ ## Sources
341
+
342
+ - [NCBI E-utilities Documentation](https://www.ncbi.nlm.nih.gov/books/NBK25497/)
343
+ - [NCBI Rate Limits](https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/)
344
+ - [OpenAlex API Docs](https://docs.openalex.org/)
345
+ - [OpenAlex Limitations](https://docs.openalex.org/api-entities/authors/limitations)
346
+ - [Europe PMC RESTful API](https://europepmc.org/RestfulWebService)
347
+ - [Europe PMC Preprints](https://pmc.ncbi.nlm.nih.gov/articles/PMC11426508/)
348
+ - [ClinicalTrials.gov API](https://clinicaltrials.gov/data-api/api)
examples/full_stack_demo/run_full.py CHANGED
@@ -2,7 +2,7 @@
2
  """
3
  Demo: Full Stack DeepBoner Agent (Phases 1-8).
4
 
5
- This script demonstrates the COMPLETE REAL drug repurposing research pipeline:
6
  - Phase 2: REAL Search (PubMed + ClinicalTrials + Europe PMC)
7
  - Phase 6: REAL Embeddings (sentence-transformers + ChromaDB)
8
  - Phase 7: REAL Hypothesis (LLM mechanistic reasoning)
@@ -190,7 +190,7 @@ Examples:
190
  )
191
  parser.add_argument(
192
  "query",
193
- help="Research query (e.g., 'metformin Alzheimer's disease')",
194
  )
195
  parser.add_argument(
196
  "-i",
 
2
  """
3
  Demo: Full Stack DeepBoner Agent (Phases 1-8).
4
 
5
+ This script demonstrates the COMPLETE REAL sexual health research pipeline:
6
  - Phase 2: REAL Search (PubMed + ClinicalTrials + Europe PMC)
7
  - Phase 6: REAL Embeddings (sentence-transformers + ChromaDB)
8
  - Phase 7: REAL Hypothesis (LLM mechanistic reasoning)
 
190
  )
191
  parser.add_argument(
192
  "query",
193
+ help="Research query (e.g., 'testosterone libido')",
194
  )
195
  parser.add_argument(
196
  "-i",
examples/orchestrator_demo/run_agent.py CHANGED
@@ -51,7 +51,7 @@ Examples:
51
  uv run python examples/orchestrator_demo/run_agent.py "flibanserin HSDD" --iterations 5
52
  """,
53
  )
54
- parser.add_argument("query", help="Research query (e.g., 'metformin cancer')")
55
  parser.add_argument("--iterations", type=int, default=3, help="Max iterations (default: 3)")
56
  args = parser.parse_args()
57
 
 
51
  uv run python examples/orchestrator_demo/run_agent.py "flibanserin HSDD" --iterations 5
52
  """,
53
  )
54
+ parser.add_argument("query", help="Research query (e.g., 'testosterone libido')")
55
  parser.add_argument("--iterations", type=int, default=3, help="Max iterations (default: 3)")
56
  args = parser.parse_args()
57
 
examples/search_demo/run_search.py CHANGED
@@ -1,6 +1,6 @@
1
  #!/usr/bin/env python3
2
  """
3
- Demo: Search for drug repurposing evidence.
4
 
5
  This script demonstrates multi-source search functionality:
6
  - PubMed search (biomedical literature)
 
1
  #!/usr/bin/env python3
2
  """
3
+ Demo: Search for sexual health research evidence.
4
 
5
  This script demonstrates multi-source search functionality:
6
  - PubMed search (biomedical literature)
src/agent_factory/judges.py CHANGED
@@ -166,7 +166,13 @@ class JudgeHandler:
166
  return assessment
167
 
168
  except Exception as e:
169
- logger.error("Assessment failed", error=str(e))
 
 
 
 
 
 
170
  # Return a safe default assessment on failure
171
  return self._create_fallback_assessment(question, str(e))
172
 
 
166
  return assessment
167
 
168
  except Exception as e:
169
+ # Log with context for debugging
170
+ logger.error(
171
+ "Assessment failed",
172
+ error=str(e),
173
+ exc_type=type(e).__name__,
174
+ evidence_count=len(evidence),
175
+ )
176
  # Return a safe default assessment on failure
177
  return self._create_fallback_assessment(question, str(e))
178
 
src/agents/tools.py CHANGED
@@ -125,7 +125,7 @@ async def search_preprints(query: str, max_results: int = 10) -> str:
125
  from bioRxiv, medRxiv, and peer-reviewed papers.
126
 
127
  Args:
128
- query: Search terms (e.g., "long covid treatment")
129
  max_results: Maximum results to return (default 10)
130
 
131
  Returns:
 
125
  from bioRxiv, medRxiv, and peer-reviewed papers.
126
 
127
  Args:
128
+ query: Search terms (e.g., "flibanserin HSDD preprint")
129
  max_results: Maximum results to return (default 10)
130
 
131
  Returns:
src/app.py CHANGED
@@ -2,7 +2,7 @@
2
 
3
  import os
4
  from collections.abc import AsyncGenerator
5
- from typing import Any
6
 
7
  import gradio as gr
8
  from pydantic_ai.models.anthropic import AnthropicModel
@@ -22,10 +22,12 @@ from src.utils.config import settings
22
  from src.utils.exceptions import ConfigurationError
23
  from src.utils.models import OrchestratorConfig
24
 
 
 
25
 
26
  def configure_orchestrator(
27
  use_mock: bool = False,
28
- mode: str = "simple",
29
  user_api_key: str | None = None,
30
  domain: str | ResearchDomain | None = None,
31
  ) -> tuple[Any, str]:
@@ -100,7 +102,7 @@ def configure_orchestrator(
100
  search_handler=search_handler,
101
  judge_handler=judge_handler,
102
  config=config,
103
- mode=mode, # type: ignore
104
  api_key=user_api_key,
105
  domain=domain,
106
  )
@@ -111,7 +113,7 @@ def configure_orchestrator(
111
  async def research_agent(
112
  message: str,
113
  history: list[dict[str, Any]],
114
- mode: str = "simple",
115
  domain: str = "sexual_health",
116
  api_key: str = "",
117
  api_key_state: str = "",
@@ -140,6 +142,10 @@ async def research_agent(
140
  api_key_state_str = api_key_state or ""
141
  domain_str = domain or "sexual_health"
142
 
 
 
 
 
143
  # BUG FIX: Prefer freshly-entered key, then persisted state
144
  user_api_key = (api_key_str.strip() or api_key_state_str.strip()) or None
145
 
@@ -153,12 +159,12 @@ async def research_agent(
153
  has_paid_key = has_openai or has_anthropic or bool(user_api_key)
154
 
155
  # Advanced mode requires OpenAI specifically (due to agent-framework binding)
156
- if mode == "advanced" and not (has_openai or is_openai_user_key):
157
  yield (
158
  "⚠️ **Warning**: Advanced mode currently requires OpenAI API key. "
159
  "Anthropic keys only work in Simple mode. Falling back to Simple.\n\n"
160
  )
161
- mode = "simple"
162
 
163
  # Inform user about fallback if no keys
164
  if not has_paid_key:
@@ -177,14 +183,16 @@ async def research_agent(
177
  # It will use: Paid API > HF Inference (free tier)
178
  orchestrator, backend_name = configure_orchestrator(
179
  use_mock=False, # Never use mock in production - HF Inference is the free fallback
180
- mode=mode,
181
  user_api_key=user_api_key,
182
  domain=domain_str,
183
  )
184
 
185
  # Immediate backend info + loading feedback so user knows something is happening
 
 
186
  yield (
187
- f"🧠 **Backend**: {backend_name} | **Domain**: {domain_str.title()}\n\n"
188
  "⏳ **Processing...** Searching PubMed, ClinicalTrials.gov, Europe PMC, OpenAlex...\n"
189
  )
190
 
 
2
 
3
  import os
4
  from collections.abc import AsyncGenerator
5
+ from typing import Any, Literal
6
 
7
  import gradio as gr
8
  from pydantic_ai.models.anthropic import AnthropicModel
 
22
  from src.utils.exceptions import ConfigurationError
23
  from src.utils.models import OrchestratorConfig
24
 
25
+ OrchestratorMode = Literal["simple", "magentic", "advanced", "hierarchical"]
26
+
27
 
28
  def configure_orchestrator(
29
  use_mock: bool = False,
30
+ mode: OrchestratorMode = "simple",
31
  user_api_key: str | None = None,
32
  domain: str | ResearchDomain | None = None,
33
  ) -> tuple[Any, str]:
 
102
  search_handler=search_handler,
103
  judge_handler=judge_handler,
104
  config=config,
105
+ mode=mode,
106
  api_key=user_api_key,
107
  domain=domain,
108
  )
 
113
  async def research_agent(
114
  message: str,
115
  history: list[dict[str, Any]],
116
+ mode: str = "simple", # Gradio passes strings; validated below
117
  domain: str = "sexual_health",
118
  api_key: str = "",
119
  api_key_state: str = "",
 
142
  api_key_state_str = api_key_state or ""
143
  domain_str = domain or "sexual_health"
144
 
145
+ # Validate and cast mode to proper type
146
+ valid_modes: set[str] = {"simple", "magentic", "advanced", "hierarchical"}
147
+ mode_validated: OrchestratorMode = mode if mode in valid_modes else "simple" # type: ignore[assignment]
148
+
149
  # BUG FIX: Prefer freshly-entered key, then persisted state
150
  user_api_key = (api_key_str.strip() or api_key_state_str.strip()) or None
151
 
 
159
  has_paid_key = has_openai or has_anthropic or bool(user_api_key)
160
 
161
  # Advanced mode requires OpenAI specifically (due to agent-framework binding)
162
+ if mode_validated == "advanced" and not (has_openai or is_openai_user_key):
163
  yield (
164
  "⚠️ **Warning**: Advanced mode currently requires OpenAI API key. "
165
  "Anthropic keys only work in Simple mode. Falling back to Simple.\n\n"
166
  )
167
+ mode_validated = "simple"
168
 
169
  # Inform user about fallback if no keys
170
  if not has_paid_key:
 
183
  # It will use: Paid API > HF Inference (free tier)
184
  orchestrator, backend_name = configure_orchestrator(
185
  use_mock=False, # Never use mock in production - HF Inference is the free fallback
186
+ mode=mode_validated,
187
  user_api_key=user_api_key,
188
  domain=domain_str,
189
  )
190
 
191
  # Immediate backend info + loading feedback so user knows something is happening
192
+ # Use replace to get "Sexual Health" instead of "Sexual_Health" from .title()
193
+ domain_display = domain_str.replace("_", " ").title()
194
  yield (
195
+ f"🧠 **Backend**: {backend_name} | **Domain**: {domain_display}\n\n"
196
  "⏳ **Processing...** Searching PubMed, ClinicalTrials.gov, Europe PMC, OpenAlex...\n"
197
  )
198
 
src/middleware/sub_iteration.py CHANGED
@@ -81,12 +81,18 @@ class SubIterationMiddleware:
81
  history.append(result)
82
  best_result = result # Assume latest is best for now
83
  except Exception as e:
84
- logger.error("Sub-iteration execution failed", error=str(e))
 
 
 
 
 
85
  if event_callback:
86
  await event_callback(
87
  AgentEvent(
88
  type="error",
89
  message=f"Sub-iteration execution failed: {e}",
 
90
  iteration=i,
91
  )
92
  )
@@ -97,12 +103,18 @@ class SubIterationMiddleware:
97
  assessment = await self.judge.assess(task, result, history)
98
  final_assessment = assessment
99
  except Exception as e:
100
- logger.error("Sub-iteration judge failed", error=str(e))
 
 
 
 
 
101
  if event_callback:
102
  await event_callback(
103
  AgentEvent(
104
  type="error",
105
  message=f"Sub-iteration judge failed: {e}",
 
106
  iteration=i,
107
  )
108
  )
 
81
  history.append(result)
82
  best_result = result # Assume latest is best for now
83
  except Exception as e:
84
+ logger.error(
85
+ "Sub-iteration execution failed",
86
+ error=str(e),
87
+ exc_type=type(e).__name__,
88
+ iteration=i,
89
+ )
90
  if event_callback:
91
  await event_callback(
92
  AgentEvent(
93
  type="error",
94
  message=f"Sub-iteration execution failed: {e}",
95
+ data={"recoverable": False, "error_type": type(e).__name__},
96
  iteration=i,
97
  )
98
  )
 
103
  assessment = await self.judge.assess(task, result, history)
104
  final_assessment = assessment
105
  except Exception as e:
106
+ logger.error(
107
+ "Sub-iteration judge failed",
108
+ error=str(e),
109
+ exc_type=type(e).__name__,
110
+ iteration=i,
111
+ )
112
  if event_callback:
113
  await event_callback(
114
  AgentEvent(
115
  type="error",
116
  message=f"Sub-iteration judge failed: {e}",
117
+ data={"recoverable": False, "error_type": type(e).__name__},
118
  iteration=i,
119
  )
120
  )
src/orchestrators/simple.py CHANGED
@@ -18,7 +18,9 @@ import structlog
18
 
19
  from src.config.domain import ResearchDomain, get_domain_config
20
  from src.orchestrators.base import JudgeHandlerProtocol, SearchHandlerProtocol
 
21
  from src.utils.config import settings
 
22
  from src.utils.models import (
23
  AgentEvent,
24
  Evidence,
@@ -132,12 +134,25 @@ class Orchestrator:
132
  iteration=iteration,
133
  )
134
 
 
 
 
 
 
 
 
 
135
  except Exception as e:
136
- logger.error("Modal analysis failed", error=str(e))
 
 
 
 
 
137
  yield AgentEvent(
138
  type="error",
139
  message=f"Modal analysis failed: {e}",
140
- data={"error": str(e)},
141
  iteration=iteration,
142
  )
143
 
@@ -288,11 +303,26 @@ class Orchestrator:
288
  if errors:
289
  logger.warning("Search errors", errors=errors)
290
 
 
 
 
 
 
 
 
 
 
291
  except Exception as e:
292
- logger.error("Search phase failed", error=str(e))
 
 
 
 
 
293
  yield AgentEvent(
294
  type="error",
295
  message=f"Search failed: {e!s}",
 
296
  iteration=iteration,
297
  )
298
  continue
@@ -388,9 +418,9 @@ class Orchestrator:
388
  iteration=iteration,
389
  )
390
 
391
- # Generate final response
392
  # Use all gathered evidence for the final report
393
- final_response = self._generate_synthesis(query, all_evidence, assessment)
394
 
395
  yield AgentEvent(
396
  type="complete",
@@ -424,11 +454,26 @@ class Orchestrator:
424
  iteration=iteration,
425
  )
426
 
 
 
 
 
 
 
 
 
 
427
  except Exception as e:
428
- logger.error("Judge phase failed", error=str(e))
 
 
 
 
 
429
  yield AgentEvent(
430
  type="error",
431
  message=f"Assessment failed: {e!s}",
 
432
  iteration=iteration,
433
  )
434
  continue
@@ -445,14 +490,105 @@ class Orchestrator:
445
  iteration=iteration,
446
  )
447
 
448
- def _generate_synthesis(
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
449
  self,
450
  query: str,
451
  evidence: list[Evidence],
452
  assessment: JudgeAssessment,
453
  ) -> str:
454
  """
455
- Generate the final synthesis response.
 
 
456
 
457
  Args:
458
  query: The original question
@@ -460,7 +596,7 @@ class Orchestrator:
460
  assessment: The final assessment
461
 
462
  Returns:
463
- Formatted synthesis as markdown
464
  """
465
  drug_list = (
466
  "\n".join([f"- **{d}**" for d in assessment.details.drug_candidates])
@@ -474,7 +610,7 @@ class Orchestrator:
474
  [
475
  f"{i + 1}. [{e.citation.title}]({e.citation.url}) "
476
  f"({e.citation.source.upper()}, {e.citation.date})"
477
- for i, e in enumerate(evidence[:10]) # Limit to 10 citations
478
  ]
479
  )
480
 
 
18
 
19
  from src.config.domain import ResearchDomain, get_domain_config
20
  from src.orchestrators.base import JudgeHandlerProtocol, SearchHandlerProtocol
21
+ from src.prompts.synthesis import format_synthesis_prompt, get_synthesis_system_prompt
22
  from src.utils.config import settings
23
+ from src.utils.exceptions import JudgeError, ModalError, SearchError
24
  from src.utils.models import (
25
  AgentEvent,
26
  Evidence,
 
134
  iteration=iteration,
135
  )
136
 
137
+ except ModalError as e:
138
+ logger.error("Modal analysis failed", error=str(e), exc_type="ModalError")
139
+ yield AgentEvent(
140
+ type="error",
141
+ message=f"Modal analysis failed: {e}",
142
+ data={"error": str(e), "recoverable": True},
143
+ iteration=iteration,
144
+ )
145
  except Exception as e:
146
+ # Unexpected error - log with full context for debugging
147
+ logger.error(
148
+ "Modal analysis failed unexpectedly",
149
+ error=str(e),
150
+ exc_type=type(e).__name__,
151
+ )
152
  yield AgentEvent(
153
  type="error",
154
  message=f"Modal analysis failed: {e}",
155
+ data={"error": str(e), "recoverable": True},
156
  iteration=iteration,
157
  )
158
 
 
303
  if errors:
304
  logger.warning("Search errors", errors=errors)
305
 
306
+ except SearchError as e:
307
+ logger.error("Search phase failed", error=str(e), exc_type="SearchError")
308
+ yield AgentEvent(
309
+ type="error",
310
+ message=f"Search failed: {e!s}",
311
+ data={"recoverable": True, "error_type": "search"},
312
+ iteration=iteration,
313
+ )
314
+ continue
315
  except Exception as e:
316
+ # Unexpected error - log full context for debugging
317
+ logger.error(
318
+ "Search phase failed unexpectedly",
319
+ error=str(e),
320
+ exc_type=type(e).__name__,
321
+ )
322
  yield AgentEvent(
323
  type="error",
324
  message=f"Search failed: {e!s}",
325
+ data={"recoverable": True, "error_type": "unexpected"},
326
  iteration=iteration,
327
  )
328
  continue
 
418
  iteration=iteration,
419
  )
420
 
421
+ # Generate final response using LLM narrative synthesis
422
  # Use all gathered evidence for the final report
423
+ final_response = await self._generate_synthesis(query, all_evidence, assessment)
424
 
425
  yield AgentEvent(
426
  type="complete",
 
454
  iteration=iteration,
455
  )
456
 
457
+ except JudgeError as e:
458
+ logger.error("Judge phase failed", error=str(e), exc_type="JudgeError")
459
+ yield AgentEvent(
460
+ type="error",
461
+ message=f"Assessment failed: {e!s}",
462
+ data={"recoverable": True, "error_type": "judge"},
463
+ iteration=iteration,
464
+ )
465
+ continue
466
  except Exception as e:
467
+ # Unexpected error - log full context for debugging
468
+ logger.error(
469
+ "Judge phase failed unexpectedly",
470
+ error=str(e),
471
+ exc_type=type(e).__name__,
472
+ )
473
  yield AgentEvent(
474
  type="error",
475
  message=f"Assessment failed: {e!s}",
476
+ data={"recoverable": True, "error_type": "unexpected"},
477
  iteration=iteration,
478
  )
479
  continue
 
490
  iteration=iteration,
491
  )
492
 
493
+ async def _generate_synthesis(
494
+ self,
495
+ query: str,
496
+ evidence: list[Evidence],
497
+ assessment: JudgeAssessment,
498
+ ) -> str:
499
+ """
500
+ Generate the final synthesis response using LLM.
501
+
502
+ This method calls an LLM to generate a narrative research report,
503
+ following the Microsoft Agent Framework pattern of using LLM synthesis
504
+ instead of string templating.
505
+
506
+ Args:
507
+ query: The original question
508
+ evidence: All collected evidence
509
+ assessment: The final assessment
510
+
511
+ Returns:
512
+ Narrative synthesis as markdown
513
+ """
514
+ # Build evidence summary for LLM context (limit to avoid token overflow)
515
+ evidence_lines = []
516
+ for e in evidence[:20]:
517
+ authors = ", ".join(e.citation.authors[:2]) if e.citation.authors else "Unknown"
518
+ content_preview = e.content[:200].replace("\n", " ")
519
+ evidence_lines.append(
520
+ f"- {e.citation.title} ({authors}, {e.citation.date}): {content_preview}..."
521
+ )
522
+ evidence_summary = "\n".join(evidence_lines)
523
+
524
+ # Format synthesis prompt with assessment data
525
+ user_prompt = format_synthesis_prompt(
526
+ query=query,
527
+ evidence_summary=evidence_summary,
528
+ drug_candidates=assessment.details.drug_candidates,
529
+ key_findings=assessment.details.key_findings,
530
+ mechanism_score=assessment.details.mechanism_score,
531
+ clinical_score=assessment.details.clinical_evidence_score,
532
+ confidence=assessment.confidence,
533
+ )
534
+
535
+ # Get domain-specific system prompt
536
+ system_prompt = get_synthesis_system_prompt(self.domain)
537
+
538
+ try:
539
+ # Import here to avoid circular deps and keep optional
540
+ from pydantic_ai import Agent
541
+
542
+ from src.agent_factory.judges import get_model
543
+
544
+ # Create synthesis agent (string output, not structured)
545
+ agent: Agent[None, str] = Agent(
546
+ model=get_model(),
547
+ output_type=str,
548
+ system_prompt=system_prompt,
549
+ )
550
+ result = await agent.run(user_prompt)
551
+ narrative = result.output
552
+
553
+ logger.info("LLM narrative synthesis completed", chars=len(narrative))
554
+
555
+ except Exception as e:
556
+ # Fallback to template synthesis if LLM fails
557
+ # This is intentionally broad - LLM can fail many ways (API, parsing, etc.)
558
+ logger.warning(
559
+ "LLM synthesis failed, using template fallback",
560
+ error=str(e),
561
+ exc_type=type(e).__name__,
562
+ evidence_count=len(evidence),
563
+ )
564
+ return self._generate_template_synthesis(query, evidence, assessment)
565
+
566
+ # Add full citation list footer
567
+ citations = "\n".join(
568
+ f"{i + 1}. [{e.citation.title}]({e.citation.url}) "
569
+ f"({e.citation.source.upper()}, {e.citation.date})"
570
+ for i, e in enumerate(evidence[:15])
571
+ )
572
+
573
+ return f"""{narrative}
574
+
575
+ ---
576
+ ### Full Citation List ({len(evidence)} sources)
577
+ {citations}
578
+
579
+ *Analysis based on {len(evidence)} sources across {len(self.history)} iterations.*
580
+ """
581
+
582
+ def _generate_template_synthesis(
583
  self,
584
  query: str,
585
  evidence: list[Evidence],
586
  assessment: JudgeAssessment,
587
  ) -> str:
588
  """
589
+ Generate fallback template synthesis (no LLM).
590
+
591
+ Used when LLM synthesis fails or is unavailable.
592
 
593
  Args:
594
  query: The original question
 
596
  assessment: The final assessment
597
 
598
  Returns:
599
+ Formatted synthesis as markdown (bullet-point style)
600
  """
601
  drug_list = (
602
  "\n".join([f"- **{d}**" for d in assessment.details.drug_candidates])
 
610
  [
611
  f"{i + 1}. [{e.citation.title}]({e.citation.url}) "
612
  f"({e.citation.source.upper()}, {e.citation.date})"
613
+ for i, e in enumerate(evidence[:10])
614
  ]
615
  )
616
 
src/prompts/synthesis.py ADDED
@@ -0,0 +1,209 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Prompts for narrative report synthesis.
2
+
3
+ This module provides prompts that transform structured evidence data
4
+ into professional, narrative research reports. The key insight is that
5
+ report generation requires an LLM call for synthesis, not string templating.
6
+
7
+ Reference: Microsoft Agent Framework concurrent_custom_aggregator.py pattern.
8
+ """
9
+
10
+ from src.config.domain import ResearchDomain, get_domain_config
11
+
12
+
13
+ def get_synthesis_system_prompt(domain: ResearchDomain | str | None = None) -> str:
14
+ """Get the system prompt for narrative synthesis.
15
+
16
+ Args:
17
+ domain: Research domain for customization (defaults to settings)
18
+
19
+ Returns:
20
+ System prompt instructing LLM to write narrative prose
21
+ """
22
+ config = get_domain_config(domain)
23
+ return f"""You are a scientific writer specializing in {config.name.lower()}.
24
+ Your task is to synthesize research evidence into a clear, NARRATIVE report.
25
+
26
+ ## CRITICAL: Writing Style
27
+ - Write in PROSE PARAGRAPHS, not bullet points
28
+ - Use academic but accessible language
29
+ - Be specific about evidence strength (e.g., "in an RCT of N=200")
30
+ - Reference specific studies by author name when available
31
+ - Provide quantitative results where available (p-values, effect sizes, NNT)
32
+
33
+ ## Report Structure
34
+
35
+ ### Executive Summary (REQUIRED - 2-3 sentences)
36
+ Start with the bottom line. What does the evidence show? Example:
37
+ "Testosterone therapy demonstrates consistent efficacy for HSDD in postmenopausal
38
+ women, with transdermal formulations showing the best safety profile."
39
+
40
+ ### Background (REQUIRED - 1 paragraph)
41
+ Explain the condition, its prevalence, and clinical significance.
42
+ Why does this question matter?
43
+
44
+ ### Evidence Synthesis (REQUIRED - 2-4 paragraphs)
45
+ Weave the evidence into a coherent NARRATIVE:
46
+ - **Mechanism of Action**: How does the intervention work biologically?
47
+ - **Clinical Evidence**: What do trials show? Include effect sizes when available.
48
+ - **Comparative Evidence**: How does it compare to alternatives?
49
+
50
+ Write this as flowing prose that tells a story, NOT as a bullet list.
51
+
52
+ ### Recommendations (REQUIRED - 3-5 numbered items)
53
+ Provide specific, actionable clinical recommendations based on the evidence.
54
+ These CAN be numbered items since they are action items.
55
+
56
+ ### Limitations (REQUIRED - 1 paragraph)
57
+ Acknowledge gaps in the evidence, potential biases, and areas needing more research.
58
+ Be honest about uncertainty.
59
+
60
+ ### References (REQUIRED)
61
+ List key references with author, year, title, and URL.
62
+ Format: Author AB et al. (Year). Title. URL
63
+
64
+ ## CRITICAL RULES
65
+ 1. ONLY cite papers from the provided evidence - NEVER hallucinate or invent references
66
+ 2. Write in complete sentences and paragraphs (PROSE, not lists except Recommendations)
67
+ 3. Include specific statistics when available (p-values, confidence intervals, effect sizes)
68
+ 4. Acknowledge uncertainty honestly - do not overstate conclusions
69
+ 5. If evidence is limited, say so clearly
70
+ 6. Copy URLs exactly as provided - do not create similar-looking URLs
71
+ """
72
+
73
+
74
+ FEW_SHOT_EXAMPLE = """
75
+ ## Example: Strong Evidence Synthesis
76
+
77
+ INPUT:
78
+ - Query: "Alprostadil for erectile dysfunction"
79
+ - Evidence: 15 papers including meta-analysis of 8 RCTs (N=3,247)
80
+ - Mechanism Score: 9/10
81
+ - Clinical Score: 9/10
82
+
83
+ OUTPUT:
84
+
85
+ ### Executive Summary
86
+
87
+ Alprostadil (prostaglandin E1) represents a well-established second-line treatment
88
+ for erectile dysfunction, with meta-analytic evidence demonstrating 87% efficacy
89
+ in achieving erections sufficient for intercourse. It offers a PDE5-independent
90
+ mechanism particularly valuable for patients who do not respond to oral therapies.
91
+
92
+ ### Background
93
+
94
+ Erectile dysfunction affects approximately 30 million men in the United States,
95
+ with prevalence increasing with age from 12% at age 40 to 40% at age 70. While
96
+ PDE5 inhibitors remain first-line therapy, approximately 30% of patients are
97
+ non-responders due to diabetes, radical prostatectomy, or other factors.
98
+ Alprostadil provides an alternative mechanism through direct smooth muscle
99
+ relaxation, making it a crucial second-line option.
100
+
101
+ ### Evidence Synthesis
102
+
103
+ **Mechanism of Action**
104
+
105
+ Alprostadil works through a distinct pathway from PDE5 inhibitors. It binds to
106
+ EP2 and EP4 receptors on cavernosal smooth muscle, activating adenylate cyclase
107
+ and increasing intracellular cAMP. This leads to smooth muscle relaxation and
108
+ increased blood flow independent of nitric oxide signaling. As noted by Smith
109
+ et al. (2019), this mechanism explains its efficacy in patients with endothelial
110
+ dysfunction where nitric oxide production is impaired.
111
+
112
+ **Clinical Evidence**
113
+
114
+ A meta-analysis by Johnson et al. (2020) pooled data from 8 randomized controlled
115
+ trials (N=3,247). The primary endpoint of erection sufficient for intercourse was
116
+ achieved in 87% of alprostadil patients versus 12% placebo (RR 7.25, 95% CI:
117
+ 5.8-9.1, p<0.001). The number needed to treat was 1.3, indicating robust effect
118
+ size. Onset of action was 5-15 minutes, with duration of 30-60 minutes.
119
+
120
+ **Comparative Evidence**
121
+
122
+ Direct comparisons with PDE5 inhibitors are limited. However, in the subgroup
123
+ of PDE5 non-responders studied by Martinez et al. (2018), alprostadil achieved
124
+ successful intercourse in 72% of patients who had failed sildenafil.
125
+
126
+ ### Recommendations
127
+
128
+ 1. Consider alprostadil as second-line therapy when PDE5 inhibitors fail or are
129
+ contraindicated
130
+ 2. Start with 10 micrograms intracavernosal injection, titrate to 40 micrograms based
131
+ on response
132
+ 3. Provide in-office training for self-injection technique before home use
133
+ 4. Screen for priapism risk factors before initiating therapy
134
+ 5. Consider intraurethral alprostadil (MUSE) for patients averse to injections
135
+
136
+ ### Limitations
137
+
138
+ Long-term safety data beyond 2 years is limited. Head-to-head comparisons with
139
+ newer therapies such as low-intensity shockwave therapy are lacking. Most trials
140
+ excluded patients with severe cardiovascular disease, limiting generalizability
141
+ to this population. The psychological burden of injection therapy may affect
142
+ real-world adherence compared to oral medications.
143
+
144
+ ### References
145
+
146
+ 1. Smith AB et al. (2019). Alprostadil mechanism of action in erectile tissue.
147
+ J Urol. https://pubmed.ncbi.nlm.nih.gov/12345678/
148
+ 2. Johnson CD et al. (2020). Meta-analysis of intracavernosal alprostadil efficacy.
149
+ J Sex Med. https://pubmed.ncbi.nlm.nih.gov/23456789/
150
+ 3. Martinez R et al. (2018). Alprostadil in PDE5 inhibitor non-responders.
151
+ Int J Impot Res. https://pubmed.ncbi.nlm.nih.gov/34567890/
152
+ """
153
+
154
+
155
+ def format_synthesis_prompt(
156
+ query: str,
157
+ evidence_summary: str,
158
+ drug_candidates: list[str],
159
+ key_findings: list[str],
160
+ mechanism_score: int,
161
+ clinical_score: int,
162
+ confidence: float,
163
+ ) -> str:
164
+ """Format the user prompt for narrative synthesis.
165
+
166
+ Args:
167
+ query: Original research question
168
+ evidence_summary: Formatted summary of evidence papers
169
+ drug_candidates: List of identified drug/treatment candidates
170
+ key_findings: List of key findings from assessment
171
+ mechanism_score: Mechanism evidence score (0-10)
172
+ clinical_score: Clinical evidence score (0-10)
173
+ confidence: Overall confidence (0.0-1.0)
174
+
175
+ Returns:
176
+ Formatted user prompt for the synthesis LLM
177
+ """
178
+ candidates_str = ", ".join(drug_candidates) if drug_candidates else "None identified"
179
+ if key_findings:
180
+ findings_str = "\n".join(f"- {f}" for f in key_findings)
181
+ else:
182
+ findings_str = "No specific findings extracted"
183
+
184
+ return f"""Synthesize a narrative research report for the following query.
185
+
186
+ ## Research Question
187
+ {query}
188
+
189
+ ## Evidence Summary
190
+ {evidence_summary}
191
+
192
+ ## Identified Drug/Treatment Candidates
193
+ {candidates_str}
194
+
195
+ ## Key Findings from Evidence Assessment
196
+ {findings_str}
197
+
198
+ ## Assessment Scores
199
+ - Mechanism Score: {mechanism_score}/10
200
+ - Clinical Evidence Score: {clinical_score}/10
201
+ - Overall Confidence: {confidence:.0%}
202
+
203
+ ## Instructions
204
+ Generate a NARRATIVE research report following the structure in your system prompt.
205
+ Write in prose paragraphs, NOT bullet points (except for Recommendations section).
206
+ ONLY cite papers mentioned in the Evidence Summary above - do NOT invent references.
207
+
208
+ {FEW_SHOT_EXAMPLE}
209
+ """
src/utils/exceptions.py CHANGED
@@ -35,3 +35,27 @@ class EmbeddingError(DeepBonerError):
35
  """Raised when embedding or vector store operations fail."""
36
 
37
  pass
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
35
  """Raised when embedding or vector store operations fail."""
36
 
37
  pass
38
+
39
+
40
+ class LLMError(DeepBonerError):
41
+ """Raised when LLM operations fail (API errors, parsing errors, etc.)."""
42
+
43
+ pass
44
+
45
+
46
+ class QuotaExceededError(LLMError):
47
+ """Raised when LLM API quota is exceeded (402 errors)."""
48
+
49
+ pass
50
+
51
+
52
+ class ModalError(DeepBonerError):
53
+ """Raised when Modal sandbox operations fail."""
54
+
55
+ pass
56
+
57
+
58
+ class SynthesisError(DeepBonerError):
59
+ """Raised when report synthesis fails."""
60
+
61
+ pass
tests/e2e/test_simple_mode.py CHANGED
@@ -55,11 +55,11 @@ async def test_simple_mode_structure_validation(mock_search_handler, mock_judge_
55
  complete_event = next(e for e in events if e.type == "complete")
56
  report = complete_event.message
57
 
58
- # Check markdown structure
59
- assert "## Sexual Health Analysis" in report
60
- assert "### Citations" in report
61
- assert "### Key Findings" in report
62
 
63
- # Check for citations
64
  assert "Study on test query" in report
65
- assert "https://pubmed.example.com/123" in report
 
55
  complete_event = next(e for e in events if e.type == "complete")
56
  report = complete_event.message
57
 
58
+ # Check LLM narrative synthesis structure (SPEC_12)
59
+ # LLM generates prose with these sections (may omit ### prefix)
60
+ assert "Executive Summary" in report or "Sexual Health Analysis" in report
61
+ assert "Full Citation List" in report or "Citations" in report
62
 
63
+ # Check for citations (from citation footer added by orchestrator)
64
  assert "Study on test query" in report
65
+ assert "pubmed.example.com/123" in report
tests/integration/test_simple_mode_synthesis.py CHANGED
@@ -92,7 +92,11 @@ async def test_simple_mode_synthesizes_before_max_iterations():
92
  complete_event = complete_events[0]
93
 
94
  assert "MagicDrug" in complete_event.message
95
- assert "Drug Candidates" in complete_event.message
 
 
 
 
96
  assert complete_event.data.get("synthesis_reason") == "high_scores_with_candidates"
97
  assert complete_event.iteration == 2 # Should stop at it 2
98
 
 
92
  complete_event = complete_events[0]
93
 
94
  assert "MagicDrug" in complete_event.message
95
+ # SPEC_12: LLM synthesis produces narrative prose, not template with "Drug Candidates" header
96
+ # Check for narrative structure (LLM may omit ### prefix) OR template fallback
97
+ assert (
98
+ "Executive Summary" in complete_event.message or "Drug Candidates" in complete_event.message
99
+ )
100
  assert complete_event.data.get("synthesis_reason") == "high_scores_with_candidates"
101
  assert complete_event.iteration == 2 # Should stop at it 2
102
 
tests/unit/agent_factory/test_judges.py CHANGED
@@ -8,6 +8,7 @@ from src.agent_factory.judges import JudgeHandler, MockJudgeHandler
8
  from src.utils.models import AssessmentDetails, Citation, Evidence, JudgeAssessment
9
 
10
 
 
11
  class TestJudgeHandler:
12
  """Tests for JudgeHandler."""
13
 
@@ -107,6 +108,8 @@ class TestJudgeHandler:
107
  assert result.sufficient is False
108
  assert result.recommendation == "continue"
109
  assert len(result.next_search_queries) > 0
 
 
110
 
111
  @pytest.mark.asyncio
112
  async def test_assess_handles_llm_failure(self):
@@ -143,6 +146,7 @@ class TestJudgeHandler:
143
  assert "failed" in result.reasoning.lower()
144
 
145
 
 
146
  class TestMockJudgeHandler:
147
  """Tests for MockJudgeHandler."""
148
 
 
8
  from src.utils.models import AssessmentDetails, Citation, Evidence, JudgeAssessment
9
 
10
 
11
+ @pytest.mark.unit
12
  class TestJudgeHandler:
13
  """Tests for JudgeHandler."""
14
 
 
108
  assert result.sufficient is False
109
  assert result.recommendation == "continue"
110
  assert len(result.next_search_queries) > 0
111
+ # Assert specific expected query is present
112
+ assert "sildenafil mechanism" in result.next_search_queries
113
 
114
  @pytest.mark.asyncio
115
  async def test_assess_handles_llm_failure(self):
 
146
  assert "failed" in result.reasoning.lower()
147
 
148
 
149
+ @pytest.mark.unit
150
  class TestMockJudgeHandler:
151
  """Tests for MockJudgeHandler."""
152
 
tests/unit/graph/test_nodes.py CHANGED
@@ -12,12 +12,12 @@ async def test_judge_node_initialization(mocker):
12
  # Mock get_model to avoid needing real API keys
13
  mocker.patch("src.agents.graph.nodes.get_model", return_value=mocker.Mock())
14
 
15
- # Create a mock assessment with attributes
16
  mock_hypothesis = mocker.Mock()
17
- mock_hypothesis.drug = "Caffeine"
18
- mock_hypothesis.target = "Adenosine"
19
- mock_hypothesis.pathway = "CNS"
20
- mock_hypothesis.effect = "Alertness"
21
  mock_hypothesis.confidence = 0.8
22
 
23
  mock_assessment = mocker.Mock()
@@ -46,7 +46,7 @@ async def test_judge_node_initialization(mocker):
46
 
47
  assert "hypotheses" in update
48
  assert len(update["hypotheses"]) == 1
49
- assert update["hypotheses"][0].id == "Caffeine"
50
  assert update["hypotheses"][0].status == "proposed"
51
 
52
 
 
12
  # Mock get_model to avoid needing real API keys
13
  mocker.patch("src.agents.graph.nodes.get_model", return_value=mocker.Mock())
14
 
15
+ # Create a mock assessment with attributes (sexual health domain)
16
  mock_hypothesis = mocker.Mock()
17
+ mock_hypothesis.drug = "Testosterone"
18
+ mock_hypothesis.target = "Androgen Receptor"
19
+ mock_hypothesis.pathway = "HPG Axis"
20
+ mock_hypothesis.effect = "Libido Enhancement"
21
  mock_hypothesis.confidence = 0.8
22
 
23
  mock_assessment = mocker.Mock()
 
46
 
47
  assert "hypotheses" in update
48
  assert len(update["hypotheses"]) == 1
49
+ assert update["hypotheses"][0].id == "Testosterone"
50
  assert update["hypotheses"][0].status == "proposed"
51
 
52
 
tests/unit/orchestrators/test_simple_orchestrator_domain.py CHANGED
@@ -30,7 +30,7 @@ class TestSimpleOrchestratorDomain:
30
  domain=ResearchDomain.SEXUAL_HEALTH,
31
  )
32
 
33
- # Test _generate_synthesis
34
  mock_assessment = MagicMock()
35
  mock_assessment.details.drug_candidates = []
36
  mock_assessment.details.key_findings = []
@@ -39,7 +39,7 @@ class TestSimpleOrchestratorDomain:
39
  mock_assessment.details.mechanism_score = 5
40
  mock_assessment.details.clinical_evidence_score = 5
41
 
42
- report = orch._generate_synthesis("query", [], mock_assessment)
43
  assert "## Sexual Health Analysis" in report
44
 
45
  # Test _generate_partial_synthesis
 
30
  domain=ResearchDomain.SEXUAL_HEALTH,
31
  )
32
 
33
+ # Test _generate_template_synthesis (the sync fallback method)
34
  mock_assessment = MagicMock()
35
  mock_assessment.details.drug_candidates = []
36
  mock_assessment.details.key_findings = []
 
39
  mock_assessment.details.mechanism_score = 5
40
  mock_assessment.details.clinical_evidence_score = 5
41
 
42
+ report = orch._generate_template_synthesis("query", [], mock_assessment)
43
  assert "## Sexual Health Analysis" in report
44
 
45
  # Test _generate_partial_synthesis
tests/unit/orchestrators/test_simple_synthesis.py ADDED
@@ -0,0 +1,279 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Tests for simple orchestrator LLM synthesis."""
2
+
3
+ from unittest.mock import AsyncMock, MagicMock, patch
4
+
5
+ import pytest
6
+
7
+ from src.orchestrators.simple import Orchestrator
8
+ from src.utils.models import AssessmentDetails, Citation, Evidence, JudgeAssessment
9
+
10
+
11
+ @pytest.fixture
12
+ def sample_evidence() -> list[Evidence]:
13
+ """Sample evidence for testing synthesis."""
14
+ return [
15
+ Evidence(
16
+ content="Testosterone therapy demonstrates efficacy in treating HSDD.",
17
+ citation=Citation(
18
+ source="pubmed",
19
+ title="Testosterone and Female Sexual Desire",
20
+ url="https://pubmed.ncbi.nlm.nih.gov/12345/",
21
+ date="2023",
22
+ authors=["Smith J", "Jones A"],
23
+ ),
24
+ ),
25
+ Evidence(
26
+ content="A meta-analysis of 8 RCTs shows significant improvement in sexual desire.",
27
+ citation=Citation(
28
+ source="pubmed",
29
+ title="Meta-analysis of Testosterone Therapy",
30
+ url="https://pubmed.ncbi.nlm.nih.gov/67890/",
31
+ date="2024",
32
+ authors=["Johnson B"],
33
+ ),
34
+ ),
35
+ ]
36
+
37
+
38
+ @pytest.fixture
39
+ def sample_assessment() -> JudgeAssessment:
40
+ """Sample assessment for testing synthesis."""
41
+ return JudgeAssessment(
42
+ sufficient=True,
43
+ confidence=0.85,
44
+ reasoning="Evidence is sufficient to synthesize findings on testosterone therapy for HSDD.",
45
+ recommendation="synthesize",
46
+ next_search_queries=[],
47
+ details=AssessmentDetails(
48
+ mechanism_score=8,
49
+ mechanism_reasoning="Strong evidence of androgen receptor activation pathway.",
50
+ clinical_evidence_score=7,
51
+ clinical_reasoning="Multiple RCTs support efficacy in postmenopausal HSDD.",
52
+ drug_candidates=["Testosterone", "LibiGel"],
53
+ key_findings=[
54
+ "Testosterone improves libido in postmenopausal women",
55
+ "Transdermal formulation has best safety profile",
56
+ ],
57
+ ),
58
+ )
59
+
60
+
61
+ @pytest.mark.unit
62
+ class TestGenerateSynthesis:
63
+ """Tests for _generate_synthesis method."""
64
+
65
+ @pytest.mark.asyncio
66
+ async def test_calls_llm_for_narrative(
67
+ self,
68
+ sample_evidence: list[Evidence],
69
+ sample_assessment: JudgeAssessment,
70
+ ) -> None:
71
+ """Synthesis should make an LLM call, not just use a template."""
72
+ mock_search = MagicMock()
73
+ mock_judge = MagicMock()
74
+
75
+ orchestrator = Orchestrator(
76
+ search_handler=mock_search,
77
+ judge_handler=mock_judge,
78
+ )
79
+ orchestrator.history = [{"iteration": 1}] # Needed for footer
80
+
81
+ with (
82
+ patch("pydantic_ai.Agent") as mock_agent_class,
83
+ patch("src.agent_factory.judges.get_model") as mock_get_model,
84
+ ):
85
+ mock_model = MagicMock()
86
+ mock_get_model.return_value = mock_model
87
+
88
+ mock_agent = MagicMock()
89
+ mock_result = MagicMock()
90
+ mock_result.output = """### Executive Summary
91
+
92
+ Testosterone therapy demonstrates consistent efficacy for HSDD treatment.
93
+
94
+ ### Background
95
+
96
+ HSDD affects many postmenopausal women.
97
+
98
+ ### Evidence Synthesis
99
+
100
+ Studies show significant improvement in sexual desire scores.
101
+
102
+ ### Recommendations
103
+
104
+ 1. Consider testosterone therapy for postmenopausal HSDD
105
+
106
+ ### Limitations
107
+
108
+ Long-term safety data is limited.
109
+
110
+ ### References
111
+
112
+ 1. Smith J et al. (2023). Testosterone and Female Sexual Desire."""
113
+
114
+ mock_agent.run = AsyncMock(return_value=mock_result)
115
+ mock_agent_class.return_value = mock_agent
116
+
117
+ result = await orchestrator._generate_synthesis(
118
+ query="testosterone HSDD",
119
+ evidence=sample_evidence,
120
+ assessment=sample_assessment,
121
+ )
122
+
123
+ # Verify LLM agent was created and called
124
+ mock_agent_class.assert_called_once()
125
+ mock_agent.run.assert_called_once()
126
+
127
+ # Verify output includes narrative content
128
+ assert "Executive Summary" in result
129
+ assert "Background" in result
130
+ assert "Evidence Synthesis" in result
131
+
132
+ @pytest.mark.asyncio
133
+ async def test_falls_back_on_llm_error(
134
+ self,
135
+ sample_evidence: list[Evidence],
136
+ sample_assessment: JudgeAssessment,
137
+ ) -> None:
138
+ """Synthesis should fall back to template if LLM fails."""
139
+ mock_search = MagicMock()
140
+ mock_judge = MagicMock()
141
+
142
+ orchestrator = Orchestrator(
143
+ search_handler=mock_search,
144
+ judge_handler=mock_judge,
145
+ )
146
+ orchestrator.history = [{"iteration": 1}]
147
+
148
+ with patch("pydantic_ai.Agent") as mock_agent_class:
149
+ # Simulate LLM failure
150
+ mock_agent_class.side_effect = Exception("LLM unavailable")
151
+
152
+ result = await orchestrator._generate_synthesis(
153
+ query="testosterone HSDD",
154
+ evidence=sample_evidence,
155
+ assessment=sample_assessment,
156
+ )
157
+
158
+ # Should return template fallback (has Assessment section)
159
+ assert "Assessment" in result or "Drug Candidates" in result
160
+ assert "Testosterone" in result # Drug candidate should be present
161
+
162
+ @pytest.mark.asyncio
163
+ async def test_includes_citation_footer(
164
+ self,
165
+ sample_evidence: list[Evidence],
166
+ sample_assessment: JudgeAssessment,
167
+ ) -> None:
168
+ """Synthesis should include full citation list footer."""
169
+ mock_search = MagicMock()
170
+ mock_judge = MagicMock()
171
+
172
+ orchestrator = Orchestrator(
173
+ search_handler=mock_search,
174
+ judge_handler=mock_judge,
175
+ )
176
+ orchestrator.history = [{"iteration": 1}]
177
+
178
+ with (
179
+ patch("pydantic_ai.Agent") as mock_agent_class,
180
+ patch("src.agent_factory.judges.get_model"),
181
+ ):
182
+ mock_agent = MagicMock()
183
+ mock_result = MagicMock()
184
+ mock_result.output = "Narrative synthesis content."
185
+ mock_agent.run = AsyncMock(return_value=mock_result)
186
+ mock_agent_class.return_value = mock_agent
187
+
188
+ result = await orchestrator._generate_synthesis(
189
+ query="test query",
190
+ evidence=sample_evidence,
191
+ assessment=sample_assessment,
192
+ )
193
+
194
+ # Should include citation footer
195
+ assert "Full Citation List" in result
196
+ assert "pubmed.ncbi.nlm.nih.gov/12345" in result
197
+ assert "pubmed.ncbi.nlm.nih.gov/67890" in result
198
+
199
+
200
+ @pytest.mark.unit
201
+ class TestGenerateTemplateSynthesis:
202
+ """Tests for _generate_template_synthesis fallback method."""
203
+
204
+ def test_returns_structured_output(
205
+ self,
206
+ sample_evidence: list[Evidence],
207
+ sample_assessment: JudgeAssessment,
208
+ ) -> None:
209
+ """Template synthesis should return structured markdown."""
210
+ mock_search = MagicMock()
211
+ mock_judge = MagicMock()
212
+
213
+ orchestrator = Orchestrator(
214
+ search_handler=mock_search,
215
+ judge_handler=mock_judge,
216
+ )
217
+ orchestrator.history = [{"iteration": 1}]
218
+
219
+ result = orchestrator._generate_template_synthesis(
220
+ query="testosterone HSDD",
221
+ evidence=sample_evidence,
222
+ assessment=sample_assessment,
223
+ )
224
+
225
+ # Should have all required sections
226
+ assert "Question" in result
227
+ assert "Drug Candidates" in result
228
+ assert "Key Findings" in result
229
+ assert "Assessment" in result
230
+ assert "Citations" in result
231
+
232
+ def test_includes_drug_candidates(
233
+ self,
234
+ sample_evidence: list[Evidence],
235
+ sample_assessment: JudgeAssessment,
236
+ ) -> None:
237
+ """Template synthesis should list drug candidates."""
238
+ mock_search = MagicMock()
239
+ mock_judge = MagicMock()
240
+
241
+ orchestrator = Orchestrator(
242
+ search_handler=mock_search,
243
+ judge_handler=mock_judge,
244
+ )
245
+ orchestrator.history = [{"iteration": 1}]
246
+
247
+ result = orchestrator._generate_template_synthesis(
248
+ query="test",
249
+ evidence=sample_evidence,
250
+ assessment=sample_assessment,
251
+ )
252
+
253
+ assert "Testosterone" in result
254
+ assert "LibiGel" in result
255
+
256
+ def test_includes_scores(
257
+ self,
258
+ sample_evidence: list[Evidence],
259
+ sample_assessment: JudgeAssessment,
260
+ ) -> None:
261
+ """Template synthesis should include assessment scores."""
262
+ mock_search = MagicMock()
263
+ mock_judge = MagicMock()
264
+
265
+ orchestrator = Orchestrator(
266
+ search_handler=mock_search,
267
+ judge_handler=mock_judge,
268
+ )
269
+ orchestrator.history = [{"iteration": 1}]
270
+
271
+ result = orchestrator._generate_template_synthesis(
272
+ query="test",
273
+ evidence=sample_evidence,
274
+ assessment=sample_assessment,
275
+ )
276
+
277
+ assert "8/10" in result # Mechanism score
278
+ assert "7/10" in result # Clinical score
279
+ assert "85%" in result # Confidence
tests/unit/prompts/test_synthesis.py ADDED
@@ -0,0 +1,217 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Tests for narrative synthesis prompts."""
2
+
3
+ import pytest
4
+
5
+ from src.prompts.synthesis import (
6
+ FEW_SHOT_EXAMPLE,
7
+ format_synthesis_prompt,
8
+ get_synthesis_system_prompt,
9
+ )
10
+
11
+
12
+ @pytest.mark.unit
13
+ class TestSynthesisSystemPrompt:
14
+ """Tests for synthesis system prompt generation."""
15
+
16
+ def test_system_prompt_emphasizes_prose(self) -> None:
17
+ """System prompt should emphasize prose paragraphs, not bullets."""
18
+ prompt = get_synthesis_system_prompt()
19
+ assert "PROSE PARAGRAPHS" in prompt
20
+ assert "not bullet points" in prompt.lower()
21
+
22
+ def test_system_prompt_requires_executive_summary(self) -> None:
23
+ """System prompt should require executive summary section."""
24
+ prompt = get_synthesis_system_prompt()
25
+ assert "Executive Summary" in prompt
26
+ assert "REQUIRED" in prompt
27
+
28
+ def test_system_prompt_requires_background(self) -> None:
29
+ """System prompt should require background section."""
30
+ prompt = get_synthesis_system_prompt()
31
+ assert "Background" in prompt
32
+
33
+ def test_system_prompt_requires_evidence_synthesis(self) -> None:
34
+ """System prompt should require evidence synthesis section."""
35
+ prompt = get_synthesis_system_prompt()
36
+ assert "Evidence Synthesis" in prompt
37
+ assert "Mechanism of Action" in prompt
38
+
39
+ def test_system_prompt_requires_recommendations(self) -> None:
40
+ """System prompt should require recommendations section."""
41
+ prompt = get_synthesis_system_prompt()
42
+ assert "Recommendations" in prompt
43
+
44
+ def test_system_prompt_requires_limitations(self) -> None:
45
+ """System prompt should require limitations section."""
46
+ prompt = get_synthesis_system_prompt()
47
+ assert "Limitations" in prompt
48
+
49
+ def test_system_prompt_warns_about_hallucination(self) -> None:
50
+ """System prompt should warn about citation hallucination."""
51
+ prompt = get_synthesis_system_prompt()
52
+ assert "NEVER hallucinate" in prompt or "never hallucinate" in prompt.lower()
53
+
54
+ def test_system_prompt_includes_domain_name(self) -> None:
55
+ """System prompt should include domain name."""
56
+ prompt = get_synthesis_system_prompt("sexual_health")
57
+ assert "sexual health" in prompt.lower()
58
+
59
+
60
+ @pytest.mark.unit
61
+ class TestFormatSynthesisPrompt:
62
+ """Tests for synthesis user prompt formatting."""
63
+
64
+ def test_includes_query(self) -> None:
65
+ """User prompt should include the research query."""
66
+ prompt = format_synthesis_prompt(
67
+ query="testosterone libido",
68
+ evidence_summary="Study shows efficacy...",
69
+ drug_candidates=["Testosterone"],
70
+ key_findings=["Improved libido"],
71
+ mechanism_score=8,
72
+ clinical_score=7,
73
+ confidence=0.85,
74
+ )
75
+ assert "testosterone libido" in prompt
76
+
77
+ def test_includes_evidence_summary(self) -> None:
78
+ """User prompt should include evidence summary."""
79
+ prompt = format_synthesis_prompt(
80
+ query="test query",
81
+ evidence_summary="Study by Smith et al. shows significant results...",
82
+ drug_candidates=[],
83
+ key_findings=[],
84
+ mechanism_score=5,
85
+ clinical_score=5,
86
+ confidence=0.5,
87
+ )
88
+ assert "Study by Smith et al." in prompt
89
+
90
+ def test_includes_drug_candidates(self) -> None:
91
+ """User prompt should include drug candidates."""
92
+ prompt = format_synthesis_prompt(
93
+ query="test query",
94
+ evidence_summary="...",
95
+ drug_candidates=["Testosterone", "Flibanserin"],
96
+ key_findings=[],
97
+ mechanism_score=5,
98
+ clinical_score=5,
99
+ confidence=0.5,
100
+ )
101
+ assert "Testosterone" in prompt
102
+ assert "Flibanserin" in prompt
103
+
104
+ def test_includes_key_findings(self) -> None:
105
+ """User prompt should include key findings."""
106
+ prompt = format_synthesis_prompt(
107
+ query="test query",
108
+ evidence_summary="...",
109
+ drug_candidates=[],
110
+ key_findings=["Improved libido in postmenopausal women", "Safe profile"],
111
+ mechanism_score=5,
112
+ clinical_score=5,
113
+ confidence=0.5,
114
+ )
115
+ assert "Improved libido in postmenopausal women" in prompt
116
+ assert "Safe profile" in prompt
117
+
118
+ def test_includes_scores(self) -> None:
119
+ """User prompt should include assessment scores."""
120
+ prompt = format_synthesis_prompt(
121
+ query="test query",
122
+ evidence_summary="...",
123
+ drug_candidates=[],
124
+ key_findings=[],
125
+ mechanism_score=8,
126
+ clinical_score=7,
127
+ confidence=0.85,
128
+ )
129
+ assert "8/10" in prompt
130
+ assert "7/10" in prompt
131
+ assert "85%" in prompt
132
+
133
+ def test_handles_empty_candidates(self) -> None:
134
+ """User prompt should handle empty drug candidates."""
135
+ prompt = format_synthesis_prompt(
136
+ query="test query",
137
+ evidence_summary="...",
138
+ drug_candidates=[],
139
+ key_findings=[],
140
+ mechanism_score=5,
141
+ clinical_score=5,
142
+ confidence=0.5,
143
+ )
144
+ assert "None identified" in prompt
145
+
146
+ def test_handles_empty_findings(self) -> None:
147
+ """User prompt should handle empty key findings."""
148
+ prompt = format_synthesis_prompt(
149
+ query="test query",
150
+ evidence_summary="...",
151
+ drug_candidates=[],
152
+ key_findings=[],
153
+ mechanism_score=5,
154
+ clinical_score=5,
155
+ confidence=0.5,
156
+ )
157
+ assert "No specific findings" in prompt
158
+
159
+ def test_includes_few_shot_example(self) -> None:
160
+ """User prompt should include few-shot example."""
161
+ prompt = format_synthesis_prompt(
162
+ query="test query",
163
+ evidence_summary="...",
164
+ drug_candidates=[],
165
+ key_findings=[],
166
+ mechanism_score=5,
167
+ clinical_score=5,
168
+ confidence=0.5,
169
+ )
170
+ assert "Alprostadil" in prompt # From the few-shot example
171
+
172
+
173
+ @pytest.mark.unit
174
+ class TestFewShotExample:
175
+ """Tests for the few-shot example quality."""
176
+
177
+ def test_few_shot_is_mostly_narrative(self) -> None:
178
+ """Few-shot example should be mostly prose paragraphs, not bullets."""
179
+ # Count substantial paragraphs (>100 chars of prose)
180
+ paragraphs = [p for p in FEW_SHOT_EXAMPLE.split("\n\n") if len(p) > 100]
181
+ # Count bullet points
182
+ bullets = FEW_SHOT_EXAMPLE.count("\n- ") + FEW_SHOT_EXAMPLE.count("\n1. ")
183
+
184
+ # Prose should dominate - at least as many paragraphs as bullets
185
+ assert len(paragraphs) >= bullets, "Few-shot example should be mostly narrative prose"
186
+
187
+ def test_few_shot_has_executive_summary(self) -> None:
188
+ """Few-shot example should demonstrate executive summary."""
189
+ assert "Executive Summary" in FEW_SHOT_EXAMPLE
190
+
191
+ def test_few_shot_has_background(self) -> None:
192
+ """Few-shot example should demonstrate background section."""
193
+ assert "Background" in FEW_SHOT_EXAMPLE
194
+
195
+ def test_few_shot_has_evidence_synthesis(self) -> None:
196
+ """Few-shot example should demonstrate evidence synthesis."""
197
+ assert "Evidence Synthesis" in FEW_SHOT_EXAMPLE
198
+ assert "Mechanism of Action" in FEW_SHOT_EXAMPLE
199
+
200
+ def test_few_shot_has_recommendations(self) -> None:
201
+ """Few-shot example should demonstrate recommendations."""
202
+ assert "Recommendations" in FEW_SHOT_EXAMPLE
203
+
204
+ def test_few_shot_has_limitations(self) -> None:
205
+ """Few-shot example should demonstrate limitations."""
206
+ assert "Limitations" in FEW_SHOT_EXAMPLE
207
+
208
+ def test_few_shot_has_references(self) -> None:
209
+ """Few-shot example should demonstrate references format."""
210
+ assert "References" in FEW_SHOT_EXAMPLE
211
+ assert "pubmed.ncbi.nlm.nih.gov" in FEW_SHOT_EXAMPLE
212
+
213
+ def test_few_shot_includes_statistics(self) -> None:
214
+ """Few-shot example should demonstrate statistical reporting."""
215
+ assert "%" in FEW_SHOT_EXAMPLE # Percentages
216
+ assert "p<" in FEW_SHOT_EXAMPLE or "p=" in FEW_SHOT_EXAMPLE # P-values
217
+ assert "CI" in FEW_SHOT_EXAMPLE # Confidence intervals
tests/unit/test_mcp_tools.py CHANGED
@@ -32,6 +32,7 @@ def mock_evidence() -> Evidence:
32
  class TestSearchPubMed:
33
  """Tests for search_pubmed MCP tool."""
34
 
 
35
  @patch("src.mcp_tools._pubmed.search")
36
  async def test_returns_formatted_string(self, mock_search):
37
  """Test that search_pubmed returns Markdown formatted string."""
@@ -93,7 +94,7 @@ class TestSearchClinicalTrials:
93
  with patch("src.mcp_tools._trials") as mock_tool:
94
  mock_tool.search = AsyncMock(return_value=[mock_evidence])
95
 
96
- result = await search_clinical_trials("diabetes", 10)
97
 
98
  assert isinstance(result, str)
99
  assert "Clinical Trials" in result
 
32
  class TestSearchPubMed:
33
  """Tests for search_pubmed MCP tool."""
34
 
35
+ @pytest.mark.asyncio
36
  @patch("src.mcp_tools._pubmed.search")
37
  async def test_returns_formatted_string(self, mock_search):
38
  """Test that search_pubmed returns Markdown formatted string."""
 
94
  with patch("src.mcp_tools._trials") as mock_tool:
95
  mock_tool.search = AsyncMock(return_value=[mock_evidence])
96
 
97
+ result = await search_clinical_trials("sildenafil erectile dysfunction", 10)
98
 
99
  assert isinstance(result, str)
100
  assert "Clinical Trials" in result
tests/unit/tools/test_clinicaltrials.py CHANGED
@@ -134,9 +134,9 @@ class TestClinicalTrialsIntegration:
134
 
135
  @pytest.mark.asyncio
136
  async def test_real_api_returns_interventional(self) -> None:
137
- """Test that real API returns interventional studies."""
138
  tool = ClinicalTrialsTool()
139
- results = await tool.search("long covid treatment", max_results=3)
140
 
141
  # Should get results
142
  assert len(results) > 0
 
134
 
135
  @pytest.mark.asyncio
136
  async def test_real_api_returns_interventional(self) -> None:
137
+ """Test that real API returns interventional studies for sexual health query."""
138
  tool = ClinicalTrialsTool()
139
+ results = await tool.search("testosterone HSDD", max_results=3)
140
 
141
  # Should get results
142
  assert len(results) > 0
tests/unit/tools/test_europepmc.py CHANGED
@@ -27,8 +27,8 @@ class TestEuropePMCTool:
27
  "result": [
28
  {
29
  "id": "12345",
30
- "title": "Long COVID Treatment Study",
31
- "abstractText": "This study examines treatments for Long COVID.",
32
  "doi": "10.1234/test",
33
  "pubYear": "2024",
34
  "source": "MED",
@@ -49,11 +49,11 @@ class TestEuropePMCTool:
49
 
50
  mock_instance.get.return_value = mock_resp
51
 
52
- results = await tool.search("long covid treatment", max_results=5)
53
 
54
  assert len(results) == 1
55
  assert isinstance(results[0], Evidence)
56
- assert "Long COVID Treatment Study" in results[0].citation.title
57
 
58
  @pytest.mark.asyncio
59
  async def test_search_marks_preprints(self, tool: EuropePMCTool) -> None:
@@ -113,11 +113,11 @@ class TestEuropePMCIntegration:
113
 
114
  @pytest.mark.asyncio
115
  async def test_real_api_call(self) -> None:
116
- """Test actual API returns relevant results."""
117
  tool = EuropePMCTool()
118
- results = await tool.search("long covid treatment", max_results=3)
119
 
120
  assert len(results) > 0
121
- # At least one result should mention COVID
122
  titles = " ".join([r.citation.title.lower() for r in results])
123
- assert "covid" in titles or "sars" in titles
 
27
  "result": [
28
  {
29
  "id": "12345",
30
+ "title": "Testosterone Therapy for HSDD Study",
31
+ "abstractText": "This study examines testosterone therapy for HSDD.",
32
  "doi": "10.1234/test",
33
  "pubYear": "2024",
34
  "source": "MED",
 
49
 
50
  mock_instance.get.return_value = mock_resp
51
 
52
+ results = await tool.search("testosterone HSDD therapy", max_results=5)
53
 
54
  assert len(results) == 1
55
  assert isinstance(results[0], Evidence)
56
+ assert "Testosterone Therapy for HSDD Study" in results[0].citation.title
57
 
58
  @pytest.mark.asyncio
59
  async def test_search_marks_preprints(self, tool: EuropePMCTool) -> None:
 
113
 
114
  @pytest.mark.asyncio
115
  async def test_real_api_call(self) -> None:
116
+ """Test actual API returns relevant results for sexual health query."""
117
  tool = EuropePMCTool()
118
+ results = await tool.search("testosterone libido therapy", max_results=3)
119
 
120
  assert len(results) > 0
121
+ # At least one result should mention testosterone or libido
122
  titles = " ".join([r.citation.title.lower() for r in results])
123
+ assert "testosterone" in titles or "libido" in titles or "sexual" in titles
tests/unit/tools/test_query_utils.py CHANGED
@@ -12,8 +12,8 @@ class TestQueryPreprocessing:
12
  def test_strip_question_words(self) -> None:
13
  """Test removal of question words."""
14
  assert strip_question_words("What drugs treat HSDD") == "drugs treat hsdd"
15
- assert strip_question_words("Which medications help diabetes") == "medications diabetes"
16
- assert strip_question_words("How can we cure aging") == "we cure aging"
17
  assert strip_question_words("Is sildenafil effective") == "sildenafil"
18
 
19
  def test_strip_preserves_medical_terms(self) -> None:
 
12
  def test_strip_question_words(self) -> None:
13
  """Test removal of question words."""
14
  assert strip_question_words("What drugs treat HSDD") == "drugs treat hsdd"
15
+ assert strip_question_words("Which medications help low libido") == "medications low libido"
16
+ assert strip_question_words("How can we treat ED") == "we treat ed"
17
  assert strip_question_words("Is sildenafil effective") == "sildenafil"
18
 
19
  def test_strip_preserves_medical_terms(self) -> None: