Spaces:

MCP-1st-Birthday
/

DeepBoner

Running

App Files Files Community

VibecoderMcSwaggins Claude commited on 9 days ago

Commit

89f1173

unverified ·

1 Parent(s): 0049ad7

fix(SPEC_11): address CodeRabbit review feedback (#92)

Browse files

* fix(SPEC_11): address CodeRabbit review feedback

- Update run_full.py docstring to reference sexual health pipeline (not drug repurposing)
- Update run_full.py help text to use sexual health query example
- Fix app.py domain display to show "Sexual Health" (not "Sexual_Health")
- Update test_nodes.py mock data to use sexual health terms (Testosterone, Androgen Receptor)
- Add

@pytest
.mark.unit markers to test_judges.py test classes
- Add specific assertion for next_search_queries in test_judges.py
- Add missing

@pytest
.mark.asyncio marker to test_mcp_tools.py
- Update test_mcp_tools.py and test_clinicaltrials.py to use sexual health queries
- Fix SPEC_12 markdown bare code fence (MD040)

* fix(SPEC_11): comprehensive domain alignment audit

Second pass of CodeRabbit review fixes - found additional domain mismatches:

- examples/search_demo/run_search.py: Update docstring from "drug repurposing" to "sexual health"
- examples/orchestrator_demo/run_agent.py: Update help text from "metformin cancer" to "testosterone libido"
- src/agents/tools.py: Update search_preprints example from "long covid" to "flibanserin HSDD preprint"
- tests/unit/tools/test_europepmc.py: Replace "Long COVID" mock data and queries with testosterone/HSDD
- tests/unit/tools/test_query_utils.py: Replace "diabetes" and "aging" examples with sexual health terms

All examples, demos, and tests now consistently use sexual health domain examples.

* feat(SPEC_12): implement narrative report synthesis using LLM

Transform report generation from string templating to LLM-based narrative
synthesis, following Microsoft Agent Framework aggregator pattern.

New files:
- src/prompts/synthesis.py: Narrative synthesis prompts with few-shot example
- get_synthesis_system_prompt(): Domain-aware narrative writing instructions
- format_synthesis_prompt(): Formats evidence/assessment for LLM
- FEW_SHOT_EXAMPLE: Alprostadil ED example demonstrating prose style

- tests/unit/prompts/test_synthesis.py: 20 tests for synthesis prompts
- Verify emphasis on prose, not bullets
- Verify required sections (Executive Summary, Background, etc.)
- Verify anti-hallucination warnings
- Verify few-shot example quality

- tests/unit/orchestrators/test_simple_synthesis.py: 6 tests for orchestrator
- Test LLM agent is called for synthesis
- Test graceful fallback on LLM failure
- Test citation footer inclusion

Modified files:
- src/orchestrators/simple.py:
- Add async _generate_synthesis() that calls LLM for narrative prose
- Rename old method to _generate_template_synthesis() as fallback
- Update call site at line 394 to await the async synthesis

- tests/unit/orchestrators/test_simple_orchestrator_domain.py:
- Update test to use _generate_template_synthesis() (sync fallback)

This implements SPEC_12 acceptance criteria:
- Report contains paragraph-form prose, not just bullet points
- Report has Executive Summary, Background, Evidence Synthesis sections
- Report has actionable Recommendations and Limitations
- Citations properly formatted with author/year/title/URL
- Graceful fallback if LLM unavailable
- All 256 tests pass

* refactor: enhance exception handling and type safety

- Add new exception types: LLMError, QuotaExceededError, ModalError, SynthesisError
- Update orchestrator to catch specific exception types (SearchError, JudgeError, ModalError)
- Add exc_type logging for better debugging and observability
- Fix app.py type safety with OrchestratorMode literal type
- Add mode validation for Gradio string inputs
- Remove unnecessary type: ignore comment in app.py

* docs: add embeddings and meta-agent architecture brainstorm

Research and first-principles analysis covering:
- Embedding service comparison (FAISS, ChromaDB, Voyage AI, MixedBread)
- Selective vs full codebase embedding (selective wins)
- Meta question: would self-knowledge help agents?
- Implementation patterns for codebase RAG
- Recommended roadmap for developer tooling

* docs: add reality check section to embeddings brainstorm

Distinguish real vs vaporware based on web research:
- Cursor's

@codebase
: REAL, production (embeddings + Turbopuffer)
- Claude Code: grep-only, no semantic search natively
- MCP servers (claude-context, code-index-mcp): REAL but with bugs
- "Self-aware agents" claims: mostly vaporware

Key insight: For AI-native devs, the real opportunity is MCP servers
that give Claude Code semantic search, not embedding the codebase
for agent self-understanding.

* docs: deep dive on internal organ vs external tool

First-principles analysis with empirical research:
- SICA (ICLR 2025): 17-53% gains from self-improvement
- Gödel Agent (ACL 2025): recursive self-modification works
- Introspection paradox: self-knowledge can HURT strong models
- Anthropic research: ~20% accuracy on genuine introspection

Key conclusion: For DeepBoner's core task (research), internal
self-knowledge organ = overhead with negative ROI. The agent
doesn't need to understand its code to search PubMed.

External tools help DEVELOPMENT. Internal organs help EXPERIMENTATION.
Neither helps the research task itself.

* docs: critical tool analysis and embeddings conclusion

New: TOOL_ANALYSIS_CRITICAL.md
- Deep analysis of all 4 search tools (PubMed, ClinicalTrials, Europe PMC, OpenAlex)
- API limitations and what's actually possible
- Identified critical gaps: deduplication, outcome measures, citation traversal
- Priority improvements without horizontal sprawl
- Neo4j recommendation: not yet, use OpenAlex API first

Updated: BRAINSTORM_EMBEDDINGS_META.md
- Condensed to conclusions only
- Closed: internal embeddings/mGREP not needed for this use case
- Focus on research evidence retrieval, not codebase self-knowledge

* test: update e2e/integration tests for SPEC_12 LLM synthesis format

Tests were asserting OLD template format ("## Sexual Health Analysis",
"### Drug Candidates") but SPEC_12 implementation uses LLM-generated
narrative prose with different section headers ("Executive Summary",
"Background", "Evidence Synthesis", etc.)

Updated assertions to accept both formats for backwards compatibility.

* docs: add language identifier to code fence (MD040)

---------

Co-authored-by: Claude <noreply@anthropic.com>

Files changed (24) hide show

BRAINSTORM_EMBEDDINGS_META.md +74 -0
SPEC_12_NARRATIVE_SYNTHESIS.md +1 -1
TOOL_ANALYSIS_CRITICAL.md +348 -0
examples/full_stack_demo/run_full.py +2 -2
examples/orchestrator_demo/run_agent.py +1 -1
examples/search_demo/run_search.py +1 -1
src/agent_factory/judges.py +7 -1
src/agents/tools.py +1 -1
src/app.py +16 -8
src/middleware/sub_iteration.py +14 -2
src/orchestrators/simple.py +146 -10
src/prompts/synthesis.py +209 -0
src/utils/exceptions.py +24 -0
tests/e2e/test_simple_mode.py +6 -6
tests/integration/test_simple_mode_synthesis.py +5 -1
tests/unit/agent_factory/test_judges.py +4 -0
tests/unit/graph/test_nodes.py +6 -6
tests/unit/orchestrators/test_simple_orchestrator_domain.py +2 -2
tests/unit/orchestrators/test_simple_synthesis.py +279 -0
tests/unit/prompts/test_synthesis.py +217 -0
tests/unit/test_mcp_tools.py +2 -1
tests/unit/tools/test_clinicaltrials.py +2 -2
tests/unit/tools/test_europepmc.py +8 -8
tests/unit/tools/test_query_utils.py +2 -2

BRAINSTORM_EMBEDDINGS_META.md ADDED Viewed

	@@ -0,0 +1,74 @@

+# Embeddings Brainstorm - Conclusions
+**Date**: November 2025
+**Status**: CLOSED - Conclusions reached, no action needed
+---
+## The Question
+Should DeepBoner implement:
+1. Internal codebase embeddings/ingestion pipeline?
+2. mGREP for internal tool selection?
+3. Self-knowledge components for agents?
+## The Answer: NO
+After research and first-principles analysis, the conclusion is clear:
+### Why Not Internal Embeddings/Ingestion
+```text
+DeepBoner's Core Task:
+┌─────────────────────────────────────────────────────────┐
+│  User Query: "Evidence for testosterone in HSDD?"       │
+│                         ↓                               │
+│  1. Search PubMed, ClinicalTrials, Europe PMC          │
+│  2. Judge: Is evidence sufficient?                      │
+│  3. Synthesize: Generate report                         │
+│                         ↓                               │
+│  Output: Research report with citations                 │
+└─────────────────────────────────────────────────────────┘
+Does ANY step require self-knowledge of codebase? NO.
+```
+### Why Not mGREP for Tool Selection
+| Approach | Complexity | Accuracy |
+|----------|------------|----------|
+| Embeddings + mGREP for tool selection | High | Medium (semantic similarity ≠ correct tool) |
+| Direct prompting with tool descriptions | Low | High (LLM reasons about applicability) |
+**No real agent system uses embeddings for tool selection.** All major frameworks (LangChain, OpenAI, Anthropic, Magentic) use prompt-based tool selection because:
+1. LLMs are already doing semantic matching internally
+2. Tool count is small (5-20) - fits easily in context
+3. Prompts allow reasoning, not just similarity
+### What We Already Have
+DeepBoner already uses embeddings for the **right thing**: research evidence retrieval.
+- `src/services/embeddings.py` - ChromaDB + sentence-transformers
+- `src/services/llamaindex_rag.py` - OpenAI embeddings for premium tier
+### The Real Priority
+Instead of internal embeddings/mGREP, focus on:
+1. **Deduplication** across PubMed/Europe PMC/OpenAlex
+2. **Outcome measures** from ClinicalTrials.gov
+3. **Citation graph traversal** via OpenAlex
+See: `TOOL_ANALYSIS_CRITICAL.md` for detailed improvement roadmap.
+---
+## Research Sources
+- [SICA Paper (ICLR 2025)](https://arxiv.org/abs/2504.15228) - Self-improving agents
+- [Gödel Agent (ACL 2025)](https://arxiv.org/abs/2410.04444) - Recursive self-modification
+- [Introspection Paradox (EMNLP 2025)](https://aclanthology.org/2025.emnlp-main.352/) - Self-knowledge can hurt performance
+- [Anthropic Introspection Research](https://www.anthropic.com/research/introspection) - ~20% accuracy on genuine introspection
+---
+*This document is closed. The conclusion is: don't implement internal embeddings/mGREP for this use case.*

SPEC_12_NARRATIVE_SYNTHESIS.md CHANGED Viewed

@@ -176,7 +176,7 @@ async def summarize_results(results: list[Any]) -> str:
 ### Architecture Change
-```
 Current (Simple Mode):
   Evidence → Judge → {structured data} → String Template → Bullet Points

 ### Architecture Change
+```text
 Current (Simple Mode):
   Evidence → Judge → {structured data} → String Template → Bullet Points

TOOL_ANALYSIS_CRITICAL.md ADDED Viewed

	@@ -0,0 +1,348 @@

+# Critical Analysis: Search Tools - Limitations, Gaps, and Improvements
+**Date**: November 2025
+**Purpose**: Honest assessment of all search tools to identify what's working, what's broken, and what needs improvement WITHOUT horizontal sprawl.
+---
+## Executive Summary
+DeepBoner currently has **4 search tools**:
+1. PubMed (NCBI E-utilities)
+2. ClinicalTrials.gov (API v2)
+3. Europe PMC (includes preprints)
+4. OpenAlex (citation-aware)
+**Overall Assessment**: Tools are functional but have significant gaps in:
+- Deduplication (PubMed ∩ Europe PMC ∩ OpenAlex = massive overlap)
+- Full-text retrieval (only abstracts currently)
+- Citation graph traversal (OpenAlex has data but we don't use it)
+- Query optimization (basic synonym expansion, no MeSH term mapping)
+---
+## Tool 1: PubMed (NCBI E-utilities)
+**File**: `src/tools/pubmed.py`
+### What It Does Well
+| Feature | Status | Notes |
+|---------|--------|-------|
+| Rate limiting | ✅ | Shared limiter, respects 3/sec (no key) or 10/sec (with key) |
+| Retry logic | ✅ | tenacity with exponential backoff |
+| Query preprocessing | ✅ | Strips question words, expands synonyms |
+| Abstract parsing | ✅ | Handles XML edge cases (dict vs list) |
+### Limitations (API-Level)
+| Limitation | Severity | Workaround Possible? |
+|------------|----------|---------------------|
+| **10,000 result cap per query** | Medium | Yes - use date ranges to paginate |
+| **Abstracts only** (no full text) | High | No - full text requires PMC or publisher |
+| **No citation counts** | Medium | Yes - cross-reference with OpenAlex |
+| **Rate limit (10/sec max)** | Low | Already handled |
+### Current Implementation Gaps
+```python
+# GAP 1: No MeSH term expansion
+# Current: expand_synonyms() uses hardcoded dict
+# Better: Use NCBI's E-utilities to get MeSH terms for query
+# GAP 2: No date filtering
+# Current: Gets whatever PubMed returns (biased toward recent)
+# Better: Add date range parameter for historical research
+# GAP 3: No publication type filtering
+# Current: Returns all types (reviews, case reports, RCTs)
+# Better: Filter for RCTs and systematic reviews when appropriate
+```
+### Priority Improvements
+1. **HIGH**: Add publication type filter (Reviews, RCTs, Meta-analyses)
+2. **MEDIUM**: Add date range parameter
+3. **LOW**: MeSH term expansion via E-utilities
+---
+## Tool 2: ClinicalTrials.gov
+**File**: `src/tools/clinicaltrials.py`
+### What It Does Well
+| Feature | Status | Notes |
+|---------|--------|-------|
+| API v2 usage | ✅ | Modern API, not deprecated v1 |
+| Interventional filter | ✅ | Only gets drug/treatment studies |
+| Status filter | ✅ | COMPLETED, ACTIVE, RECRUITING |
+| httpx → requests workaround | ✅ | Bypasses WAF TLS fingerprint block |
+### Limitations (API-Level)
+| Limitation | Severity | Workaround Possible? |
+|------------|----------|---------------------|
+| **No results data** | High | Yes - available via different endpoint |
+| **No outcome measures** | High | Yes - add to FIELDS list |
+| **No adverse events** | Medium | Yes - separate API call |
+| **Sparse drug mechanism data** | Medium | No - not in API |
+### Current Implementation Gaps
+```python
+# GAP 1: Missing critical fields
+FIELDS: ClassVar[list[str]] = [
+    "NCTId",
+    "BriefTitle",
+    "Phase",
+    "OverallStatus",
+    "Condition",
+    "InterventionName",
+    "StartDate",
+    "BriefSummary",
+    # MISSING:
+    # "PrimaryOutcome",
+    # "SecondaryOutcome",
+    # "ResultsFirstSubmitDate",
+    # "StudyResults",  # Whether results are posted
+]
+# GAP 2: No results retrieval
+# Many completed trials have posted results
+# We could get actual efficacy data, not just trial existence
+# GAP 3: No linked publications
+# Trials often link to PubMed articles with results
+# We could follow these links for richer evidence
+```
+### Priority Improvements
+1. **HIGH**: Add outcome measures to FIELDS
+2. **HIGH**: Check for and retrieve posted results
+3. **MEDIUM**: Follow linked publications (NCT → PMID)
+---
+## Tool 3: Europe PMC
+**File**: `src/tools/europepmc.py`
+### What It Does Well
+| Feature | Status | Notes |
+|---------|--------|-------|
+| Preprint coverage | ✅ | bioRxiv, medRxiv, ChemRxiv indexed |
+| Preprint labeling | ✅ | `[PREPRINT - Not peer-reviewed]` marker |
+| DOI/PMID fallback URLs | ✅ | Smart URL construction |
+| Relevance scoring | ✅ | Preprints weighted lower (0.75 vs 0.9) |
+### Limitations (API-Level)
+| Limitation | Severity | Workaround Possible? |
+|------------|----------|---------------------|
+| **No full text for most articles** | High | Partial - CC-licensed available after 14 days |
+| **Citation data limited** | Medium | Only journal articles, not preprints |
+| **Preprint-publication linking gaps** | Medium | ~50% of links missing per Crossref |
+| **License info sometimes missing** | Low | Manual review required |
+### Current Implementation Gaps
+```python
+# GAP 1: No full-text retrieval
+# Europe PMC has full text for many CC-licensed articles
+# Could retrieve full text XML via separate endpoint
+# GAP 2: Massive overlap with PubMed
+# Europe PMC indexes all of PubMed/MEDLINE
+# We're getting duplicates with no deduplication
+# GAP 3: No citation network
+# Europe PMC has "citedByCount" but we don't use it
+# Could prioritize highly-cited preprints
+```
+### Priority Improvements
+1. **HIGH**: Add deduplication with PubMed (by PMID)
+2. **MEDIUM**: Retrieve citation counts for ranking
+3. **LOW**: Full-text retrieval for CC-licensed articles
+---
+## Tool 4: OpenAlex
+**File**: `src/tools/openalex.py`
+### What It Does Well
+| Feature | Status | Notes |
+|---------|--------|-------|
+| Citation counts | ✅ | Sorted by `cited_by_count:desc` |
+| Abstract reconstruction | ✅ | Handles inverted index format |
+| Concept extraction | ✅ | Hierarchical classification |
+| Open access detection | ✅ | `is_oa` and `pdf_url` |
+| Polite pool | ✅ | mailto for 100k/day limit |
+| Rich metadata | ✅ | Best metadata of all tools |
+### Limitations (API-Level)
+| Limitation | Severity | Workaround Possible? |
+|------------|----------|---------------------|
+| **Author truncation at 100** | Low | Only affects mega-author papers |
+| **No full text** | High | No - OpenAlex is metadata only |
+| **Stale data (1-2 day lag)** | Low | Acceptable for research |
+### Current Implementation Gaps
+```python
+# GAP 1: No citation graph traversal
+# OpenAlex has `cited_by` and `references` endpoints
+# We could find seminal papers by following citation chains
+# GAP 2: No related works
+# OpenAlex has ML-powered "related_works" field
+# Could expand search to similar papers
+# GAP 3: No concept filtering
+# OpenAlex has hierarchical concepts
+# Could filter for specific domains (e.g., "Sexual health" concept)
+# GAP 4: Overlap with PubMed
+# OpenAlex indexes most of PubMed
+# More duplicates without deduplication
+```
+### Priority Improvements
+1. **HIGH**: Add citation graph traversal (find seminal papers)
+2. **HIGH**: Add deduplication with PubMed/Europe PMC
+3. **MEDIUM**: Use `related_works` for query expansion
+4. **LOW**: Concept-based filtering
+---
+## Cross-Tool Issues
+### Issue 1: MASSIVE DUPLICATION
+```
+PubMed: 36M+ articles
+Europe PMC: Indexes ALL of PubMed + preprints
+OpenAlex: 250M+ works (includes PubMed)
+Current behavior: All 3 return the same papers
+Result: Duplicate evidence, wasted tokens, inflated counts
+```
+**Solution**: Deduplication by PMID/DOI
+```python
+# Proposed: Add to SearchHandler
+def deduplicate_evidence(evidence_list: list[Evidence]) -> list[Evidence]:
+    seen_ids: set[str] = set()
+    unique: list[Evidence] = []
+    for e in evidence_list:
+        # Extract PMID or DOI from URL
+        paper_id = extract_paper_id(e.citation.url)
+        if paper_id not in seen_ids:
+            seen_ids.add(paper_id)
+            unique.append(e)
+    return unique
+```
+### Issue 2: NO FULL-TEXT RETRIEVAL
+All tools return **abstracts only**. For deep research, this is limiting.
+**What's Actually Possible**:
+| Source | Full Text Access | How |
+|--------|------------------|-----|
+| PubMed Central (PMC) | Yes, for OA articles | Separate API: `efetch` with `db=pmc` |
+| Europe PMC | Yes, CC-licensed after 14 days | `/fullTextXML/{id}` endpoint |
+| OpenAlex | No | Metadata only |
+| Unpaywall | Yes, OA link discovery | Separate API |
+**Recommendation**: Add PMC full-text retrieval for open access articles.
+### Issue 3: NO CITATION GRAPH
+OpenAlex has rich citation data but we only use `cited_by_count` for sorting.
+**Untapped Capabilities**:
+- `cited_by`: Find papers that cite a key paper
+- `references`: Find sources a paper cites
+- `related_works`: ML-powered similar papers
+**Use Case**: User asks about "testosterone therapy for HSDD". We find a seminal 2019 RCT. We could automatically find:
+- Papers that cite it (newer evidence)
+- Papers it cites (foundational research)
+- Related papers (similar topics)
+---
+## What's NOT Possible (API Constraints)
+| Feature | Why Not Possible |
+|---------|------------------|
+| **bioRxiv direct search** | No keyword search API, only RSS feed of latest |
+| **arXiv search** | API exists but irrelevant for sexual health |
+| **PubMed full text** | Requires publisher access or PMC |
+| **Real-time trial results** | ClinicalTrials.gov results are static snapshots |
+| **Drug mechanism data** | Not in any API - would need ChEMBL or DrugBank |
+---
+## Recommended Improvements (Priority Order)
+### Phase 1: Fix Fundamentals (High ROI)
+1. **Deduplication** - Stop returning the same paper 3 times
+2. **Outcome measures in ClinicalTrials** - Get actual efficacy data
+3. **Citation counts from all sources** - Rank by influence, not recency
+### Phase 2: Depth Improvements (Medium ROI)
+4. **PMC full-text retrieval** - Get full papers for OA articles
+5. **Citation graph traversal** - Find seminal papers automatically
+6. **Publication type filtering** - Prioritize RCTs and meta-analyses
+### Phase 3: Quality Improvements (Lower ROI, Nice-to-Have)
+7. **MeSH term expansion** - Better PubMed queries
+8. **Related works expansion** - Use OpenAlex ML similarity
+9. **Date range filtering** - Historical vs recent research
+---
+## Neo4j Integration (Future Consideration)
+**Question**: Should we add Neo4j for citation graph storage?
+**Answer**: Not yet. Here's why:
+| Approach | Complexity | Value |
+|----------|------------|-------|
+| OpenAlex API for citation traversal | Low | High |
+| Neo4j for local citation graph | High | Medium (unless doing graph analytics) |
+| Cron job to sync OpenAlex → Neo4j | Medium | Only if we need offline access |
+**Recommendation**: Use OpenAlex API for citation traversal first. Only add Neo4j if:
+1. We need to do complex graph queries (PageRank on citations, community detection)
+2. We need offline access to citation data
+3. We're hitting OpenAlex rate limits
+---
+## Summary: What's Broken vs What's Working
+### Working Well
+- Basic search across all 4 sources
+- Rate limiting and retry logic
+- Query preprocessing
+- Evidence model with citations
+### Needs Fixing (Current Scope)
+- Deduplication (critical)
+- Outcome measures in ClinicalTrials (critical)
+- Citation-based ranking (important)
+### Future Enhancements (Out of Current Scope)
+- Full-text retrieval
+- Citation graph traversal
+- Neo4j integration
+- Drug mechanism data (would need new data sources)
+---
+## Sources
+- [NCBI E-utilities Documentation](https://www.ncbi.nlm.nih.gov/books/NBK25497/)
+- [NCBI Rate Limits](https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/)
+- [OpenAlex API Docs](https://docs.openalex.org/)
+- [OpenAlex Limitations](https://docs.openalex.org/api-entities/authors/limitations)
+- [Europe PMC RESTful API](https://europepmc.org/RestfulWebService)
+- [Europe PMC Preprints](https://pmc.ncbi.nlm.nih.gov/articles/PMC11426508/)
+- [ClinicalTrials.gov API](https://clinicaltrials.gov/data-api/api)

examples/full_stack_demo/run_full.py CHANGED Viewed

@@ -2,7 +2,7 @@
 """
 Demo: Full Stack DeepBoner Agent (Phases 1-8).
-This script demonstrates the COMPLETE REAL drug repurposing research pipeline:
 - Phase 2: REAL Search (PubMed + ClinicalTrials + Europe PMC)
 - Phase 6: REAL Embeddings (sentence-transformers + ChromaDB)
 - Phase 7: REAL Hypothesis (LLM mechanistic reasoning)
@@ -190,7 +190,7 @@ Examples:
     )
     parser.add_argument(
         "query",
-        help="Research query (e.g., 'metformin Alzheimer's disease')",
     )
     parser.add_argument(
         "-i",

 """
 Demo: Full Stack DeepBoner Agent (Phases 1-8).
+This script demonstrates the COMPLETE REAL sexual health research pipeline:
 - Phase 2: REAL Search (PubMed + ClinicalTrials + Europe PMC)
 - Phase 6: REAL Embeddings (sentence-transformers + ChromaDB)
 - Phase 7: REAL Hypothesis (LLM mechanistic reasoning)
     )
     parser.add_argument(
         "query",
+        help="Research query (e.g., 'testosterone libido')",
     )
     parser.add_argument(
         "-i",

examples/orchestrator_demo/run_agent.py CHANGED Viewed

@@ -51,7 +51,7 @@ Examples:
     uv run python examples/orchestrator_demo/run_agent.py "flibanserin HSDD" --iterations 5
         """,
     )
-    parser.add_argument("query", help="Research query (e.g., 'metformin cancer')")
     parser.add_argument("--iterations", type=int, default=3, help="Max iterations (default: 3)")
     args = parser.parse_args()

     uv run python examples/orchestrator_demo/run_agent.py "flibanserin HSDD" --iterations 5
         """,
     )
+    parser.add_argument("query", help="Research query (e.g., 'testosterone libido')")
     parser.add_argument("--iterations", type=int, default=3, help="Max iterations (default: 3)")
     args = parser.parse_args()

examples/search_demo/run_search.py CHANGED Viewed

@@ -1,6 +1,6 @@
 #!/usr/bin/env python3
 """
-Demo: Search for drug repurposing evidence.
 This script demonstrates multi-source search functionality:
 - PubMed search (biomedical literature)

 #!/usr/bin/env python3
 """
+Demo: Search for sexual health research evidence.
 This script demonstrates multi-source search functionality:
 - PubMed search (biomedical literature)

src/agent_factory/judges.py CHANGED Viewed

@@ -166,7 +166,13 @@ class JudgeHandler:
             return assessment
         except Exception as e:
-            logger.error("Assessment failed", error=str(e))
             # Return a safe default assessment on failure
             return self._create_fallback_assessment(question, str(e))

             return assessment
         except Exception as e:
+            # Log with context for debugging
+            logger.error(
+                "Assessment failed",
+                error=str(e),
+                exc_type=type(e).__name__,
+                evidence_count=len(evidence),
+            )
             # Return a safe default assessment on failure
             return self._create_fallback_assessment(question, str(e))

src/agents/tools.py CHANGED Viewed

@@ -125,7 +125,7 @@ async def search_preprints(query: str, max_results: int = 10) -> str:
     from bioRxiv, medRxiv, and peer-reviewed papers.
     Args:
-        query: Search terms (e.g., "long covid treatment")
         max_results: Maximum results to return (default 10)
     Returns:

     from bioRxiv, medRxiv, and peer-reviewed papers.
     Args:
+        query: Search terms (e.g., "flibanserin HSDD preprint")
         max_results: Maximum results to return (default 10)
     Returns:

src/app.py CHANGED Viewed

@@ -2,7 +2,7 @@
 import os
 from collections.abc import AsyncGenerator
-from typing import Any
 import gradio as gr
 from pydantic_ai.models.anthropic import AnthropicModel
@@ -22,10 +22,12 @@ from src.utils.config import settings
 from src.utils.exceptions import ConfigurationError
 from src.utils.models import OrchestratorConfig
 def configure_orchestrator(
     use_mock: bool = False,
-    mode: str = "simple",
     user_api_key: str | None = None,
     domain: str | ResearchDomain | None = None,
 ) -> tuple[Any, str]:
@@ -100,7 +102,7 @@ def configure_orchestrator(
         search_handler=search_handler,
         judge_handler=judge_handler,
         config=config,
-        mode=mode,  # type: ignore
         api_key=user_api_key,
         domain=domain,
     )
@@ -111,7 +113,7 @@ def configure_orchestrator(
 async def research_agent(
     message: str,
     history: list[dict[str, Any]],
-    mode: str = "simple",
     domain: str = "sexual_health",
     api_key: str = "",
     api_key_state: str = "",
@@ -140,6 +142,10 @@ async def research_agent(
     api_key_state_str = api_key_state or ""
     domain_str = domain or "sexual_health"
     # BUG FIX: Prefer freshly-entered key, then persisted state
     user_api_key = (api_key_str.strip() or api_key_state_str.strip()) or None
@@ -153,12 +159,12 @@ async def research_agent(
     has_paid_key = has_openai or has_anthropic or bool(user_api_key)
     # Advanced mode requires OpenAI specifically (due to agent-framework binding)
-    if mode == "advanced" and not (has_openai or is_openai_user_key):
         yield (
             "⚠️ **Warning**: Advanced mode currently requires OpenAI API key. "
             "Anthropic keys only work in Simple mode. Falling back to Simple.\n\n"
         )
-        mode = "simple"
     # Inform user about fallback if no keys
     if not has_paid_key:
@@ -177,14 +183,16 @@ async def research_agent(
         # It will use: Paid API > HF Inference (free tier)
         orchestrator, backend_name = configure_orchestrator(
             use_mock=False,  # Never use mock in production - HF Inference is the free fallback
-            mode=mode,
             user_api_key=user_api_key,
             domain=domain_str,
         )
         # Immediate backend info + loading feedback so user knows something is happening
         yield (
-            f"🧠 **Backend**: {backend_name} | **Domain**: {domain_str.title()}\n\n"
             "⏳ **Processing...** Searching PubMed, ClinicalTrials.gov, Europe PMC, OpenAlex...\n"
         )

 import os
 from collections.abc import AsyncGenerator
+from typing import Any, Literal
 import gradio as gr
 from pydantic_ai.models.anthropic import AnthropicModel
 from src.utils.exceptions import ConfigurationError
 from src.utils.models import OrchestratorConfig
+OrchestratorMode = Literal["simple", "magentic", "advanced", "hierarchical"]
 def configure_orchestrator(
     use_mock: bool = False,
+    mode: OrchestratorMode = "simple",
     user_api_key: str | None = None,
     domain: str | ResearchDomain | None = None,
 ) -> tuple[Any, str]:
         search_handler=search_handler,
         judge_handler=judge_handler,
         config=config,
+        mode=mode,
         api_key=user_api_key,
         domain=domain,
     )
 async def research_agent(
     message: str,
     history: list[dict[str, Any]],
+    mode: str = "simple",  # Gradio passes strings; validated below
     domain: str = "sexual_health",
     api_key: str = "",
     api_key_state: str = "",
     api_key_state_str = api_key_state or ""
     domain_str = domain or "sexual_health"
+    # Validate and cast mode to proper type
+    valid_modes: set[str] = {"simple", "magentic", "advanced", "hierarchical"}
+    mode_validated: OrchestratorMode = mode if mode in valid_modes else "simple"  # type: ignore[assignment]
     # BUG FIX: Prefer freshly-entered key, then persisted state
     user_api_key = (api_key_str.strip() or api_key_state_str.strip()) or None
     has_paid_key = has_openai or has_anthropic or bool(user_api_key)
     # Advanced mode requires OpenAI specifically (due to agent-framework binding)
+    if mode_validated == "advanced" and not (has_openai or is_openai_user_key):
         yield (
             "⚠️ **Warning**: Advanced mode currently requires OpenAI API key. "
             "Anthropic keys only work in Simple mode. Falling back to Simple.\n\n"
         )
+        mode_validated = "simple"
     # Inform user about fallback if no keys
     if not has_paid_key:
         # It will use: Paid API > HF Inference (free tier)
         orchestrator, backend_name = configure_orchestrator(
             use_mock=False,  # Never use mock in production - HF Inference is the free fallback
+            mode=mode_validated,
             user_api_key=user_api_key,
             domain=domain_str,
         )
         # Immediate backend info + loading feedback so user knows something is happening
+        # Use replace to get "Sexual Health" instead of "Sexual_Health" from .title()
+        domain_display = domain_str.replace("_", " ").title()
         yield (
+            f"🧠 **Backend**: {backend_name} | **Domain**: {domain_display}\n\n"
             "⏳ **Processing...** Searching PubMed, ClinicalTrials.gov, Europe PMC, OpenAlex...\n"
         )

src/middleware/sub_iteration.py CHANGED Viewed

@@ -81,12 +81,18 @@ class SubIterationMiddleware:
                 history.append(result)
                 best_result = result  # Assume latest is best for now
             except Exception as e:
-                logger.error("Sub-iteration execution failed", error=str(e))
                 if event_callback:
                     await event_callback(
                         AgentEvent(
                             type="error",
                             message=f"Sub-iteration execution failed: {e}",
                             iteration=i,
                         )
                     )
@@ -97,12 +103,18 @@ class SubIterationMiddleware:
                 assessment = await self.judge.assess(task, result, history)
                 final_assessment = assessment
             except Exception as e:
-                logger.error("Sub-iteration judge failed", error=str(e))
                 if event_callback:
                     await event_callback(
                         AgentEvent(
                             type="error",
                             message=f"Sub-iteration judge failed: {e}",
                             iteration=i,
                         )
                     )

                 history.append(result)
                 best_result = result  # Assume latest is best for now
             except Exception as e:
+                logger.error(
+                    "Sub-iteration execution failed",
+                    error=str(e),
+                    exc_type=type(e).__name__,
+                    iteration=i,
+                )
                 if event_callback:
                     await event_callback(
                         AgentEvent(
                             type="error",
                             message=f"Sub-iteration execution failed: {e}",
+                            data={"recoverable": False, "error_type": type(e).__name__},
                             iteration=i,
                         )
                     )
                 assessment = await self.judge.assess(task, result, history)
                 final_assessment = assessment
             except Exception as e:
+                logger.error(
+                    "Sub-iteration judge failed",
+                    error=str(e),
+                    exc_type=type(e).__name__,
+                    iteration=i,
+                )
                 if event_callback:
                     await event_callback(
                         AgentEvent(
                             type="error",
                             message=f"Sub-iteration judge failed: {e}",
+                            data={"recoverable": False, "error_type": type(e).__name__},
                             iteration=i,
                         )
                     )

src/orchestrators/simple.py CHANGED Viewed

@@ -18,7 +18,9 @@ import structlog
 from src.config.domain import ResearchDomain, get_domain_config
 from src.orchestrators.base import JudgeHandlerProtocol, SearchHandlerProtocol
 from src.utils.config import settings
 from src.utils.models import (
     AgentEvent,
     Evidence,
@@ -132,12 +134,25 @@ class Orchestrator:
                 iteration=iteration,
             )
         except Exception as e:
-            logger.error("Modal analysis failed", error=str(e))
             yield AgentEvent(
                 type="error",
                 message=f"Modal analysis failed: {e}",
-                data={"error": str(e)},
                 iteration=iteration,
             )
@@ -288,11 +303,26 @@ class Orchestrator:
                 if errors:
                     logger.warning("Search errors", errors=errors)
             except Exception as e:
-                logger.error("Search phase failed", error=str(e))
                 yield AgentEvent(
                     type="error",
                     message=f"Search failed: {e!s}",
                     iteration=iteration,
                 )
                 continue
@@ -388,9 +418,9 @@ class Orchestrator:
                         iteration=iteration,
                     )
-                    # Generate final response
                     # Use all gathered evidence for the final report
-                    final_response = self._generate_synthesis(query, all_evidence, assessment)
                     yield AgentEvent(
                         type="complete",
@@ -424,11 +454,26 @@ class Orchestrator:
                         iteration=iteration,
                     )
             except Exception as e:
-                logger.error("Judge phase failed", error=str(e))
                 yield AgentEvent(
                     type="error",
                     message=f"Assessment failed: {e!s}",
                     iteration=iteration,
                 )
                 continue
@@ -445,14 +490,105 @@ class Orchestrator:
             iteration=iteration,
         )
-    def _generate_synthesis(
         self,
         query: str,
         evidence: list[Evidence],
         assessment: JudgeAssessment,
     ) -> str:
         """
-        Generate the final synthesis response.
         Args:
             query: The original question
@@ -460,7 +596,7 @@ class Orchestrator:
             assessment: The final assessment
         Returns:
-            Formatted synthesis as markdown
         """
         drug_list = (
             "\n".join([f"- **{d}**" for d in assessment.details.drug_candidates])
@@ -474,7 +610,7 @@ class Orchestrator:
             [
                 f"{i + 1}. [{e.citation.title}]({e.citation.url}) "
                 f"({e.citation.source.upper()}, {e.citation.date})"
-                for i, e in enumerate(evidence[:10])  # Limit to 10 citations
             ]
         )

 from src.config.domain import ResearchDomain, get_domain_config
 from src.orchestrators.base import JudgeHandlerProtocol, SearchHandlerProtocol
+from src.prompts.synthesis import format_synthesis_prompt, get_synthesis_system_prompt
 from src.utils.config import settings
+from src.utils.exceptions import JudgeError, ModalError, SearchError
 from src.utils.models import (
     AgentEvent,
     Evidence,
                 iteration=iteration,
             )
+        except ModalError as e:
+            logger.error("Modal analysis failed", error=str(e), exc_type="ModalError")
+            yield AgentEvent(
+                type="error",
+                message=f"Modal analysis failed: {e}",
+                data={"error": str(e), "recoverable": True},
+                iteration=iteration,
+            )
         except Exception as e:
+            # Unexpected error - log with full context for debugging
+            logger.error(
+                "Modal analysis failed unexpectedly",
+                error=str(e),
+                exc_type=type(e).__name__,
+            )
             yield AgentEvent(
                 type="error",
                 message=f"Modal analysis failed: {e}",
+                data={"error": str(e), "recoverable": True},
                 iteration=iteration,
             )
                 if errors:
                     logger.warning("Search errors", errors=errors)
+            except SearchError as e:
+                logger.error("Search phase failed", error=str(e), exc_type="SearchError")
+                yield AgentEvent(
+                    type="error",
+                    message=f"Search failed: {e!s}",
+                    data={"recoverable": True, "error_type": "search"},
+                    iteration=iteration,
+                )
+                continue
             except Exception as e:
+                # Unexpected error - log full context for debugging
+                logger.error(
+                    "Search phase failed unexpectedly",
+                    error=str(e),
+                    exc_type=type(e).__name__,
+                )
                 yield AgentEvent(
                     type="error",
                     message=f"Search failed: {e!s}",
+                    data={"recoverable": True, "error_type": "unexpected"},
                     iteration=iteration,
                 )
                 continue
                         iteration=iteration,
                     )
+                    # Generate final response using LLM narrative synthesis
                     # Use all gathered evidence for the final report
+                    final_response = await self._generate_synthesis(query, all_evidence, assessment)
                     yield AgentEvent(
                         type="complete",
                         iteration=iteration,
                     )
+            except JudgeError as e:
+                logger.error("Judge phase failed", error=str(e), exc_type="JudgeError")
+                yield AgentEvent(
+                    type="error",
+                    message=f"Assessment failed: {e!s}",
+                    data={"recoverable": True, "error_type": "judge"},
+                    iteration=iteration,
+                )
+                continue
             except Exception as e:
+                # Unexpected error - log full context for debugging
+                logger.error(
+                    "Judge phase failed unexpectedly",
+                    error=str(e),
+                    exc_type=type(e).__name__,
+                )
                 yield AgentEvent(
                     type="error",
                     message=f"Assessment failed: {e!s}",
+                    data={"recoverable": True, "error_type": "unexpected"},
                     iteration=iteration,
                 )
                 continue
             iteration=iteration,
         )
+    async def _generate_synthesis(
+        self,
+        query: str,
+        evidence: list[Evidence],
+        assessment: JudgeAssessment,
+    ) -> str:
+        """
+        Generate the final synthesis response using LLM.
+        This method calls an LLM to generate a narrative research report,
+        following the Microsoft Agent Framework pattern of using LLM synthesis
+        instead of string templating.
+        Args:
+            query: The original question
+            evidence: All collected evidence
+            assessment: The final assessment
+        Returns:
+            Narrative synthesis as markdown
+        """
+        # Build evidence summary for LLM context (limit to avoid token overflow)
+        evidence_lines = []
+        for e in evidence[:20]:
+            authors = ", ".join(e.citation.authors[:2]) if e.citation.authors else "Unknown"
+            content_preview = e.content[:200].replace("\n", " ")
+            evidence_lines.append(
+                f"- {e.citation.title} ({authors}, {e.citation.date}): {content_preview}..."
+            )
+        evidence_summary = "\n".join(evidence_lines)
+        # Format synthesis prompt with assessment data
+        user_prompt = format_synthesis_prompt(
+            query=query,
+            evidence_summary=evidence_summary,
+            drug_candidates=assessment.details.drug_candidates,
+            key_findings=assessment.details.key_findings,
+            mechanism_score=assessment.details.mechanism_score,
+            clinical_score=assessment.details.clinical_evidence_score,
+            confidence=assessment.confidence,
+        )
+        # Get domain-specific system prompt
+        system_prompt = get_synthesis_system_prompt(self.domain)
+        try:
+            # Import here to avoid circular deps and keep optional
+            from pydantic_ai import Agent
+            from src.agent_factory.judges import get_model
+            # Create synthesis agent (string output, not structured)
+            agent: Agent[None, str] = Agent(
+                model=get_model(),
+                output_type=str,
+                system_prompt=system_prompt,
+            )
+            result = await agent.run(user_prompt)
+            narrative = result.output
+            logger.info("LLM narrative synthesis completed", chars=len(narrative))
+        except Exception as e:
+            # Fallback to template synthesis if LLM fails
+            # This is intentionally broad - LLM can fail many ways (API, parsing, etc.)
+            logger.warning(
+                "LLM synthesis failed, using template fallback",
+                error=str(e),
+                exc_type=type(e).__name__,
+                evidence_count=len(evidence),
+            )
+            return self._generate_template_synthesis(query, evidence, assessment)
+        # Add full citation list footer
+        citations = "\n".join(
+            f"{i + 1}. [{e.citation.title}]({e.citation.url}) "
+            f"({e.citation.source.upper()}, {e.citation.date})"
+            for i, e in enumerate(evidence[:15])
+        )
+        return f"""{narrative}
+---
+### Full Citation List ({len(evidence)} sources)
+{citations}
+*Analysis based on {len(evidence)} sources across {len(self.history)} iterations.*
+"""
+    def _generate_template_synthesis(
         self,
         query: str,
         evidence: list[Evidence],
         assessment: JudgeAssessment,
     ) -> str:
         """
+        Generate fallback template synthesis (no LLM).
+        Used when LLM synthesis fails or is unavailable.
         Args:
             query: The original question
             assessment: The final assessment
         Returns:
+            Formatted synthesis as markdown (bullet-point style)
         """
         drug_list = (
             "\n".join([f"- **{d}**" for d in assessment.details.drug_candidates])
             [
                 f"{i + 1}. [{e.citation.title}]({e.citation.url}) "
                 f"({e.citation.source.upper()}, {e.citation.date})"
+                for i, e in enumerate(evidence[:10])
             ]
         )

src/prompts/synthesis.py ADDED Viewed

	@@ -0,0 +1,209 @@

+"""Prompts for narrative report synthesis.
+This module provides prompts that transform structured evidence data
+into professional, narrative research reports. The key insight is that
+report generation requires an LLM call for synthesis, not string templating.
+Reference: Microsoft Agent Framework concurrent_custom_aggregator.py pattern.
+"""
+from src.config.domain import ResearchDomain, get_domain_config
+def get_synthesis_system_prompt(domain: ResearchDomain | str | None = None) -> str:
+    """Get the system prompt for narrative synthesis.
+    Args:
+        domain: Research domain for customization (defaults to settings)
+    Returns:
+        System prompt instructing LLM to write narrative prose
+    """
+    config = get_domain_config(domain)
+    return f"""You are a scientific writer specializing in {config.name.lower()}.
+Your task is to synthesize research evidence into a clear, NARRATIVE report.
+## CRITICAL: Writing Style
+- Write in PROSE PARAGRAPHS, not bullet points
+- Use academic but accessible language
+- Be specific about evidence strength (e.g., "in an RCT of N=200")
+- Reference specific studies by author name when available
+- Provide quantitative results where available (p-values, effect sizes, NNT)
+## Report Structure
+### Executive Summary (REQUIRED - 2-3 sentences)
+Start with the bottom line. What does the evidence show? Example:
+"Testosterone therapy demonstrates consistent efficacy for HSDD in postmenopausal
+women, with transdermal formulations showing the best safety profile."
+### Background (REQUIRED - 1 paragraph)
+Explain the condition, its prevalence, and clinical significance.
+Why does this question matter?
+### Evidence Synthesis (REQUIRED - 2-4 paragraphs)
+Weave the evidence into a coherent NARRATIVE:
+- **Mechanism of Action**: How does the intervention work biologically?
+- **Clinical Evidence**: What do trials show? Include effect sizes when available.
+- **Comparative Evidence**: How does it compare to alternatives?
+Write this as flowing prose that tells a story, NOT as a bullet list.
+### Recommendations (REQUIRED - 3-5 numbered items)
+Provide specific, actionable clinical recommendations based on the evidence.
+These CAN be numbered items since they are action items.
+### Limitations (REQUIRED - 1 paragraph)
+Acknowledge gaps in the evidence, potential biases, and areas needing more research.
+Be honest about uncertainty.
+### References (REQUIRED)
+List key references with author, year, title, and URL.
+Format: Author AB et al. (Year). Title. URL
+## CRITICAL RULES
+1. ONLY cite papers from the provided evidence - NEVER hallucinate or invent references
+2. Write in complete sentences and paragraphs (PROSE, not lists except Recommendations)
+3. Include specific statistics when available (p-values, confidence intervals, effect sizes)
+4. Acknowledge uncertainty honestly - do not overstate conclusions
+5. If evidence is limited, say so clearly
+6. Copy URLs exactly as provided - do not create similar-looking URLs
+"""
+FEW_SHOT_EXAMPLE = """
+## Example: Strong Evidence Synthesis
+INPUT:
+- Query: "Alprostadil for erectile dysfunction"
+- Evidence: 15 papers including meta-analysis of 8 RCTs (N=3,247)
+- Mechanism Score: 9/10
+- Clinical Score: 9/10
+OUTPUT:
+### Executive Summary
+Alprostadil (prostaglandin E1) represents a well-established second-line treatment
+for erectile dysfunction, with meta-analytic evidence demonstrating 87% efficacy
+in achieving erections sufficient for intercourse. It offers a PDE5-independent
+mechanism particularly valuable for patients who do not respond to oral therapies.
+### Background
+Erectile dysfunction affects approximately 30 million men in the United States,
+with prevalence increasing with age from 12% at age 40 to 40% at age 70. While
+PDE5 inhibitors remain first-line therapy, approximately 30% of patients are
+non-responders due to diabetes, radical prostatectomy, or other factors.
+Alprostadil provides an alternative mechanism through direct smooth muscle
+relaxation, making it a crucial second-line option.
+### Evidence Synthesis
+**Mechanism of Action**
+Alprostadil works through a distinct pathway from PDE5 inhibitors. It binds to
+EP2 and EP4 receptors on cavernosal smooth muscle, activating adenylate cyclase
+and increasing intracellular cAMP. This leads to smooth muscle relaxation and
+increased blood flow independent of nitric oxide signaling. As noted by Smith
+et al. (2019), this mechanism explains its efficacy in patients with endothelial
+dysfunction where nitric oxide production is impaired.
+**Clinical Evidence**
+A meta-analysis by Johnson et al. (2020) pooled data from 8 randomized controlled
+trials (N=3,247). The primary endpoint of erection sufficient for intercourse was
+achieved in 87% of alprostadil patients versus 12% placebo (RR 7.25, 95% CI:
+5.8-9.1, p<0.001). The number needed to treat was 1.3, indicating robust effect
+size. Onset of action was 5-15 minutes, with duration of 30-60 minutes.
+**Comparative Evidence**
+Direct comparisons with PDE5 inhibitors are limited. However, in the subgroup
+of PDE5 non-responders studied by Martinez et al. (2018), alprostadil achieved
+successful intercourse in 72% of patients who had failed sildenafil.
+### Recommendations
+1. Consider alprostadil as second-line therapy when PDE5 inhibitors fail or are
+   contraindicated
+2. Start with 10 micrograms intracavernosal injection, titrate to 40 micrograms based
+   on response
+3. Provide in-office training for self-injection technique before home use
+4. Screen for priapism risk factors before initiating therapy
+5. Consider intraurethral alprostadil (MUSE) for patients averse to injections
+### Limitations
+Long-term safety data beyond 2 years is limited. Head-to-head comparisons with
+newer therapies such as low-intensity shockwave therapy are lacking. Most trials
+excluded patients with severe cardiovascular disease, limiting generalizability
+to this population. The psychological burden of injection therapy may affect
+real-world adherence compared to oral medications.
+### References
+1. Smith AB et al. (2019). Alprostadil mechanism of action in erectile tissue.
+   J Urol. https://pubmed.ncbi.nlm.nih.gov/12345678/
+2. Johnson CD et al. (2020). Meta-analysis of intracavernosal alprostadil efficacy.
+   J Sex Med. https://pubmed.ncbi.nlm.nih.gov/23456789/
+3. Martinez R et al. (2018). Alprostadil in PDE5 inhibitor non-responders.
+   Int J Impot Res. https://pubmed.ncbi.nlm.nih.gov/34567890/
+"""
+def format_synthesis_prompt(
+    query: str,
+    evidence_summary: str,
+    drug_candidates: list[str],
+    key_findings: list[str],
+    mechanism_score: int,
+    clinical_score: int,
+    confidence: float,
+) -> str:
+    """Format the user prompt for narrative synthesis.
+    Args:
+        query: Original research question
+        evidence_summary: Formatted summary of evidence papers
+        drug_candidates: List of identified drug/treatment candidates
+        key_findings: List of key findings from assessment
+        mechanism_score: Mechanism evidence score (0-10)
+        clinical_score: Clinical evidence score (0-10)
+        confidence: Overall confidence (0.0-1.0)
+    Returns:
+        Formatted user prompt for the synthesis LLM
+    """
+    candidates_str = ", ".join(drug_candidates) if drug_candidates else "None identified"
+    if key_findings:
+        findings_str = "\n".join(f"- {f}" for f in key_findings)
+    else:
+        findings_str = "No specific findings extracted"
+    return f"""Synthesize a narrative research report for the following query.
+## Research Question
+{query}
+## Evidence Summary
+{evidence_summary}
+## Identified Drug/Treatment Candidates
+{candidates_str}
+## Key Findings from Evidence Assessment
+{findings_str}
+## Assessment Scores
+- Mechanism Score: {mechanism_score}/10
+- Clinical Evidence Score: {clinical_score}/10
+- Overall Confidence: {confidence:.0%}
+## Instructions
+Generate a NARRATIVE research report following the structure in your system prompt.
+Write in prose paragraphs, NOT bullet points (except for Recommendations section).
+ONLY cite papers mentioned in the Evidence Summary above - do NOT invent references.
+{FEW_SHOT_EXAMPLE}
+"""

src/utils/exceptions.py CHANGED Viewed

@@ -35,3 +35,27 @@ class EmbeddingError(DeepBonerError):
     """Raised when embedding or vector store operations fail."""
     pass

     """Raised when embedding or vector store operations fail."""
     pass
+class LLMError(DeepBonerError):
+    """Raised when LLM operations fail (API errors, parsing errors, etc.)."""
+    pass
+class QuotaExceededError(LLMError):
+    """Raised when LLM API quota is exceeded (402 errors)."""
+    pass
+class ModalError(DeepBonerError):
+    """Raised when Modal sandbox operations fail."""
+    pass
+class SynthesisError(DeepBonerError):
+    """Raised when report synthesis fails."""
+    pass

tests/e2e/test_simple_mode.py CHANGED Viewed

@@ -55,11 +55,11 @@ async def test_simple_mode_structure_validation(mock_search_handler, mock_judge_
     complete_event = next(e for e in events if e.type == "complete")
     report = complete_event.message
-    # Check markdown structure
-    assert "## Sexual Health Analysis" in report
-    assert "### Citations" in report
-    assert "### Key Findings" in report
-    # Check for citations
     assert "Study on test query" in report
-    assert "https://pubmed.example.com/123" in report

     complete_event = next(e for e in events if e.type == "complete")
     report = complete_event.message
+    # Check LLM narrative synthesis structure (SPEC_12)
+    # LLM generates prose with these sections (may omit ### prefix)
+    assert "Executive Summary" in report or "Sexual Health Analysis" in report
+    assert "Full Citation List" in report or "Citations" in report
+    # Check for citations (from citation footer added by orchestrator)
     assert "Study on test query" in report
+    assert "pubmed.example.com/123" in report

tests/integration/test_simple_mode_synthesis.py CHANGED Viewed

@@ -92,7 +92,11 @@ async def test_simple_mode_synthesizes_before_max_iterations():
     complete_event = complete_events[0]
     assert "MagicDrug" in complete_event.message
-    assert "Drug Candidates" in complete_event.message
     assert complete_event.data.get("synthesis_reason") == "high_scores_with_candidates"
     assert complete_event.iteration == 2  # Should stop at it 2

     complete_event = complete_events[0]
     assert "MagicDrug" in complete_event.message
+    # SPEC_12: LLM synthesis produces narrative prose, not template with "Drug Candidates" header
+    # Check for narrative structure (LLM may omit ### prefix) OR template fallback
+    assert (
+        "Executive Summary" in complete_event.message or "Drug Candidates" in complete_event.message
+    )
     assert complete_event.data.get("synthesis_reason") == "high_scores_with_candidates"
     assert complete_event.iteration == 2  # Should stop at it 2

tests/unit/agent_factory/test_judges.py CHANGED Viewed

@@ -8,6 +8,7 @@ from src.agent_factory.judges import JudgeHandler, MockJudgeHandler
 from src.utils.models import AssessmentDetails, Citation, Evidence, JudgeAssessment
 class TestJudgeHandler:
     """Tests for JudgeHandler."""
@@ -107,6 +108,8 @@ class TestJudgeHandler:
             assert result.sufficient is False
             assert result.recommendation == "continue"
             assert len(result.next_search_queries) > 0
     @pytest.mark.asyncio
     async def test_assess_handles_llm_failure(self):
@@ -143,6 +146,7 @@ class TestJudgeHandler:
             assert "failed" in result.reasoning.lower()
 class TestMockJudgeHandler:
     """Tests for MockJudgeHandler."""

 from src.utils.models import AssessmentDetails, Citation, Evidence, JudgeAssessment
+@pytest.mark.unit
 class TestJudgeHandler:
     """Tests for JudgeHandler."""
             assert result.sufficient is False
             assert result.recommendation == "continue"
             assert len(result.next_search_queries) > 0
+            # Assert specific expected query is present
+            assert "sildenafil mechanism" in result.next_search_queries
     @pytest.mark.asyncio
     async def test_assess_handles_llm_failure(self):
             assert "failed" in result.reasoning.lower()
+@pytest.mark.unit
 class TestMockJudgeHandler:
     """Tests for MockJudgeHandler."""

tests/unit/graph/test_nodes.py CHANGED Viewed

@@ -12,12 +12,12 @@ async def test_judge_node_initialization(mocker):
     # Mock get_model to avoid needing real API keys
     mocker.patch("src.agents.graph.nodes.get_model", return_value=mocker.Mock())
-    # Create a mock assessment with attributes
     mock_hypothesis = mocker.Mock()
-    mock_hypothesis.drug = "Caffeine"
-    mock_hypothesis.target = "Adenosine"
-    mock_hypothesis.pathway = "CNS"
-    mock_hypothesis.effect = "Alertness"
     mock_hypothesis.confidence = 0.8
     mock_assessment = mocker.Mock()
@@ -46,7 +46,7 @@ async def test_judge_node_initialization(mocker):
     assert "hypotheses" in update
     assert len(update["hypotheses"]) == 1
-    assert update["hypotheses"][0].id == "Caffeine"
     assert update["hypotheses"][0].status == "proposed"

     # Mock get_model to avoid needing real API keys
     mocker.patch("src.agents.graph.nodes.get_model", return_value=mocker.Mock())
+    # Create a mock assessment with attributes (sexual health domain)
     mock_hypothesis = mocker.Mock()
+    mock_hypothesis.drug = "Testosterone"
+    mock_hypothesis.target = "Androgen Receptor"
+    mock_hypothesis.pathway = "HPG Axis"
+    mock_hypothesis.effect = "Libido Enhancement"
     mock_hypothesis.confidence = 0.8
     mock_assessment = mocker.Mock()
     assert "hypotheses" in update
     assert len(update["hypotheses"]) == 1
+    assert update["hypotheses"][0].id == "Testosterone"
     assert update["hypotheses"][0].status == "proposed"

tests/unit/orchestrators/test_simple_orchestrator_domain.py CHANGED Viewed

@@ -30,7 +30,7 @@ class TestSimpleOrchestratorDomain:
             domain=ResearchDomain.SEXUAL_HEALTH,
         )
-        # Test _generate_synthesis
         mock_assessment = MagicMock()
         mock_assessment.details.drug_candidates = []
         mock_assessment.details.key_findings = []
@@ -39,7 +39,7 @@ class TestSimpleOrchestratorDomain:
         mock_assessment.details.mechanism_score = 5
         mock_assessment.details.clinical_evidence_score = 5
-        report = orch._generate_synthesis("query", [], mock_assessment)
         assert "## Sexual Health Analysis" in report
         # Test _generate_partial_synthesis

             domain=ResearchDomain.SEXUAL_HEALTH,
         )
+        # Test _generate_template_synthesis (the sync fallback method)
         mock_assessment = MagicMock()
         mock_assessment.details.drug_candidates = []
         mock_assessment.details.key_findings = []
         mock_assessment.details.mechanism_score = 5
         mock_assessment.details.clinical_evidence_score = 5
+        report = orch._generate_template_synthesis("query", [], mock_assessment)
         assert "## Sexual Health Analysis" in report
         # Test _generate_partial_synthesis

tests/unit/orchestrators/test_simple_synthesis.py ADDED Viewed

	@@ -0,0 +1,279 @@

+"""Tests for simple orchestrator LLM synthesis."""
+from unittest.mock import AsyncMock, MagicMock, patch
+import pytest
+from src.orchestrators.simple import Orchestrator
+from src.utils.models import AssessmentDetails, Citation, Evidence, JudgeAssessment
+@pytest.fixture
+def sample_evidence() -> list[Evidence]:
+    """Sample evidence for testing synthesis."""
+    return [
+        Evidence(
+            content="Testosterone therapy demonstrates efficacy in treating HSDD.",
+            citation=Citation(
+                source="pubmed",
+                title="Testosterone and Female Sexual Desire",
+                url="https://pubmed.ncbi.nlm.nih.gov/12345/",
+                date="2023",
+                authors=["Smith J", "Jones A"],
+            ),
+        ),
+        Evidence(
+            content="A meta-analysis of 8 RCTs shows significant improvement in sexual desire.",
+            citation=Citation(
+                source="pubmed",
+                title="Meta-analysis of Testosterone Therapy",
+                url="https://pubmed.ncbi.nlm.nih.gov/67890/",
+                date="2024",
+                authors=["Johnson B"],
+            ),
+        ),
+    ]
+@pytest.fixture
+def sample_assessment() -> JudgeAssessment:
+    """Sample assessment for testing synthesis."""
+    return JudgeAssessment(
+        sufficient=True,
+        confidence=0.85,
+        reasoning="Evidence is sufficient to synthesize findings on testosterone therapy for HSDD.",
+        recommendation="synthesize",
+        next_search_queries=[],
+        details=AssessmentDetails(
+            mechanism_score=8,
+            mechanism_reasoning="Strong evidence of androgen receptor activation pathway.",
+            clinical_evidence_score=7,
+            clinical_reasoning="Multiple RCTs support efficacy in postmenopausal HSDD.",
+            drug_candidates=["Testosterone", "LibiGel"],
+            key_findings=[
+                "Testosterone improves libido in postmenopausal women",
+                "Transdermal formulation has best safety profile",
+            ],
+        ),
+    )
+@pytest.mark.unit
+class TestGenerateSynthesis:
+    """Tests for _generate_synthesis method."""
+    @pytest.mark.asyncio
+    async def test_calls_llm_for_narrative(
+        self,
+        sample_evidence: list[Evidence],
+        sample_assessment: JudgeAssessment,
+    ) -> None:
+        """Synthesis should make an LLM call, not just use a template."""
+        mock_search = MagicMock()
+        mock_judge = MagicMock()
+        orchestrator = Orchestrator(
+            search_handler=mock_search,
+            judge_handler=mock_judge,
+        )
+        orchestrator.history = [{"iteration": 1}]  # Needed for footer
+        with (
+            patch("pydantic_ai.Agent") as mock_agent_class,
+            patch("src.agent_factory.judges.get_model") as mock_get_model,
+        ):
+            mock_model = MagicMock()
+            mock_get_model.return_value = mock_model
+            mock_agent = MagicMock()
+            mock_result = MagicMock()
+            mock_result.output = """### Executive Summary
+Testosterone therapy demonstrates consistent efficacy for HSDD treatment.
+### Background
+HSDD affects many postmenopausal women.
+### Evidence Synthesis
+Studies show significant improvement in sexual desire scores.
+### Recommendations
+1. Consider testosterone therapy for postmenopausal HSDD
+### Limitations
+Long-term safety data is limited.
+### References
+1. Smith J et al. (2023). Testosterone and Female Sexual Desire."""
+            mock_agent.run = AsyncMock(return_value=mock_result)
+            mock_agent_class.return_value = mock_agent
+            result = await orchestrator._generate_synthesis(
+                query="testosterone HSDD",
+                evidence=sample_evidence,
+                assessment=sample_assessment,
+            )
+            # Verify LLM agent was created and called
+            mock_agent_class.assert_called_once()
+            mock_agent.run.assert_called_once()
+            # Verify output includes narrative content
+            assert "Executive Summary" in result
+            assert "Background" in result
+            assert "Evidence Synthesis" in result
+    @pytest.mark.asyncio
+    async def test_falls_back_on_llm_error(
+        self,
+        sample_evidence: list[Evidence],
+        sample_assessment: JudgeAssessment,
+    ) -> None:
+        """Synthesis should fall back to template if LLM fails."""
+        mock_search = MagicMock()
+        mock_judge = MagicMock()
+        orchestrator = Orchestrator(
+            search_handler=mock_search,
+            judge_handler=mock_judge,
+        )
+        orchestrator.history = [{"iteration": 1}]
+        with patch("pydantic_ai.Agent") as mock_agent_class:
+            # Simulate LLM failure
+            mock_agent_class.side_effect = Exception("LLM unavailable")
+            result = await orchestrator._generate_synthesis(
+                query="testosterone HSDD",
+                evidence=sample_evidence,
+                assessment=sample_assessment,
+            )
+            # Should return template fallback (has Assessment section)
+            assert "Assessment" in result or "Drug Candidates" in result
+            assert "Testosterone" in result  # Drug candidate should be present
+    @pytest.mark.asyncio
+    async def test_includes_citation_footer(
+        self,
+        sample_evidence: list[Evidence],
+        sample_assessment: JudgeAssessment,
+    ) -> None:
+        """Synthesis should include full citation list footer."""
+        mock_search = MagicMock()
+        mock_judge = MagicMock()
+        orchestrator = Orchestrator(
+            search_handler=mock_search,
+            judge_handler=mock_judge,
+        )
+        orchestrator.history = [{"iteration": 1}]
+        with (
+            patch("pydantic_ai.Agent") as mock_agent_class,
+            patch("src.agent_factory.judges.get_model"),
+        ):
+            mock_agent = MagicMock()
+            mock_result = MagicMock()
+            mock_result.output = "Narrative synthesis content."
+            mock_agent.run = AsyncMock(return_value=mock_result)
+            mock_agent_class.return_value = mock_agent
+            result = await orchestrator._generate_synthesis(
+                query="test query",
+                evidence=sample_evidence,
+                assessment=sample_assessment,
+            )
+            # Should include citation footer
+            assert "Full Citation List" in result
+            assert "pubmed.ncbi.nlm.nih.gov/12345" in result
+            assert "pubmed.ncbi.nlm.nih.gov/67890" in result
+@pytest.mark.unit
+class TestGenerateTemplateSynthesis:
+    """Tests for _generate_template_synthesis fallback method."""
+    def test_returns_structured_output(
+        self,
+        sample_evidence: list[Evidence],
+        sample_assessment: JudgeAssessment,
+    ) -> None:
+        """Template synthesis should return structured markdown."""
+        mock_search = MagicMock()
+        mock_judge = MagicMock()
+        orchestrator = Orchestrator(
+            search_handler=mock_search,
+            judge_handler=mock_judge,
+        )
+        orchestrator.history = [{"iteration": 1}]
+        result = orchestrator._generate_template_synthesis(
+            query="testosterone HSDD",
+            evidence=sample_evidence,
+            assessment=sample_assessment,
+        )
+        # Should have all required sections
+        assert "Question" in result
+        assert "Drug Candidates" in result
+        assert "Key Findings" in result
+        assert "Assessment" in result
+        assert "Citations" in result
+    def test_includes_drug_candidates(
+        self,
+        sample_evidence: list[Evidence],
+        sample_assessment: JudgeAssessment,
+    ) -> None:
+        """Template synthesis should list drug candidates."""
+        mock_search = MagicMock()
+        mock_judge = MagicMock()
+        orchestrator = Orchestrator(
+            search_handler=mock_search,
+            judge_handler=mock_judge,
+        )
+        orchestrator.history = [{"iteration": 1}]
+        result = orchestrator._generate_template_synthesis(
+            query="test",
+            evidence=sample_evidence,
+            assessment=sample_assessment,
+        )
+        assert "Testosterone" in result
+        assert "LibiGel" in result
+    def test_includes_scores(
+        self,
+        sample_evidence: list[Evidence],
+        sample_assessment: JudgeAssessment,
+    ) -> None:
+        """Template synthesis should include assessment scores."""
+        mock_search = MagicMock()
+        mock_judge = MagicMock()
+        orchestrator = Orchestrator(
+            search_handler=mock_search,
+            judge_handler=mock_judge,
+        )
+        orchestrator.history = [{"iteration": 1}]
+        result = orchestrator._generate_template_synthesis(
+            query="test",
+            evidence=sample_evidence,
+            assessment=sample_assessment,
+        )
+        assert "8/10" in result  # Mechanism score
+        assert "7/10" in result  # Clinical score
+        assert "85%" in result  # Confidence

tests/unit/prompts/test_synthesis.py ADDED Viewed

	@@ -0,0 +1,217 @@

+"""Tests for narrative synthesis prompts."""
+import pytest
+from src.prompts.synthesis import (
+    FEW_SHOT_EXAMPLE,
+    format_synthesis_prompt,
+    get_synthesis_system_prompt,
+)
+@pytest.mark.unit
+class TestSynthesisSystemPrompt:
+    """Tests for synthesis system prompt generation."""
+    def test_system_prompt_emphasizes_prose(self) -> None:
+        """System prompt should emphasize prose paragraphs, not bullets."""
+        prompt = get_synthesis_system_prompt()
+        assert "PROSE PARAGRAPHS" in prompt
+        assert "not bullet points" in prompt.lower()
+    def test_system_prompt_requires_executive_summary(self) -> None:
+        """System prompt should require executive summary section."""
+        prompt = get_synthesis_system_prompt()
+        assert "Executive Summary" in prompt
+        assert "REQUIRED" in prompt
+    def test_system_prompt_requires_background(self) -> None:
+        """System prompt should require background section."""
+        prompt = get_synthesis_system_prompt()
+        assert "Background" in prompt
+    def test_system_prompt_requires_evidence_synthesis(self) -> None:
+        """System prompt should require evidence synthesis section."""
+        prompt = get_synthesis_system_prompt()
+        assert "Evidence Synthesis" in prompt
+        assert "Mechanism of Action" in prompt
+    def test_system_prompt_requires_recommendations(self) -> None:
+        """System prompt should require recommendations section."""
+        prompt = get_synthesis_system_prompt()
+        assert "Recommendations" in prompt
+    def test_system_prompt_requires_limitations(self) -> None:
+        """System prompt should require limitations section."""
+        prompt = get_synthesis_system_prompt()
+        assert "Limitations" in prompt
+    def test_system_prompt_warns_about_hallucination(self) -> None:
+        """System prompt should warn about citation hallucination."""
+        prompt = get_synthesis_system_prompt()
+        assert "NEVER hallucinate" in prompt or "never hallucinate" in prompt.lower()
+    def test_system_prompt_includes_domain_name(self) -> None:
+        """System prompt should include domain name."""
+        prompt = get_synthesis_system_prompt("sexual_health")
+        assert "sexual health" in prompt.lower()
+@pytest.mark.unit
+class TestFormatSynthesisPrompt:
+    """Tests for synthesis user prompt formatting."""
+    def test_includes_query(self) -> None:
+        """User prompt should include the research query."""
+        prompt = format_synthesis_prompt(
+            query="testosterone libido",
+            evidence_summary="Study shows efficacy...",
+            drug_candidates=["Testosterone"],
+            key_findings=["Improved libido"],
+            mechanism_score=8,
+            clinical_score=7,
+            confidence=0.85,
+        )
+        assert "testosterone libido" in prompt
+    def test_includes_evidence_summary(self) -> None:
+        """User prompt should include evidence summary."""
+        prompt = format_synthesis_prompt(
+            query="test query",
+            evidence_summary="Study by Smith et al. shows significant results...",
+            drug_candidates=[],
+            key_findings=[],
+            mechanism_score=5,
+            clinical_score=5,
+            confidence=0.5,
+        )
+        assert "Study by Smith et al." in prompt
+    def test_includes_drug_candidates(self) -> None:
+        """User prompt should include drug candidates."""
+        prompt = format_synthesis_prompt(
+            query="test query",
+            evidence_summary="...",
+            drug_candidates=["Testosterone", "Flibanserin"],
+            key_findings=[],
+            mechanism_score=5,
+            clinical_score=5,
+            confidence=0.5,
+        )
+        assert "Testosterone" in prompt
+        assert "Flibanserin" in prompt
+    def test_includes_key_findings(self) -> None:
+        """User prompt should include key findings."""
+        prompt = format_synthesis_prompt(
+            query="test query",
+            evidence_summary="...",
+            drug_candidates=[],
+            key_findings=["Improved libido in postmenopausal women", "Safe profile"],
+            mechanism_score=5,
+            clinical_score=5,
+            confidence=0.5,
+        )
+        assert "Improved libido in postmenopausal women" in prompt
+        assert "Safe profile" in prompt
+    def test_includes_scores(self) -> None:
+        """User prompt should include assessment scores."""
+        prompt = format_synthesis_prompt(
+            query="test query",
+            evidence_summary="...",
+            drug_candidates=[],
+            key_findings=[],
+            mechanism_score=8,
+            clinical_score=7,
+            confidence=0.85,
+        )
+        assert "8/10" in prompt
+        assert "7/10" in prompt
+        assert "85%" in prompt
+    def test_handles_empty_candidates(self) -> None:
+        """User prompt should handle empty drug candidates."""
+        prompt = format_synthesis_prompt(
+            query="test query",
+            evidence_summary="...",
+            drug_candidates=[],
+            key_findings=[],
+            mechanism_score=5,
+            clinical_score=5,
+            confidence=0.5,
+        )
+        assert "None identified" in prompt
+    def test_handles_empty_findings(self) -> None:
+        """User prompt should handle empty key findings."""
+        prompt = format_synthesis_prompt(
+            query="test query",
+            evidence_summary="...",
+            drug_candidates=[],
+            key_findings=[],
+            mechanism_score=5,
+            clinical_score=5,
+            confidence=0.5,
+        )
+        assert "No specific findings" in prompt
+    def test_includes_few_shot_example(self) -> None:
+        """User prompt should include few-shot example."""
+        prompt = format_synthesis_prompt(
+            query="test query",
+            evidence_summary="...",
+            drug_candidates=[],
+            key_findings=[],
+            mechanism_score=5,
+            clinical_score=5,
+            confidence=0.5,
+        )
+        assert "Alprostadil" in prompt  # From the few-shot example
+@pytest.mark.unit
+class TestFewShotExample:
+    """Tests for the few-shot example quality."""
+    def test_few_shot_is_mostly_narrative(self) -> None:
+        """Few-shot example should be mostly prose paragraphs, not bullets."""
+        # Count substantial paragraphs (>100 chars of prose)
+        paragraphs = [p for p in FEW_SHOT_EXAMPLE.split("\n\n") if len(p) > 100]
+        # Count bullet points
+        bullets = FEW_SHOT_EXAMPLE.count("\n- ") + FEW_SHOT_EXAMPLE.count("\n1. ")
+        # Prose should dominate - at least as many paragraphs as bullets
+        assert len(paragraphs) >= bullets, "Few-shot example should be mostly narrative prose"
+    def test_few_shot_has_executive_summary(self) -> None:
+        """Few-shot example should demonstrate executive summary."""
+        assert "Executive Summary" in FEW_SHOT_EXAMPLE
+    def test_few_shot_has_background(self) -> None:
+        """Few-shot example should demonstrate background section."""
+        assert "Background" in FEW_SHOT_EXAMPLE
+    def test_few_shot_has_evidence_synthesis(self) -> None:
+        """Few-shot example should demonstrate evidence synthesis."""
+        assert "Evidence Synthesis" in FEW_SHOT_EXAMPLE
+        assert "Mechanism of Action" in FEW_SHOT_EXAMPLE
+    def test_few_shot_has_recommendations(self) -> None:
+        """Few-shot example should demonstrate recommendations."""
+        assert "Recommendations" in FEW_SHOT_EXAMPLE
+    def test_few_shot_has_limitations(self) -> None:
+        """Few-shot example should demonstrate limitations."""
+        assert "Limitations" in FEW_SHOT_EXAMPLE
+    def test_few_shot_has_references(self) -> None:
+        """Few-shot example should demonstrate references format."""
+        assert "References" in FEW_SHOT_EXAMPLE
+        assert "pubmed.ncbi.nlm.nih.gov" in FEW_SHOT_EXAMPLE
+    def test_few_shot_includes_statistics(self) -> None:
+        """Few-shot example should demonstrate statistical reporting."""
+        assert "%" in FEW_SHOT_EXAMPLE  # Percentages
+        assert "p<" in FEW_SHOT_EXAMPLE or "p=" in FEW_SHOT_EXAMPLE  # P-values
+        assert "CI" in FEW_SHOT_EXAMPLE  # Confidence intervals

tests/unit/test_mcp_tools.py CHANGED Viewed

@@ -32,6 +32,7 @@ def mock_evidence() -> Evidence:
 class TestSearchPubMed:
     """Tests for search_pubmed MCP tool."""
     @patch("src.mcp_tools._pubmed.search")
     async def test_returns_formatted_string(self, mock_search):
         """Test that search_pubmed returns Markdown formatted string."""
@@ -93,7 +94,7 @@ class TestSearchClinicalTrials:
         with patch("src.mcp_tools._trials") as mock_tool:
             mock_tool.search = AsyncMock(return_value=[mock_evidence])
-            result = await search_clinical_trials("diabetes", 10)
             assert isinstance(result, str)
             assert "Clinical Trials" in result

 class TestSearchPubMed:
     """Tests for search_pubmed MCP tool."""
+    @pytest.mark.asyncio
     @patch("src.mcp_tools._pubmed.search")
     async def test_returns_formatted_string(self, mock_search):
         """Test that search_pubmed returns Markdown formatted string."""
         with patch("src.mcp_tools._trials") as mock_tool:
             mock_tool.search = AsyncMock(return_value=[mock_evidence])
+            result = await search_clinical_trials("sildenafil erectile dysfunction", 10)
             assert isinstance(result, str)
             assert "Clinical Trials" in result

tests/unit/tools/test_clinicaltrials.py CHANGED Viewed

@@ -134,9 +134,9 @@ class TestClinicalTrialsIntegration:
     @pytest.mark.asyncio
     async def test_real_api_returns_interventional(self) -> None:
-        """Test that real API returns interventional studies."""
         tool = ClinicalTrialsTool()
-        results = await tool.search("long covid treatment", max_results=3)
         # Should get results
         assert len(results) > 0

     @pytest.mark.asyncio
     async def test_real_api_returns_interventional(self) -> None:
+        """Test that real API returns interventional studies for sexual health query."""
         tool = ClinicalTrialsTool()
+        results = await tool.search("testosterone HSDD", max_results=3)
         # Should get results
         assert len(results) > 0

tests/unit/tools/test_europepmc.py CHANGED Viewed

@@ -27,8 +27,8 @@ class TestEuropePMCTool:
                 "result": [
                     {
                         "id": "12345",
-                        "title": "Long COVID Treatment Study",
-                        "abstractText": "This study examines treatments for Long COVID.",
                         "doi": "10.1234/test",
                         "pubYear": "2024",
                         "source": "MED",
@@ -49,11 +49,11 @@ class TestEuropePMCTool:
             mock_instance.get.return_value = mock_resp
-            results = await tool.search("long covid treatment", max_results=5)
             assert len(results) == 1
             assert isinstance(results[0], Evidence)
-            assert "Long COVID Treatment Study" in results[0].citation.title
     @pytest.mark.asyncio
     async def test_search_marks_preprints(self, tool: EuropePMCTool) -> None:
@@ -113,11 +113,11 @@ class TestEuropePMCIntegration:
     @pytest.mark.asyncio
     async def test_real_api_call(self) -> None:
-        """Test actual API returns relevant results."""
         tool = EuropePMCTool()
-        results = await tool.search("long covid treatment", max_results=3)
         assert len(results) > 0
-        # At least one result should mention COVID
         titles = " ".join([r.citation.title.lower() for r in results])
-        assert "covid" in titles or "sars" in titles

                 "result": [
                     {
                         "id": "12345",
+                        "title": "Testosterone Therapy for HSDD Study",
+                        "abstractText": "This study examines testosterone therapy for HSDD.",
                         "doi": "10.1234/test",
                         "pubYear": "2024",
                         "source": "MED",
             mock_instance.get.return_value = mock_resp
+            results = await tool.search("testosterone HSDD therapy", max_results=5)
             assert len(results) == 1
             assert isinstance(results[0], Evidence)
+            assert "Testosterone Therapy for HSDD Study" in results[0].citation.title
     @pytest.mark.asyncio
     async def test_search_marks_preprints(self, tool: EuropePMCTool) -> None:
     @pytest.mark.asyncio
     async def test_real_api_call(self) -> None:
+        """Test actual API returns relevant results for sexual health query."""
         tool = EuropePMCTool()
+        results = await tool.search("testosterone libido therapy", max_results=3)
         assert len(results) > 0
+        # At least one result should mention testosterone or libido
         titles = " ".join([r.citation.title.lower() for r in results])
+        assert "testosterone" in titles or "libido" in titles or "sexual" in titles

tests/unit/tools/test_query_utils.py CHANGED Viewed

@@ -12,8 +12,8 @@ class TestQueryPreprocessing:
     def test_strip_question_words(self) -> None:
         """Test removal of question words."""
         assert strip_question_words("What drugs treat HSDD") == "drugs treat hsdd"
-        assert strip_question_words("Which medications help diabetes") == "medications diabetes"
-        assert strip_question_words("How can we cure aging") == "we cure aging"
         assert strip_question_words("Is sildenafil effective") == "sildenafil"
     def test_strip_preserves_medical_terms(self) -> None:

     def test_strip_question_words(self) -> None:
         """Test removal of question words."""
         assert strip_question_words("What drugs treat HSDD") == "drugs treat hsdd"
+        assert strip_question_words("Which medications help low libido") == "medications low libido"
+        assert strip_question_words("How can we treat ED") == "we treat ed"
         assert strip_question_words("Is sildenafil effective") == "sildenafil"
     def test_strip_preserves_medical_terms(self) -> None: