Spaces:
Running
fix(SPEC_11): address CodeRabbit review feedback (#92)
Browse files* fix(SPEC_11): address CodeRabbit review feedback
- Update run_full.py docstring to reference sexual health pipeline (not drug repurposing)
- Update run_full.py help text to use sexual health query example
- Fix app.py domain display to show "Sexual Health" (not "Sexual_Health")
- Update test_nodes.py mock data to use sexual health terms (Testosterone, Androgen Receptor)
- Add
@pytest
.mark.unit markers to test_judges.py test classes
- Add specific assertion for next_search_queries in test_judges.py
- Add missing
@pytest
.mark.asyncio marker to test_mcp_tools.py
- Update test_mcp_tools.py and test_clinicaltrials.py to use sexual health queries
- Fix SPEC_12 markdown bare code fence (MD040)
* fix(SPEC_11): comprehensive domain alignment audit
Second pass of CodeRabbit review fixes - found additional domain mismatches:
- examples/search_demo/run_search.py: Update docstring from "drug repurposing" to "sexual health"
- examples/orchestrator_demo/run_agent.py: Update help text from "metformin cancer" to "testosterone libido"
- src/agents/tools.py: Update search_preprints example from "long covid" to "flibanserin HSDD preprint"
- tests/unit/tools/test_europepmc.py: Replace "Long COVID" mock data and queries with testosterone/HSDD
- tests/unit/tools/test_query_utils.py: Replace "diabetes" and "aging" examples with sexual health terms
All examples, demos, and tests now consistently use sexual health domain examples.
* feat(SPEC_12): implement narrative report synthesis using LLM
Transform report generation from string templating to LLM-based narrative
synthesis, following Microsoft Agent Framework aggregator pattern.
New files:
- src/prompts/synthesis.py: Narrative synthesis prompts with few-shot example
- get_synthesis_system_prompt(): Domain-aware narrative writing instructions
- format_synthesis_prompt(): Formats evidence/assessment for LLM
- FEW_SHOT_EXAMPLE: Alprostadil ED example demonstrating prose style
- tests/unit/prompts/test_synthesis.py: 20 tests for synthesis prompts
- Verify emphasis on prose, not bullets
- Verify required sections (Executive Summary, Background, etc.)
- Verify anti-hallucination warnings
- Verify few-shot example quality
- tests/unit/orchestrators/test_simple_synthesis.py: 6 tests for orchestrator
- Test LLM agent is called for synthesis
- Test graceful fallback on LLM failure
- Test citation footer inclusion
Modified files:
- src/orchestrators/simple.py:
- Add async _generate_synthesis() that calls LLM for narrative prose
- Rename old method to _generate_template_synthesis() as fallback
- Update call site at line 394 to await the async synthesis
- tests/unit/orchestrators/test_simple_orchestrator_domain.py:
- Update test to use _generate_template_synthesis() (sync fallback)
This implements SPEC_12 acceptance criteria:
- Report contains paragraph-form prose, not just bullet points
- Report has Executive Summary, Background, Evidence Synthesis sections
- Report has actionable Recommendations and Limitations
- Citations properly formatted with author/year/title/URL
- Graceful fallback if LLM unavailable
- All 256 tests pass
* refactor: enhance exception handling and type safety
- Add new exception types: LLMError, QuotaExceededError, ModalError, SynthesisError
- Update orchestrator to catch specific exception types (SearchError, JudgeError, ModalError)
- Add exc_type logging for better debugging and observability
- Fix app.py type safety with OrchestratorMode literal type
- Add mode validation for Gradio string inputs
- Remove unnecessary type: ignore comment in app.py
* docs: add embeddings and meta-agent architecture brainstorm
Research and first-principles analysis covering:
- Embedding service comparison (FAISS, ChromaDB, Voyage AI, MixedBread)
- Selective vs full codebase embedding (selective wins)
- Meta question: would self-knowledge help agents?
- Implementation patterns for codebase RAG
- Recommended roadmap for developer tooling
* docs: add reality check section to embeddings brainstorm
Distinguish real vs vaporware based on web research:
- Cursor's
@codebase
: REAL, production (embeddings + Turbopuffer)
- Claude Code: grep-only, no semantic search natively
- MCP servers (claude-context, code-index-mcp): REAL but with bugs
- "Self-aware agents" claims: mostly vaporware
Key insight: For AI-native devs, the real opportunity is MCP servers
that give Claude Code semantic search, not embedding the codebase
for agent self-understanding.
* docs: deep dive on internal organ vs external tool
First-principles analysis with empirical research:
- SICA (ICLR 2025): 17-53% gains from self-improvement
- Gödel Agent (ACL 2025): recursive self-modification works
- Introspection paradox: self-knowledge can HURT strong models
- Anthropic research: ~20% accuracy on genuine introspection
Key conclusion: For DeepBoner's core task (research), internal
self-knowledge organ = overhead with negative ROI. The agent
doesn't need to understand its code to search PubMed.
External tools help DEVELOPMENT. Internal organs help EXPERIMENTATION.
Neither helps the research task itself.
* docs: critical tool analysis and embeddings conclusion
New: TOOL_ANALYSIS_CRITICAL.md
- Deep analysis of all 4 search tools (PubMed, ClinicalTrials, Europe PMC, OpenAlex)
- API limitations and what's actually possible
- Identified critical gaps: deduplication, outcome measures, citation traversal
- Priority improvements without horizontal sprawl
- Neo4j recommendation: not yet, use OpenAlex API first
Updated: BRAINSTORM_EMBEDDINGS_META.md
- Condensed to conclusions only
- Closed: internal embeddings/mGREP not needed for this use case
- Focus on research evidence retrieval, not codebase self-knowledge
* test: update e2e/integration tests for SPEC_12 LLM synthesis format
Tests were asserting OLD template format ("## Sexual Health Analysis",
"### Drug Candidates") but SPEC_12 implementation uses LLM-generated
narrative prose with different section headers ("Executive Summary",
"Background", "Evidence Synthesis", etc.)
Updated assertions to accept both formats for backwards compatibility.
* docs: add language identifier to code fence (MD040)
---------
Co-authored-by: Claude <noreply@anthropic.com>
- BRAINSTORM_EMBEDDINGS_META.md +74 -0
- SPEC_12_NARRATIVE_SYNTHESIS.md +1 -1
- TOOL_ANALYSIS_CRITICAL.md +348 -0
- examples/full_stack_demo/run_full.py +2 -2
- examples/orchestrator_demo/run_agent.py +1 -1
- examples/search_demo/run_search.py +1 -1
- src/agent_factory/judges.py +7 -1
- src/agents/tools.py +1 -1
- src/app.py +16 -8
- src/middleware/sub_iteration.py +14 -2
- src/orchestrators/simple.py +146 -10
- src/prompts/synthesis.py +209 -0
- src/utils/exceptions.py +24 -0
- tests/e2e/test_simple_mode.py +6 -6
- tests/integration/test_simple_mode_synthesis.py +5 -1
- tests/unit/agent_factory/test_judges.py +4 -0
- tests/unit/graph/test_nodes.py +6 -6
- tests/unit/orchestrators/test_simple_orchestrator_domain.py +2 -2
- tests/unit/orchestrators/test_simple_synthesis.py +279 -0
- tests/unit/prompts/test_synthesis.py +217 -0
- tests/unit/test_mcp_tools.py +2 -1
- tests/unit/tools/test_clinicaltrials.py +2 -2
- tests/unit/tools/test_europepmc.py +8 -8
- tests/unit/tools/test_query_utils.py +2 -2
|
@@ -0,0 +1,74 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Embeddings Brainstorm - Conclusions
|
| 2 |
+
|
| 3 |
+
**Date**: November 2025
|
| 4 |
+
**Status**: CLOSED - Conclusions reached, no action needed
|
| 5 |
+
|
| 6 |
+
---
|
| 7 |
+
|
| 8 |
+
## The Question
|
| 9 |
+
|
| 10 |
+
Should DeepBoner implement:
|
| 11 |
+
1. Internal codebase embeddings/ingestion pipeline?
|
| 12 |
+
2. mGREP for internal tool selection?
|
| 13 |
+
3. Self-knowledge components for agents?
|
| 14 |
+
|
| 15 |
+
## The Answer: NO
|
| 16 |
+
|
| 17 |
+
After research and first-principles analysis, the conclusion is clear:
|
| 18 |
+
|
| 19 |
+
### Why Not Internal Embeddings/Ingestion
|
| 20 |
+
|
| 21 |
+
```text
|
| 22 |
+
DeepBoner's Core Task:
|
| 23 |
+
┌─────────────────────────────────────────────────────────┐
|
| 24 |
+
│ User Query: "Evidence for testosterone in HSDD?" │
|
| 25 |
+
│ ↓ │
|
| 26 |
+
│ 1. Search PubMed, ClinicalTrials, Europe PMC │
|
| 27 |
+
│ 2. Judge: Is evidence sufficient? │
|
| 28 |
+
│ 3. Synthesize: Generate report │
|
| 29 |
+
│ ↓ │
|
| 30 |
+
│ Output: Research report with citations │
|
| 31 |
+
└─────────────────────────────────────────────────────────┘
|
| 32 |
+
|
| 33 |
+
Does ANY step require self-knowledge of codebase? NO.
|
| 34 |
+
```
|
| 35 |
+
|
| 36 |
+
### Why Not mGREP for Tool Selection
|
| 37 |
+
|
| 38 |
+
| Approach | Complexity | Accuracy |
|
| 39 |
+
|----------|------------|----------|
|
| 40 |
+
| Embeddings + mGREP for tool selection | High | Medium (semantic similarity ≠ correct tool) |
|
| 41 |
+
| Direct prompting with tool descriptions | Low | High (LLM reasons about applicability) |
|
| 42 |
+
|
| 43 |
+
**No real agent system uses embeddings for tool selection.** All major frameworks (LangChain, OpenAI, Anthropic, Magentic) use prompt-based tool selection because:
|
| 44 |
+
1. LLMs are already doing semantic matching internally
|
| 45 |
+
2. Tool count is small (5-20) - fits easily in context
|
| 46 |
+
3. Prompts allow reasoning, not just similarity
|
| 47 |
+
|
| 48 |
+
### What We Already Have
|
| 49 |
+
|
| 50 |
+
DeepBoner already uses embeddings for the **right thing**: research evidence retrieval.
|
| 51 |
+
- `src/services/embeddings.py` - ChromaDB + sentence-transformers
|
| 52 |
+
- `src/services/llamaindex_rag.py` - OpenAI embeddings for premium tier
|
| 53 |
+
|
| 54 |
+
### The Real Priority
|
| 55 |
+
|
| 56 |
+
Instead of internal embeddings/mGREP, focus on:
|
| 57 |
+
1. **Deduplication** across PubMed/Europe PMC/OpenAlex
|
| 58 |
+
2. **Outcome measures** from ClinicalTrials.gov
|
| 59 |
+
3. **Citation graph traversal** via OpenAlex
|
| 60 |
+
|
| 61 |
+
See: `TOOL_ANALYSIS_CRITICAL.md` for detailed improvement roadmap.
|
| 62 |
+
|
| 63 |
+
---
|
| 64 |
+
|
| 65 |
+
## Research Sources
|
| 66 |
+
|
| 67 |
+
- [SICA Paper (ICLR 2025)](https://arxiv.org/abs/2504.15228) - Self-improving agents
|
| 68 |
+
- [Gödel Agent (ACL 2025)](https://arxiv.org/abs/2410.04444) - Recursive self-modification
|
| 69 |
+
- [Introspection Paradox (EMNLP 2025)](https://aclanthology.org/2025.emnlp-main.352/) - Self-knowledge can hurt performance
|
| 70 |
+
- [Anthropic Introspection Research](https://www.anthropic.com/research/introspection) - ~20% accuracy on genuine introspection
|
| 71 |
+
|
| 72 |
+
---
|
| 73 |
+
|
| 74 |
+
*This document is closed. The conclusion is: don't implement internal embeddings/mGREP for this use case.*
|
|
@@ -176,7 +176,7 @@ async def summarize_results(results: list[Any]) -> str:
|
|
| 176 |
|
| 177 |
### Architecture Change
|
| 178 |
|
| 179 |
-
```
|
| 180 |
Current (Simple Mode):
|
| 181 |
Evidence → Judge → {structured data} → String Template → Bullet Points
|
| 182 |
|
|
|
|
| 176 |
|
| 177 |
### Architecture Change
|
| 178 |
|
| 179 |
+
```text
|
| 180 |
Current (Simple Mode):
|
| 181 |
Evidence → Judge → {structured data} → String Template → Bullet Points
|
| 182 |
|
|
@@ -0,0 +1,348 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Critical Analysis: Search Tools - Limitations, Gaps, and Improvements
|
| 2 |
+
|
| 3 |
+
**Date**: November 2025
|
| 4 |
+
**Purpose**: Honest assessment of all search tools to identify what's working, what's broken, and what needs improvement WITHOUT horizontal sprawl.
|
| 5 |
+
|
| 6 |
+
---
|
| 7 |
+
|
| 8 |
+
## Executive Summary
|
| 9 |
+
|
| 10 |
+
DeepBoner currently has **4 search tools**:
|
| 11 |
+
1. PubMed (NCBI E-utilities)
|
| 12 |
+
2. ClinicalTrials.gov (API v2)
|
| 13 |
+
3. Europe PMC (includes preprints)
|
| 14 |
+
4. OpenAlex (citation-aware)
|
| 15 |
+
|
| 16 |
+
**Overall Assessment**: Tools are functional but have significant gaps in:
|
| 17 |
+
- Deduplication (PubMed ∩ Europe PMC ∩ OpenAlex = massive overlap)
|
| 18 |
+
- Full-text retrieval (only abstracts currently)
|
| 19 |
+
- Citation graph traversal (OpenAlex has data but we don't use it)
|
| 20 |
+
- Query optimization (basic synonym expansion, no MeSH term mapping)
|
| 21 |
+
|
| 22 |
+
---
|
| 23 |
+
|
| 24 |
+
## Tool 1: PubMed (NCBI E-utilities)
|
| 25 |
+
|
| 26 |
+
**File**: `src/tools/pubmed.py`
|
| 27 |
+
|
| 28 |
+
### What It Does Well
|
| 29 |
+
| Feature | Status | Notes |
|
| 30 |
+
|---------|--------|-------|
|
| 31 |
+
| Rate limiting | ✅ | Shared limiter, respects 3/sec (no key) or 10/sec (with key) |
|
| 32 |
+
| Retry logic | ✅ | tenacity with exponential backoff |
|
| 33 |
+
| Query preprocessing | ✅ | Strips question words, expands synonyms |
|
| 34 |
+
| Abstract parsing | ✅ | Handles XML edge cases (dict vs list) |
|
| 35 |
+
|
| 36 |
+
### Limitations (API-Level)
|
| 37 |
+
| Limitation | Severity | Workaround Possible? |
|
| 38 |
+
|------------|----------|---------------------|
|
| 39 |
+
| **10,000 result cap per query** | Medium | Yes - use date ranges to paginate |
|
| 40 |
+
| **Abstracts only** (no full text) | High | No - full text requires PMC or publisher |
|
| 41 |
+
| **No citation counts** | Medium | Yes - cross-reference with OpenAlex |
|
| 42 |
+
| **Rate limit (10/sec max)** | Low | Already handled |
|
| 43 |
+
|
| 44 |
+
### Current Implementation Gaps
|
| 45 |
+
```python
|
| 46 |
+
# GAP 1: No MeSH term expansion
|
| 47 |
+
# Current: expand_synonyms() uses hardcoded dict
|
| 48 |
+
# Better: Use NCBI's E-utilities to get MeSH terms for query
|
| 49 |
+
|
| 50 |
+
# GAP 2: No date filtering
|
| 51 |
+
# Current: Gets whatever PubMed returns (biased toward recent)
|
| 52 |
+
# Better: Add date range parameter for historical research
|
| 53 |
+
|
| 54 |
+
# GAP 3: No publication type filtering
|
| 55 |
+
# Current: Returns all types (reviews, case reports, RCTs)
|
| 56 |
+
# Better: Filter for RCTs and systematic reviews when appropriate
|
| 57 |
+
```
|
| 58 |
+
|
| 59 |
+
### Priority Improvements
|
| 60 |
+
1. **HIGH**: Add publication type filter (Reviews, RCTs, Meta-analyses)
|
| 61 |
+
2. **MEDIUM**: Add date range parameter
|
| 62 |
+
3. **LOW**: MeSH term expansion via E-utilities
|
| 63 |
+
|
| 64 |
+
---
|
| 65 |
+
|
| 66 |
+
## Tool 2: ClinicalTrials.gov
|
| 67 |
+
|
| 68 |
+
**File**: `src/tools/clinicaltrials.py`
|
| 69 |
+
|
| 70 |
+
### What It Does Well
|
| 71 |
+
| Feature | Status | Notes |
|
| 72 |
+
|---------|--------|-------|
|
| 73 |
+
| API v2 usage | ✅ | Modern API, not deprecated v1 |
|
| 74 |
+
| Interventional filter | ✅ | Only gets drug/treatment studies |
|
| 75 |
+
| Status filter | ✅ | COMPLETED, ACTIVE, RECRUITING |
|
| 76 |
+
| httpx → requests workaround | ✅ | Bypasses WAF TLS fingerprint block |
|
| 77 |
+
|
| 78 |
+
### Limitations (API-Level)
|
| 79 |
+
| Limitation | Severity | Workaround Possible? |
|
| 80 |
+
|------------|----------|---------------------|
|
| 81 |
+
| **No results data** | High | Yes - available via different endpoint |
|
| 82 |
+
| **No outcome measures** | High | Yes - add to FIELDS list |
|
| 83 |
+
| **No adverse events** | Medium | Yes - separate API call |
|
| 84 |
+
| **Sparse drug mechanism data** | Medium | No - not in API |
|
| 85 |
+
|
| 86 |
+
### Current Implementation Gaps
|
| 87 |
+
```python
|
| 88 |
+
# GAP 1: Missing critical fields
|
| 89 |
+
FIELDS: ClassVar[list[str]] = [
|
| 90 |
+
"NCTId",
|
| 91 |
+
"BriefTitle",
|
| 92 |
+
"Phase",
|
| 93 |
+
"OverallStatus",
|
| 94 |
+
"Condition",
|
| 95 |
+
"InterventionName",
|
| 96 |
+
"StartDate",
|
| 97 |
+
"BriefSummary",
|
| 98 |
+
# MISSING:
|
| 99 |
+
# "PrimaryOutcome",
|
| 100 |
+
# "SecondaryOutcome",
|
| 101 |
+
# "ResultsFirstSubmitDate",
|
| 102 |
+
# "StudyResults", # Whether results are posted
|
| 103 |
+
]
|
| 104 |
+
|
| 105 |
+
# GAP 2: No results retrieval
|
| 106 |
+
# Many completed trials have posted results
|
| 107 |
+
# We could get actual efficacy data, not just trial existence
|
| 108 |
+
|
| 109 |
+
# GAP 3: No linked publications
|
| 110 |
+
# Trials often link to PubMed articles with results
|
| 111 |
+
# We could follow these links for richer evidence
|
| 112 |
+
```
|
| 113 |
+
|
| 114 |
+
### Priority Improvements
|
| 115 |
+
1. **HIGH**: Add outcome measures to FIELDS
|
| 116 |
+
2. **HIGH**: Check for and retrieve posted results
|
| 117 |
+
3. **MEDIUM**: Follow linked publications (NCT → PMID)
|
| 118 |
+
|
| 119 |
+
---
|
| 120 |
+
|
| 121 |
+
## Tool 3: Europe PMC
|
| 122 |
+
|
| 123 |
+
**File**: `src/tools/europepmc.py`
|
| 124 |
+
|
| 125 |
+
### What It Does Well
|
| 126 |
+
| Feature | Status | Notes |
|
| 127 |
+
|---------|--------|-------|
|
| 128 |
+
| Preprint coverage | ✅ | bioRxiv, medRxiv, ChemRxiv indexed |
|
| 129 |
+
| Preprint labeling | ✅ | `[PREPRINT - Not peer-reviewed]` marker |
|
| 130 |
+
| DOI/PMID fallback URLs | ✅ | Smart URL construction |
|
| 131 |
+
| Relevance scoring | ✅ | Preprints weighted lower (0.75 vs 0.9) |
|
| 132 |
+
|
| 133 |
+
### Limitations (API-Level)
|
| 134 |
+
| Limitation | Severity | Workaround Possible? |
|
| 135 |
+
|------------|----------|---------------------|
|
| 136 |
+
| **No full text for most articles** | High | Partial - CC-licensed available after 14 days |
|
| 137 |
+
| **Citation data limited** | Medium | Only journal articles, not preprints |
|
| 138 |
+
| **Preprint-publication linking gaps** | Medium | ~50% of links missing per Crossref |
|
| 139 |
+
| **License info sometimes missing** | Low | Manual review required |
|
| 140 |
+
|
| 141 |
+
### Current Implementation Gaps
|
| 142 |
+
```python
|
| 143 |
+
# GAP 1: No full-text retrieval
|
| 144 |
+
# Europe PMC has full text for many CC-licensed articles
|
| 145 |
+
# Could retrieve full text XML via separate endpoint
|
| 146 |
+
|
| 147 |
+
# GAP 2: Massive overlap with PubMed
|
| 148 |
+
# Europe PMC indexes all of PubMed/MEDLINE
|
| 149 |
+
# We're getting duplicates with no deduplication
|
| 150 |
+
|
| 151 |
+
# GAP 3: No citation network
|
| 152 |
+
# Europe PMC has "citedByCount" but we don't use it
|
| 153 |
+
# Could prioritize highly-cited preprints
|
| 154 |
+
```
|
| 155 |
+
|
| 156 |
+
### Priority Improvements
|
| 157 |
+
1. **HIGH**: Add deduplication with PubMed (by PMID)
|
| 158 |
+
2. **MEDIUM**: Retrieve citation counts for ranking
|
| 159 |
+
3. **LOW**: Full-text retrieval for CC-licensed articles
|
| 160 |
+
|
| 161 |
+
---
|
| 162 |
+
|
| 163 |
+
## Tool 4: OpenAlex
|
| 164 |
+
|
| 165 |
+
**File**: `src/tools/openalex.py`
|
| 166 |
+
|
| 167 |
+
### What It Does Well
|
| 168 |
+
| Feature | Status | Notes |
|
| 169 |
+
|---------|--------|-------|
|
| 170 |
+
| Citation counts | ✅ | Sorted by `cited_by_count:desc` |
|
| 171 |
+
| Abstract reconstruction | ✅ | Handles inverted index format |
|
| 172 |
+
| Concept extraction | ✅ | Hierarchical classification |
|
| 173 |
+
| Open access detection | ✅ | `is_oa` and `pdf_url` |
|
| 174 |
+
| Polite pool | ✅ | mailto for 100k/day limit |
|
| 175 |
+
| Rich metadata | ✅ | Best metadata of all tools |
|
| 176 |
+
|
| 177 |
+
### Limitations (API-Level)
|
| 178 |
+
| Limitation | Severity | Workaround Possible? |
|
| 179 |
+
|------------|----------|---------------------|
|
| 180 |
+
| **Author truncation at 100** | Low | Only affects mega-author papers |
|
| 181 |
+
| **No full text** | High | No - OpenAlex is metadata only |
|
| 182 |
+
| **Stale data (1-2 day lag)** | Low | Acceptable for research |
|
| 183 |
+
|
| 184 |
+
### Current Implementation Gaps
|
| 185 |
+
```python
|
| 186 |
+
# GAP 1: No citation graph traversal
|
| 187 |
+
# OpenAlex has `cited_by` and `references` endpoints
|
| 188 |
+
# We could find seminal papers by following citation chains
|
| 189 |
+
|
| 190 |
+
# GAP 2: No related works
|
| 191 |
+
# OpenAlex has ML-powered "related_works" field
|
| 192 |
+
# Could expand search to similar papers
|
| 193 |
+
|
| 194 |
+
# GAP 3: No concept filtering
|
| 195 |
+
# OpenAlex has hierarchical concepts
|
| 196 |
+
# Could filter for specific domains (e.g., "Sexual health" concept)
|
| 197 |
+
|
| 198 |
+
# GAP 4: Overlap with PubMed
|
| 199 |
+
# OpenAlex indexes most of PubMed
|
| 200 |
+
# More duplicates without deduplication
|
| 201 |
+
```
|
| 202 |
+
|
| 203 |
+
### Priority Improvements
|
| 204 |
+
1. **HIGH**: Add citation graph traversal (find seminal papers)
|
| 205 |
+
2. **HIGH**: Add deduplication with PubMed/Europe PMC
|
| 206 |
+
3. **MEDIUM**: Use `related_works` for query expansion
|
| 207 |
+
4. **LOW**: Concept-based filtering
|
| 208 |
+
|
| 209 |
+
---
|
| 210 |
+
|
| 211 |
+
## Cross-Tool Issues
|
| 212 |
+
|
| 213 |
+
### Issue 1: MASSIVE DUPLICATION
|
| 214 |
+
|
| 215 |
+
```
|
| 216 |
+
PubMed: 36M+ articles
|
| 217 |
+
Europe PMC: Indexes ALL of PubMed + preprints
|
| 218 |
+
OpenAlex: 250M+ works (includes PubMed)
|
| 219 |
+
|
| 220 |
+
Current behavior: All 3 return the same papers
|
| 221 |
+
Result: Duplicate evidence, wasted tokens, inflated counts
|
| 222 |
+
```
|
| 223 |
+
|
| 224 |
+
**Solution**: Deduplication by PMID/DOI
|
| 225 |
+
```python
|
| 226 |
+
# Proposed: Add to SearchHandler
|
| 227 |
+
def deduplicate_evidence(evidence_list: list[Evidence]) -> list[Evidence]:
|
| 228 |
+
seen_ids: set[str] = set()
|
| 229 |
+
unique: list[Evidence] = []
|
| 230 |
+
for e in evidence_list:
|
| 231 |
+
# Extract PMID or DOI from URL
|
| 232 |
+
paper_id = extract_paper_id(e.citation.url)
|
| 233 |
+
if paper_id not in seen_ids:
|
| 234 |
+
seen_ids.add(paper_id)
|
| 235 |
+
unique.append(e)
|
| 236 |
+
return unique
|
| 237 |
+
```
|
| 238 |
+
|
| 239 |
+
### Issue 2: NO FULL-TEXT RETRIEVAL
|
| 240 |
+
|
| 241 |
+
All tools return **abstracts only**. For deep research, this is limiting.
|
| 242 |
+
|
| 243 |
+
**What's Actually Possible**:
|
| 244 |
+
| Source | Full Text Access | How |
|
| 245 |
+
|--------|------------------|-----|
|
| 246 |
+
| PubMed Central (PMC) | Yes, for OA articles | Separate API: `efetch` with `db=pmc` |
|
| 247 |
+
| Europe PMC | Yes, CC-licensed after 14 days | `/fullTextXML/{id}` endpoint |
|
| 248 |
+
| OpenAlex | No | Metadata only |
|
| 249 |
+
| Unpaywall | Yes, OA link discovery | Separate API |
|
| 250 |
+
|
| 251 |
+
**Recommendation**: Add PMC full-text retrieval for open access articles.
|
| 252 |
+
|
| 253 |
+
### Issue 3: NO CITATION GRAPH
|
| 254 |
+
|
| 255 |
+
OpenAlex has rich citation data but we only use `cited_by_count` for sorting.
|
| 256 |
+
|
| 257 |
+
**Untapped Capabilities**:
|
| 258 |
+
- `cited_by`: Find papers that cite a key paper
|
| 259 |
+
- `references`: Find sources a paper cites
|
| 260 |
+
- `related_works`: ML-powered similar papers
|
| 261 |
+
|
| 262 |
+
**Use Case**: User asks about "testosterone therapy for HSDD". We find a seminal 2019 RCT. We could automatically find:
|
| 263 |
+
- Papers that cite it (newer evidence)
|
| 264 |
+
- Papers it cites (foundational research)
|
| 265 |
+
- Related papers (similar topics)
|
| 266 |
+
|
| 267 |
+
---
|
| 268 |
+
|
| 269 |
+
## What's NOT Possible (API Constraints)
|
| 270 |
+
|
| 271 |
+
| Feature | Why Not Possible |
|
| 272 |
+
|---------|------------------|
|
| 273 |
+
| **bioRxiv direct search** | No keyword search API, only RSS feed of latest |
|
| 274 |
+
| **arXiv search** | API exists but irrelevant for sexual health |
|
| 275 |
+
| **PubMed full text** | Requires publisher access or PMC |
|
| 276 |
+
| **Real-time trial results** | ClinicalTrials.gov results are static snapshots |
|
| 277 |
+
| **Drug mechanism data** | Not in any API - would need ChEMBL or DrugBank |
|
| 278 |
+
|
| 279 |
+
---
|
| 280 |
+
|
| 281 |
+
## Recommended Improvements (Priority Order)
|
| 282 |
+
|
| 283 |
+
### Phase 1: Fix Fundamentals (High ROI)
|
| 284 |
+
1. **Deduplication** - Stop returning the same paper 3 times
|
| 285 |
+
2. **Outcome measures in ClinicalTrials** - Get actual efficacy data
|
| 286 |
+
3. **Citation counts from all sources** - Rank by influence, not recency
|
| 287 |
+
|
| 288 |
+
### Phase 2: Depth Improvements (Medium ROI)
|
| 289 |
+
4. **PMC full-text retrieval** - Get full papers for OA articles
|
| 290 |
+
5. **Citation graph traversal** - Find seminal papers automatically
|
| 291 |
+
6. **Publication type filtering** - Prioritize RCTs and meta-analyses
|
| 292 |
+
|
| 293 |
+
### Phase 3: Quality Improvements (Lower ROI, Nice-to-Have)
|
| 294 |
+
7. **MeSH term expansion** - Better PubMed queries
|
| 295 |
+
8. **Related works expansion** - Use OpenAlex ML similarity
|
| 296 |
+
9. **Date range filtering** - Historical vs recent research
|
| 297 |
+
|
| 298 |
+
---
|
| 299 |
+
|
| 300 |
+
## Neo4j Integration (Future Consideration)
|
| 301 |
+
|
| 302 |
+
**Question**: Should we add Neo4j for citation graph storage?
|
| 303 |
+
|
| 304 |
+
**Answer**: Not yet. Here's why:
|
| 305 |
+
|
| 306 |
+
| Approach | Complexity | Value |
|
| 307 |
+
|----------|------------|-------|
|
| 308 |
+
| OpenAlex API for citation traversal | Low | High |
|
| 309 |
+
| Neo4j for local citation graph | High | Medium (unless doing graph analytics) |
|
| 310 |
+
| Cron job to sync OpenAlex → Neo4j | Medium | Only if we need offline access |
|
| 311 |
+
|
| 312 |
+
**Recommendation**: Use OpenAlex API for citation traversal first. Only add Neo4j if:
|
| 313 |
+
1. We need to do complex graph queries (PageRank on citations, community detection)
|
| 314 |
+
2. We need offline access to citation data
|
| 315 |
+
3. We're hitting OpenAlex rate limits
|
| 316 |
+
|
| 317 |
+
---
|
| 318 |
+
|
| 319 |
+
## Summary: What's Broken vs What's Working
|
| 320 |
+
|
| 321 |
+
### Working Well
|
| 322 |
+
- Basic search across all 4 sources
|
| 323 |
+
- Rate limiting and retry logic
|
| 324 |
+
- Query preprocessing
|
| 325 |
+
- Evidence model with citations
|
| 326 |
+
|
| 327 |
+
### Needs Fixing (Current Scope)
|
| 328 |
+
- Deduplication (critical)
|
| 329 |
+
- Outcome measures in ClinicalTrials (critical)
|
| 330 |
+
- Citation-based ranking (important)
|
| 331 |
+
|
| 332 |
+
### Future Enhancements (Out of Current Scope)
|
| 333 |
+
- Full-text retrieval
|
| 334 |
+
- Citation graph traversal
|
| 335 |
+
- Neo4j integration
|
| 336 |
+
- Drug mechanism data (would need new data sources)
|
| 337 |
+
|
| 338 |
+
---
|
| 339 |
+
|
| 340 |
+
## Sources
|
| 341 |
+
|
| 342 |
+
- [NCBI E-utilities Documentation](https://www.ncbi.nlm.nih.gov/books/NBK25497/)
|
| 343 |
+
- [NCBI Rate Limits](https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/)
|
| 344 |
+
- [OpenAlex API Docs](https://docs.openalex.org/)
|
| 345 |
+
- [OpenAlex Limitations](https://docs.openalex.org/api-entities/authors/limitations)
|
| 346 |
+
- [Europe PMC RESTful API](https://europepmc.org/RestfulWebService)
|
| 347 |
+
- [Europe PMC Preprints](https://pmc.ncbi.nlm.nih.gov/articles/PMC11426508/)
|
| 348 |
+
- [ClinicalTrials.gov API](https://clinicaltrials.gov/data-api/api)
|
|
@@ -2,7 +2,7 @@
|
|
| 2 |
"""
|
| 3 |
Demo: Full Stack DeepBoner Agent (Phases 1-8).
|
| 4 |
|
| 5 |
-
This script demonstrates the COMPLETE REAL
|
| 6 |
- Phase 2: REAL Search (PubMed + ClinicalTrials + Europe PMC)
|
| 7 |
- Phase 6: REAL Embeddings (sentence-transformers + ChromaDB)
|
| 8 |
- Phase 7: REAL Hypothesis (LLM mechanistic reasoning)
|
|
@@ -190,7 +190,7 @@ Examples:
|
|
| 190 |
)
|
| 191 |
parser.add_argument(
|
| 192 |
"query",
|
| 193 |
-
help="Research query (e.g., '
|
| 194 |
)
|
| 195 |
parser.add_argument(
|
| 196 |
"-i",
|
|
|
|
| 2 |
"""
|
| 3 |
Demo: Full Stack DeepBoner Agent (Phases 1-8).
|
| 4 |
|
| 5 |
+
This script demonstrates the COMPLETE REAL sexual health research pipeline:
|
| 6 |
- Phase 2: REAL Search (PubMed + ClinicalTrials + Europe PMC)
|
| 7 |
- Phase 6: REAL Embeddings (sentence-transformers + ChromaDB)
|
| 8 |
- Phase 7: REAL Hypothesis (LLM mechanistic reasoning)
|
|
|
|
| 190 |
)
|
| 191 |
parser.add_argument(
|
| 192 |
"query",
|
| 193 |
+
help="Research query (e.g., 'testosterone libido')",
|
| 194 |
)
|
| 195 |
parser.add_argument(
|
| 196 |
"-i",
|
|
@@ -51,7 +51,7 @@ Examples:
|
|
| 51 |
uv run python examples/orchestrator_demo/run_agent.py "flibanserin HSDD" --iterations 5
|
| 52 |
""",
|
| 53 |
)
|
| 54 |
-
parser.add_argument("query", help="Research query (e.g., '
|
| 55 |
parser.add_argument("--iterations", type=int, default=3, help="Max iterations (default: 3)")
|
| 56 |
args = parser.parse_args()
|
| 57 |
|
|
|
|
| 51 |
uv run python examples/orchestrator_demo/run_agent.py "flibanserin HSDD" --iterations 5
|
| 52 |
""",
|
| 53 |
)
|
| 54 |
+
parser.add_argument("query", help="Research query (e.g., 'testosterone libido')")
|
| 55 |
parser.add_argument("--iterations", type=int, default=3, help="Max iterations (default: 3)")
|
| 56 |
args = parser.parse_args()
|
| 57 |
|
|
@@ -1,6 +1,6 @@
|
|
| 1 |
#!/usr/bin/env python3
|
| 2 |
"""
|
| 3 |
-
Demo: Search for
|
| 4 |
|
| 5 |
This script demonstrates multi-source search functionality:
|
| 6 |
- PubMed search (biomedical literature)
|
|
|
|
| 1 |
#!/usr/bin/env python3
|
| 2 |
"""
|
| 3 |
+
Demo: Search for sexual health research evidence.
|
| 4 |
|
| 5 |
This script demonstrates multi-source search functionality:
|
| 6 |
- PubMed search (biomedical literature)
|
|
@@ -166,7 +166,13 @@ class JudgeHandler:
|
|
| 166 |
return assessment
|
| 167 |
|
| 168 |
except Exception as e:
|
| 169 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 170 |
# Return a safe default assessment on failure
|
| 171 |
return self._create_fallback_assessment(question, str(e))
|
| 172 |
|
|
|
|
| 166 |
return assessment
|
| 167 |
|
| 168 |
except Exception as e:
|
| 169 |
+
# Log with context for debugging
|
| 170 |
+
logger.error(
|
| 171 |
+
"Assessment failed",
|
| 172 |
+
error=str(e),
|
| 173 |
+
exc_type=type(e).__name__,
|
| 174 |
+
evidence_count=len(evidence),
|
| 175 |
+
)
|
| 176 |
# Return a safe default assessment on failure
|
| 177 |
return self._create_fallback_assessment(question, str(e))
|
| 178 |
|
|
@@ -125,7 +125,7 @@ async def search_preprints(query: str, max_results: int = 10) -> str:
|
|
| 125 |
from bioRxiv, medRxiv, and peer-reviewed papers.
|
| 126 |
|
| 127 |
Args:
|
| 128 |
-
query: Search terms (e.g., "
|
| 129 |
max_results: Maximum results to return (default 10)
|
| 130 |
|
| 131 |
Returns:
|
|
|
|
| 125 |
from bioRxiv, medRxiv, and peer-reviewed papers.
|
| 126 |
|
| 127 |
Args:
|
| 128 |
+
query: Search terms (e.g., "flibanserin HSDD preprint")
|
| 129 |
max_results: Maximum results to return (default 10)
|
| 130 |
|
| 131 |
Returns:
|
|
@@ -2,7 +2,7 @@
|
|
| 2 |
|
| 3 |
import os
|
| 4 |
from collections.abc import AsyncGenerator
|
| 5 |
-
from typing import Any
|
| 6 |
|
| 7 |
import gradio as gr
|
| 8 |
from pydantic_ai.models.anthropic import AnthropicModel
|
|
@@ -22,10 +22,12 @@ from src.utils.config import settings
|
|
| 22 |
from src.utils.exceptions import ConfigurationError
|
| 23 |
from src.utils.models import OrchestratorConfig
|
| 24 |
|
|
|
|
|
|
|
| 25 |
|
| 26 |
def configure_orchestrator(
|
| 27 |
use_mock: bool = False,
|
| 28 |
-
mode:
|
| 29 |
user_api_key: str | None = None,
|
| 30 |
domain: str | ResearchDomain | None = None,
|
| 31 |
) -> tuple[Any, str]:
|
|
@@ -100,7 +102,7 @@ def configure_orchestrator(
|
|
| 100 |
search_handler=search_handler,
|
| 101 |
judge_handler=judge_handler,
|
| 102 |
config=config,
|
| 103 |
-
mode=mode,
|
| 104 |
api_key=user_api_key,
|
| 105 |
domain=domain,
|
| 106 |
)
|
|
@@ -111,7 +113,7 @@ def configure_orchestrator(
|
|
| 111 |
async def research_agent(
|
| 112 |
message: str,
|
| 113 |
history: list[dict[str, Any]],
|
| 114 |
-
mode: str = "simple",
|
| 115 |
domain: str = "sexual_health",
|
| 116 |
api_key: str = "",
|
| 117 |
api_key_state: str = "",
|
|
@@ -140,6 +142,10 @@ async def research_agent(
|
|
| 140 |
api_key_state_str = api_key_state or ""
|
| 141 |
domain_str = domain or "sexual_health"
|
| 142 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 143 |
# BUG FIX: Prefer freshly-entered key, then persisted state
|
| 144 |
user_api_key = (api_key_str.strip() or api_key_state_str.strip()) or None
|
| 145 |
|
|
@@ -153,12 +159,12 @@ async def research_agent(
|
|
| 153 |
has_paid_key = has_openai or has_anthropic or bool(user_api_key)
|
| 154 |
|
| 155 |
# Advanced mode requires OpenAI specifically (due to agent-framework binding)
|
| 156 |
-
if
|
| 157 |
yield (
|
| 158 |
"⚠️ **Warning**: Advanced mode currently requires OpenAI API key. "
|
| 159 |
"Anthropic keys only work in Simple mode. Falling back to Simple.\n\n"
|
| 160 |
)
|
| 161 |
-
|
| 162 |
|
| 163 |
# Inform user about fallback if no keys
|
| 164 |
if not has_paid_key:
|
|
@@ -177,14 +183,16 @@ async def research_agent(
|
|
| 177 |
# It will use: Paid API > HF Inference (free tier)
|
| 178 |
orchestrator, backend_name = configure_orchestrator(
|
| 179 |
use_mock=False, # Never use mock in production - HF Inference is the free fallback
|
| 180 |
-
mode=
|
| 181 |
user_api_key=user_api_key,
|
| 182 |
domain=domain_str,
|
| 183 |
)
|
| 184 |
|
| 185 |
# Immediate backend info + loading feedback so user knows something is happening
|
|
|
|
|
|
|
| 186 |
yield (
|
| 187 |
-
f"🧠 **Backend**: {backend_name} | **Domain**: {
|
| 188 |
"⏳ **Processing...** Searching PubMed, ClinicalTrials.gov, Europe PMC, OpenAlex...\n"
|
| 189 |
)
|
| 190 |
|
|
|
|
| 2 |
|
| 3 |
import os
|
| 4 |
from collections.abc import AsyncGenerator
|
| 5 |
+
from typing import Any, Literal
|
| 6 |
|
| 7 |
import gradio as gr
|
| 8 |
from pydantic_ai.models.anthropic import AnthropicModel
|
|
|
|
| 22 |
from src.utils.exceptions import ConfigurationError
|
| 23 |
from src.utils.models import OrchestratorConfig
|
| 24 |
|
| 25 |
+
OrchestratorMode = Literal["simple", "magentic", "advanced", "hierarchical"]
|
| 26 |
+
|
| 27 |
|
| 28 |
def configure_orchestrator(
|
| 29 |
use_mock: bool = False,
|
| 30 |
+
mode: OrchestratorMode = "simple",
|
| 31 |
user_api_key: str | None = None,
|
| 32 |
domain: str | ResearchDomain | None = None,
|
| 33 |
) -> tuple[Any, str]:
|
|
|
|
| 102 |
search_handler=search_handler,
|
| 103 |
judge_handler=judge_handler,
|
| 104 |
config=config,
|
| 105 |
+
mode=mode,
|
| 106 |
api_key=user_api_key,
|
| 107 |
domain=domain,
|
| 108 |
)
|
|
|
|
| 113 |
async def research_agent(
|
| 114 |
message: str,
|
| 115 |
history: list[dict[str, Any]],
|
| 116 |
+
mode: str = "simple", # Gradio passes strings; validated below
|
| 117 |
domain: str = "sexual_health",
|
| 118 |
api_key: str = "",
|
| 119 |
api_key_state: str = "",
|
|
|
|
| 142 |
api_key_state_str = api_key_state or ""
|
| 143 |
domain_str = domain or "sexual_health"
|
| 144 |
|
| 145 |
+
# Validate and cast mode to proper type
|
| 146 |
+
valid_modes: set[str] = {"simple", "magentic", "advanced", "hierarchical"}
|
| 147 |
+
mode_validated: OrchestratorMode = mode if mode in valid_modes else "simple" # type: ignore[assignment]
|
| 148 |
+
|
| 149 |
# BUG FIX: Prefer freshly-entered key, then persisted state
|
| 150 |
user_api_key = (api_key_str.strip() or api_key_state_str.strip()) or None
|
| 151 |
|
|
|
|
| 159 |
has_paid_key = has_openai or has_anthropic or bool(user_api_key)
|
| 160 |
|
| 161 |
# Advanced mode requires OpenAI specifically (due to agent-framework binding)
|
| 162 |
+
if mode_validated == "advanced" and not (has_openai or is_openai_user_key):
|
| 163 |
yield (
|
| 164 |
"⚠️ **Warning**: Advanced mode currently requires OpenAI API key. "
|
| 165 |
"Anthropic keys only work in Simple mode. Falling back to Simple.\n\n"
|
| 166 |
)
|
| 167 |
+
mode_validated = "simple"
|
| 168 |
|
| 169 |
# Inform user about fallback if no keys
|
| 170 |
if not has_paid_key:
|
|
|
|
| 183 |
# It will use: Paid API > HF Inference (free tier)
|
| 184 |
orchestrator, backend_name = configure_orchestrator(
|
| 185 |
use_mock=False, # Never use mock in production - HF Inference is the free fallback
|
| 186 |
+
mode=mode_validated,
|
| 187 |
user_api_key=user_api_key,
|
| 188 |
domain=domain_str,
|
| 189 |
)
|
| 190 |
|
| 191 |
# Immediate backend info + loading feedback so user knows something is happening
|
| 192 |
+
# Use replace to get "Sexual Health" instead of "Sexual_Health" from .title()
|
| 193 |
+
domain_display = domain_str.replace("_", " ").title()
|
| 194 |
yield (
|
| 195 |
+
f"🧠 **Backend**: {backend_name} | **Domain**: {domain_display}\n\n"
|
| 196 |
"⏳ **Processing...** Searching PubMed, ClinicalTrials.gov, Europe PMC, OpenAlex...\n"
|
| 197 |
)
|
| 198 |
|
|
@@ -81,12 +81,18 @@ class SubIterationMiddleware:
|
|
| 81 |
history.append(result)
|
| 82 |
best_result = result # Assume latest is best for now
|
| 83 |
except Exception as e:
|
| 84 |
-
logger.error(
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 85 |
if event_callback:
|
| 86 |
await event_callback(
|
| 87 |
AgentEvent(
|
| 88 |
type="error",
|
| 89 |
message=f"Sub-iteration execution failed: {e}",
|
|
|
|
| 90 |
iteration=i,
|
| 91 |
)
|
| 92 |
)
|
|
@@ -97,12 +103,18 @@ class SubIterationMiddleware:
|
|
| 97 |
assessment = await self.judge.assess(task, result, history)
|
| 98 |
final_assessment = assessment
|
| 99 |
except Exception as e:
|
| 100 |
-
logger.error(
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 101 |
if event_callback:
|
| 102 |
await event_callback(
|
| 103 |
AgentEvent(
|
| 104 |
type="error",
|
| 105 |
message=f"Sub-iteration judge failed: {e}",
|
|
|
|
| 106 |
iteration=i,
|
| 107 |
)
|
| 108 |
)
|
|
|
|
| 81 |
history.append(result)
|
| 82 |
best_result = result # Assume latest is best for now
|
| 83 |
except Exception as e:
|
| 84 |
+
logger.error(
|
| 85 |
+
"Sub-iteration execution failed",
|
| 86 |
+
error=str(e),
|
| 87 |
+
exc_type=type(e).__name__,
|
| 88 |
+
iteration=i,
|
| 89 |
+
)
|
| 90 |
if event_callback:
|
| 91 |
await event_callback(
|
| 92 |
AgentEvent(
|
| 93 |
type="error",
|
| 94 |
message=f"Sub-iteration execution failed: {e}",
|
| 95 |
+
data={"recoverable": False, "error_type": type(e).__name__},
|
| 96 |
iteration=i,
|
| 97 |
)
|
| 98 |
)
|
|
|
|
| 103 |
assessment = await self.judge.assess(task, result, history)
|
| 104 |
final_assessment = assessment
|
| 105 |
except Exception as e:
|
| 106 |
+
logger.error(
|
| 107 |
+
"Sub-iteration judge failed",
|
| 108 |
+
error=str(e),
|
| 109 |
+
exc_type=type(e).__name__,
|
| 110 |
+
iteration=i,
|
| 111 |
+
)
|
| 112 |
if event_callback:
|
| 113 |
await event_callback(
|
| 114 |
AgentEvent(
|
| 115 |
type="error",
|
| 116 |
message=f"Sub-iteration judge failed: {e}",
|
| 117 |
+
data={"recoverable": False, "error_type": type(e).__name__},
|
| 118 |
iteration=i,
|
| 119 |
)
|
| 120 |
)
|
|
@@ -18,7 +18,9 @@ import structlog
|
|
| 18 |
|
| 19 |
from src.config.domain import ResearchDomain, get_domain_config
|
| 20 |
from src.orchestrators.base import JudgeHandlerProtocol, SearchHandlerProtocol
|
|
|
|
| 21 |
from src.utils.config import settings
|
|
|
|
| 22 |
from src.utils.models import (
|
| 23 |
AgentEvent,
|
| 24 |
Evidence,
|
|
@@ -132,12 +134,25 @@ class Orchestrator:
|
|
| 132 |
iteration=iteration,
|
| 133 |
)
|
| 134 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 135 |
except Exception as e:
|
| 136 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 137 |
yield AgentEvent(
|
| 138 |
type="error",
|
| 139 |
message=f"Modal analysis failed: {e}",
|
| 140 |
-
data={"error": str(e)},
|
| 141 |
iteration=iteration,
|
| 142 |
)
|
| 143 |
|
|
@@ -288,11 +303,26 @@ class Orchestrator:
|
|
| 288 |
if errors:
|
| 289 |
logger.warning("Search errors", errors=errors)
|
| 290 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 291 |
except Exception as e:
|
| 292 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 293 |
yield AgentEvent(
|
| 294 |
type="error",
|
| 295 |
message=f"Search failed: {e!s}",
|
|
|
|
| 296 |
iteration=iteration,
|
| 297 |
)
|
| 298 |
continue
|
|
@@ -388,9 +418,9 @@ class Orchestrator:
|
|
| 388 |
iteration=iteration,
|
| 389 |
)
|
| 390 |
|
| 391 |
-
# Generate final response
|
| 392 |
# Use all gathered evidence for the final report
|
| 393 |
-
final_response = self._generate_synthesis(query, all_evidence, assessment)
|
| 394 |
|
| 395 |
yield AgentEvent(
|
| 396 |
type="complete",
|
|
@@ -424,11 +454,26 @@ class Orchestrator:
|
|
| 424 |
iteration=iteration,
|
| 425 |
)
|
| 426 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 427 |
except Exception as e:
|
| 428 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 429 |
yield AgentEvent(
|
| 430 |
type="error",
|
| 431 |
message=f"Assessment failed: {e!s}",
|
|
|
|
| 432 |
iteration=iteration,
|
| 433 |
)
|
| 434 |
continue
|
|
@@ -445,14 +490,105 @@ class Orchestrator:
|
|
| 445 |
iteration=iteration,
|
| 446 |
)
|
| 447 |
|
| 448 |
-
def _generate_synthesis(
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 449 |
self,
|
| 450 |
query: str,
|
| 451 |
evidence: list[Evidence],
|
| 452 |
assessment: JudgeAssessment,
|
| 453 |
) -> str:
|
| 454 |
"""
|
| 455 |
-
Generate
|
|
|
|
|
|
|
| 456 |
|
| 457 |
Args:
|
| 458 |
query: The original question
|
|
@@ -460,7 +596,7 @@ class Orchestrator:
|
|
| 460 |
assessment: The final assessment
|
| 461 |
|
| 462 |
Returns:
|
| 463 |
-
Formatted synthesis as markdown
|
| 464 |
"""
|
| 465 |
drug_list = (
|
| 466 |
"\n".join([f"- **{d}**" for d in assessment.details.drug_candidates])
|
|
@@ -474,7 +610,7 @@ class Orchestrator:
|
|
| 474 |
[
|
| 475 |
f"{i + 1}. [{e.citation.title}]({e.citation.url}) "
|
| 476 |
f"({e.citation.source.upper()}, {e.citation.date})"
|
| 477 |
-
for i, e in enumerate(evidence[:10])
|
| 478 |
]
|
| 479 |
)
|
| 480 |
|
|
|
|
| 18 |
|
| 19 |
from src.config.domain import ResearchDomain, get_domain_config
|
| 20 |
from src.orchestrators.base import JudgeHandlerProtocol, SearchHandlerProtocol
|
| 21 |
+
from src.prompts.synthesis import format_synthesis_prompt, get_synthesis_system_prompt
|
| 22 |
from src.utils.config import settings
|
| 23 |
+
from src.utils.exceptions import JudgeError, ModalError, SearchError
|
| 24 |
from src.utils.models import (
|
| 25 |
AgentEvent,
|
| 26 |
Evidence,
|
|
|
|
| 134 |
iteration=iteration,
|
| 135 |
)
|
| 136 |
|
| 137 |
+
except ModalError as e:
|
| 138 |
+
logger.error("Modal analysis failed", error=str(e), exc_type="ModalError")
|
| 139 |
+
yield AgentEvent(
|
| 140 |
+
type="error",
|
| 141 |
+
message=f"Modal analysis failed: {e}",
|
| 142 |
+
data={"error": str(e), "recoverable": True},
|
| 143 |
+
iteration=iteration,
|
| 144 |
+
)
|
| 145 |
except Exception as e:
|
| 146 |
+
# Unexpected error - log with full context for debugging
|
| 147 |
+
logger.error(
|
| 148 |
+
"Modal analysis failed unexpectedly",
|
| 149 |
+
error=str(e),
|
| 150 |
+
exc_type=type(e).__name__,
|
| 151 |
+
)
|
| 152 |
yield AgentEvent(
|
| 153 |
type="error",
|
| 154 |
message=f"Modal analysis failed: {e}",
|
| 155 |
+
data={"error": str(e), "recoverable": True},
|
| 156 |
iteration=iteration,
|
| 157 |
)
|
| 158 |
|
|
|
|
| 303 |
if errors:
|
| 304 |
logger.warning("Search errors", errors=errors)
|
| 305 |
|
| 306 |
+
except SearchError as e:
|
| 307 |
+
logger.error("Search phase failed", error=str(e), exc_type="SearchError")
|
| 308 |
+
yield AgentEvent(
|
| 309 |
+
type="error",
|
| 310 |
+
message=f"Search failed: {e!s}",
|
| 311 |
+
data={"recoverable": True, "error_type": "search"},
|
| 312 |
+
iteration=iteration,
|
| 313 |
+
)
|
| 314 |
+
continue
|
| 315 |
except Exception as e:
|
| 316 |
+
# Unexpected error - log full context for debugging
|
| 317 |
+
logger.error(
|
| 318 |
+
"Search phase failed unexpectedly",
|
| 319 |
+
error=str(e),
|
| 320 |
+
exc_type=type(e).__name__,
|
| 321 |
+
)
|
| 322 |
yield AgentEvent(
|
| 323 |
type="error",
|
| 324 |
message=f"Search failed: {e!s}",
|
| 325 |
+
data={"recoverable": True, "error_type": "unexpected"},
|
| 326 |
iteration=iteration,
|
| 327 |
)
|
| 328 |
continue
|
|
|
|
| 418 |
iteration=iteration,
|
| 419 |
)
|
| 420 |
|
| 421 |
+
# Generate final response using LLM narrative synthesis
|
| 422 |
# Use all gathered evidence for the final report
|
| 423 |
+
final_response = await self._generate_synthesis(query, all_evidence, assessment)
|
| 424 |
|
| 425 |
yield AgentEvent(
|
| 426 |
type="complete",
|
|
|
|
| 454 |
iteration=iteration,
|
| 455 |
)
|
| 456 |
|
| 457 |
+
except JudgeError as e:
|
| 458 |
+
logger.error("Judge phase failed", error=str(e), exc_type="JudgeError")
|
| 459 |
+
yield AgentEvent(
|
| 460 |
+
type="error",
|
| 461 |
+
message=f"Assessment failed: {e!s}",
|
| 462 |
+
data={"recoverable": True, "error_type": "judge"},
|
| 463 |
+
iteration=iteration,
|
| 464 |
+
)
|
| 465 |
+
continue
|
| 466 |
except Exception as e:
|
| 467 |
+
# Unexpected error - log full context for debugging
|
| 468 |
+
logger.error(
|
| 469 |
+
"Judge phase failed unexpectedly",
|
| 470 |
+
error=str(e),
|
| 471 |
+
exc_type=type(e).__name__,
|
| 472 |
+
)
|
| 473 |
yield AgentEvent(
|
| 474 |
type="error",
|
| 475 |
message=f"Assessment failed: {e!s}",
|
| 476 |
+
data={"recoverable": True, "error_type": "unexpected"},
|
| 477 |
iteration=iteration,
|
| 478 |
)
|
| 479 |
continue
|
|
|
|
| 490 |
iteration=iteration,
|
| 491 |
)
|
| 492 |
|
| 493 |
+
async def _generate_synthesis(
|
| 494 |
+
self,
|
| 495 |
+
query: str,
|
| 496 |
+
evidence: list[Evidence],
|
| 497 |
+
assessment: JudgeAssessment,
|
| 498 |
+
) -> str:
|
| 499 |
+
"""
|
| 500 |
+
Generate the final synthesis response using LLM.
|
| 501 |
+
|
| 502 |
+
This method calls an LLM to generate a narrative research report,
|
| 503 |
+
following the Microsoft Agent Framework pattern of using LLM synthesis
|
| 504 |
+
instead of string templating.
|
| 505 |
+
|
| 506 |
+
Args:
|
| 507 |
+
query: The original question
|
| 508 |
+
evidence: All collected evidence
|
| 509 |
+
assessment: The final assessment
|
| 510 |
+
|
| 511 |
+
Returns:
|
| 512 |
+
Narrative synthesis as markdown
|
| 513 |
+
"""
|
| 514 |
+
# Build evidence summary for LLM context (limit to avoid token overflow)
|
| 515 |
+
evidence_lines = []
|
| 516 |
+
for e in evidence[:20]:
|
| 517 |
+
authors = ", ".join(e.citation.authors[:2]) if e.citation.authors else "Unknown"
|
| 518 |
+
content_preview = e.content[:200].replace("\n", " ")
|
| 519 |
+
evidence_lines.append(
|
| 520 |
+
f"- {e.citation.title} ({authors}, {e.citation.date}): {content_preview}..."
|
| 521 |
+
)
|
| 522 |
+
evidence_summary = "\n".join(evidence_lines)
|
| 523 |
+
|
| 524 |
+
# Format synthesis prompt with assessment data
|
| 525 |
+
user_prompt = format_synthesis_prompt(
|
| 526 |
+
query=query,
|
| 527 |
+
evidence_summary=evidence_summary,
|
| 528 |
+
drug_candidates=assessment.details.drug_candidates,
|
| 529 |
+
key_findings=assessment.details.key_findings,
|
| 530 |
+
mechanism_score=assessment.details.mechanism_score,
|
| 531 |
+
clinical_score=assessment.details.clinical_evidence_score,
|
| 532 |
+
confidence=assessment.confidence,
|
| 533 |
+
)
|
| 534 |
+
|
| 535 |
+
# Get domain-specific system prompt
|
| 536 |
+
system_prompt = get_synthesis_system_prompt(self.domain)
|
| 537 |
+
|
| 538 |
+
try:
|
| 539 |
+
# Import here to avoid circular deps and keep optional
|
| 540 |
+
from pydantic_ai import Agent
|
| 541 |
+
|
| 542 |
+
from src.agent_factory.judges import get_model
|
| 543 |
+
|
| 544 |
+
# Create synthesis agent (string output, not structured)
|
| 545 |
+
agent: Agent[None, str] = Agent(
|
| 546 |
+
model=get_model(),
|
| 547 |
+
output_type=str,
|
| 548 |
+
system_prompt=system_prompt,
|
| 549 |
+
)
|
| 550 |
+
result = await agent.run(user_prompt)
|
| 551 |
+
narrative = result.output
|
| 552 |
+
|
| 553 |
+
logger.info("LLM narrative synthesis completed", chars=len(narrative))
|
| 554 |
+
|
| 555 |
+
except Exception as e:
|
| 556 |
+
# Fallback to template synthesis if LLM fails
|
| 557 |
+
# This is intentionally broad - LLM can fail many ways (API, parsing, etc.)
|
| 558 |
+
logger.warning(
|
| 559 |
+
"LLM synthesis failed, using template fallback",
|
| 560 |
+
error=str(e),
|
| 561 |
+
exc_type=type(e).__name__,
|
| 562 |
+
evidence_count=len(evidence),
|
| 563 |
+
)
|
| 564 |
+
return self._generate_template_synthesis(query, evidence, assessment)
|
| 565 |
+
|
| 566 |
+
# Add full citation list footer
|
| 567 |
+
citations = "\n".join(
|
| 568 |
+
f"{i + 1}. [{e.citation.title}]({e.citation.url}) "
|
| 569 |
+
f"({e.citation.source.upper()}, {e.citation.date})"
|
| 570 |
+
for i, e in enumerate(evidence[:15])
|
| 571 |
+
)
|
| 572 |
+
|
| 573 |
+
return f"""{narrative}
|
| 574 |
+
|
| 575 |
+
---
|
| 576 |
+
### Full Citation List ({len(evidence)} sources)
|
| 577 |
+
{citations}
|
| 578 |
+
|
| 579 |
+
*Analysis based on {len(evidence)} sources across {len(self.history)} iterations.*
|
| 580 |
+
"""
|
| 581 |
+
|
| 582 |
+
def _generate_template_synthesis(
|
| 583 |
self,
|
| 584 |
query: str,
|
| 585 |
evidence: list[Evidence],
|
| 586 |
assessment: JudgeAssessment,
|
| 587 |
) -> str:
|
| 588 |
"""
|
| 589 |
+
Generate fallback template synthesis (no LLM).
|
| 590 |
+
|
| 591 |
+
Used when LLM synthesis fails or is unavailable.
|
| 592 |
|
| 593 |
Args:
|
| 594 |
query: The original question
|
|
|
|
| 596 |
assessment: The final assessment
|
| 597 |
|
| 598 |
Returns:
|
| 599 |
+
Formatted synthesis as markdown (bullet-point style)
|
| 600 |
"""
|
| 601 |
drug_list = (
|
| 602 |
"\n".join([f"- **{d}**" for d in assessment.details.drug_candidates])
|
|
|
|
| 610 |
[
|
| 611 |
f"{i + 1}. [{e.citation.title}]({e.citation.url}) "
|
| 612 |
f"({e.citation.source.upper()}, {e.citation.date})"
|
| 613 |
+
for i, e in enumerate(evidence[:10])
|
| 614 |
]
|
| 615 |
)
|
| 616 |
|
|
@@ -0,0 +1,209 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Prompts for narrative report synthesis.
|
| 2 |
+
|
| 3 |
+
This module provides prompts that transform structured evidence data
|
| 4 |
+
into professional, narrative research reports. The key insight is that
|
| 5 |
+
report generation requires an LLM call for synthesis, not string templating.
|
| 6 |
+
|
| 7 |
+
Reference: Microsoft Agent Framework concurrent_custom_aggregator.py pattern.
|
| 8 |
+
"""
|
| 9 |
+
|
| 10 |
+
from src.config.domain import ResearchDomain, get_domain_config
|
| 11 |
+
|
| 12 |
+
|
| 13 |
+
def get_synthesis_system_prompt(domain: ResearchDomain | str | None = None) -> str:
|
| 14 |
+
"""Get the system prompt for narrative synthesis.
|
| 15 |
+
|
| 16 |
+
Args:
|
| 17 |
+
domain: Research domain for customization (defaults to settings)
|
| 18 |
+
|
| 19 |
+
Returns:
|
| 20 |
+
System prompt instructing LLM to write narrative prose
|
| 21 |
+
"""
|
| 22 |
+
config = get_domain_config(domain)
|
| 23 |
+
return f"""You are a scientific writer specializing in {config.name.lower()}.
|
| 24 |
+
Your task is to synthesize research evidence into a clear, NARRATIVE report.
|
| 25 |
+
|
| 26 |
+
## CRITICAL: Writing Style
|
| 27 |
+
- Write in PROSE PARAGRAPHS, not bullet points
|
| 28 |
+
- Use academic but accessible language
|
| 29 |
+
- Be specific about evidence strength (e.g., "in an RCT of N=200")
|
| 30 |
+
- Reference specific studies by author name when available
|
| 31 |
+
- Provide quantitative results where available (p-values, effect sizes, NNT)
|
| 32 |
+
|
| 33 |
+
## Report Structure
|
| 34 |
+
|
| 35 |
+
### Executive Summary (REQUIRED - 2-3 sentences)
|
| 36 |
+
Start with the bottom line. What does the evidence show? Example:
|
| 37 |
+
"Testosterone therapy demonstrates consistent efficacy for HSDD in postmenopausal
|
| 38 |
+
women, with transdermal formulations showing the best safety profile."
|
| 39 |
+
|
| 40 |
+
### Background (REQUIRED - 1 paragraph)
|
| 41 |
+
Explain the condition, its prevalence, and clinical significance.
|
| 42 |
+
Why does this question matter?
|
| 43 |
+
|
| 44 |
+
### Evidence Synthesis (REQUIRED - 2-4 paragraphs)
|
| 45 |
+
Weave the evidence into a coherent NARRATIVE:
|
| 46 |
+
- **Mechanism of Action**: How does the intervention work biologically?
|
| 47 |
+
- **Clinical Evidence**: What do trials show? Include effect sizes when available.
|
| 48 |
+
- **Comparative Evidence**: How does it compare to alternatives?
|
| 49 |
+
|
| 50 |
+
Write this as flowing prose that tells a story, NOT as a bullet list.
|
| 51 |
+
|
| 52 |
+
### Recommendations (REQUIRED - 3-5 numbered items)
|
| 53 |
+
Provide specific, actionable clinical recommendations based on the evidence.
|
| 54 |
+
These CAN be numbered items since they are action items.
|
| 55 |
+
|
| 56 |
+
### Limitations (REQUIRED - 1 paragraph)
|
| 57 |
+
Acknowledge gaps in the evidence, potential biases, and areas needing more research.
|
| 58 |
+
Be honest about uncertainty.
|
| 59 |
+
|
| 60 |
+
### References (REQUIRED)
|
| 61 |
+
List key references with author, year, title, and URL.
|
| 62 |
+
Format: Author AB et al. (Year). Title. URL
|
| 63 |
+
|
| 64 |
+
## CRITICAL RULES
|
| 65 |
+
1. ONLY cite papers from the provided evidence - NEVER hallucinate or invent references
|
| 66 |
+
2. Write in complete sentences and paragraphs (PROSE, not lists except Recommendations)
|
| 67 |
+
3. Include specific statistics when available (p-values, confidence intervals, effect sizes)
|
| 68 |
+
4. Acknowledge uncertainty honestly - do not overstate conclusions
|
| 69 |
+
5. If evidence is limited, say so clearly
|
| 70 |
+
6. Copy URLs exactly as provided - do not create similar-looking URLs
|
| 71 |
+
"""
|
| 72 |
+
|
| 73 |
+
|
| 74 |
+
FEW_SHOT_EXAMPLE = """
|
| 75 |
+
## Example: Strong Evidence Synthesis
|
| 76 |
+
|
| 77 |
+
INPUT:
|
| 78 |
+
- Query: "Alprostadil for erectile dysfunction"
|
| 79 |
+
- Evidence: 15 papers including meta-analysis of 8 RCTs (N=3,247)
|
| 80 |
+
- Mechanism Score: 9/10
|
| 81 |
+
- Clinical Score: 9/10
|
| 82 |
+
|
| 83 |
+
OUTPUT:
|
| 84 |
+
|
| 85 |
+
### Executive Summary
|
| 86 |
+
|
| 87 |
+
Alprostadil (prostaglandin E1) represents a well-established second-line treatment
|
| 88 |
+
for erectile dysfunction, with meta-analytic evidence demonstrating 87% efficacy
|
| 89 |
+
in achieving erections sufficient for intercourse. It offers a PDE5-independent
|
| 90 |
+
mechanism particularly valuable for patients who do not respond to oral therapies.
|
| 91 |
+
|
| 92 |
+
### Background
|
| 93 |
+
|
| 94 |
+
Erectile dysfunction affects approximately 30 million men in the United States,
|
| 95 |
+
with prevalence increasing with age from 12% at age 40 to 40% at age 70. While
|
| 96 |
+
PDE5 inhibitors remain first-line therapy, approximately 30% of patients are
|
| 97 |
+
non-responders due to diabetes, radical prostatectomy, or other factors.
|
| 98 |
+
Alprostadil provides an alternative mechanism through direct smooth muscle
|
| 99 |
+
relaxation, making it a crucial second-line option.
|
| 100 |
+
|
| 101 |
+
### Evidence Synthesis
|
| 102 |
+
|
| 103 |
+
**Mechanism of Action**
|
| 104 |
+
|
| 105 |
+
Alprostadil works through a distinct pathway from PDE5 inhibitors. It binds to
|
| 106 |
+
EP2 and EP4 receptors on cavernosal smooth muscle, activating adenylate cyclase
|
| 107 |
+
and increasing intracellular cAMP. This leads to smooth muscle relaxation and
|
| 108 |
+
increased blood flow independent of nitric oxide signaling. As noted by Smith
|
| 109 |
+
et al. (2019), this mechanism explains its efficacy in patients with endothelial
|
| 110 |
+
dysfunction where nitric oxide production is impaired.
|
| 111 |
+
|
| 112 |
+
**Clinical Evidence**
|
| 113 |
+
|
| 114 |
+
A meta-analysis by Johnson et al. (2020) pooled data from 8 randomized controlled
|
| 115 |
+
trials (N=3,247). The primary endpoint of erection sufficient for intercourse was
|
| 116 |
+
achieved in 87% of alprostadil patients versus 12% placebo (RR 7.25, 95% CI:
|
| 117 |
+
5.8-9.1, p<0.001). The number needed to treat was 1.3, indicating robust effect
|
| 118 |
+
size. Onset of action was 5-15 minutes, with duration of 30-60 minutes.
|
| 119 |
+
|
| 120 |
+
**Comparative Evidence**
|
| 121 |
+
|
| 122 |
+
Direct comparisons with PDE5 inhibitors are limited. However, in the subgroup
|
| 123 |
+
of PDE5 non-responders studied by Martinez et al. (2018), alprostadil achieved
|
| 124 |
+
successful intercourse in 72% of patients who had failed sildenafil.
|
| 125 |
+
|
| 126 |
+
### Recommendations
|
| 127 |
+
|
| 128 |
+
1. Consider alprostadil as second-line therapy when PDE5 inhibitors fail or are
|
| 129 |
+
contraindicated
|
| 130 |
+
2. Start with 10 micrograms intracavernosal injection, titrate to 40 micrograms based
|
| 131 |
+
on response
|
| 132 |
+
3. Provide in-office training for self-injection technique before home use
|
| 133 |
+
4. Screen for priapism risk factors before initiating therapy
|
| 134 |
+
5. Consider intraurethral alprostadil (MUSE) for patients averse to injections
|
| 135 |
+
|
| 136 |
+
### Limitations
|
| 137 |
+
|
| 138 |
+
Long-term safety data beyond 2 years is limited. Head-to-head comparisons with
|
| 139 |
+
newer therapies such as low-intensity shockwave therapy are lacking. Most trials
|
| 140 |
+
excluded patients with severe cardiovascular disease, limiting generalizability
|
| 141 |
+
to this population. The psychological burden of injection therapy may affect
|
| 142 |
+
real-world adherence compared to oral medications.
|
| 143 |
+
|
| 144 |
+
### References
|
| 145 |
+
|
| 146 |
+
1. Smith AB et al. (2019). Alprostadil mechanism of action in erectile tissue.
|
| 147 |
+
J Urol. https://pubmed.ncbi.nlm.nih.gov/12345678/
|
| 148 |
+
2. Johnson CD et al. (2020). Meta-analysis of intracavernosal alprostadil efficacy.
|
| 149 |
+
J Sex Med. https://pubmed.ncbi.nlm.nih.gov/23456789/
|
| 150 |
+
3. Martinez R et al. (2018). Alprostadil in PDE5 inhibitor non-responders.
|
| 151 |
+
Int J Impot Res. https://pubmed.ncbi.nlm.nih.gov/34567890/
|
| 152 |
+
"""
|
| 153 |
+
|
| 154 |
+
|
| 155 |
+
def format_synthesis_prompt(
|
| 156 |
+
query: str,
|
| 157 |
+
evidence_summary: str,
|
| 158 |
+
drug_candidates: list[str],
|
| 159 |
+
key_findings: list[str],
|
| 160 |
+
mechanism_score: int,
|
| 161 |
+
clinical_score: int,
|
| 162 |
+
confidence: float,
|
| 163 |
+
) -> str:
|
| 164 |
+
"""Format the user prompt for narrative synthesis.
|
| 165 |
+
|
| 166 |
+
Args:
|
| 167 |
+
query: Original research question
|
| 168 |
+
evidence_summary: Formatted summary of evidence papers
|
| 169 |
+
drug_candidates: List of identified drug/treatment candidates
|
| 170 |
+
key_findings: List of key findings from assessment
|
| 171 |
+
mechanism_score: Mechanism evidence score (0-10)
|
| 172 |
+
clinical_score: Clinical evidence score (0-10)
|
| 173 |
+
confidence: Overall confidence (0.0-1.0)
|
| 174 |
+
|
| 175 |
+
Returns:
|
| 176 |
+
Formatted user prompt for the synthesis LLM
|
| 177 |
+
"""
|
| 178 |
+
candidates_str = ", ".join(drug_candidates) if drug_candidates else "None identified"
|
| 179 |
+
if key_findings:
|
| 180 |
+
findings_str = "\n".join(f"- {f}" for f in key_findings)
|
| 181 |
+
else:
|
| 182 |
+
findings_str = "No specific findings extracted"
|
| 183 |
+
|
| 184 |
+
return f"""Synthesize a narrative research report for the following query.
|
| 185 |
+
|
| 186 |
+
## Research Question
|
| 187 |
+
{query}
|
| 188 |
+
|
| 189 |
+
## Evidence Summary
|
| 190 |
+
{evidence_summary}
|
| 191 |
+
|
| 192 |
+
## Identified Drug/Treatment Candidates
|
| 193 |
+
{candidates_str}
|
| 194 |
+
|
| 195 |
+
## Key Findings from Evidence Assessment
|
| 196 |
+
{findings_str}
|
| 197 |
+
|
| 198 |
+
## Assessment Scores
|
| 199 |
+
- Mechanism Score: {mechanism_score}/10
|
| 200 |
+
- Clinical Evidence Score: {clinical_score}/10
|
| 201 |
+
- Overall Confidence: {confidence:.0%}
|
| 202 |
+
|
| 203 |
+
## Instructions
|
| 204 |
+
Generate a NARRATIVE research report following the structure in your system prompt.
|
| 205 |
+
Write in prose paragraphs, NOT bullet points (except for Recommendations section).
|
| 206 |
+
ONLY cite papers mentioned in the Evidence Summary above - do NOT invent references.
|
| 207 |
+
|
| 208 |
+
{FEW_SHOT_EXAMPLE}
|
| 209 |
+
"""
|
|
@@ -35,3 +35,27 @@ class EmbeddingError(DeepBonerError):
|
|
| 35 |
"""Raised when embedding or vector store operations fail."""
|
| 36 |
|
| 37 |
pass
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 35 |
"""Raised when embedding or vector store operations fail."""
|
| 36 |
|
| 37 |
pass
|
| 38 |
+
|
| 39 |
+
|
| 40 |
+
class LLMError(DeepBonerError):
|
| 41 |
+
"""Raised when LLM operations fail (API errors, parsing errors, etc.)."""
|
| 42 |
+
|
| 43 |
+
pass
|
| 44 |
+
|
| 45 |
+
|
| 46 |
+
class QuotaExceededError(LLMError):
|
| 47 |
+
"""Raised when LLM API quota is exceeded (402 errors)."""
|
| 48 |
+
|
| 49 |
+
pass
|
| 50 |
+
|
| 51 |
+
|
| 52 |
+
class ModalError(DeepBonerError):
|
| 53 |
+
"""Raised when Modal sandbox operations fail."""
|
| 54 |
+
|
| 55 |
+
pass
|
| 56 |
+
|
| 57 |
+
|
| 58 |
+
class SynthesisError(DeepBonerError):
|
| 59 |
+
"""Raised when report synthesis fails."""
|
| 60 |
+
|
| 61 |
+
pass
|
|
@@ -55,11 +55,11 @@ async def test_simple_mode_structure_validation(mock_search_handler, mock_judge_
|
|
| 55 |
complete_event = next(e for e in events if e.type == "complete")
|
| 56 |
report = complete_event.message
|
| 57 |
|
| 58 |
-
# Check
|
| 59 |
-
|
| 60 |
-
assert "
|
| 61 |
-
assert "
|
| 62 |
|
| 63 |
-
# Check for citations
|
| 64 |
assert "Study on test query" in report
|
| 65 |
-
assert "
|
|
|
|
| 55 |
complete_event = next(e for e in events if e.type == "complete")
|
| 56 |
report = complete_event.message
|
| 57 |
|
| 58 |
+
# Check LLM narrative synthesis structure (SPEC_12)
|
| 59 |
+
# LLM generates prose with these sections (may omit ### prefix)
|
| 60 |
+
assert "Executive Summary" in report or "Sexual Health Analysis" in report
|
| 61 |
+
assert "Full Citation List" in report or "Citations" in report
|
| 62 |
|
| 63 |
+
# Check for citations (from citation footer added by orchestrator)
|
| 64 |
assert "Study on test query" in report
|
| 65 |
+
assert "pubmed.example.com/123" in report
|
|
@@ -92,7 +92,11 @@ async def test_simple_mode_synthesizes_before_max_iterations():
|
|
| 92 |
complete_event = complete_events[0]
|
| 93 |
|
| 94 |
assert "MagicDrug" in complete_event.message
|
| 95 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 96 |
assert complete_event.data.get("synthesis_reason") == "high_scores_with_candidates"
|
| 97 |
assert complete_event.iteration == 2 # Should stop at it 2
|
| 98 |
|
|
|
|
| 92 |
complete_event = complete_events[0]
|
| 93 |
|
| 94 |
assert "MagicDrug" in complete_event.message
|
| 95 |
+
# SPEC_12: LLM synthesis produces narrative prose, not template with "Drug Candidates" header
|
| 96 |
+
# Check for narrative structure (LLM may omit ### prefix) OR template fallback
|
| 97 |
+
assert (
|
| 98 |
+
"Executive Summary" in complete_event.message or "Drug Candidates" in complete_event.message
|
| 99 |
+
)
|
| 100 |
assert complete_event.data.get("synthesis_reason") == "high_scores_with_candidates"
|
| 101 |
assert complete_event.iteration == 2 # Should stop at it 2
|
| 102 |
|
|
@@ -8,6 +8,7 @@ from src.agent_factory.judges import JudgeHandler, MockJudgeHandler
|
|
| 8 |
from src.utils.models import AssessmentDetails, Citation, Evidence, JudgeAssessment
|
| 9 |
|
| 10 |
|
|
|
|
| 11 |
class TestJudgeHandler:
|
| 12 |
"""Tests for JudgeHandler."""
|
| 13 |
|
|
@@ -107,6 +108,8 @@ class TestJudgeHandler:
|
|
| 107 |
assert result.sufficient is False
|
| 108 |
assert result.recommendation == "continue"
|
| 109 |
assert len(result.next_search_queries) > 0
|
|
|
|
|
|
|
| 110 |
|
| 111 |
@pytest.mark.asyncio
|
| 112 |
async def test_assess_handles_llm_failure(self):
|
|
@@ -143,6 +146,7 @@ class TestJudgeHandler:
|
|
| 143 |
assert "failed" in result.reasoning.lower()
|
| 144 |
|
| 145 |
|
|
|
|
| 146 |
class TestMockJudgeHandler:
|
| 147 |
"""Tests for MockJudgeHandler."""
|
| 148 |
|
|
|
|
| 8 |
from src.utils.models import AssessmentDetails, Citation, Evidence, JudgeAssessment
|
| 9 |
|
| 10 |
|
| 11 |
+
@pytest.mark.unit
|
| 12 |
class TestJudgeHandler:
|
| 13 |
"""Tests for JudgeHandler."""
|
| 14 |
|
|
|
|
| 108 |
assert result.sufficient is False
|
| 109 |
assert result.recommendation == "continue"
|
| 110 |
assert len(result.next_search_queries) > 0
|
| 111 |
+
# Assert specific expected query is present
|
| 112 |
+
assert "sildenafil mechanism" in result.next_search_queries
|
| 113 |
|
| 114 |
@pytest.mark.asyncio
|
| 115 |
async def test_assess_handles_llm_failure(self):
|
|
|
|
| 146 |
assert "failed" in result.reasoning.lower()
|
| 147 |
|
| 148 |
|
| 149 |
+
@pytest.mark.unit
|
| 150 |
class TestMockJudgeHandler:
|
| 151 |
"""Tests for MockJudgeHandler."""
|
| 152 |
|
|
@@ -12,12 +12,12 @@ async def test_judge_node_initialization(mocker):
|
|
| 12 |
# Mock get_model to avoid needing real API keys
|
| 13 |
mocker.patch("src.agents.graph.nodes.get_model", return_value=mocker.Mock())
|
| 14 |
|
| 15 |
-
# Create a mock assessment with attributes
|
| 16 |
mock_hypothesis = mocker.Mock()
|
| 17 |
-
mock_hypothesis.drug = "
|
| 18 |
-
mock_hypothesis.target = "
|
| 19 |
-
mock_hypothesis.pathway = "
|
| 20 |
-
mock_hypothesis.effect = "
|
| 21 |
mock_hypothesis.confidence = 0.8
|
| 22 |
|
| 23 |
mock_assessment = mocker.Mock()
|
|
@@ -46,7 +46,7 @@ async def test_judge_node_initialization(mocker):
|
|
| 46 |
|
| 47 |
assert "hypotheses" in update
|
| 48 |
assert len(update["hypotheses"]) == 1
|
| 49 |
-
assert update["hypotheses"][0].id == "
|
| 50 |
assert update["hypotheses"][0].status == "proposed"
|
| 51 |
|
| 52 |
|
|
|
|
| 12 |
# Mock get_model to avoid needing real API keys
|
| 13 |
mocker.patch("src.agents.graph.nodes.get_model", return_value=mocker.Mock())
|
| 14 |
|
| 15 |
+
# Create a mock assessment with attributes (sexual health domain)
|
| 16 |
mock_hypothesis = mocker.Mock()
|
| 17 |
+
mock_hypothesis.drug = "Testosterone"
|
| 18 |
+
mock_hypothesis.target = "Androgen Receptor"
|
| 19 |
+
mock_hypothesis.pathway = "HPG Axis"
|
| 20 |
+
mock_hypothesis.effect = "Libido Enhancement"
|
| 21 |
mock_hypothesis.confidence = 0.8
|
| 22 |
|
| 23 |
mock_assessment = mocker.Mock()
|
|
|
|
| 46 |
|
| 47 |
assert "hypotheses" in update
|
| 48 |
assert len(update["hypotheses"]) == 1
|
| 49 |
+
assert update["hypotheses"][0].id == "Testosterone"
|
| 50 |
assert update["hypotheses"][0].status == "proposed"
|
| 51 |
|
| 52 |
|
|
@@ -30,7 +30,7 @@ class TestSimpleOrchestratorDomain:
|
|
| 30 |
domain=ResearchDomain.SEXUAL_HEALTH,
|
| 31 |
)
|
| 32 |
|
| 33 |
-
# Test
|
| 34 |
mock_assessment = MagicMock()
|
| 35 |
mock_assessment.details.drug_candidates = []
|
| 36 |
mock_assessment.details.key_findings = []
|
|
@@ -39,7 +39,7 @@ class TestSimpleOrchestratorDomain:
|
|
| 39 |
mock_assessment.details.mechanism_score = 5
|
| 40 |
mock_assessment.details.clinical_evidence_score = 5
|
| 41 |
|
| 42 |
-
report = orch.
|
| 43 |
assert "## Sexual Health Analysis" in report
|
| 44 |
|
| 45 |
# Test _generate_partial_synthesis
|
|
|
|
| 30 |
domain=ResearchDomain.SEXUAL_HEALTH,
|
| 31 |
)
|
| 32 |
|
| 33 |
+
# Test _generate_template_synthesis (the sync fallback method)
|
| 34 |
mock_assessment = MagicMock()
|
| 35 |
mock_assessment.details.drug_candidates = []
|
| 36 |
mock_assessment.details.key_findings = []
|
|
|
|
| 39 |
mock_assessment.details.mechanism_score = 5
|
| 40 |
mock_assessment.details.clinical_evidence_score = 5
|
| 41 |
|
| 42 |
+
report = orch._generate_template_synthesis("query", [], mock_assessment)
|
| 43 |
assert "## Sexual Health Analysis" in report
|
| 44 |
|
| 45 |
# Test _generate_partial_synthesis
|
|
@@ -0,0 +1,279 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Tests for simple orchestrator LLM synthesis."""
|
| 2 |
+
|
| 3 |
+
from unittest.mock import AsyncMock, MagicMock, patch
|
| 4 |
+
|
| 5 |
+
import pytest
|
| 6 |
+
|
| 7 |
+
from src.orchestrators.simple import Orchestrator
|
| 8 |
+
from src.utils.models import AssessmentDetails, Citation, Evidence, JudgeAssessment
|
| 9 |
+
|
| 10 |
+
|
| 11 |
+
@pytest.fixture
|
| 12 |
+
def sample_evidence() -> list[Evidence]:
|
| 13 |
+
"""Sample evidence for testing synthesis."""
|
| 14 |
+
return [
|
| 15 |
+
Evidence(
|
| 16 |
+
content="Testosterone therapy demonstrates efficacy in treating HSDD.",
|
| 17 |
+
citation=Citation(
|
| 18 |
+
source="pubmed",
|
| 19 |
+
title="Testosterone and Female Sexual Desire",
|
| 20 |
+
url="https://pubmed.ncbi.nlm.nih.gov/12345/",
|
| 21 |
+
date="2023",
|
| 22 |
+
authors=["Smith J", "Jones A"],
|
| 23 |
+
),
|
| 24 |
+
),
|
| 25 |
+
Evidence(
|
| 26 |
+
content="A meta-analysis of 8 RCTs shows significant improvement in sexual desire.",
|
| 27 |
+
citation=Citation(
|
| 28 |
+
source="pubmed",
|
| 29 |
+
title="Meta-analysis of Testosterone Therapy",
|
| 30 |
+
url="https://pubmed.ncbi.nlm.nih.gov/67890/",
|
| 31 |
+
date="2024",
|
| 32 |
+
authors=["Johnson B"],
|
| 33 |
+
),
|
| 34 |
+
),
|
| 35 |
+
]
|
| 36 |
+
|
| 37 |
+
|
| 38 |
+
@pytest.fixture
|
| 39 |
+
def sample_assessment() -> JudgeAssessment:
|
| 40 |
+
"""Sample assessment for testing synthesis."""
|
| 41 |
+
return JudgeAssessment(
|
| 42 |
+
sufficient=True,
|
| 43 |
+
confidence=0.85,
|
| 44 |
+
reasoning="Evidence is sufficient to synthesize findings on testosterone therapy for HSDD.",
|
| 45 |
+
recommendation="synthesize",
|
| 46 |
+
next_search_queries=[],
|
| 47 |
+
details=AssessmentDetails(
|
| 48 |
+
mechanism_score=8,
|
| 49 |
+
mechanism_reasoning="Strong evidence of androgen receptor activation pathway.",
|
| 50 |
+
clinical_evidence_score=7,
|
| 51 |
+
clinical_reasoning="Multiple RCTs support efficacy in postmenopausal HSDD.",
|
| 52 |
+
drug_candidates=["Testosterone", "LibiGel"],
|
| 53 |
+
key_findings=[
|
| 54 |
+
"Testosterone improves libido in postmenopausal women",
|
| 55 |
+
"Transdermal formulation has best safety profile",
|
| 56 |
+
],
|
| 57 |
+
),
|
| 58 |
+
)
|
| 59 |
+
|
| 60 |
+
|
| 61 |
+
@pytest.mark.unit
|
| 62 |
+
class TestGenerateSynthesis:
|
| 63 |
+
"""Tests for _generate_synthesis method."""
|
| 64 |
+
|
| 65 |
+
@pytest.mark.asyncio
|
| 66 |
+
async def test_calls_llm_for_narrative(
|
| 67 |
+
self,
|
| 68 |
+
sample_evidence: list[Evidence],
|
| 69 |
+
sample_assessment: JudgeAssessment,
|
| 70 |
+
) -> None:
|
| 71 |
+
"""Synthesis should make an LLM call, not just use a template."""
|
| 72 |
+
mock_search = MagicMock()
|
| 73 |
+
mock_judge = MagicMock()
|
| 74 |
+
|
| 75 |
+
orchestrator = Orchestrator(
|
| 76 |
+
search_handler=mock_search,
|
| 77 |
+
judge_handler=mock_judge,
|
| 78 |
+
)
|
| 79 |
+
orchestrator.history = [{"iteration": 1}] # Needed for footer
|
| 80 |
+
|
| 81 |
+
with (
|
| 82 |
+
patch("pydantic_ai.Agent") as mock_agent_class,
|
| 83 |
+
patch("src.agent_factory.judges.get_model") as mock_get_model,
|
| 84 |
+
):
|
| 85 |
+
mock_model = MagicMock()
|
| 86 |
+
mock_get_model.return_value = mock_model
|
| 87 |
+
|
| 88 |
+
mock_agent = MagicMock()
|
| 89 |
+
mock_result = MagicMock()
|
| 90 |
+
mock_result.output = """### Executive Summary
|
| 91 |
+
|
| 92 |
+
Testosterone therapy demonstrates consistent efficacy for HSDD treatment.
|
| 93 |
+
|
| 94 |
+
### Background
|
| 95 |
+
|
| 96 |
+
HSDD affects many postmenopausal women.
|
| 97 |
+
|
| 98 |
+
### Evidence Synthesis
|
| 99 |
+
|
| 100 |
+
Studies show significant improvement in sexual desire scores.
|
| 101 |
+
|
| 102 |
+
### Recommendations
|
| 103 |
+
|
| 104 |
+
1. Consider testosterone therapy for postmenopausal HSDD
|
| 105 |
+
|
| 106 |
+
### Limitations
|
| 107 |
+
|
| 108 |
+
Long-term safety data is limited.
|
| 109 |
+
|
| 110 |
+
### References
|
| 111 |
+
|
| 112 |
+
1. Smith J et al. (2023). Testosterone and Female Sexual Desire."""
|
| 113 |
+
|
| 114 |
+
mock_agent.run = AsyncMock(return_value=mock_result)
|
| 115 |
+
mock_agent_class.return_value = mock_agent
|
| 116 |
+
|
| 117 |
+
result = await orchestrator._generate_synthesis(
|
| 118 |
+
query="testosterone HSDD",
|
| 119 |
+
evidence=sample_evidence,
|
| 120 |
+
assessment=sample_assessment,
|
| 121 |
+
)
|
| 122 |
+
|
| 123 |
+
# Verify LLM agent was created and called
|
| 124 |
+
mock_agent_class.assert_called_once()
|
| 125 |
+
mock_agent.run.assert_called_once()
|
| 126 |
+
|
| 127 |
+
# Verify output includes narrative content
|
| 128 |
+
assert "Executive Summary" in result
|
| 129 |
+
assert "Background" in result
|
| 130 |
+
assert "Evidence Synthesis" in result
|
| 131 |
+
|
| 132 |
+
@pytest.mark.asyncio
|
| 133 |
+
async def test_falls_back_on_llm_error(
|
| 134 |
+
self,
|
| 135 |
+
sample_evidence: list[Evidence],
|
| 136 |
+
sample_assessment: JudgeAssessment,
|
| 137 |
+
) -> None:
|
| 138 |
+
"""Synthesis should fall back to template if LLM fails."""
|
| 139 |
+
mock_search = MagicMock()
|
| 140 |
+
mock_judge = MagicMock()
|
| 141 |
+
|
| 142 |
+
orchestrator = Orchestrator(
|
| 143 |
+
search_handler=mock_search,
|
| 144 |
+
judge_handler=mock_judge,
|
| 145 |
+
)
|
| 146 |
+
orchestrator.history = [{"iteration": 1}]
|
| 147 |
+
|
| 148 |
+
with patch("pydantic_ai.Agent") as mock_agent_class:
|
| 149 |
+
# Simulate LLM failure
|
| 150 |
+
mock_agent_class.side_effect = Exception("LLM unavailable")
|
| 151 |
+
|
| 152 |
+
result = await orchestrator._generate_synthesis(
|
| 153 |
+
query="testosterone HSDD",
|
| 154 |
+
evidence=sample_evidence,
|
| 155 |
+
assessment=sample_assessment,
|
| 156 |
+
)
|
| 157 |
+
|
| 158 |
+
# Should return template fallback (has Assessment section)
|
| 159 |
+
assert "Assessment" in result or "Drug Candidates" in result
|
| 160 |
+
assert "Testosterone" in result # Drug candidate should be present
|
| 161 |
+
|
| 162 |
+
@pytest.mark.asyncio
|
| 163 |
+
async def test_includes_citation_footer(
|
| 164 |
+
self,
|
| 165 |
+
sample_evidence: list[Evidence],
|
| 166 |
+
sample_assessment: JudgeAssessment,
|
| 167 |
+
) -> None:
|
| 168 |
+
"""Synthesis should include full citation list footer."""
|
| 169 |
+
mock_search = MagicMock()
|
| 170 |
+
mock_judge = MagicMock()
|
| 171 |
+
|
| 172 |
+
orchestrator = Orchestrator(
|
| 173 |
+
search_handler=mock_search,
|
| 174 |
+
judge_handler=mock_judge,
|
| 175 |
+
)
|
| 176 |
+
orchestrator.history = [{"iteration": 1}]
|
| 177 |
+
|
| 178 |
+
with (
|
| 179 |
+
patch("pydantic_ai.Agent") as mock_agent_class,
|
| 180 |
+
patch("src.agent_factory.judges.get_model"),
|
| 181 |
+
):
|
| 182 |
+
mock_agent = MagicMock()
|
| 183 |
+
mock_result = MagicMock()
|
| 184 |
+
mock_result.output = "Narrative synthesis content."
|
| 185 |
+
mock_agent.run = AsyncMock(return_value=mock_result)
|
| 186 |
+
mock_agent_class.return_value = mock_agent
|
| 187 |
+
|
| 188 |
+
result = await orchestrator._generate_synthesis(
|
| 189 |
+
query="test query",
|
| 190 |
+
evidence=sample_evidence,
|
| 191 |
+
assessment=sample_assessment,
|
| 192 |
+
)
|
| 193 |
+
|
| 194 |
+
# Should include citation footer
|
| 195 |
+
assert "Full Citation List" in result
|
| 196 |
+
assert "pubmed.ncbi.nlm.nih.gov/12345" in result
|
| 197 |
+
assert "pubmed.ncbi.nlm.nih.gov/67890" in result
|
| 198 |
+
|
| 199 |
+
|
| 200 |
+
@pytest.mark.unit
|
| 201 |
+
class TestGenerateTemplateSynthesis:
|
| 202 |
+
"""Tests for _generate_template_synthesis fallback method."""
|
| 203 |
+
|
| 204 |
+
def test_returns_structured_output(
|
| 205 |
+
self,
|
| 206 |
+
sample_evidence: list[Evidence],
|
| 207 |
+
sample_assessment: JudgeAssessment,
|
| 208 |
+
) -> None:
|
| 209 |
+
"""Template synthesis should return structured markdown."""
|
| 210 |
+
mock_search = MagicMock()
|
| 211 |
+
mock_judge = MagicMock()
|
| 212 |
+
|
| 213 |
+
orchestrator = Orchestrator(
|
| 214 |
+
search_handler=mock_search,
|
| 215 |
+
judge_handler=mock_judge,
|
| 216 |
+
)
|
| 217 |
+
orchestrator.history = [{"iteration": 1}]
|
| 218 |
+
|
| 219 |
+
result = orchestrator._generate_template_synthesis(
|
| 220 |
+
query="testosterone HSDD",
|
| 221 |
+
evidence=sample_evidence,
|
| 222 |
+
assessment=sample_assessment,
|
| 223 |
+
)
|
| 224 |
+
|
| 225 |
+
# Should have all required sections
|
| 226 |
+
assert "Question" in result
|
| 227 |
+
assert "Drug Candidates" in result
|
| 228 |
+
assert "Key Findings" in result
|
| 229 |
+
assert "Assessment" in result
|
| 230 |
+
assert "Citations" in result
|
| 231 |
+
|
| 232 |
+
def test_includes_drug_candidates(
|
| 233 |
+
self,
|
| 234 |
+
sample_evidence: list[Evidence],
|
| 235 |
+
sample_assessment: JudgeAssessment,
|
| 236 |
+
) -> None:
|
| 237 |
+
"""Template synthesis should list drug candidates."""
|
| 238 |
+
mock_search = MagicMock()
|
| 239 |
+
mock_judge = MagicMock()
|
| 240 |
+
|
| 241 |
+
orchestrator = Orchestrator(
|
| 242 |
+
search_handler=mock_search,
|
| 243 |
+
judge_handler=mock_judge,
|
| 244 |
+
)
|
| 245 |
+
orchestrator.history = [{"iteration": 1}]
|
| 246 |
+
|
| 247 |
+
result = orchestrator._generate_template_synthesis(
|
| 248 |
+
query="test",
|
| 249 |
+
evidence=sample_evidence,
|
| 250 |
+
assessment=sample_assessment,
|
| 251 |
+
)
|
| 252 |
+
|
| 253 |
+
assert "Testosterone" in result
|
| 254 |
+
assert "LibiGel" in result
|
| 255 |
+
|
| 256 |
+
def test_includes_scores(
|
| 257 |
+
self,
|
| 258 |
+
sample_evidence: list[Evidence],
|
| 259 |
+
sample_assessment: JudgeAssessment,
|
| 260 |
+
) -> None:
|
| 261 |
+
"""Template synthesis should include assessment scores."""
|
| 262 |
+
mock_search = MagicMock()
|
| 263 |
+
mock_judge = MagicMock()
|
| 264 |
+
|
| 265 |
+
orchestrator = Orchestrator(
|
| 266 |
+
search_handler=mock_search,
|
| 267 |
+
judge_handler=mock_judge,
|
| 268 |
+
)
|
| 269 |
+
orchestrator.history = [{"iteration": 1}]
|
| 270 |
+
|
| 271 |
+
result = orchestrator._generate_template_synthesis(
|
| 272 |
+
query="test",
|
| 273 |
+
evidence=sample_evidence,
|
| 274 |
+
assessment=sample_assessment,
|
| 275 |
+
)
|
| 276 |
+
|
| 277 |
+
assert "8/10" in result # Mechanism score
|
| 278 |
+
assert "7/10" in result # Clinical score
|
| 279 |
+
assert "85%" in result # Confidence
|
|
@@ -0,0 +1,217 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Tests for narrative synthesis prompts."""
|
| 2 |
+
|
| 3 |
+
import pytest
|
| 4 |
+
|
| 5 |
+
from src.prompts.synthesis import (
|
| 6 |
+
FEW_SHOT_EXAMPLE,
|
| 7 |
+
format_synthesis_prompt,
|
| 8 |
+
get_synthesis_system_prompt,
|
| 9 |
+
)
|
| 10 |
+
|
| 11 |
+
|
| 12 |
+
@pytest.mark.unit
|
| 13 |
+
class TestSynthesisSystemPrompt:
|
| 14 |
+
"""Tests for synthesis system prompt generation."""
|
| 15 |
+
|
| 16 |
+
def test_system_prompt_emphasizes_prose(self) -> None:
|
| 17 |
+
"""System prompt should emphasize prose paragraphs, not bullets."""
|
| 18 |
+
prompt = get_synthesis_system_prompt()
|
| 19 |
+
assert "PROSE PARAGRAPHS" in prompt
|
| 20 |
+
assert "not bullet points" in prompt.lower()
|
| 21 |
+
|
| 22 |
+
def test_system_prompt_requires_executive_summary(self) -> None:
|
| 23 |
+
"""System prompt should require executive summary section."""
|
| 24 |
+
prompt = get_synthesis_system_prompt()
|
| 25 |
+
assert "Executive Summary" in prompt
|
| 26 |
+
assert "REQUIRED" in prompt
|
| 27 |
+
|
| 28 |
+
def test_system_prompt_requires_background(self) -> None:
|
| 29 |
+
"""System prompt should require background section."""
|
| 30 |
+
prompt = get_synthesis_system_prompt()
|
| 31 |
+
assert "Background" in prompt
|
| 32 |
+
|
| 33 |
+
def test_system_prompt_requires_evidence_synthesis(self) -> None:
|
| 34 |
+
"""System prompt should require evidence synthesis section."""
|
| 35 |
+
prompt = get_synthesis_system_prompt()
|
| 36 |
+
assert "Evidence Synthesis" in prompt
|
| 37 |
+
assert "Mechanism of Action" in prompt
|
| 38 |
+
|
| 39 |
+
def test_system_prompt_requires_recommendations(self) -> None:
|
| 40 |
+
"""System prompt should require recommendations section."""
|
| 41 |
+
prompt = get_synthesis_system_prompt()
|
| 42 |
+
assert "Recommendations" in prompt
|
| 43 |
+
|
| 44 |
+
def test_system_prompt_requires_limitations(self) -> None:
|
| 45 |
+
"""System prompt should require limitations section."""
|
| 46 |
+
prompt = get_synthesis_system_prompt()
|
| 47 |
+
assert "Limitations" in prompt
|
| 48 |
+
|
| 49 |
+
def test_system_prompt_warns_about_hallucination(self) -> None:
|
| 50 |
+
"""System prompt should warn about citation hallucination."""
|
| 51 |
+
prompt = get_synthesis_system_prompt()
|
| 52 |
+
assert "NEVER hallucinate" in prompt or "never hallucinate" in prompt.lower()
|
| 53 |
+
|
| 54 |
+
def test_system_prompt_includes_domain_name(self) -> None:
|
| 55 |
+
"""System prompt should include domain name."""
|
| 56 |
+
prompt = get_synthesis_system_prompt("sexual_health")
|
| 57 |
+
assert "sexual health" in prompt.lower()
|
| 58 |
+
|
| 59 |
+
|
| 60 |
+
@pytest.mark.unit
|
| 61 |
+
class TestFormatSynthesisPrompt:
|
| 62 |
+
"""Tests for synthesis user prompt formatting."""
|
| 63 |
+
|
| 64 |
+
def test_includes_query(self) -> None:
|
| 65 |
+
"""User prompt should include the research query."""
|
| 66 |
+
prompt = format_synthesis_prompt(
|
| 67 |
+
query="testosterone libido",
|
| 68 |
+
evidence_summary="Study shows efficacy...",
|
| 69 |
+
drug_candidates=["Testosterone"],
|
| 70 |
+
key_findings=["Improved libido"],
|
| 71 |
+
mechanism_score=8,
|
| 72 |
+
clinical_score=7,
|
| 73 |
+
confidence=0.85,
|
| 74 |
+
)
|
| 75 |
+
assert "testosterone libido" in prompt
|
| 76 |
+
|
| 77 |
+
def test_includes_evidence_summary(self) -> None:
|
| 78 |
+
"""User prompt should include evidence summary."""
|
| 79 |
+
prompt = format_synthesis_prompt(
|
| 80 |
+
query="test query",
|
| 81 |
+
evidence_summary="Study by Smith et al. shows significant results...",
|
| 82 |
+
drug_candidates=[],
|
| 83 |
+
key_findings=[],
|
| 84 |
+
mechanism_score=5,
|
| 85 |
+
clinical_score=5,
|
| 86 |
+
confidence=0.5,
|
| 87 |
+
)
|
| 88 |
+
assert "Study by Smith et al." in prompt
|
| 89 |
+
|
| 90 |
+
def test_includes_drug_candidates(self) -> None:
|
| 91 |
+
"""User prompt should include drug candidates."""
|
| 92 |
+
prompt = format_synthesis_prompt(
|
| 93 |
+
query="test query",
|
| 94 |
+
evidence_summary="...",
|
| 95 |
+
drug_candidates=["Testosterone", "Flibanserin"],
|
| 96 |
+
key_findings=[],
|
| 97 |
+
mechanism_score=5,
|
| 98 |
+
clinical_score=5,
|
| 99 |
+
confidence=0.5,
|
| 100 |
+
)
|
| 101 |
+
assert "Testosterone" in prompt
|
| 102 |
+
assert "Flibanserin" in prompt
|
| 103 |
+
|
| 104 |
+
def test_includes_key_findings(self) -> None:
|
| 105 |
+
"""User prompt should include key findings."""
|
| 106 |
+
prompt = format_synthesis_prompt(
|
| 107 |
+
query="test query",
|
| 108 |
+
evidence_summary="...",
|
| 109 |
+
drug_candidates=[],
|
| 110 |
+
key_findings=["Improved libido in postmenopausal women", "Safe profile"],
|
| 111 |
+
mechanism_score=5,
|
| 112 |
+
clinical_score=5,
|
| 113 |
+
confidence=0.5,
|
| 114 |
+
)
|
| 115 |
+
assert "Improved libido in postmenopausal women" in prompt
|
| 116 |
+
assert "Safe profile" in prompt
|
| 117 |
+
|
| 118 |
+
def test_includes_scores(self) -> None:
|
| 119 |
+
"""User prompt should include assessment scores."""
|
| 120 |
+
prompt = format_synthesis_prompt(
|
| 121 |
+
query="test query",
|
| 122 |
+
evidence_summary="...",
|
| 123 |
+
drug_candidates=[],
|
| 124 |
+
key_findings=[],
|
| 125 |
+
mechanism_score=8,
|
| 126 |
+
clinical_score=7,
|
| 127 |
+
confidence=0.85,
|
| 128 |
+
)
|
| 129 |
+
assert "8/10" in prompt
|
| 130 |
+
assert "7/10" in prompt
|
| 131 |
+
assert "85%" in prompt
|
| 132 |
+
|
| 133 |
+
def test_handles_empty_candidates(self) -> None:
|
| 134 |
+
"""User prompt should handle empty drug candidates."""
|
| 135 |
+
prompt = format_synthesis_prompt(
|
| 136 |
+
query="test query",
|
| 137 |
+
evidence_summary="...",
|
| 138 |
+
drug_candidates=[],
|
| 139 |
+
key_findings=[],
|
| 140 |
+
mechanism_score=5,
|
| 141 |
+
clinical_score=5,
|
| 142 |
+
confidence=0.5,
|
| 143 |
+
)
|
| 144 |
+
assert "None identified" in prompt
|
| 145 |
+
|
| 146 |
+
def test_handles_empty_findings(self) -> None:
|
| 147 |
+
"""User prompt should handle empty key findings."""
|
| 148 |
+
prompt = format_synthesis_prompt(
|
| 149 |
+
query="test query",
|
| 150 |
+
evidence_summary="...",
|
| 151 |
+
drug_candidates=[],
|
| 152 |
+
key_findings=[],
|
| 153 |
+
mechanism_score=5,
|
| 154 |
+
clinical_score=5,
|
| 155 |
+
confidence=0.5,
|
| 156 |
+
)
|
| 157 |
+
assert "No specific findings" in prompt
|
| 158 |
+
|
| 159 |
+
def test_includes_few_shot_example(self) -> None:
|
| 160 |
+
"""User prompt should include few-shot example."""
|
| 161 |
+
prompt = format_synthesis_prompt(
|
| 162 |
+
query="test query",
|
| 163 |
+
evidence_summary="...",
|
| 164 |
+
drug_candidates=[],
|
| 165 |
+
key_findings=[],
|
| 166 |
+
mechanism_score=5,
|
| 167 |
+
clinical_score=5,
|
| 168 |
+
confidence=0.5,
|
| 169 |
+
)
|
| 170 |
+
assert "Alprostadil" in prompt # From the few-shot example
|
| 171 |
+
|
| 172 |
+
|
| 173 |
+
@pytest.mark.unit
|
| 174 |
+
class TestFewShotExample:
|
| 175 |
+
"""Tests for the few-shot example quality."""
|
| 176 |
+
|
| 177 |
+
def test_few_shot_is_mostly_narrative(self) -> None:
|
| 178 |
+
"""Few-shot example should be mostly prose paragraphs, not bullets."""
|
| 179 |
+
# Count substantial paragraphs (>100 chars of prose)
|
| 180 |
+
paragraphs = [p for p in FEW_SHOT_EXAMPLE.split("\n\n") if len(p) > 100]
|
| 181 |
+
# Count bullet points
|
| 182 |
+
bullets = FEW_SHOT_EXAMPLE.count("\n- ") + FEW_SHOT_EXAMPLE.count("\n1. ")
|
| 183 |
+
|
| 184 |
+
# Prose should dominate - at least as many paragraphs as bullets
|
| 185 |
+
assert len(paragraphs) >= bullets, "Few-shot example should be mostly narrative prose"
|
| 186 |
+
|
| 187 |
+
def test_few_shot_has_executive_summary(self) -> None:
|
| 188 |
+
"""Few-shot example should demonstrate executive summary."""
|
| 189 |
+
assert "Executive Summary" in FEW_SHOT_EXAMPLE
|
| 190 |
+
|
| 191 |
+
def test_few_shot_has_background(self) -> None:
|
| 192 |
+
"""Few-shot example should demonstrate background section."""
|
| 193 |
+
assert "Background" in FEW_SHOT_EXAMPLE
|
| 194 |
+
|
| 195 |
+
def test_few_shot_has_evidence_synthesis(self) -> None:
|
| 196 |
+
"""Few-shot example should demonstrate evidence synthesis."""
|
| 197 |
+
assert "Evidence Synthesis" in FEW_SHOT_EXAMPLE
|
| 198 |
+
assert "Mechanism of Action" in FEW_SHOT_EXAMPLE
|
| 199 |
+
|
| 200 |
+
def test_few_shot_has_recommendations(self) -> None:
|
| 201 |
+
"""Few-shot example should demonstrate recommendations."""
|
| 202 |
+
assert "Recommendations" in FEW_SHOT_EXAMPLE
|
| 203 |
+
|
| 204 |
+
def test_few_shot_has_limitations(self) -> None:
|
| 205 |
+
"""Few-shot example should demonstrate limitations."""
|
| 206 |
+
assert "Limitations" in FEW_SHOT_EXAMPLE
|
| 207 |
+
|
| 208 |
+
def test_few_shot_has_references(self) -> None:
|
| 209 |
+
"""Few-shot example should demonstrate references format."""
|
| 210 |
+
assert "References" in FEW_SHOT_EXAMPLE
|
| 211 |
+
assert "pubmed.ncbi.nlm.nih.gov" in FEW_SHOT_EXAMPLE
|
| 212 |
+
|
| 213 |
+
def test_few_shot_includes_statistics(self) -> None:
|
| 214 |
+
"""Few-shot example should demonstrate statistical reporting."""
|
| 215 |
+
assert "%" in FEW_SHOT_EXAMPLE # Percentages
|
| 216 |
+
assert "p<" in FEW_SHOT_EXAMPLE or "p=" in FEW_SHOT_EXAMPLE # P-values
|
| 217 |
+
assert "CI" in FEW_SHOT_EXAMPLE # Confidence intervals
|
|
@@ -32,6 +32,7 @@ def mock_evidence() -> Evidence:
|
|
| 32 |
class TestSearchPubMed:
|
| 33 |
"""Tests for search_pubmed MCP tool."""
|
| 34 |
|
|
|
|
| 35 |
@patch("src.mcp_tools._pubmed.search")
|
| 36 |
async def test_returns_formatted_string(self, mock_search):
|
| 37 |
"""Test that search_pubmed returns Markdown formatted string."""
|
|
@@ -93,7 +94,7 @@ class TestSearchClinicalTrials:
|
|
| 93 |
with patch("src.mcp_tools._trials") as mock_tool:
|
| 94 |
mock_tool.search = AsyncMock(return_value=[mock_evidence])
|
| 95 |
|
| 96 |
-
result = await search_clinical_trials("
|
| 97 |
|
| 98 |
assert isinstance(result, str)
|
| 99 |
assert "Clinical Trials" in result
|
|
|
|
| 32 |
class TestSearchPubMed:
|
| 33 |
"""Tests for search_pubmed MCP tool."""
|
| 34 |
|
| 35 |
+
@pytest.mark.asyncio
|
| 36 |
@patch("src.mcp_tools._pubmed.search")
|
| 37 |
async def test_returns_formatted_string(self, mock_search):
|
| 38 |
"""Test that search_pubmed returns Markdown formatted string."""
|
|
|
|
| 94 |
with patch("src.mcp_tools._trials") as mock_tool:
|
| 95 |
mock_tool.search = AsyncMock(return_value=[mock_evidence])
|
| 96 |
|
| 97 |
+
result = await search_clinical_trials("sildenafil erectile dysfunction", 10)
|
| 98 |
|
| 99 |
assert isinstance(result, str)
|
| 100 |
assert "Clinical Trials" in result
|
|
@@ -134,9 +134,9 @@ class TestClinicalTrialsIntegration:
|
|
| 134 |
|
| 135 |
@pytest.mark.asyncio
|
| 136 |
async def test_real_api_returns_interventional(self) -> None:
|
| 137 |
-
"""Test that real API returns interventional studies."""
|
| 138 |
tool = ClinicalTrialsTool()
|
| 139 |
-
results = await tool.search("
|
| 140 |
|
| 141 |
# Should get results
|
| 142 |
assert len(results) > 0
|
|
|
|
| 134 |
|
| 135 |
@pytest.mark.asyncio
|
| 136 |
async def test_real_api_returns_interventional(self) -> None:
|
| 137 |
+
"""Test that real API returns interventional studies for sexual health query."""
|
| 138 |
tool = ClinicalTrialsTool()
|
| 139 |
+
results = await tool.search("testosterone HSDD", max_results=3)
|
| 140 |
|
| 141 |
# Should get results
|
| 142 |
assert len(results) > 0
|
|
@@ -27,8 +27,8 @@ class TestEuropePMCTool:
|
|
| 27 |
"result": [
|
| 28 |
{
|
| 29 |
"id": "12345",
|
| 30 |
-
"title": "
|
| 31 |
-
"abstractText": "This study examines
|
| 32 |
"doi": "10.1234/test",
|
| 33 |
"pubYear": "2024",
|
| 34 |
"source": "MED",
|
|
@@ -49,11 +49,11 @@ class TestEuropePMCTool:
|
|
| 49 |
|
| 50 |
mock_instance.get.return_value = mock_resp
|
| 51 |
|
| 52 |
-
results = await tool.search("
|
| 53 |
|
| 54 |
assert len(results) == 1
|
| 55 |
assert isinstance(results[0], Evidence)
|
| 56 |
-
assert "
|
| 57 |
|
| 58 |
@pytest.mark.asyncio
|
| 59 |
async def test_search_marks_preprints(self, tool: EuropePMCTool) -> None:
|
|
@@ -113,11 +113,11 @@ class TestEuropePMCIntegration:
|
|
| 113 |
|
| 114 |
@pytest.mark.asyncio
|
| 115 |
async def test_real_api_call(self) -> None:
|
| 116 |
-
"""Test actual API returns relevant results."""
|
| 117 |
tool = EuropePMCTool()
|
| 118 |
-
results = await tool.search("
|
| 119 |
|
| 120 |
assert len(results) > 0
|
| 121 |
-
# At least one result should mention
|
| 122 |
titles = " ".join([r.citation.title.lower() for r in results])
|
| 123 |
-
assert "
|
|
|
|
| 27 |
"result": [
|
| 28 |
{
|
| 29 |
"id": "12345",
|
| 30 |
+
"title": "Testosterone Therapy for HSDD Study",
|
| 31 |
+
"abstractText": "This study examines testosterone therapy for HSDD.",
|
| 32 |
"doi": "10.1234/test",
|
| 33 |
"pubYear": "2024",
|
| 34 |
"source": "MED",
|
|
|
|
| 49 |
|
| 50 |
mock_instance.get.return_value = mock_resp
|
| 51 |
|
| 52 |
+
results = await tool.search("testosterone HSDD therapy", max_results=5)
|
| 53 |
|
| 54 |
assert len(results) == 1
|
| 55 |
assert isinstance(results[0], Evidence)
|
| 56 |
+
assert "Testosterone Therapy for HSDD Study" in results[0].citation.title
|
| 57 |
|
| 58 |
@pytest.mark.asyncio
|
| 59 |
async def test_search_marks_preprints(self, tool: EuropePMCTool) -> None:
|
|
|
|
| 113 |
|
| 114 |
@pytest.mark.asyncio
|
| 115 |
async def test_real_api_call(self) -> None:
|
| 116 |
+
"""Test actual API returns relevant results for sexual health query."""
|
| 117 |
tool = EuropePMCTool()
|
| 118 |
+
results = await tool.search("testosterone libido therapy", max_results=3)
|
| 119 |
|
| 120 |
assert len(results) > 0
|
| 121 |
+
# At least one result should mention testosterone or libido
|
| 122 |
titles = " ".join([r.citation.title.lower() for r in results])
|
| 123 |
+
assert "testosterone" in titles or "libido" in titles or "sexual" in titles
|
|
@@ -12,8 +12,8 @@ class TestQueryPreprocessing:
|
|
| 12 |
def test_strip_question_words(self) -> None:
|
| 13 |
"""Test removal of question words."""
|
| 14 |
assert strip_question_words("What drugs treat HSDD") == "drugs treat hsdd"
|
| 15 |
-
assert strip_question_words("Which medications help
|
| 16 |
-
assert strip_question_words("How can we
|
| 17 |
assert strip_question_words("Is sildenafil effective") == "sildenafil"
|
| 18 |
|
| 19 |
def test_strip_preserves_medical_terms(self) -> None:
|
|
|
|
| 12 |
def test_strip_question_words(self) -> None:
|
| 13 |
"""Test removal of question words."""
|
| 14 |
assert strip_question_words("What drugs treat HSDD") == "drugs treat hsdd"
|
| 15 |
+
assert strip_question_words("Which medications help low libido") == "medications low libido"
|
| 16 |
+
assert strip_question_words("How can we treat ED") == "we treat ed"
|
| 17 |
assert strip_question_words("Is sildenafil effective") == "sildenafil"
|
| 18 |
|
| 19 |
def test_strip_preserves_medical_terms(self) -> None:
|