Spaces:
Running
AGENTS.md
This file provides guidance to AI agents when working with code in this repository.
Project Overview
DeepBoner is an AI-native sexual health research agent. It uses a search-and-judge loop to autonomously search biomedical databases (PubMed, ClinicalTrials.gov, Europe PMC) and synthesize evidence for queries like "What drugs improve female libido post-menopause?" or "Evidence for testosterone therapy in women with HSDD?".
Current Status: Phases 1-14 COMPLETE (Foundation through Demo Submission).
Development Commands
# Install all dependencies (including dev)
make install # or: uv sync --all-extras && uv run pre-commit install
# Run all quality checks (lint + typecheck + test) - MUST PASS BEFORE COMMIT
make check
# Individual commands
make test # uv run pytest tests/unit/ -v
make lint # uv run ruff check src tests
make format # uv run ruff format src tests
make typecheck # uv run mypy src
make test-cov # uv run pytest --cov=src --cov-report=term-missing
# Run single test
uv run pytest tests/unit/utils/test_config.py::TestSettings::test_default_max_iterations -v
# Integration tests (real APIs)
uv run pytest -m integration
Architecture
Pattern: Search-and-judge loop with multi-tool orchestration.
User Question β Orchestrator
β
Search Loop:
1. Query PubMed, ClinicalTrials.gov, Europe PMC
2. Gather evidence
3. Judge quality ("Do we have enough?")
4. If NO β Refine query, search more
5. If YES β Synthesize findings (+ optional Modal analysis)
β
Research Report with Citations
Key Components:
src/orchestrators/- Orchestrator package (simple, advanced, langgraph modes)simple.py- Main search-and-judge loopadvanced.py- Multi-agent Magentic modelanggraph_orchestrator.py- LangGraph-based workflow
src/tools/pubmed.py- PubMed E-utilities searchsrc/tools/clinicaltrials.py- ClinicalTrials.gov APIsrc/tools/europepmc.py- Europe PMC searchsrc/tools/code_execution.py- Modal sandbox executionsrc/tools/search_handler.py- Scatter-gather orchestrationsrc/services/embeddings.py- Local embeddings (sentence-transformers, in-memory)src/services/llamaindex_rag.py- Premium embeddings (OpenAI, persistent ChromaDB)src/services/embedding_protocol.py- Protocol interface for embedding servicessrc/services/research_memory.py- Shared memory layer for research statesrc/services/statistical_analyzer.py- Statistical analysis via Modalsrc/utils/service_loader.py- Tiered service selection (free vs premium)src/agent_factory/judges.py- LLM-based evidence assessmentsrc/agents/- Magentic multi-agent mode (SearchAgent, JudgeAgent, etc.)src/mcp_tools.py- MCP tool wrappers for Claude Desktopsrc/utils/config.py- Pydantic Settings (loads from.env)src/utils/models.py- Evidence, Citation, SearchResult modelssrc/utils/exceptions.py- Exception hierarchysrc/app.py- Gradio UI with MCP server (HuggingFace Spaces)
Break Conditions: Judge approval, token budget (50K max), or max iterations (default 10).
Configuration
Settings via pydantic-settings from .env:
LLM_PROVIDER: "openai" or "anthropic"OPENAI_API_KEY/ANTHROPIC_API_KEY: LLM keysNCBI_API_KEY: Optional, for higher PubMed rate limitsMODAL_TOKEN_ID/MODAL_TOKEN_SECRET: For Modal sandbox (optional)MAX_ITERATIONS: 1-50, default 10LOG_LEVEL: DEBUG, INFO, WARNING, ERROR
Exception Hierarchy
DeepBonerError (base)
βββ SearchError
β βββ RateLimitError
βββ JudgeError
βββ ConfigurationError
βββ EmbeddingError
LLM Model Defaults (November 2025)
Given the rapid advancements, as of November 29, 2025, the DeepBoner project uses the following default LLM models in its configuration (src/utils/config.py):
- OpenAI:
gpt-5- Current flagship model (November 2025). Requires Tier 5 access.
- Anthropic:
claude-sonnet-4-5-20250929- This is the mid-range Claude 4.5 model, released on September 29, 2025.
- The flagship
Claude Opus 4.5(released November 24, 2025) is also available and can be configured by advanced users for enhanced capabilities.
- HuggingFace (Free Tier):
meta-llama/Llama-3.1-70B-Instruct- This remains the default for the free tier, subject to quota limits.
It is crucial to keep these defaults updated as the LLM landscape evolves.
Testing
- TDD: Write tests first in
tests/unit/, implement insrc/ - Markers:
unit,integration,slow - Mocking:
respxfor httpx,pytest-mockfor general mocking - Fixtures:
tests/conftest.pyhasmock_httpx_client,mock_llm_response
Coding Standards
- Python 3.11+, strict mypy, ruff (100-char lines)
- Type all functions, use Pydantic models for data
- Use
structlogfor logging, not print - Conventional commits:
feat(scope):,fix:,docs:
Git Workflow
main: Production-ready (GitHub)dev: Development integration (GitHub)- Remote
origin: GitHub (source of truth for PRs/code review) - Remote
huggingface-upstream: HuggingFace Spaces (deployment target)
HuggingFace Spaces Collaboration:
- Each contributor should use their own dev branch:
yourname-dev(e.g.,vcms-dev,mario-dev) - DO NOT push directly to
mainordevon HuggingFace - these can be overwritten easily - GitHub is the source of truth; HuggingFace is for deployment/demo
- Consider using git hooks to prevent accidental pushes to protected branches