Spaces:
Sleeping
A newer version of the Streamlit SDK is available:
1.52.2
title: Clinical Trial Inspector
emoji: π§¬
colorFrom: blue
colorTo: indigo
sdk: streamlit
sdk_version: 1.40.1
app_file: ct_agent_app.py
pinned: false
Clinical Trial Inspector Agent π΅οΈββοΈπ
Clinical Trial Inspector is an advanced AI agent designed to revolutionize how researchers, clinicians, and analysts explore clinical trial data. By combining Semantic Search, Retrieval-Augmented Generation (RAG), and Visual Analytics, it transforms raw data from ClinicalTrials.gov into actionable insights.
Built with LangChain, LlamaIndex, Streamlit, Altair, Streamlit-Agraph, and Google Gemini, this tool goes beyond simple keyword search. It understands natural language, generates inline visualizations, performs complex multi-dimensional analysis, and visualizes relationships in an interactive knowledge graph.
β¨ Key Features
2. π§ Intelligent Search & Retrieval
- Hybrid Search: Combines Semantic Search (vector similarity) with BM25 Keyword Search (sparse retrieval) using LanceDB's Native Hybrid Search. This ensures you find studies that match both the meaning (e.g., "kidney cancer" -> "renal cell carcinoma") and exact terms (e.g., "NCT04589845", "Teclistamab").
- Smart Filtering:
- Strict Pre-Filtering: For specific sponsors (e.g., "Pfizer"), it forces the engine to look only at that sponsor's studies first, ensuring 100% recall.
- Strict Keyword Filtering (Analytics Only): For counting questions (e.g., "How many studies..."), the Analytics Engine (
get_study_analytics) prioritizes studies where the query explicitly appears in the Title or Conditions, ensuring high precision and accurate counts. - Sponsor Alias Support: Intelligently maps aliases (e.g., "J&J", "MSD") to their canonical sponsor names ("Janssen", "Merck Sharp & Dohme") for accurate aggregation.
- Smart Summary: Returns a clean, concise list of relevant studies.
- Query Expansion: Automatically expands your search terms with medical synonyms (e.g., "Heart Attack" -> "Myocardial Infarction").
- Re-Ranking: Uses a Cross-Encoder (
ms-marco-MiniLM) to re-score results for maximum relevance. - Query Decomposition: Breaks down complex multi-part questions (e.g., "Compare the primary outcomes of Keytruda vs Opdivo") into sub-questions for precise answers.
- Cohort SQL Generation: Translates eligibility criteria into standard SQL queries (OMOP CDM) for patient cohort identification.
π Visual Analytics & Insights
- Inline Charts (Contextual): The agent automatically generates Bar Charts and Line Charts directly in the chat stream when you ask aggregation questions (e.g., "Top sponsors for Multiple Myeloma").
- Analytics Dashboard (Global): A dedicated dashboard to analyze trends across the entire dataset (60,000+ studies), independent of your chat session.
- Interactive Knowledge Graph: Visualize connections between Studies, Sponsors, and Conditions in a dynamic, interactive network graph.
π Geospatial Dashboard
- Global Trial Map: Visualize the geographic distribution of clinical trials on an interactive world map.
- Region Toggle: Switch between World View (Country-level aggregation) and USA View (State-level aggregation).
- Dot Visualization: Uses dynamic CircleMarkers (dots) sized by trial count to show density.
- Interactive Filters: Filter the map by Phase, Status, Sponsor, Start Year, and Study Type.
π Multi-Filter Analysis
- Complex Filtering: Answer sophisticated questions by applying multiple filters simultaneously.
- Example: "What are Pfizer's most common study indications for Phase 2?"
- Full Dataset Scope: General analytics questions analyze the entire database, not just a sample.
- Smart Retrieval: Retrieves up to 5,000 relevant studies for comprehensive analysis.
β‘ High-Performance Ingestion
- Parallel Processing: Uses multi-core processing to ingest and embed thousands of studies per minute.
- LanceDB Integration: Uses LanceDB for high-performance vector storage and native hybrid search.
- Idempotent Updates: Smartly updates existing records without duplication, allowing for seamless data refreshes.
π€ Agent Capabilities & Tools
The agent is equipped with specialized tools to handle different types of requests:
1. search_trials
- Purpose: Finds specific clinical trials based on natural language queries.
- Capabilities: Semantic Search, Smart Filtering (Phase, Status, Sponsor, Intervention), Query Expansion, Hybrid Search, Re-Ranking.
2. get_study_analytics
- Purpose: Aggregates data to reveal trends and insights.
- Capabilities: Multi-Filtering, Grouping (Phase, Status, Sponsor, Year, Condition), Full Dataset Access, Inline Visualization.
3. compare_studies
- Purpose: Handles complex comparison or multi-part questions.
- Capabilities: Uses Query Decomposition to break a complex query into sub-queries, executes them against the database, and synthesizes the results.
4. find_similar_studies
- Purpose: Discovers studies that are semantically similar to a specific trial.
- Capabilities:
- NCT Lookup: Automatically fetches content if queried with an NCT ID.
- Self-Exclusion: Filters out the reference study from results.
- Scoring: Returns similarity scores for transparency.
5. get_study_details
- Purpose: Fetches the full text content of a specific study by NCT ID.
- Capabilities: Retrieves all chunks of a study to provide comprehensive details (Criteria, Summary, Protocol).
6. get_cohort_sql
- Purpose: Translates clinical trial eligibility criteria into standard SQL queries for claims data analysis.
- Capabilities:
- Extraction: Parses text into structured inclusion/exclusion rules (Concepts, Codes).
- SQL Generation: Generates OMOP-compatible SQL queries targeting
medical_claimsandpharmacy_claims. - Logic Enforcement: Applies temporal logic (e.g., "2 diagnoses > 30 days apart") for chronic conditions.
βοΈ How It Works (RAG Pipeline)
ποΈ Ingestion Pipeline
- Ingestion:
ingest_ct.pyfetches study data from ClinicalTrials.gov. It extracts rich text (including Eligibility Criteria and Interventions) and structured metadata. It uses multiprocessing for speed. - Embedding: Text is converted into vector embeddings using
PubMedBERTand stored in LanceDB. - Retrieval:
- Query Transformation: Synonyms are injected via LLM.
- Pre-Filtering: Strict filters (Status, Year, Sponsor) reduce the search scope.
- Hybrid Search: Parallel Vector Search (Semantic) and BM25 (Keyword) combined via LanceDB Native Hybrid Search.
- Post-Filtering: Additional metadata checks (Phase, Intervention) on retrieved candidates.
- Re-Ranking: Cross-Encoder re-scoring (Cached for performance).
- Synthesis: Google Gemini synthesizes the final answer.
π³ Docker Deployment Structure
The application is containerized for easy deployment to Hugging Face Spaces or any Docker-compatible environment.
Dockerfile Breakdown
- Base Image:
python:3.10-slim(Lightweight and secure). - Dependencies: Installs system tools (
build-essential,git) and Python packages fromrequirements.txt. - Port: Exposes port
8501for Streamlit. - Entrypoint: Runs
streamlit run ct_agent_app.py.
Recent Updates π
- RAG Optimization: Implemented a Cached Reranker and reduced retrieval candidates (
TOP_K=200) for 2-3x faster search performance. - Enhanced Analytics: Added support for grouping by Country and State in the Analytics Engine.
- Dynamic Configuration: Improved API key handling for secure, multi-user sessions.
graph TD
API[ClinicalTrials.gov API] -->|Fetch Batches| Script[ingest_ct.py]
Script -->|Process & Embed| LanceDB[(LanceDB)]
π§ RAG Retrieval Flow
graph TD
User[User Query] -->|Expand| Synonyms[Synonym Injection]
Synonyms -->|Pre-Filter| PreFilter[Pre-Retrieval Filters]
PreFilter -->|Filtered Scope| Hybrid[Hybrid Search]
Hybrid -->|Parallel Search| Vector[Vector Search] & BM25[BM25 Keyword Search]
Vector & BM25 -->|Reciprocal Rank Fusion| Fusion[Merged Candidates]
Fusion -->|Candidates| PostFilter[Post-Retrieval Filters]
PostFilter -->|Top N| ReRank[Cross-Encoder Re-Ranking]
ReRank -->|Context| LLM[Google Gemini]
LLM -->|Answer| Response[Final Response]
πΈοΈ Knowledge Graph
graph TD
LanceDB[(LanceDB)] -->|Metadata| GraphBuilder[build_graph]
GraphBuilder -->|Nodes & Edges| Agraph[Streamlit Agraph]
π οΈ Tech Stack
- Frontend: Streamlit, Altair, Streamlit-Agraph
- LLM: Google Gemini (
gemini-2.5-flash) - Orchestration: LangChain (Agents, Tool Calling)
- Retrieval (RAG): LlamaIndex (VectorStoreIndex, SubQuestionQueryEngine)
- Vector Database: LanceDB (Local)
- Embeddings: HuggingFace (
pritamdeka/S-PubMedBert-MS-MARCO)
π Getting Started
Prerequisites
- Python 3.10+
- A Google Cloud API Key with access to Gemini
Installation
Clone the repository
git clone <repository-url> cd clinical_trial_agentCreate and activate a virtual environment
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activateInstall dependencies
pip install -r requirements.txtSet up Environment Variables Create a
.envfile in the root directory and add your Google API Key:GOOGLE_API_KEY=your_google_api_key_here
π Usage
1. Ingest Data
Populate the local database. The script uses parallel processing for speed.
# Recommended: Ingest 5000 recent studies
python scripts/ingest_ct.py --limit 5000 --years 5
# Ingest ALL studies (Warning: Large download!)
python scripts/ingest_ct.py --limit -1 --years 10
2. Run the Agent
Launch the Streamlit application:
streamlit run ct_agent_app.py
3. Ask Questions!
- Search: "Find studies for Multiple Myeloma.","What are the top countries for Multiple Myeloma studies?", "What are the most common drugs for Multiple Myeloma studies?", "Which organizations are the most common for Breast cancer?"
- Similarity Search: "Find studies similar to NCT04567890."
- Temporal Search: "Find Migraine studies from 2020."
- Comparison: "Compare the primary outcomes of Keytruda vs Opdivo."
- Cohort Creation "Create a SQL Query for NCT04567890 study."
- Analytics: "Who are the top sponsors for Breast Cancer?" (Now supports grouping by Intervention and Study Type!)
- Graph: Go to the Knowledge Graph tab to visualize connections.
π§ͺ Testing & Quality
- Unit Tests: Run
python -m pytest tests/test_unit.pyto verify core logic. - Hybrid Search Tests: Run
python -m pytest tests/test_hybrid_search.pyto verify the search engine's precision and recall. - Data Integrity: Run
python -m unittest tests/test_data_integrity.pyto verify database content against known ground truths. - Sponsor Normalization: Run
python -m pytest tests/test_sponsor_normalization.pyto verify alias mapping logic. - Linting: Codebase is formatted with
blackand linted withflake8.
π Project Structure
ct_agent_app.py: Main application logic.modules/:utils.py: Configuration, Normalization, Custom Filters.constants.py: Static data (Coordinates, Mappings).tools.py: Tool definitions (search_trials,compare_studies, etc.).cohort_tools.py: SQL generation logic (get_cohort_sql).graph_viz.py: Knowledge Graph logic.
scripts/:ingest_ct.py: Parallel data ingestion pipeline.analyze_db.py: Database inspection.
ct_gov_lancedb/: Persisted LanceDB vector store.tests/:test_unit.py: Core logic tests.test_hybrid_search.py: Integration tests for search engine.