Spaces:

gkip
/

clinical_trial_inspector

Sleeping

App Files Files Community

clinical_trial_inspector / README.md

Geoffrey Kip

Docs & Cleanup: Update README with Docker info and remove legacy code

7c4c603 16 days ago

preview code

raw

history blame contribute delete

12.5 kB

A newer version of the Streamlit SDK is available: 1.52.2

Upgrade

metadata

title: Clinical Trial Inspector
emoji: 🧬
colorFrom: blue
colorTo: indigo
sdk: streamlit
sdk_version: 1.40.1
app_file: ct_agent_app.py
pinned: false

Clinical Trial Inspector Agent 🕵️‍♂️💊

Clinical Trial Inspector is an advanced AI agent designed to revolutionize how researchers, clinicians, and analysts explore clinical trial data. By combining Semantic Search, Retrieval-Augmented Generation (RAG), and Visual Analytics, it transforms raw data from ClinicalTrials.gov into actionable insights.

Built with LangChain, LlamaIndex, Streamlit, Altair, Streamlit-Agraph, and Google Gemini, this tool goes beyond simple keyword search. It understands natural language, generates inline visualizations, performs complex multi-dimensional analysis, and visualizes relationships in an interactive knowledge graph.

✨ Key Features

2. 🧠 Intelligent Search & Retrieval

Hybrid Search: Combines Semantic Search (vector similarity) with BM25 Keyword Search (sparse retrieval) using LanceDB's Native Hybrid Search. This ensures you find studies that match both the meaning (e.g., "kidney cancer" -> "renal cell carcinoma") and exact terms (e.g., "NCT04589845", "Teclistamab").
Smart Filtering:
- Strict Pre-Filtering: For specific sponsors (e.g., "Pfizer"), it forces the engine to look only at that sponsor's studies first, ensuring 100% recall.
- Strict Keyword Filtering (Analytics Only): For counting questions (e.g., "How many studies..."), the Analytics Engine (get_study_analytics) prioritizes studies where the query explicitly appears in the Title or Conditions, ensuring high precision and accurate counts.
- Sponsor Alias Support: Intelligently maps aliases (e.g., "J&J", "MSD") to their canonical sponsor names ("Janssen", "Merck Sharp & Dohme") for accurate aggregation.
Smart Summary: Returns a clean, concise list of relevant studies.
Query Expansion: Automatically expands your search terms with medical synonyms (e.g., "Heart Attack" -> "Myocardial Infarction").
Re-Ranking: Uses a Cross-Encoder (ms-marco-MiniLM) to re-score results for maximum relevance.
Query Decomposition: Breaks down complex multi-part questions (e.g., "Compare the primary outcomes of Keytruda vs Opdivo") into sub-questions for precise answers.
Cohort SQL Generation: Translates eligibility criteria into standard SQL queries (OMOP CDM) for patient cohort identification.

📊 Visual Analytics & Insights

Inline Charts (Contextual): The agent automatically generates Bar Charts and Line Charts directly in the chat stream when you ask aggregation questions (e.g., "Top sponsors for Multiple Myeloma").
Analytics Dashboard (Global): A dedicated dashboard to analyze trends across the entire dataset (60,000+ studies), independent of your chat session.
Interactive Knowledge Graph: Visualize connections between Studies, Sponsors, and Conditions in a dynamic, interactive network graph.

🌍 Geospatial Dashboard

Global Trial Map: Visualize the geographic distribution of clinical trials on an interactive world map.
Region Toggle: Switch between World View (Country-level aggregation) and USA View (State-level aggregation).
Dot Visualization: Uses dynamic CircleMarkers (dots) sized by trial count to show density.
Interactive Filters: Filter the map by Phase, Status, Sponsor, Start Year, and Study Type.

🔍 Multi-Filter Analysis

Complex Filtering: Answer sophisticated questions by applying multiple filters simultaneously.
- Example: "What are Pfizer's most common study indications for Phase 2?"
Full Dataset Scope: General analytics questions analyze the entire database, not just a sample.
Smart Retrieval: Retrieves up to 5,000 relevant studies for comprehensive analysis.

⚡ High-Performance Ingestion

Parallel Processing: Uses multi-core processing to ingest and embed thousands of studies per minute.
LanceDB Integration: Uses LanceDB for high-performance vector storage and native hybrid search.
Idempotent Updates: Smartly updates existing records without duplication, allowing for seamless data refreshes.

🤖 Agent Capabilities & Tools

The agent is equipped with specialized tools to handle different types of requests:

1. `search_trials`

Purpose: Finds specific clinical trials based on natural language queries.
Capabilities: Semantic Search, Smart Filtering (Phase, Status, Sponsor, Intervention), Query Expansion, Hybrid Search, Re-Ranking.

2. `get_study_analytics`

Purpose: Aggregates data to reveal trends and insights.
Capabilities: Multi-Filtering, Grouping (Phase, Status, Sponsor, Year, Condition), Full Dataset Access, Inline Visualization.

3. `compare_studies`

Purpose: Handles complex comparison or multi-part questions.
Capabilities: Uses Query Decomposition to break a complex query into sub-queries, executes them against the database, and synthesizes the results.

4. `find_similar_studies`

Purpose: Discovers studies that are semantically similar to a specific trial.
Capabilities:
- NCT Lookup: Automatically fetches content if queried with an NCT ID.
- Self-Exclusion: Filters out the reference study from results.
- Scoring: Returns similarity scores for transparency.

5. `get_study_details`

Purpose: Fetches the full text content of a specific study by NCT ID.
Capabilities: Retrieves all chunks of a study to provide comprehensive details (Criteria, Summary, Protocol).

6. `get_cohort_sql`

Purpose: Translates clinical trial eligibility criteria into standard SQL queries for claims data analysis.
Capabilities:
- Extraction: Parses text into structured inclusion/exclusion rules (Concepts, Codes).
- SQL Generation: Generates OMOP-compatible SQL queries targeting medical_claims and pharmacy_claims.
- Logic Enforcement: Applies temporal logic (e.g., "2 diagnoses > 30 days apart") for chronic conditions.

⚙️ How It Works (RAG Pipeline)

🏗️ Ingestion Pipeline

Ingestion: ingest_ct.py fetches study data from ClinicalTrials.gov. It extracts rich text (including Eligibility Criteria and Interventions) and structured metadata. It uses multiprocessing for speed.
Embedding: Text is converted into vector embeddings using PubMedBERT and stored in LanceDB.
Retrieval:
- Query Transformation: Synonyms are injected via LLM.
- Pre-Filtering: Strict filters (Status, Year, Sponsor) reduce the search scope.
- Hybrid Search: Parallel Vector Search (Semantic) and BM25 (Keyword) combined via LanceDB Native Hybrid Search.
- Post-Filtering: Additional metadata checks (Phase, Intervention) on retrieved candidates.
- Re-Ranking: Cross-Encoder re-scoring (Cached for performance).
Synthesis: Google Gemini synthesizes the final answer.

🐳 Docker Deployment Structure

The application is containerized for easy deployment to Hugging Face Spaces or any Docker-compatible environment.

Dockerfile Breakdown

Base Image: python:3.10-slim (Lightweight and secure).
Dependencies: Installs system tools (build-essential, git) and Python packages from requirements.txt.
Port: Exposes port 8501 for Streamlit.
Entrypoint: Runs streamlit run ct_agent_app.py.

Recent Updates 🚀

RAG Optimization: Implemented a Cached Reranker and reduced retrieval candidates (TOP_K=200) for 2-3x faster search performance.
Enhanced Analytics: Added support for grouping by Country and State in the Analytics Engine.
Dynamic Configuration: Improved API key handling for secure, multi-user sessions.

graph TD
    API[ClinicalTrials.gov API] -->|Fetch Batches| Script[ingest_ct.py]
    Script -->|Process & Embed| LanceDB[(LanceDB)]

🧠 RAG Retrieval Flow

graph TD
    User[User Query] -->|Expand| Synonyms[Synonym Injection]
    Synonyms -->|Pre-Filter| PreFilter[Pre-Retrieval Filters]
    PreFilter -->|Filtered Scope| Hybrid[Hybrid Search]
    Hybrid -->|Parallel Search| Vector[Vector Search] & BM25[BM25 Keyword Search]
    Vector & BM25 -->|Reciprocal Rank Fusion| Fusion[Merged Candidates]
    Fusion -->|Candidates| PostFilter[Post-Retrieval Filters]
    PostFilter -->|Top N| ReRank[Cross-Encoder Re-Ranking]
    ReRank -->|Context| LLM[Google Gemini]
    LLM -->|Answer| Response[Final Response]

🕸️ Knowledge Graph

graph TD
    LanceDB[(LanceDB)] -->|Metadata| GraphBuilder[build_graph]
    GraphBuilder -->|Nodes & Edges| Agraph[Streamlit Agraph]

🛠️ Tech Stack

Frontend: Streamlit, Altair, Streamlit-Agraph
LLM: Google Gemini (gemini-2.5-flash)
Orchestration: LangChain (Agents, Tool Calling)
Retrieval (RAG): LlamaIndex (VectorStoreIndex, SubQuestionQueryEngine)
Vector Database: LanceDB (Local)
Embeddings: HuggingFace (pritamdeka/S-PubMedBert-MS-MARCO)

🚀 Getting Started

Prerequisites

Python 3.10+
A Google Cloud API Key with access to Gemini

Installation

Clone the repository

git clone <repository-url>
cd clinical_trial_agent

Create and activate a virtual environment

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies
```
pip install -r requirements.txt
```
Set up Environment Variables Create a .env file in the root directory and add your Google API Key:
```
GOOGLE_API_KEY=your_google_api_key_here
```

📖 Usage

1. Ingest Data

Populate the local database. The script uses parallel processing for speed.

# Recommended: Ingest 5000 recent studies
python scripts/ingest_ct.py --limit 5000 --years 5

# Ingest ALL studies (Warning: Large download!)
python scripts/ingest_ct.py --limit -1 --years 10

2. Run the Agent

Launch the Streamlit application:

streamlit run ct_agent_app.py

3. Ask Questions!

Search: "Find studies for Multiple Myeloma.","What are the top countries for Multiple Myeloma studies?", "What are the most common drugs for Multiple Myeloma studies?", "Which organizations are the most common for Breast cancer?"
Similarity Search: "Find studies similar to NCT04567890."
Temporal Search: "Find Migraine studies from 2020."
Comparison: "Compare the primary outcomes of Keytruda vs Opdivo."
Cohort Creation "Create a SQL Query for NCT04567890 study."
Analytics: "Who are the top sponsors for Breast Cancer?" (Now supports grouping by Intervention and Study Type!)
Graph: Go to the Knowledge Graph tab to visualize connections.

🧪 Testing & Quality

Unit Tests: Run python -m pytest tests/test_unit.py to verify core logic.
Hybrid Search Tests: Run python -m pytest tests/test_hybrid_search.py to verify the search engine's precision and recall.
Data Integrity: Run python -m unittest tests/test_data_integrity.py to verify database content against known ground truths.
Sponsor Normalization: Run python -m pytest tests/test_sponsor_normalization.py to verify alias mapping logic.
Linting: Codebase is formatted with black and linted with flake8.

📂 Project Structure

ct_agent_app.py: Main application logic.
modules/:
- utils.py: Configuration, Normalization, Custom Filters.
- constants.py: Static data (Coordinates, Mappings).
- tools.py: Tool definitions (search_trials, compare_studies, etc.).
- cohort_tools.py: SQL generation logic (get_cohort_sql).
- graph_viz.py: Knowledge Graph logic.
scripts/:
- ingest_ct.py: Parallel data ingestion pipeline.
- analyze_db.py: Database inspection.
ct_gov_lancedb/: Persisted LanceDB vector store.
tests/:
- test_unit.py: Core logic tests.
- test_hybrid_search.py: Integration tests for search engine.