--- title: Website Downloader/Cloner emoji: 📉 colorFrom: purple colorTo: blue sdk: docker pinned: true license: apache-2.0 short_description: A lightweight Gradio app for mirroring websites. --- # HTTrack Website Scraper - Gradio App A professional web scraping tool built with Gradio and HTTrack for downloading and archiving websites for offline browsing. ## Features ✨ **Key Features:** - 🌐 Full website mirroring with HTTrack - 📊 Real-time progress tracking - 📝 Detailed logging with timestamps - 📦 Automatic ZIP archive creation - 🎨 Clean, intuitive Gradio interface - 🐳 Docker containerized for easy deployment - ⚙️ Configurable scraping parameters ## Prerequisites ### Option 1: Docker (Recommended) - Docker installed on your system - Docker Compose (optional) ### Option 2: Local Installation - Python 3.11+ - HTTrack installed on your system - pip for Python package management ## Installation ### Using Docker (Recommended) 1. **Build the Docker image:** ```bash docker build -t httrack-scraper . ``` 2. **Run the container:** ```bash docker run -p 7860:7860 httrack-scraper ``` 3. **Access the app:** Open your browser to `http://localhost:7860` ### Using Docker Compose Create a `docker-compose.yml`: ```yaml version: '3.8' services: httrack-scraper: build: . ports: - "7860:7860" volumes: - ./downloads:/tmp/httrack_temp environment: - GRADIO_SERVER_NAME=0.0.0.0 - GRADIO_SERVER_PORT=7860 restart: unless-stopped ``` Then run: ```bash docker-compose up -d ``` ### Local Installation 1. **Install HTTrack:** **Ubuntu/Debian:** ```bash sudo apt-get update sudo apt-get install httrack ``` **macOS:** ```bash brew install httrack ``` **Fedora/RHEL:** ```bash sudo dnf install httrack ``` 2. **Install Python dependencies:** ```bash pip install -r requirements.txt ``` 3. **Run the application:** ```bash python app.py ``` ## Usage ### Basic Workflow 1. **Enter Website URL**: Input the complete URL (e.g., `https://example.com`) 2. **Configure Settings**: - **Max Depth**: How many link levels to follow (1-10) - **Max Rate**: Download speed limit in bytes/sec (0 = unlimited) - **Respect robots.txt**: Honor the website's scraping preferences 3. **Start Scraping**: Click the "Start Scraping" button 4. **Monitor Progress**: Watch real-time logs and progress bar 5. **Download Archive**: Get the ZIP file when complete ### Recommended Settings | Site Type | Max Depth | Max Rate | Notes | |-----------|-----------|----------|-------| | Small Blog | 2-3 | 500000 | Quick scrape | | Medium Site | 3-5 | 500000 | Balanced approach | | Large Site | 5-7 | 1000000 | May take hours | | Full Archive | 7-10 | 1000000 | Complete download | ### Advanced HTTrack Options The app uses these HTTrack parameters: - `-r`: Recursion depth - `-v`: Verbose logging - `-c8`: 8 simultaneous connections - `-%P`: Extended HTML parsing - `-A`: Max transfer rate limiting - `-F`: Custom User-Agent string ## File Structure ``` . ├── app.py # Main Gradio application ├── Dockerfile # Docker container definition ├── requirements.txt # Python dependencies └── README.md # This file ``` ## Legal & Ethical Considerations ⚠️ **IMPORTANT**: - **Permission Required**: Only scrape websites you have explicit permission to download - **Respect robots.txt**: Honor website preferences for crawlers - **Terms of Service**: Check the site's ToS before scraping - **Copyright**: Be aware of copyright restrictions on content - **Server Load**: Use rate limiting to avoid overwhelming servers - **Personal Use**: Scraped content is typically for personal/research use only **Legitimate Use Cases:** - Backing up your own website - Research and analysis (with permission) - Archiving public domain content - Educational purposes - Offline browsing of authorized content ## Troubleshooting ### HTTrack Not Found **Error**: `httrack: command not found` **Solution**: Install HTTrack: ```bash # Ubuntu/Debian sudo apt-get install httrack # macOS brew install httrack ``` ### Permission Denied **Error**: Permission issues when creating files **Solution**: Ensure the app has write permissions: ```bash chmod 777 /tmp/httrack_temp ``` ### Large Site Timeout **Issue**: Scraping large sites takes too long **Solution**: - Reduce max depth (try 2-3 instead of 5+) - Increase max rate if bandwidth allows - Run in a persistent environment (not locally) ### Docker Port Conflict **Error**: Port 7860 already in use **Solution**: Use a different port: ```bash docker run -p 8080:7860 httrack-scraper ``` ## Performance Tips 1. **Start Small**: Test with depth 1-2 before full scrapes 2. **Rate Limiting**: Use reasonable rate limits (500KB-1MB/s) 3. **Disk Space**: Ensure adequate storage for large sites 4. **Network**: Stable internet connection recommended 5. **Resources**: HTTrack can be CPU/memory intensive ## Development ### Running in Development Mode ```bash # Install development dependencies pip install -r requirements.txt # Run with auto-reload python app.py ``` ### Customizing the UI Edit `app.py` to modify: - Theme: Change `gr.themes.Soft()` to other themes - Colors: Modify the custom CSS - Layout: Adjust the Gradio Blocks structure ### Adding Features The modular design allows easy extension: - Add new HTTrack options in `scrape_website()` - Enhance logging in `add_log()` - Improve progress parsing in `parse_httrack_output()` ## Environment Variables | Variable | Default | Description | |----------|---------|-------------| | `GRADIO_SERVER_NAME` | `0.0.0.0` | Server host | | `GRADIO_SERVER_PORT` | `7860` | Server port | | `PYTHONUNBUFFERED` | `1` | Python output buffering | ## Docker Image Details **Base Image**: `python:3.11-slim` **Installed Packages**: - Python 3.11 - HTTrack - Gradio - Required system libraries **Image Size**: ~400MB **Security**: Runs as non-root user (configurable) ## Contributing Contributions welcome! Areas for improvement: - Additional HTTrack options - Better progress parsing - Multi-URL batch processing - Resume capability for interrupted downloads - Custom file filtering ## License This project is provided as-is for educational and personal use. Please respect copyright laws and terms of service when scraping websites. ## Acknowledgments - **HTTrack**: Xavier Roche and contributors - **Gradio**: Hugging Face team - **Community**: Various open-source contributors that inspired this project ## Support For issues: 1. Check the troubleshooting section 2. Review HTTrack documentation: https://www.httrack.com/ 3. Check Gradio docs: https://www.gradio.app/docs --- **Version**: 0.0.1 **Last Updated**: December 2025