Spaces:
Runtime error
Runtime error
| title: Website Downloader/Cloner | |
| emoji: π | |
| colorFrom: purple | |
| colorTo: blue | |
| sdk: docker | |
| pinned: true | |
| license: apache-2.0 | |
| short_description: A lightweight Gradio app for mirroring websites. | |
| # HTTrack Website Scraper - Gradio App | |
| A professional web scraping tool built with Gradio and HTTrack for downloading and archiving websites for offline browsing. | |
| ## Features | |
| β¨ **Key Features:** | |
| - π Full website mirroring with HTTrack | |
| - π Real-time progress tracking | |
| - π Detailed logging with timestamps | |
| - π¦ Automatic ZIP archive creation | |
| - π¨ Clean, intuitive Gradio interface | |
| - π³ Docker containerized for easy deployment | |
| - βοΈ Configurable scraping parameters | |
| ## Prerequisites | |
| ### Option 1: Docker (Recommended) | |
| - Docker installed on your system | |
| - Docker Compose (optional) | |
| ### Option 2: Local Installation | |
| - Python 3.11+ | |
| - HTTrack installed on your system | |
| - pip for Python package management | |
| ## Installation | |
| ### Using Docker (Recommended) | |
| 1. **Build the Docker image:** | |
| ```bash | |
| docker build -t httrack-scraper . | |
| ``` | |
| 2. **Run the container:** | |
| ```bash | |
| docker run -p 7860:7860 httrack-scraper | |
| ``` | |
| 3. **Access the app:** | |
| Open your browser to `http://localhost:7860` | |
| ### Using Docker Compose | |
| Create a `docker-compose.yml`: | |
| ```yaml | |
| version: '3.8' | |
| services: | |
| httrack-scraper: | |
| build: . | |
| ports: | |
| - "7860:7860" | |
| volumes: | |
| - ./downloads:/tmp/httrack_temp | |
| environment: | |
| - GRADIO_SERVER_NAME=0.0.0.0 | |
| - GRADIO_SERVER_PORT=7860 | |
| restart: unless-stopped | |
| ``` | |
| Then run: | |
| ```bash | |
| docker-compose up -d | |
| ``` | |
| ### Local Installation | |
| 1. **Install HTTrack:** | |
| **Ubuntu/Debian:** | |
| ```bash | |
| sudo apt-get update | |
| sudo apt-get install httrack | |
| ``` | |
| **macOS:** | |
| ```bash | |
| brew install httrack | |
| ``` | |
| **Fedora/RHEL:** | |
| ```bash | |
| sudo dnf install httrack | |
| ``` | |
| 2. **Install Python dependencies:** | |
| ```bash | |
| pip install -r requirements.txt | |
| ``` | |
| 3. **Run the application:** | |
| ```bash | |
| python app.py | |
| ``` | |
| ## Usage | |
| ### Basic Workflow | |
| 1. **Enter Website URL**: Input the complete URL (e.g., `https://example.com`) | |
| 2. **Configure Settings**: | |
| - **Max Depth**: How many link levels to follow (1-10) | |
| - **Max Rate**: Download speed limit in bytes/sec (0 = unlimited) | |
| - **Respect robots.txt**: Honor the website's scraping preferences | |
| 3. **Start Scraping**: Click the "Start Scraping" button | |
| 4. **Monitor Progress**: Watch real-time logs and progress bar | |
| 5. **Download Archive**: Get the ZIP file when complete | |
| ### Recommended Settings | |
| | Site Type | Max Depth | Max Rate | Notes | | |
| |-----------|-----------|----------|-------| | |
| | Small Blog | 2-3 | 500000 | Quick scrape | | |
| | Medium Site | 3-5 | 500000 | Balanced approach | | |
| | Large Site | 5-7 | 1000000 | May take hours | | |
| | Full Archive | 7-10 | 1000000 | Complete download | | |
| ### Advanced HTTrack Options | |
| The app uses these HTTrack parameters: | |
| - `-r<N>`: Recursion depth | |
| - `-v`: Verbose logging | |
| - `-c8`: 8 simultaneous connections | |
| - `-%P`: Extended HTML parsing | |
| - `-A<N>`: Max transfer rate limiting | |
| - `-F`: Custom User-Agent string | |
| ## File Structure | |
| ``` | |
| . | |
| βββ app.py # Main Gradio application | |
| βββ Dockerfile # Docker container definition | |
| βββ requirements.txt # Python dependencies | |
| βββ README.md # This file | |
| ``` | |
| ## Legal & Ethical Considerations | |
| β οΈ **IMPORTANT**: | |
| - **Permission Required**: Only scrape websites you have explicit permission to download | |
| - **Respect robots.txt**: Honor website preferences for crawlers | |
| - **Terms of Service**: Check the site's ToS before scraping | |
| - **Copyright**: Be aware of copyright restrictions on content | |
| - **Server Load**: Use rate limiting to avoid overwhelming servers | |
| - **Personal Use**: Scraped content is typically for personal/research use only | |
| **Legitimate Use Cases:** | |
| - Backing up your own website | |
| - Research and analysis (with permission) | |
| - Archiving public domain content | |
| - Educational purposes | |
| - Offline browsing of authorized content | |
| ## Troubleshooting | |
| ### HTTrack Not Found | |
| **Error**: `httrack: command not found` | |
| **Solution**: Install HTTrack: | |
| ```bash | |
| # Ubuntu/Debian | |
| sudo apt-get install httrack | |
| # macOS | |
| brew install httrack | |
| ``` | |
| ### Permission Denied | |
| **Error**: Permission issues when creating files | |
| **Solution**: Ensure the app has write permissions: | |
| ```bash | |
| chmod 777 /tmp/httrack_temp | |
| ``` | |
| ### Large Site Timeout | |
| **Issue**: Scraping large sites takes too long | |
| **Solution**: | |
| - Reduce max depth (try 2-3 instead of 5+) | |
| - Increase max rate if bandwidth allows | |
| - Run in a persistent environment (not locally) | |
| ### Docker Port Conflict | |
| **Error**: Port 7860 already in use | |
| **Solution**: Use a different port: | |
| ```bash | |
| docker run -p 8080:7860 httrack-scraper | |
| ``` | |
| ## Performance Tips | |
| 1. **Start Small**: Test with depth 1-2 before full scrapes | |
| 2. **Rate Limiting**: Use reasonable rate limits (500KB-1MB/s) | |
| 3. **Disk Space**: Ensure adequate storage for large sites | |
| 4. **Network**: Stable internet connection recommended | |
| 5. **Resources**: HTTrack can be CPU/memory intensive | |
| ## Development | |
| ### Running in Development Mode | |
| ```bash | |
| # Install development dependencies | |
| pip install -r requirements.txt | |
| # Run with auto-reload | |
| python app.py | |
| ``` | |
| ### Customizing the UI | |
| Edit `app.py` to modify: | |
| - Theme: Change `gr.themes.Soft()` to other themes | |
| - Colors: Modify the custom CSS | |
| - Layout: Adjust the Gradio Blocks structure | |
| ### Adding Features | |
| The modular design allows easy extension: | |
| - Add new HTTrack options in `scrape_website()` | |
| - Enhance logging in `add_log()` | |
| - Improve progress parsing in `parse_httrack_output()` | |
| ## Environment Variables | |
| | Variable | Default | Description | | |
| |----------|---------|-------------| | |
| | `GRADIO_SERVER_NAME` | `0.0.0.0` | Server host | | |
| | `GRADIO_SERVER_PORT` | `7860` | Server port | | |
| | `PYTHONUNBUFFERED` | `1` | Python output buffering | | |
| ## Docker Image Details | |
| **Base Image**: `python:3.11-slim` | |
| **Installed Packages**: | |
| - Python 3.11 | |
| - HTTrack | |
| - Gradio | |
| - Required system libraries | |
| **Image Size**: ~400MB | |
| **Security**: Runs as non-root user (configurable) | |
| ## Contributing | |
| Contributions welcome! Areas for improvement: | |
| - Additional HTTrack options | |
| - Better progress parsing | |
| - Multi-URL batch processing | |
| - Resume capability for interrupted downloads | |
| - Custom file filtering | |
| ## License | |
| This project is provided as-is for educational and personal use. Please respect copyright laws and terms of service when scraping websites. | |
| ## Acknowledgments | |
| - **HTTrack**: Xavier Roche and contributors | |
| - **Gradio**: Hugging Face team | |
| - **Community**: Various open-source contributors that inspired this project | |
| ## Support | |
| For issues: | |
| 1. Check the troubleshooting section | |
| 2. Review HTTrack documentation: https://www.httrack.com/ | |
| 3. Check Gradio docs: https://www.gradio.app/docs | |
| --- | |
| **Version**: 0.0.1 | |
| **Last Updated**: December 2025 |