Spaces:
Runtime error
title: Website Downloader/Cloner
emoji: π
colorFrom: purple
colorTo: blue
sdk: docker
pinned: true
license: apache-2.0
short_description: A lightweight Gradio app for mirroring websites.
HTTrack Website Scraper - Gradio App
A professional web scraping tool built with Gradio and HTTrack for downloading and archiving websites for offline browsing.
Features
β¨ Key Features:
- π Full website mirroring with HTTrack
- π Real-time progress tracking
- π Detailed logging with timestamps
- π¦ Automatic ZIP archive creation
- π¨ Clean, intuitive Gradio interface
- π³ Docker containerized for easy deployment
- βοΈ Configurable scraping parameters
Prerequisites
Option 1: Docker (Recommended)
- Docker installed on your system
- Docker Compose (optional)
Option 2: Local Installation
- Python 3.11+
- HTTrack installed on your system
- pip for Python package management
Installation
Using Docker (Recommended)
- Build the Docker image:
docker build -t httrack-scraper .
- Run the container:
docker run -p 7860:7860 httrack-scraper
- Access the app:
Open your browser to
http://localhost:7860
Using Docker Compose
Create a docker-compose.yml:
version: '3.8'
services:
httrack-scraper:
build: .
ports:
- "7860:7860"
volumes:
- ./downloads:/tmp/httrack_temp
environment:
- GRADIO_SERVER_NAME=0.0.0.0
- GRADIO_SERVER_PORT=7860
restart: unless-stopped
Then run:
docker-compose up -d
Local Installation
- Install HTTrack:
Ubuntu/Debian:
sudo apt-get update
sudo apt-get install httrack
macOS:
brew install httrack
Fedora/RHEL:
sudo dnf install httrack
- Install Python dependencies:
pip install -r requirements.txt
- Run the application:
python app.py
Usage
Basic Workflow
- Enter Website URL: Input the complete URL (e.g.,
https://example.com) - Configure Settings:
- Max Depth: How many link levels to follow (1-10)
- Max Rate: Download speed limit in bytes/sec (0 = unlimited)
- Respect robots.txt: Honor the website's scraping preferences
- Start Scraping: Click the "Start Scraping" button
- Monitor Progress: Watch real-time logs and progress bar
- Download Archive: Get the ZIP file when complete
Recommended Settings
| Site Type | Max Depth | Max Rate | Notes |
|---|---|---|---|
| Small Blog | 2-3 | 500000 | Quick scrape |
| Medium Site | 3-5 | 500000 | Balanced approach |
| Large Site | 5-7 | 1000000 | May take hours |
| Full Archive | 7-10 | 1000000 | Complete download |
Advanced HTTrack Options
The app uses these HTTrack parameters:
-r<N>: Recursion depth-v: Verbose logging-c8: 8 simultaneous connections-%P: Extended HTML parsing-A<N>: Max transfer rate limiting-F: Custom User-Agent string
File Structure
.
βββ app.py # Main Gradio application
βββ Dockerfile # Docker container definition
βββ requirements.txt # Python dependencies
βββ README.md # This file
Legal & Ethical Considerations
β οΈ IMPORTANT:
- Permission Required: Only scrape websites you have explicit permission to download
- Respect robots.txt: Honor website preferences for crawlers
- Terms of Service: Check the site's ToS before scraping
- Copyright: Be aware of copyright restrictions on content
- Server Load: Use rate limiting to avoid overwhelming servers
- Personal Use: Scraped content is typically for personal/research use only
Legitimate Use Cases:
- Backing up your own website
- Research and analysis (with permission)
- Archiving public domain content
- Educational purposes
- Offline browsing of authorized content
Troubleshooting
HTTrack Not Found
Error: httrack: command not found
Solution: Install HTTrack:
# Ubuntu/Debian
sudo apt-get install httrack
# macOS
brew install httrack
Permission Denied
Error: Permission issues when creating files
Solution: Ensure the app has write permissions:
chmod 777 /tmp/httrack_temp
Large Site Timeout
Issue: Scraping large sites takes too long
Solution:
- Reduce max depth (try 2-3 instead of 5+)
- Increase max rate if bandwidth allows
- Run in a persistent environment (not locally)
Docker Port Conflict
Error: Port 7860 already in use
Solution: Use a different port:
docker run -p 8080:7860 httrack-scraper
Performance Tips
- Start Small: Test with depth 1-2 before full scrapes
- Rate Limiting: Use reasonable rate limits (500KB-1MB/s)
- Disk Space: Ensure adequate storage for large sites
- Network: Stable internet connection recommended
- Resources: HTTrack can be CPU/memory intensive
Development
Running in Development Mode
# Install development dependencies
pip install -r requirements.txt
# Run with auto-reload
python app.py
Customizing the UI
Edit app.py to modify:
- Theme: Change
gr.themes.Soft()to other themes - Colors: Modify the custom CSS
- Layout: Adjust the Gradio Blocks structure
Adding Features
The modular design allows easy extension:
- Add new HTTrack options in
scrape_website() - Enhance logging in
add_log() - Improve progress parsing in
parse_httrack_output()
Environment Variables
| Variable | Default | Description |
|---|---|---|
GRADIO_SERVER_NAME |
0.0.0.0 |
Server host |
GRADIO_SERVER_PORT |
7860 |
Server port |
PYTHONUNBUFFERED |
1 |
Python output buffering |
Docker Image Details
Base Image: python:3.11-slim
Installed Packages:
- Python 3.11
- HTTrack
- Gradio
- Required system libraries
Image Size: ~400MB
Security: Runs as non-root user (configurable)
Contributing
Contributions welcome! Areas for improvement:
- Additional HTTrack options
- Better progress parsing
- Multi-URL batch processing
- Resume capability for interrupted downloads
- Custom file filtering
License
This project is provided as-is for educational and personal use. Please respect copyright laws and terms of service when scraping websites.
Acknowledgments
- HTTrack: Xavier Roche and contributors
- Gradio: Hugging Face team
- Community: Various open-source contributors that inspired this project
Support
For issues:
- Check the troubleshooting section
- Review HTTrack documentation: https://www.httrack.com/
- Check Gradio docs: https://www.gradio.app/docs
Version: 0.0.1
Last Updated: December 2025