webdl / README.md
namelessai's picture
Update README.md
bc8dcba verified
---
title: Website Downloader/Cloner
emoji: πŸ“‰
colorFrom: purple
colorTo: blue
sdk: docker
pinned: true
license: apache-2.0
short_description: A lightweight Gradio app for mirroring websites.
---
# HTTrack Website Scraper - Gradio App
A professional web scraping tool built with Gradio and HTTrack for downloading and archiving websites for offline browsing.
## Features
✨ **Key Features:**
- 🌐 Full website mirroring with HTTrack
- πŸ“Š Real-time progress tracking
- πŸ“ Detailed logging with timestamps
- πŸ“¦ Automatic ZIP archive creation
- 🎨 Clean, intuitive Gradio interface
- 🐳 Docker containerized for easy deployment
- βš™οΈ Configurable scraping parameters
## Prerequisites
### Option 1: Docker (Recommended)
- Docker installed on your system
- Docker Compose (optional)
### Option 2: Local Installation
- Python 3.11+
- HTTrack installed on your system
- pip for Python package management
## Installation
### Using Docker (Recommended)
1. **Build the Docker image:**
```bash
docker build -t httrack-scraper .
```
2. **Run the container:**
```bash
docker run -p 7860:7860 httrack-scraper
```
3. **Access the app:**
Open your browser to `http://localhost:7860`
### Using Docker Compose
Create a `docker-compose.yml`:
```yaml
version: '3.8'
services:
httrack-scraper:
build: .
ports:
- "7860:7860"
volumes:
- ./downloads:/tmp/httrack_temp
environment:
- GRADIO_SERVER_NAME=0.0.0.0
- GRADIO_SERVER_PORT=7860
restart: unless-stopped
```
Then run:
```bash
docker-compose up -d
```
### Local Installation
1. **Install HTTrack:**
**Ubuntu/Debian:**
```bash
sudo apt-get update
sudo apt-get install httrack
```
**macOS:**
```bash
brew install httrack
```
**Fedora/RHEL:**
```bash
sudo dnf install httrack
```
2. **Install Python dependencies:**
```bash
pip install -r requirements.txt
```
3. **Run the application:**
```bash
python app.py
```
## Usage
### Basic Workflow
1. **Enter Website URL**: Input the complete URL (e.g., `https://example.com`)
2. **Configure Settings**:
- **Max Depth**: How many link levels to follow (1-10)
- **Max Rate**: Download speed limit in bytes/sec (0 = unlimited)
- **Respect robots.txt**: Honor the website's scraping preferences
3. **Start Scraping**: Click the "Start Scraping" button
4. **Monitor Progress**: Watch real-time logs and progress bar
5. **Download Archive**: Get the ZIP file when complete
### Recommended Settings
| Site Type | Max Depth | Max Rate | Notes |
|-----------|-----------|----------|-------|
| Small Blog | 2-3 | 500000 | Quick scrape |
| Medium Site | 3-5 | 500000 | Balanced approach |
| Large Site | 5-7 | 1000000 | May take hours |
| Full Archive | 7-10 | 1000000 | Complete download |
### Advanced HTTrack Options
The app uses these HTTrack parameters:
- `-r<N>`: Recursion depth
- `-v`: Verbose logging
- `-c8`: 8 simultaneous connections
- `-%P`: Extended HTML parsing
- `-A<N>`: Max transfer rate limiting
- `-F`: Custom User-Agent string
## File Structure
```
.
β”œβ”€β”€ app.py # Main Gradio application
β”œβ”€β”€ Dockerfile # Docker container definition
β”œβ”€β”€ requirements.txt # Python dependencies
└── README.md # This file
```
## Legal & Ethical Considerations
⚠️ **IMPORTANT**:
- **Permission Required**: Only scrape websites you have explicit permission to download
- **Respect robots.txt**: Honor website preferences for crawlers
- **Terms of Service**: Check the site's ToS before scraping
- **Copyright**: Be aware of copyright restrictions on content
- **Server Load**: Use rate limiting to avoid overwhelming servers
- **Personal Use**: Scraped content is typically for personal/research use only
**Legitimate Use Cases:**
- Backing up your own website
- Research and analysis (with permission)
- Archiving public domain content
- Educational purposes
- Offline browsing of authorized content
## Troubleshooting
### HTTrack Not Found
**Error**: `httrack: command not found`
**Solution**: Install HTTrack:
```bash
# Ubuntu/Debian
sudo apt-get install httrack
# macOS
brew install httrack
```
### Permission Denied
**Error**: Permission issues when creating files
**Solution**: Ensure the app has write permissions:
```bash
chmod 777 /tmp/httrack_temp
```
### Large Site Timeout
**Issue**: Scraping large sites takes too long
**Solution**:
- Reduce max depth (try 2-3 instead of 5+)
- Increase max rate if bandwidth allows
- Run in a persistent environment (not locally)
### Docker Port Conflict
**Error**: Port 7860 already in use
**Solution**: Use a different port:
```bash
docker run -p 8080:7860 httrack-scraper
```
## Performance Tips
1. **Start Small**: Test with depth 1-2 before full scrapes
2. **Rate Limiting**: Use reasonable rate limits (500KB-1MB/s)
3. **Disk Space**: Ensure adequate storage for large sites
4. **Network**: Stable internet connection recommended
5. **Resources**: HTTrack can be CPU/memory intensive
## Development
### Running in Development Mode
```bash
# Install development dependencies
pip install -r requirements.txt
# Run with auto-reload
python app.py
```
### Customizing the UI
Edit `app.py` to modify:
- Theme: Change `gr.themes.Soft()` to other themes
- Colors: Modify the custom CSS
- Layout: Adjust the Gradio Blocks structure
### Adding Features
The modular design allows easy extension:
- Add new HTTrack options in `scrape_website()`
- Enhance logging in `add_log()`
- Improve progress parsing in `parse_httrack_output()`
## Environment Variables
| Variable | Default | Description |
|----------|---------|-------------|
| `GRADIO_SERVER_NAME` | `0.0.0.0` | Server host |
| `GRADIO_SERVER_PORT` | `7860` | Server port |
| `PYTHONUNBUFFERED` | `1` | Python output buffering |
## Docker Image Details
**Base Image**: `python:3.11-slim`
**Installed Packages**:
- Python 3.11
- HTTrack
- Gradio
- Required system libraries
**Image Size**: ~400MB
**Security**: Runs as non-root user (configurable)
## Contributing
Contributions welcome! Areas for improvement:
- Additional HTTrack options
- Better progress parsing
- Multi-URL batch processing
- Resume capability for interrupted downloads
- Custom file filtering
## License
This project is provided as-is for educational and personal use. Please respect copyright laws and terms of service when scraping websites.
## Acknowledgments
- **HTTrack**: Xavier Roche and contributors
- **Gradio**: Hugging Face team
- **Community**: Various open-source contributors that inspired this project
## Support
For issues:
1. Check the troubleshooting section
2. Review HTTrack documentation: https://www.httrack.com/
3. Check Gradio docs: https://www.gradio.app/docs
---
**Version**: 0.0.1
**Last Updated**: December 2025