---
title: Website Downloader/Cloner
emoji: 📉
colorFrom: purple
colorTo: blue
sdk: docker
pinned: true
license: apache-2.0
short_description: A lightweight Gradio app for mirroring websites.
---

# HTTrack Website Scraper - Gradio App

A professional web scraping tool built with Gradio and HTTrack for downloading and archiving websites for offline browsing.

## Features

✨ **Key Features:**
- 🌐 Full website mirroring with HTTrack
- 📊 Real-time progress tracking
- 📝 Detailed logging with timestamps
- 📦 Automatic ZIP archive creation
- 🎨 Clean, intuitive Gradio interface
- 🐳 Docker containerized for easy deployment
- ⚙️ Configurable scraping parameters

## Prerequisites

### Option 1: Docker (Recommended)
- Docker installed on your system
- Docker Compose (optional)

### Option 2: Local Installation
- Python 3.11+
- HTTrack installed on your system
- pip for Python package management

## Installation

### Using Docker (Recommended)

1. **Build the Docker image:**
```bash
docker build -t httrack-scraper .
```

2. **Run the container:**
```bash
docker run -p 7860:7860 httrack-scraper
```

3. **Access the app:**
Open your browser to `http://localhost:7860`

### Using Docker Compose

Create a `docker-compose.yml`:

```yaml
version: '3.8'

services:
  httrack-scraper:
    build: .
    ports:
      - "7860:7860"
    volumes:
      - ./downloads:/tmp/httrack_temp
    environment:
      - GRADIO_SERVER_NAME=0.0.0.0
      - GRADIO_SERVER_PORT=7860
    restart: unless-stopped
```

Then run:
```bash
docker-compose up -d
```

### Local Installation

1. **Install HTTrack:**

**Ubuntu/Debian:**
```bash
sudo apt-get update
sudo apt-get install httrack
```

**macOS:**
```bash
brew install httrack
```

**Fedora/RHEL:**
```bash
sudo dnf install httrack
```

2. **Install Python dependencies:**
```bash
pip install -r requirements.txt
```

3. **Run the application:**
```bash
python app.py
```

## Usage

### Basic Workflow

1. **Enter Website URL**: Input the complete URL (e.g., `https://example.com`)
2. **Configure Settings**:
   - **Max Depth**: How many link levels to follow (1-10)
   - **Max Rate**: Download speed limit in bytes/sec (0 = unlimited)
   - **Respect robots.txt**: Honor the website's scraping preferences
3. **Start Scraping**: Click the "Start Scraping" button
4. **Monitor Progress**: Watch real-time logs and progress bar
5. **Download Archive**: Get the ZIP file when complete

### Recommended Settings

| Site Type | Max Depth | Max Rate | Notes |
|-----------|-----------|----------|-------|
| Small Blog | 2-3 | 500000 | Quick scrape |
| Medium Site | 3-5 | 500000 | Balanced approach |
| Large Site | 5-7 | 1000000 | May take hours |
| Full Archive | 7-10 | 1000000 | Complete download |

### Advanced HTTrack Options

The app uses these HTTrack parameters:
- `-r<N>`: Recursion depth
- `-v`: Verbose logging
- `-c8`: 8 simultaneous connections
- `-%P`: Extended HTML parsing
- `-A<N>`: Max transfer rate limiting
- `-F`: Custom User-Agent string

## File Structure

```
.
├── app.py              # Main Gradio application
├── Dockerfile          # Docker container definition
├── requirements.txt    # Python dependencies
└── README.md          # This file
```

## Legal & Ethical Considerations

⚠️ **IMPORTANT**: 

- **Permission Required**: Only scrape websites you have explicit permission to download
- **Respect robots.txt**: Honor website preferences for crawlers
- **Terms of Service**: Check the site's ToS before scraping
- **Copyright**: Be aware of copyright restrictions on content
- **Server Load**: Use rate limiting to avoid overwhelming servers
- **Personal Use**: Scraped content is typically for personal/research use only

**Legitimate Use Cases:**
- Backing up your own website
- Research and analysis (with permission)
- Archiving public domain content
- Educational purposes
- Offline browsing of authorized content

## Troubleshooting

### HTTrack Not Found
**Error**: `httrack: command not found`

**Solution**: Install HTTrack:
```bash
# Ubuntu/Debian
sudo apt-get install httrack

# macOS
brew install httrack
```

### Permission Denied
**Error**: Permission issues when creating files

**Solution**: Ensure the app has write permissions:
```bash
chmod 777 /tmp/httrack_temp
```

### Large Site Timeout
**Issue**: Scraping large sites takes too long

**Solution**: 
- Reduce max depth (try 2-3 instead of 5+)
- Increase max rate if bandwidth allows
- Run in a persistent environment (not locally)

### Docker Port Conflict
**Error**: Port 7860 already in use

**Solution**: Use a different port:
```bash
docker run -p 8080:7860 httrack-scraper
```

## Performance Tips

1. **Start Small**: Test with depth 1-2 before full scrapes
2. **Rate Limiting**: Use reasonable rate limits (500KB-1MB/s)
3. **Disk Space**: Ensure adequate storage for large sites
4. **Network**: Stable internet connection recommended
5. **Resources**: HTTrack can be CPU/memory intensive

## Development

### Running in Development Mode

```bash
# Install development dependencies
pip install -r requirements.txt

# Run with auto-reload
python app.py
```

### Customizing the UI

Edit `app.py` to modify:
- Theme: Change `gr.themes.Soft()` to other themes
- Colors: Modify the custom CSS
- Layout: Adjust the Gradio Blocks structure

### Adding Features

The modular design allows easy extension:
- Add new HTTrack options in `scrape_website()`
- Enhance logging in `add_log()`
- Improve progress parsing in `parse_httrack_output()`

## Environment Variables

| Variable | Default | Description |
|----------|---------|-------------|
| `GRADIO_SERVER_NAME` | `0.0.0.0` | Server host |
| `GRADIO_SERVER_PORT` | `7860` | Server port |
| `PYTHONUNBUFFERED` | `1` | Python output buffering |

## Docker Image Details

**Base Image**: `python:3.11-slim`

**Installed Packages**:
- Python 3.11
- HTTrack
- Gradio
- Required system libraries

**Image Size**: ~400MB

**Security**: Runs as non-root user (configurable)

## Contributing

Contributions welcome! Areas for improvement:
- Additional HTTrack options
- Better progress parsing
- Multi-URL batch processing
- Resume capability for interrupted downloads
- Custom file filtering

## License

This project is provided as-is for educational and personal use. Please respect copyright laws and terms of service when scraping websites.

## Acknowledgments

- **HTTrack**: Xavier Roche and contributors
- **Gradio**: Hugging Face team
- **Community**: Various open-source contributors that inspired this project

## Support

For issues:
1. Check the troubleshooting section
2. Review HTTrack documentation: https://www.httrack.com/
3. Check Gradio docs: https://www.gradio.app/docs

---

**Version**: 0.0.1  
**Last Updated**: December 2025