webdl / README.md
namelessai's picture
Update README.md
bc8dcba verified
metadata
title: Website Downloader/Cloner
emoji: πŸ“‰
colorFrom: purple
colorTo: blue
sdk: docker
pinned: true
license: apache-2.0
short_description: A lightweight Gradio app for mirroring websites.

HTTrack Website Scraper - Gradio App

A professional web scraping tool built with Gradio and HTTrack for downloading and archiving websites for offline browsing.

Features

✨ Key Features:

  • 🌐 Full website mirroring with HTTrack
  • πŸ“Š Real-time progress tracking
  • πŸ“ Detailed logging with timestamps
  • πŸ“¦ Automatic ZIP archive creation
  • 🎨 Clean, intuitive Gradio interface
  • 🐳 Docker containerized for easy deployment
  • βš™οΈ Configurable scraping parameters

Prerequisites

Option 1: Docker (Recommended)

  • Docker installed on your system
  • Docker Compose (optional)

Option 2: Local Installation

  • Python 3.11+
  • HTTrack installed on your system
  • pip for Python package management

Installation

Using Docker (Recommended)

  1. Build the Docker image:
docker build -t httrack-scraper .
  1. Run the container:
docker run -p 7860:7860 httrack-scraper
  1. Access the app: Open your browser to http://localhost:7860

Using Docker Compose

Create a docker-compose.yml:

version: '3.8'

services:
  httrack-scraper:
    build: .
    ports:
      - "7860:7860"
    volumes:
      - ./downloads:/tmp/httrack_temp
    environment:
      - GRADIO_SERVER_NAME=0.0.0.0
      - GRADIO_SERVER_PORT=7860
    restart: unless-stopped

Then run:

docker-compose up -d

Local Installation

  1. Install HTTrack:

Ubuntu/Debian:

sudo apt-get update
sudo apt-get install httrack

macOS:

brew install httrack

Fedora/RHEL:

sudo dnf install httrack
  1. Install Python dependencies:
pip install -r requirements.txt
  1. Run the application:
python app.py

Usage

Basic Workflow

  1. Enter Website URL: Input the complete URL (e.g., https://example.com)
  2. Configure Settings:
    • Max Depth: How many link levels to follow (1-10)
    • Max Rate: Download speed limit in bytes/sec (0 = unlimited)
    • Respect robots.txt: Honor the website's scraping preferences
  3. Start Scraping: Click the "Start Scraping" button
  4. Monitor Progress: Watch real-time logs and progress bar
  5. Download Archive: Get the ZIP file when complete

Recommended Settings

Site Type Max Depth Max Rate Notes
Small Blog 2-3 500000 Quick scrape
Medium Site 3-5 500000 Balanced approach
Large Site 5-7 1000000 May take hours
Full Archive 7-10 1000000 Complete download

Advanced HTTrack Options

The app uses these HTTrack parameters:

  • -r<N>: Recursion depth
  • -v: Verbose logging
  • -c8: 8 simultaneous connections
  • -%P: Extended HTML parsing
  • -A<N>: Max transfer rate limiting
  • -F: Custom User-Agent string

File Structure

.
β”œβ”€β”€ app.py              # Main Gradio application
β”œβ”€β”€ Dockerfile          # Docker container definition
β”œβ”€β”€ requirements.txt    # Python dependencies
└── README.md          # This file

Legal & Ethical Considerations

⚠️ IMPORTANT:

  • Permission Required: Only scrape websites you have explicit permission to download
  • Respect robots.txt: Honor website preferences for crawlers
  • Terms of Service: Check the site's ToS before scraping
  • Copyright: Be aware of copyright restrictions on content
  • Server Load: Use rate limiting to avoid overwhelming servers
  • Personal Use: Scraped content is typically for personal/research use only

Legitimate Use Cases:

  • Backing up your own website
  • Research and analysis (with permission)
  • Archiving public domain content
  • Educational purposes
  • Offline browsing of authorized content

Troubleshooting

HTTrack Not Found

Error: httrack: command not found

Solution: Install HTTrack:

# Ubuntu/Debian
sudo apt-get install httrack

# macOS
brew install httrack

Permission Denied

Error: Permission issues when creating files

Solution: Ensure the app has write permissions:

chmod 777 /tmp/httrack_temp

Large Site Timeout

Issue: Scraping large sites takes too long

Solution:

  • Reduce max depth (try 2-3 instead of 5+)
  • Increase max rate if bandwidth allows
  • Run in a persistent environment (not locally)

Docker Port Conflict

Error: Port 7860 already in use

Solution: Use a different port:

docker run -p 8080:7860 httrack-scraper

Performance Tips

  1. Start Small: Test with depth 1-2 before full scrapes
  2. Rate Limiting: Use reasonable rate limits (500KB-1MB/s)
  3. Disk Space: Ensure adequate storage for large sites
  4. Network: Stable internet connection recommended
  5. Resources: HTTrack can be CPU/memory intensive

Development

Running in Development Mode

# Install development dependencies
pip install -r requirements.txt

# Run with auto-reload
python app.py

Customizing the UI

Edit app.py to modify:

  • Theme: Change gr.themes.Soft() to other themes
  • Colors: Modify the custom CSS
  • Layout: Adjust the Gradio Blocks structure

Adding Features

The modular design allows easy extension:

  • Add new HTTrack options in scrape_website()
  • Enhance logging in add_log()
  • Improve progress parsing in parse_httrack_output()

Environment Variables

Variable Default Description
GRADIO_SERVER_NAME 0.0.0.0 Server host
GRADIO_SERVER_PORT 7860 Server port
PYTHONUNBUFFERED 1 Python output buffering

Docker Image Details

Base Image: python:3.11-slim

Installed Packages:

  • Python 3.11
  • HTTrack
  • Gradio
  • Required system libraries

Image Size: ~400MB

Security: Runs as non-root user (configurable)

Contributing

Contributions welcome! Areas for improvement:

  • Additional HTTrack options
  • Better progress parsing
  • Multi-URL batch processing
  • Resume capability for interrupted downloads
  • Custom file filtering

License

This project is provided as-is for educational and personal use. Please respect copyright laws and terms of service when scraping websites.

Acknowledgments

  • HTTrack: Xavier Roche and contributors
  • Gradio: Hugging Face team
  • Community: Various open-source contributors that inspired this project

Support

For issues:

  1. Check the troubleshooting section
  2. Review HTTrack documentation: https://www.httrack.com/
  3. Check Gradio docs: https://www.gradio.app/docs

Version: 0.0.1
Last Updated: December 2025