Spaces:

namelessai
/

webdl

Runtime error

App Files Files Community

webdl / README.md

namelessai

Update README.md

bc8dcba verified 14 days ago

preview code

raw

history blame contribute delete

6.78 kB

metadata

title: Website Downloader/Cloner
emoji: 📉
colorFrom: purple
colorTo: blue
sdk: docker
pinned: true
license: apache-2.0
short_description: A lightweight Gradio app for mirroring websites.

HTTrack Website Scraper - Gradio App

A professional web scraping tool built with Gradio and HTTrack for downloading and archiving websites for offline browsing.

Features

✨ Key Features:

🌐 Full website mirroring with HTTrack
📊 Real-time progress tracking
📝 Detailed logging with timestamps
📦 Automatic ZIP archive creation
🎨 Clean, intuitive Gradio interface
🐳 Docker containerized for easy deployment
⚙️ Configurable scraping parameters

Prerequisites

Option 1: Docker (Recommended)

Docker installed on your system
Docker Compose (optional)

Option 2: Local Installation

Python 3.11+
HTTrack installed on your system
pip for Python package management

Installation

Using Docker (Recommended)

Build the Docker image:

docker build -t httrack-scraper .

Run the container:

docker run -p 7860:7860 httrack-scraper

Access the app: Open your browser to http://localhost:7860

Using Docker Compose

Create a docker-compose.yml:

version: '3.8'

services:
  httrack-scraper:
    build: .
    ports:
      - "7860:7860"
    volumes:
      - ./downloads:/tmp/httrack_temp
    environment:
      - GRADIO_SERVER_NAME=0.0.0.0
      - GRADIO_SERVER_PORT=7860
    restart: unless-stopped

Then run:

docker-compose up -d

Local Installation

Install HTTrack:

Ubuntu/Debian:

sudo apt-get update
sudo apt-get install httrack

macOS:

brew install httrack

Fedora/RHEL:

sudo dnf install httrack

Install Python dependencies:

pip install -r requirements.txt

Run the application:

python app.py

Usage

Basic Workflow

Enter Website URL: Input the complete URL (e.g., https://example.com)
Configure Settings:
- Max Depth: How many link levels to follow (1-10)
- Max Rate: Download speed limit in bytes/sec (0 = unlimited)
- Respect robots.txt: Honor the website's scraping preferences
Start Scraping: Click the "Start Scraping" button
Monitor Progress: Watch real-time logs and progress bar
Download Archive: Get the ZIP file when complete

Recommended Settings

Site Type	Max Depth	Max Rate	Notes
Small Blog	2-3	500000	Quick scrape
Medium Site	3-5	500000	Balanced approach
Large Site	5-7	1000000	May take hours
Full Archive	7-10	1000000	Complete download

Advanced HTTrack Options

The app uses these HTTrack parameters:

-r<N>: Recursion depth
-v: Verbose logging
-c8: 8 simultaneous connections
-%P: Extended HTML parsing
-A<N>: Max transfer rate limiting
-F: Custom User-Agent string

File Structure

.
├── app.py              # Main Gradio application
├── Dockerfile          # Docker container definition
├── requirements.txt    # Python dependencies
└── README.md          # This file

Legal & Ethical Considerations

⚠️ IMPORTANT:

Permission Required: Only scrape websites you have explicit permission to download
Respect robots.txt: Honor website preferences for crawlers
Terms of Service: Check the site's ToS before scraping
Copyright: Be aware of copyright restrictions on content
Server Load: Use rate limiting to avoid overwhelming servers
Personal Use: Scraped content is typically for personal/research use only

Legitimate Use Cases:

Backing up your own website
Research and analysis (with permission)
Archiving public domain content
Educational purposes
Offline browsing of authorized content

Troubleshooting

HTTrack Not Found

Error: httrack: command not found

Solution: Install HTTrack:

# Ubuntu/Debian
sudo apt-get install httrack

# macOS
brew install httrack

Permission Denied

Error: Permission issues when creating files

Solution: Ensure the app has write permissions:

chmod 777 /tmp/httrack_temp

Large Site Timeout

Issue: Scraping large sites takes too long

Solution:

Reduce max depth (try 2-3 instead of 5+)
Increase max rate if bandwidth allows
Run in a persistent environment (not locally)

Docker Port Conflict

Error: Port 7860 already in use

Solution: Use a different port:

docker run -p 8080:7860 httrack-scraper

Performance Tips

Start Small: Test with depth 1-2 before full scrapes
Rate Limiting: Use reasonable rate limits (500KB-1MB/s)
Disk Space: Ensure adequate storage for large sites
Network: Stable internet connection recommended
Resources: HTTrack can be CPU/memory intensive

Development

Running in Development Mode

# Install development dependencies
pip install -r requirements.txt

# Run with auto-reload
python app.py

Customizing the UI

Edit app.py to modify:

Theme: Change gr.themes.Soft() to other themes
Colors: Modify the custom CSS
Layout: Adjust the Gradio Blocks structure

Adding Features

The modular design allows easy extension:

Add new HTTrack options in scrape_website()
Enhance logging in add_log()
Improve progress parsing in parse_httrack_output()

Environment Variables

Variable	Default	Description
`GRADIO_SERVER_NAME`	`0.0.0.0`	Server host
`GRADIO_SERVER_PORT`	`7860`	Server port
`PYTHONUNBUFFERED`	`1`	Python output buffering

Docker Image Details

Base Image: python:3.11-slim

Installed Packages:

Python 3.11
HTTrack
Gradio
Required system libraries

Image Size: ~400MB

Security: Runs as non-root user (configurable)

Contributing

Contributions welcome! Areas for improvement:

Additional HTTrack options
Better progress parsing
Multi-URL batch processing
Resume capability for interrupted downloads
Custom file filtering

License

This project is provided as-is for educational and personal use. Please respect copyright laws and terms of service when scraping websites.

Acknowledgments

HTTrack: Xavier Roche and contributors
Gradio: Hugging Face team
Community: Various open-source contributors that inspired this project

Support

For issues:

Check the troubleshooting section
Review HTTrack documentation: https://www.httrack.com/
Check Gradio docs: https://www.gradio.app/docs

Version: 0.0.1
Last Updated: December 2025