Spaces:

namelessai
/

webdl

Runtime error

App Files Files Community

webdl / README.md

namelessai

Update README.md

bc8dcba verified 16 days ago

preview code

raw

history blame contribute delete

6.78 kB

	---
	title: Website Downloader/Cloner
	emoji: 📉
	colorFrom: purple
	colorTo: blue
	sdk: docker
	pinned: true
	license: apache-2.0
	short_description: A lightweight Gradio app for mirroring websites.
	---

	# HTTrack Website Scraper - Gradio App

	A professional web scraping tool built with Gradio and HTTrack for downloading and archiving websites for offline browsing.

	## Features

	✨ Key Features:
	- 🌐 Full website mirroring with HTTrack
	- 📊 Real-time progress tracking
	- 📝 Detailed logging with timestamps
	- 📦 Automatic ZIP archive creation
	- 🎨 Clean, intuitive Gradio interface
	- 🐳 Docker containerized for easy deployment
	- ⚙️ Configurable scraping parameters

	## Prerequisites

	### Option 1: Docker (Recommended)
	- Docker installed on your system
	- Docker Compose (optional)

	### Option 2: Local Installation
	- Python 3.11+
	- HTTrack installed on your system
	- pip for Python package management

	## Installation

	### Using Docker (Recommended)

	1. Build the Docker image:
	```bash
	docker build -t httrack-scraper .
	```

	2. Run the container:
	```bash
	docker run -p 7860:7860 httrack-scraper
	```

	3. Access the app:
	Open your browser to `http://localhost:7860`

	### Using Docker Compose

	Create a `docker-compose.yml`:

	```yaml
	version: '3.8'

	services:
	httrack-scraper:
	build: .
	ports:
	- "7860:7860"
	volumes:
	- ./downloads:/tmp/httrack_temp
	environment:
	- GRADIO_SERVER_NAME=0.0.0.0
	- GRADIO_SERVER_PORT=7860
	restart: unless-stopped
	```

	Then run:
	```bash
	docker-compose up -d
	```

	### Local Installation

	1. Install HTTrack:

	Ubuntu/Debian:
	```bash
	sudo apt-get update
	sudo apt-get install httrack
	```

	macOS:
	```bash
	brew install httrack
	```

	Fedora/RHEL:
	```bash
	sudo dnf install httrack
	```

	2. Install Python dependencies:
	```bash
	pip install -r requirements.txt
	```

	3. Run the application:
	```bash
	python app.py
	```

	## Usage

	### Basic Workflow

	1. Enter Website URL: Input the complete URL (e.g., `https://example.com`)
	2. Configure Settings:
	- Max Depth: How many link levels to follow (1-10)
	- Max Rate: Download speed limit in bytes/sec (0 = unlimited)
	- Respect robots.txt: Honor the website's scraping preferences
	3. Start Scraping: Click the "Start Scraping" button
	4. Monitor Progress: Watch real-time logs and progress bar
	5. Download Archive: Get the ZIP file when complete

	### Recommended Settings

	\| Site Type \| Max Depth \| Max Rate \| Notes \|
	\|-----------\|-----------\|----------\|-------\|
	\| Small Blog \| 2-3 \| 500000 \| Quick scrape \|
	\| Medium Site \| 3-5 \| 500000 \| Balanced approach \|
	\| Large Site \| 5-7 \| 1000000 \| May take hours \|
	\| Full Archive \| 7-10 \| 1000000 \| Complete download \|

	### Advanced HTTrack Options

	The app uses these HTTrack parameters:
	- `-r<N>`: Recursion depth
	- `-v`: Verbose logging
	- `-c8`: 8 simultaneous connections
	- `-%P`: Extended HTML parsing
	- `-A<N>`: Max transfer rate limiting
	- `-F`: Custom User-Agent string

	## File Structure

	```
	.
	├── app.py # Main Gradio application
	├── Dockerfile # Docker container definition
	├── requirements.txt # Python dependencies
	└── README.md # This file
	```

	## Legal & Ethical Considerations

	⚠️ IMPORTANT:

	- Permission Required: Only scrape websites you have explicit permission to download
	- Respect robots.txt: Honor website preferences for crawlers
	- Terms of Service: Check the site's ToS before scraping
	- Copyright: Be aware of copyright restrictions on content
	- Server Load: Use rate limiting to avoid overwhelming servers
	- Personal Use: Scraped content is typically for personal/research use only

	Legitimate Use Cases:
	- Backing up your own website
	- Research and analysis (with permission)
	- Archiving public domain content
	- Educational purposes
	- Offline browsing of authorized content

	## Troubleshooting

	### HTTrack Not Found
	Error: `httrack: command not found`

	Solution: Install HTTrack:
	```bash
	# Ubuntu/Debian
	sudo apt-get install httrack

	# macOS
	brew install httrack
	```

	### Permission Denied
	Error: Permission issues when creating files

	Solution: Ensure the app has write permissions:
	```bash
	chmod 777 /tmp/httrack_temp
	```

	### Large Site Timeout
	Issue: Scraping large sites takes too long

	Solution:
	- Reduce max depth (try 2-3 instead of 5+)
	- Increase max rate if bandwidth allows
	- Run in a persistent environment (not locally)

	### Docker Port Conflict
	Error: Port 7860 already in use

	Solution: Use a different port:
	```bash
	docker run -p 8080:7860 httrack-scraper
	```

	## Performance Tips

	1. Start Small: Test with depth 1-2 before full scrapes
	2. Rate Limiting: Use reasonable rate limits (500KB-1MB/s)
	3. Disk Space: Ensure adequate storage for large sites
	4. Network: Stable internet connection recommended
	5. Resources: HTTrack can be CPU/memory intensive

	## Development

	### Running in Development Mode

	```bash
	# Install development dependencies
	pip install -r requirements.txt

	# Run with auto-reload
	python app.py
	```

	### Customizing the UI

	Edit `app.py` to modify:
	- Theme: Change `gr.themes.Soft()` to other themes
	- Colors: Modify the custom CSS
	- Layout: Adjust the Gradio Blocks structure

	### Adding Features

	The modular design allows easy extension:
	- Add new HTTrack options in `scrape_website()`
	- Enhance logging in `add_log()`
	- Improve progress parsing in `parse_httrack_output()`

	## Environment Variables

	\| Variable \| Default \| Description \|
	\|----------\|---------\|-------------\|
	\| `GRADIO_SERVER_NAME` \| `0.0.0.0` \| Server host \|
	\| `GRADIO_SERVER_PORT` \| `7860` \| Server port \|
	\| `PYTHONUNBUFFERED` \| `1` \| Python output buffering \|

	## Docker Image Details

	Base Image: `python:3.11-slim`

	Installed Packages:
	- Python 3.11
	- HTTrack
	- Gradio
	- Required system libraries

	Image Size: ~400MB

	Security: Runs as non-root user (configurable)

	## Contributing

	Contributions welcome! Areas for improvement:
	- Additional HTTrack options
	- Better progress parsing
	- Multi-URL batch processing
	- Resume capability for interrupted downloads
	- Custom file filtering

	## License

	This project is provided as-is for educational and personal use. Please respect copyright laws and terms of service when scraping websites.

	## Acknowledgments

	- HTTrack: Xavier Roche and contributors
	- Gradio: Hugging Face team
	- Community: Various open-source contributors that inspired this project

	## Support

	For issues:
	1. Check the troubleshooting section
	2. Review HTTrack documentation: https://www.httrack.com/
	3. Check Gradio docs: https://www.gradio.app/docs

	---

	Version: 0.0.1
	Last Updated: December 2025