Spaces:

namelessai
/

webdl

Runtime error

App Files Files Community

namelessai commited on 14 days ago

Commit

bc8dcba

verified ·

1 Parent(s): ad4094f

Update README.md

Browse files

Files changed (1) hide show

README.md +285 -4

README.md CHANGED Viewed

@@ -1,12 +1,293 @@
 ---
-title: Webdl
 emoji: 📉
-colorFrom: blue
 colorTo: blue
 sdk: docker
-pinned: false
 license: apache-2.0
 short_description: A lightweight Gradio app for mirroring websites.
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: Website Downloader/Cloner
 emoji: 📉
+colorFrom: purple
 colorTo: blue
 sdk: docker
+pinned: true
 license: apache-2.0
 short_description: A lightweight Gradio app for mirroring websites.
 ---
+# HTTrack Website Scraper - Gradio App
+A professional web scraping tool built with Gradio and HTTrack for downloading and archiving websites for offline browsing.
+## Features
+✨ **Key Features:**
+- 🌐 Full website mirroring with HTTrack
+- 📊 Real-time progress tracking
+- 📝 Detailed logging with timestamps
+- 📦 Automatic ZIP archive creation
+- 🎨 Clean, intuitive Gradio interface
+- 🐳 Docker containerized for easy deployment
+- ⚙️ Configurable scraping parameters
+## Prerequisites
+### Option 1: Docker (Recommended)
+- Docker installed on your system
+- Docker Compose (optional)
+### Option 2: Local Installation
+- Python 3.11+
+- HTTrack installed on your system
+- pip for Python package management
+## Installation
+### Using Docker (Recommended)
+1. **Build the Docker image:**
+```bash
+docker build -t httrack-scraper .
+```
+2. **Run the container:**
+```bash
+docker run -p 7860:7860 httrack-scraper
+```
+3. **Access the app:**
+Open your browser to `http://localhost:7860`
+### Using Docker Compose
+Create a `docker-compose.yml`:
+```yaml
+version: '3.8'
+services:
+  httrack-scraper:
+    build: .
+    ports:
+      - "7860:7860"
+    volumes:
+      - ./downloads:/tmp/httrack_temp
+    environment:
+      - GRADIO_SERVER_NAME=0.0.0.0
+      - GRADIO_SERVER_PORT=7860
+    restart: unless-stopped
+```
+Then run:
+```bash
+docker-compose up -d
+```
+### Local Installation
+1. **Install HTTrack:**
+**Ubuntu/Debian:**
+```bash
+sudo apt-get update
+sudo apt-get install httrack
+```
+**macOS:**
+```bash
+brew install httrack
+```
+**Fedora/RHEL:**
+```bash
+sudo dnf install httrack
+```
+2. **Install Python dependencies:**
+```bash
+pip install -r requirements.txt
+```
+3. **Run the application:**
+```bash
+python app.py
+```
+## Usage
+### Basic Workflow
+1. **Enter Website URL**: Input the complete URL (e.g., `https://example.com`)
+2. **Configure Settings**:
+   - **Max Depth**: How many link levels to follow (1-10)
+   - **Max Rate**: Download speed limit in bytes/sec (0 = unlimited)
+   - **Respect robots.txt**: Honor the website's scraping preferences
+3. **Start Scraping**: Click the "Start Scraping" button
+4. **Monitor Progress**: Watch real-time logs and progress bar
+5. **Download Archive**: Get the ZIP file when complete
+### Recommended Settings
+| Site Type | Max Depth | Max Rate | Notes |
+|-----------|-----------|----------|-------|
+| Small Blog | 2-3 | 500000 | Quick scrape |
+| Medium Site | 3-5 | 500000 | Balanced approach |
+| Large Site | 5-7 | 1000000 | May take hours |
+| Full Archive | 7-10 | 1000000 | Complete download |
+### Advanced HTTrack Options
+The app uses these HTTrack parameters:
+- `-r<N>`: Recursion depth
+- `-v`: Verbose logging
+- `-c8`: 8 simultaneous connections
+- `-%P`: Extended HTML parsing
+- `-A<N>`: Max transfer rate limiting
+- `-F`: Custom User-Agent string
+## File Structure
+```
+.
+├── app.py              # Main Gradio application
+├── Dockerfile          # Docker container definition
+├── requirements.txt    # Python dependencies
+└── README.md          # This file
+```
+## Legal & Ethical Considerations
+⚠️ **IMPORTANT**:
+- **Permission Required**: Only scrape websites you have explicit permission to download
+- **Respect robots.txt**: Honor website preferences for crawlers
+- **Terms of Service**: Check the site's ToS before scraping
+- **Copyright**: Be aware of copyright restrictions on content
+- **Server Load**: Use rate limiting to avoid overwhelming servers
+- **Personal Use**: Scraped content is typically for personal/research use only
+**Legitimate Use Cases:**
+- Backing up your own website
+- Research and analysis (with permission)
+- Archiving public domain content
+- Educational purposes
+- Offline browsing of authorized content
+## Troubleshooting
+### HTTrack Not Found
+**Error**: `httrack: command not found`
+**Solution**: Install HTTrack:
+```bash
+# Ubuntu/Debian
+sudo apt-get install httrack
+# macOS
+brew install httrack
+```
+### Permission Denied
+**Error**: Permission issues when creating files
+**Solution**: Ensure the app has write permissions:
+```bash
+chmod 777 /tmp/httrack_temp
+```
+### Large Site Timeout
+**Issue**: Scraping large sites takes too long
+**Solution**:
+- Reduce max depth (try 2-3 instead of 5+)
+- Increase max rate if bandwidth allows
+- Run in a persistent environment (not locally)
+### Docker Port Conflict
+**Error**: Port 7860 already in use
+**Solution**: Use a different port:
+```bash
+docker run -p 8080:7860 httrack-scraper
+```
+## Performance Tips
+1. **Start Small**: Test with depth 1-2 before full scrapes
+2. **Rate Limiting**: Use reasonable rate limits (500KB-1MB/s)
+3. **Disk Space**: Ensure adequate storage for large sites
+4. **Network**: Stable internet connection recommended
+5. **Resources**: HTTrack can be CPU/memory intensive
+## Development
+### Running in Development Mode
+```bash
+# Install development dependencies
+pip install -r requirements.txt
+# Run with auto-reload
+python app.py
+```
+### Customizing the UI
+Edit `app.py` to modify:
+- Theme: Change `gr.themes.Soft()` to other themes
+- Colors: Modify the custom CSS
+- Layout: Adjust the Gradio Blocks structure
+### Adding Features
+The modular design allows easy extension:
+- Add new HTTrack options in `scrape_website()`
+- Enhance logging in `add_log()`
+- Improve progress parsing in `parse_httrack_output()`
+## Environment Variables
+| Variable | Default | Description |
+|----------|---------|-------------|
+| `GRADIO_SERVER_NAME` | `0.0.0.0` | Server host |
+| `GRADIO_SERVER_PORT` | `7860` | Server port |
+| `PYTHONUNBUFFERED` | `1` | Python output buffering |
+## Docker Image Details
+**Base Image**: `python:3.11-slim`
+**Installed Packages**:
+- Python 3.11
+- HTTrack
+- Gradio
+- Required system libraries
+**Image Size**: ~400MB
+**Security**: Runs as non-root user (configurable)
+## Contributing
+Contributions welcome! Areas for improvement:
+- Additional HTTrack options
+- Better progress parsing
+- Multi-URL batch processing
+- Resume capability for interrupted downloads
+- Custom file filtering
+## License
+This project is provided as-is for educational and personal use. Please respect copyright laws and terms of service when scraping websites.
+## Acknowledgments
+- **HTTrack**: Xavier Roche and contributors
+- **Gradio**: Hugging Face team
+- **Community**: Various open-source contributors that inspired this project
+## Support
+For issues:
+1. Check the troubleshooting section
+2. Review HTTrack documentation: https://www.httrack.com/
+3. Check Gradio docs: https://www.gradio.app/docs
+---
+**Version**: 0.0.1
+**Last Updated**: December 2025