namelessai commited on
Commit
bc8dcba
Β·
verified Β·
1 Parent(s): ad4094f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +285 -4
README.md CHANGED
@@ -1,12 +1,293 @@
1
  ---
2
- title: Webdl
3
  emoji: πŸ“‰
4
- colorFrom: blue
5
  colorTo: blue
6
  sdk: docker
7
- pinned: false
8
  license: apache-2.0
9
  short_description: A lightweight Gradio app for mirroring websites.
10
  ---
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: Website Downloader/Cloner
3
  emoji: πŸ“‰
4
+ colorFrom: purple
5
  colorTo: blue
6
  sdk: docker
7
+ pinned: true
8
  license: apache-2.0
9
  short_description: A lightweight Gradio app for mirroring websites.
10
  ---
11
 
12
+ # HTTrack Website Scraper - Gradio App
13
+
14
+ A professional web scraping tool built with Gradio and HTTrack for downloading and archiving websites for offline browsing.
15
+
16
+ ## Features
17
+
18
+ ✨ **Key Features:**
19
+ - 🌐 Full website mirroring with HTTrack
20
+ - πŸ“Š Real-time progress tracking
21
+ - πŸ“ Detailed logging with timestamps
22
+ - πŸ“¦ Automatic ZIP archive creation
23
+ - 🎨 Clean, intuitive Gradio interface
24
+ - 🐳 Docker containerized for easy deployment
25
+ - βš™οΈ Configurable scraping parameters
26
+
27
+ ## Prerequisites
28
+
29
+ ### Option 1: Docker (Recommended)
30
+ - Docker installed on your system
31
+ - Docker Compose (optional)
32
+
33
+ ### Option 2: Local Installation
34
+ - Python 3.11+
35
+ - HTTrack installed on your system
36
+ - pip for Python package management
37
+
38
+ ## Installation
39
+
40
+ ### Using Docker (Recommended)
41
+
42
+ 1. **Build the Docker image:**
43
+ ```bash
44
+ docker build -t httrack-scraper .
45
+ ```
46
+
47
+ 2. **Run the container:**
48
+ ```bash
49
+ docker run -p 7860:7860 httrack-scraper
50
+ ```
51
+
52
+ 3. **Access the app:**
53
+ Open your browser to `http://localhost:7860`
54
+
55
+ ### Using Docker Compose
56
+
57
+ Create a `docker-compose.yml`:
58
+
59
+ ```yaml
60
+ version: '3.8'
61
+
62
+ services:
63
+ httrack-scraper:
64
+ build: .
65
+ ports:
66
+ - "7860:7860"
67
+ volumes:
68
+ - ./downloads:/tmp/httrack_temp
69
+ environment:
70
+ - GRADIO_SERVER_NAME=0.0.0.0
71
+ - GRADIO_SERVER_PORT=7860
72
+ restart: unless-stopped
73
+ ```
74
+
75
+ Then run:
76
+ ```bash
77
+ docker-compose up -d
78
+ ```
79
+
80
+ ### Local Installation
81
+
82
+ 1. **Install HTTrack:**
83
+
84
+ **Ubuntu/Debian:**
85
+ ```bash
86
+ sudo apt-get update
87
+ sudo apt-get install httrack
88
+ ```
89
+
90
+ **macOS:**
91
+ ```bash
92
+ brew install httrack
93
+ ```
94
+
95
+ **Fedora/RHEL:**
96
+ ```bash
97
+ sudo dnf install httrack
98
+ ```
99
+
100
+ 2. **Install Python dependencies:**
101
+ ```bash
102
+ pip install -r requirements.txt
103
+ ```
104
+
105
+ 3. **Run the application:**
106
+ ```bash
107
+ python app.py
108
+ ```
109
+
110
+ ## Usage
111
+
112
+ ### Basic Workflow
113
+
114
+ 1. **Enter Website URL**: Input the complete URL (e.g., `https://example.com`)
115
+ 2. **Configure Settings**:
116
+ - **Max Depth**: How many link levels to follow (1-10)
117
+ - **Max Rate**: Download speed limit in bytes/sec (0 = unlimited)
118
+ - **Respect robots.txt**: Honor the website's scraping preferences
119
+ 3. **Start Scraping**: Click the "Start Scraping" button
120
+ 4. **Monitor Progress**: Watch real-time logs and progress bar
121
+ 5. **Download Archive**: Get the ZIP file when complete
122
+
123
+ ### Recommended Settings
124
+
125
+ | Site Type | Max Depth | Max Rate | Notes |
126
+ |-----------|-----------|----------|-------|
127
+ | Small Blog | 2-3 | 500000 | Quick scrape |
128
+ | Medium Site | 3-5 | 500000 | Balanced approach |
129
+ | Large Site | 5-7 | 1000000 | May take hours |
130
+ | Full Archive | 7-10 | 1000000 | Complete download |
131
+
132
+ ### Advanced HTTrack Options
133
+
134
+ The app uses these HTTrack parameters:
135
+ - `-r<N>`: Recursion depth
136
+ - `-v`: Verbose logging
137
+ - `-c8`: 8 simultaneous connections
138
+ - `-%P`: Extended HTML parsing
139
+ - `-A<N>`: Max transfer rate limiting
140
+ - `-F`: Custom User-Agent string
141
+
142
+ ## File Structure
143
+
144
+ ```
145
+ .
146
+ β”œβ”€β”€ app.py # Main Gradio application
147
+ β”œβ”€β”€ Dockerfile # Docker container definition
148
+ β”œβ”€β”€ requirements.txt # Python dependencies
149
+ └── README.md # This file
150
+ ```
151
+
152
+ ## Legal & Ethical Considerations
153
+
154
+ ⚠️ **IMPORTANT**:
155
+
156
+ - **Permission Required**: Only scrape websites you have explicit permission to download
157
+ - **Respect robots.txt**: Honor website preferences for crawlers
158
+ - **Terms of Service**: Check the site's ToS before scraping
159
+ - **Copyright**: Be aware of copyright restrictions on content
160
+ - **Server Load**: Use rate limiting to avoid overwhelming servers
161
+ - **Personal Use**: Scraped content is typically for personal/research use only
162
+
163
+ **Legitimate Use Cases:**
164
+ - Backing up your own website
165
+ - Research and analysis (with permission)
166
+ - Archiving public domain content
167
+ - Educational purposes
168
+ - Offline browsing of authorized content
169
+
170
+ ## Troubleshooting
171
+
172
+ ### HTTrack Not Found
173
+ **Error**: `httrack: command not found`
174
+
175
+ **Solution**: Install HTTrack:
176
+ ```bash
177
+ # Ubuntu/Debian
178
+ sudo apt-get install httrack
179
+
180
+ # macOS
181
+ brew install httrack
182
+ ```
183
+
184
+ ### Permission Denied
185
+ **Error**: Permission issues when creating files
186
+
187
+ **Solution**: Ensure the app has write permissions:
188
+ ```bash
189
+ chmod 777 /tmp/httrack_temp
190
+ ```
191
+
192
+ ### Large Site Timeout
193
+ **Issue**: Scraping large sites takes too long
194
+
195
+ **Solution**:
196
+ - Reduce max depth (try 2-3 instead of 5+)
197
+ - Increase max rate if bandwidth allows
198
+ - Run in a persistent environment (not locally)
199
+
200
+ ### Docker Port Conflict
201
+ **Error**: Port 7860 already in use
202
+
203
+ **Solution**: Use a different port:
204
+ ```bash
205
+ docker run -p 8080:7860 httrack-scraper
206
+ ```
207
+
208
+ ## Performance Tips
209
+
210
+ 1. **Start Small**: Test with depth 1-2 before full scrapes
211
+ 2. **Rate Limiting**: Use reasonable rate limits (500KB-1MB/s)
212
+ 3. **Disk Space**: Ensure adequate storage for large sites
213
+ 4. **Network**: Stable internet connection recommended
214
+ 5. **Resources**: HTTrack can be CPU/memory intensive
215
+
216
+ ## Development
217
+
218
+ ### Running in Development Mode
219
+
220
+ ```bash
221
+ # Install development dependencies
222
+ pip install -r requirements.txt
223
+
224
+ # Run with auto-reload
225
+ python app.py
226
+ ```
227
+
228
+ ### Customizing the UI
229
+
230
+ Edit `app.py` to modify:
231
+ - Theme: Change `gr.themes.Soft()` to other themes
232
+ - Colors: Modify the custom CSS
233
+ - Layout: Adjust the Gradio Blocks structure
234
+
235
+ ### Adding Features
236
+
237
+ The modular design allows easy extension:
238
+ - Add new HTTrack options in `scrape_website()`
239
+ - Enhance logging in `add_log()`
240
+ - Improve progress parsing in `parse_httrack_output()`
241
+
242
+ ## Environment Variables
243
+
244
+ | Variable | Default | Description |
245
+ |----------|---------|-------------|
246
+ | `GRADIO_SERVER_NAME` | `0.0.0.0` | Server host |
247
+ | `GRADIO_SERVER_PORT` | `7860` | Server port |
248
+ | `PYTHONUNBUFFERED` | `1` | Python output buffering |
249
+
250
+ ## Docker Image Details
251
+
252
+ **Base Image**: `python:3.11-slim`
253
+
254
+ **Installed Packages**:
255
+ - Python 3.11
256
+ - HTTrack
257
+ - Gradio
258
+ - Required system libraries
259
+
260
+ **Image Size**: ~400MB
261
+
262
+ **Security**: Runs as non-root user (configurable)
263
+
264
+ ## Contributing
265
+
266
+ Contributions welcome! Areas for improvement:
267
+ - Additional HTTrack options
268
+ - Better progress parsing
269
+ - Multi-URL batch processing
270
+ - Resume capability for interrupted downloads
271
+ - Custom file filtering
272
+
273
+ ## License
274
+
275
+ This project is provided as-is for educational and personal use. Please respect copyright laws and terms of service when scraping websites.
276
+
277
+ ## Acknowledgments
278
+
279
+ - **HTTrack**: Xavier Roche and contributors
280
+ - **Gradio**: Hugging Face team
281
+ - **Community**: Various open-source contributors that inspired this project
282
+
283
+ ## Support
284
+
285
+ For issues:
286
+ 1. Check the troubleshooting section
287
+ 2. Review HTTrack documentation: https://www.httrack.com/
288
+ 3. Check Gradio docs: https://www.gradio.app/docs
289
+
290
+ ---
291
+
292
+ **Version**: 0.0.1
293
+ **Last Updated**: December 2025