Spaces:
Runtime error
Runtime error
Create QUICKSTART.md
Browse files- QUICKSTART.md +155 -0
QUICKSTART.md
ADDED
|
@@ -0,0 +1,155 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Quick Start Guide
|
| 2 |
+
|
| 3 |
+
Get the HTTrack Website Scraper running in under 5 minutes!
|
| 4 |
+
|
| 5 |
+
## π Fastest Way (Docker)
|
| 6 |
+
|
| 7 |
+
```bash
|
| 8 |
+
# 1. Make the deploy script executable
|
| 9 |
+
chmod +x deploy.sh
|
| 10 |
+
|
| 11 |
+
# 2. Deploy everything
|
| 12 |
+
./deploy.sh deploy
|
| 13 |
+
|
| 14 |
+
# 3. Open your browser
|
| 15 |
+
# Go to: http://localhost:7860
|
| 16 |
+
```
|
| 17 |
+
|
| 18 |
+
That's it! π
|
| 19 |
+
|
| 20 |
+
## π Manual Docker Setup
|
| 21 |
+
|
| 22 |
+
If you prefer step-by-step:
|
| 23 |
+
|
| 24 |
+
```bash
|
| 25 |
+
# Build the image
|
| 26 |
+
docker build -t httrack-scraper .
|
| 27 |
+
|
| 28 |
+
# Run the container
|
| 29 |
+
docker run -d -p 7860:7860 --name httrack-scraper httrack-scraper
|
| 30 |
+
|
| 31 |
+
# Access at http://localhost:7860
|
| 32 |
+
```
|
| 33 |
+
|
| 34 |
+
## π Local Python Setup
|
| 35 |
+
|
| 36 |
+
Without Docker:
|
| 37 |
+
|
| 38 |
+
```bash
|
| 39 |
+
# 1. Install HTTrack
|
| 40 |
+
sudo apt-get install httrack # Ubuntu/Debian
|
| 41 |
+
brew install httrack # macOS
|
| 42 |
+
|
| 43 |
+
# 2. Install Python packages
|
| 44 |
+
pip install -r requirements.txt
|
| 45 |
+
|
| 46 |
+
# 3. Run the app
|
| 47 |
+
python app.py
|
| 48 |
+
|
| 49 |
+
# 4. Open http://localhost:7860
|
| 50 |
+
```
|
| 51 |
+
|
| 52 |
+
## π― First Scrape
|
| 53 |
+
|
| 54 |
+
1. Enter a URL: `https://example.com`
|
| 55 |
+
2. Set Max Depth: `2` (good starting point)
|
| 56 |
+
3. Click "Start Scraping"
|
| 57 |
+
4. Watch the progress and logs
|
| 58 |
+
5. Download your ZIP file when complete
|
| 59 |
+
|
| 60 |
+
## βοΈ Recommended Settings for First Time
|
| 61 |
+
|
| 62 |
+
| Setting | Value | Why |
|
| 63 |
+
|---------|-------|-----|
|
| 64 |
+
| Max Depth | 2-3 | Fast, won't download too much |
|
| 65 |
+
| Max Rate | 500000 | Respectful to servers |
|
| 66 |
+
| Respect robots.txt | β
Checked | Ethical scraping |
|
| 67 |
+
|
| 68 |
+
## π οΈ Common Commands
|
| 69 |
+
|
| 70 |
+
```bash
|
| 71 |
+
# View logs
|
| 72 |
+
docker logs -f httrack-scraper
|
| 73 |
+
|
| 74 |
+
# Stop the app
|
| 75 |
+
docker stop httrack-scraper
|
| 76 |
+
|
| 77 |
+
# Restart the app
|
| 78 |
+
docker restart httrack-scraper
|
| 79 |
+
|
| 80 |
+
# Or use the deploy script
|
| 81 |
+
./deploy.sh logs # View logs
|
| 82 |
+
./deploy.sh stop # Stop app
|
| 83 |
+
./deploy.sh restart # Restart app
|
| 84 |
+
./deploy.sh clean # Remove everything
|
| 85 |
+
```
|
| 86 |
+
|
| 87 |
+
## β οΈ Before You Start
|
| 88 |
+
|
| 89 |
+
**Important Reminders:**
|
| 90 |
+
- β
Only scrape websites you have permission to access
|
| 91 |
+
- β
Respect robots.txt files
|
| 92 |
+
- β
Be aware of copyright laws
|
| 93 |
+
- β
Use reasonable rate limits
|
| 94 |
+
- β
Check the website's Terms of Service
|
| 95 |
+
|
| 96 |
+
**Good Use Cases:**
|
| 97 |
+
- Backing up your own website
|
| 98 |
+
- Archiving with permission
|
| 99 |
+
- Research projects (with authorization)
|
| 100 |
+
- Personal offline browsing (legally obtained content)
|
| 101 |
+
|
| 102 |
+
## π Quick Troubleshooting
|
| 103 |
+
|
| 104 |
+
### Port Already in Use
|
| 105 |
+
```bash
|
| 106 |
+
# Use a different port
|
| 107 |
+
docker run -d -p 8080:7860 --name httrack-scraper httrack-scraper
|
| 108 |
+
# Then visit http://localhost:8080
|
| 109 |
+
```
|
| 110 |
+
|
| 111 |
+
### HTTrack Not Found (Local Installation)
|
| 112 |
+
```bash
|
| 113 |
+
# Install it first
|
| 114 |
+
sudo apt-get update
|
| 115 |
+
sudo apt-get install httrack
|
| 116 |
+
```
|
| 117 |
+
|
| 118 |
+
### Container Won't Start
|
| 119 |
+
```bash
|
| 120 |
+
# Check the logs
|
| 121 |
+
docker logs httrack-scraper
|
| 122 |
+
|
| 123 |
+
# Try rebuilding
|
| 124 |
+
docker stop httrack-scraper
|
| 125 |
+
docker rm httrack-scraper
|
| 126 |
+
docker build -t httrack-scraper .
|
| 127 |
+
docker run -d -p 7860:7860 --name httrack-scraper httrack-scraper
|
| 128 |
+
```
|
| 129 |
+
|
| 130 |
+
## π Next Steps
|
| 131 |
+
|
| 132 |
+
- Read the full [README.md](README.md) for detailed documentation
|
| 133 |
+
- Experiment with different depth settings
|
| 134 |
+
- Try scraping progressively larger sites
|
| 135 |
+
- Check out the logs to understand what's happening
|
| 136 |
+
|
| 137 |
+
## π‘ Tips for Better Results
|
| 138 |
+
|
| 139 |
+
1. **Start Small**: Test with depth 1-2 first
|
| 140 |
+
2. **Increase Gradually**: Only go deeper if needed
|
| 141 |
+
3. **Monitor Logs**: Watch for errors or issues
|
| 142 |
+
4. **Be Patient**: Large sites take time
|
| 143 |
+
5. **Check Size**: Monitor disk space for big scrapes
|
| 144 |
+
|
| 145 |
+
## π Learning Resources
|
| 146 |
+
|
| 147 |
+
- HTTrack Documentation: https://www.httrack.com/html/
|
| 148 |
+
- Gradio Documentation: https://www.gradio.app/docs/
|
| 149 |
+
- Docker Documentation: https://docs.docker.com/
|
| 150 |
+
|
| 151 |
+
---
|
| 152 |
+
|
| 153 |
+
**Need Help?** Check the troubleshooting section in the main README or review the logs for error messages.
|
| 154 |
+
|
| 155 |
+
**Ready to scrape?** Run `./deploy.sh deploy` and start downloading! π
|