Spaces:
Running
A newer version of the Gradio SDK is available:
6.1.0
π Timeout vs Memory Diagnostic Tools
Overview
When working with heavy models in HF Spaces, you may encounter issues that could be caused by:
- Timeout: The model takes too long to load (>5 minutes)
- Memory: The system runs out of RAM
- Both: A combination of both issues
This toolkit helps you identify and fix the exact problem.
π Files Added
1. diagnostic_tool.py
Purpose: Identify if the problem is timeout or memory
Usage:
python hf-spaces/diagnostic_tool.py
What it does:
- Monitors system memory in real real real-time
- Tracks model loading time
- Detects the exact failure point
- Provides specific recommendations
Output:
π MODEL LOADING DIAGNOSTIC: meta-llama/Llama-3.2-1B
π INITIAL SYSTEM STATE:
- Available memory: 12.50 GB
- Used memory: 3.45 GB (21.6%)
β³ Starting model loading (timeout: 300s)...
[1/2] Loading tokenizer...
β Tokenizer loaded in 2.31s
[2/2] Loading model...
β Model loaded in 45.67s
β
LOADING SUCCESSFUL in 47.98s
π‘ RECOMMENDATIONS
β
Model loaded successfully.
2. config_optimized.py
Purpose: Smart configuration based on model size
Features:
- Auto-detects model size category (small/medium/large)
- Provides optimized timeout settings
- Recommends appropriate HF Spaces tier
- Warns about memory issues before loading
Usage:
from config_optimized import HFSpacesConfig, get_optimized_request_config
# Get optimal timeout for a model
timeout = HFSpacesConfig.get_timeout_for_model("meta-llama/Llama-3.2-1B")
# Get full request config
config = get_optimized_request_config("meta-llama/Llama-3.2-1B")
response = requests.post(url, json=payload, **config)
# Check if model is recommended for your tier
is_ok = HFSpacesConfig.is_model_recommended("meta-llama/Llama-3.2-1B", tier="free")
3. DIAGNOSTIC_README.md
Purpose: Complete guide with solutions
Contents:
- How to identify timeout vs memory issues
- Step-by-step solutions for each problem
- Model size comparison table
- Code examples for fixes
- Best practices
4. Improved Error Messages in optipfair_frontend.py
What changed:
- More informative timeout error messages
- Explicit memory error detection
- Actionable recommendations in errors
- All messages in English
Example:
β **Timeout Error:**
The request exceeded 5 minutes (300s).
**Possible causes:**
1. The model is very large and takes long to load
2. The server is processing many requests
**Solutions:**
β’ Use a smaller model (1B parameters)
β’ Wait and try again (model may be caching)
β’ If it persists, run `diagnostic_tool.py` for more information
π Quick Start Guide
Step 1: Diagnose the Problem
cd hf-spaces
python diagnostic_tool.py
Step 2: Read the Output
The tool will tell you:
- β Success: Model loads fine
- β MEMORY_ERROR: Need more RAM or smaller model
- β° TIMEOUT_ERROR: Need more time or faster model
Step 3: Apply the Solution
For TIMEOUT problems:
# Option 1: Increase timeout in optipfair_frontend.py
response = requests.post(
url,
json=payload,
timeout=600 # Change from 300 to 600 seconds
)
# Option 2: Use config_optimized.py
from config_optimized import get_optimized_request_config
config = get_optimized_request_config(model_name)
response = requests.post(url, json=payload, **config)
For MEMORY problems:
# Option 1: Use smaller model
AVAILABLE_MODELS = [
"meta-llama/Llama-3.2-1B", # β
Works on free tier
"oopere/pruned40-llama-3.2-1B", # β
Works on free tier
]
# Option 2: Use quantization (in backend)
from transformers import AutoModel, BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model = AutoModel.from_pretrained(
model_name,
quantization_config=quantization_config,
low_cpu_mem_usage=True,
)
# Option 3: Upgrade HF Spaces tier
# Free: 16GB RAM β PRO: 32GB RAM β Enterprise: 64GB RAM
π Model Recommendations by Tier
Free Tier (16GB RAM)
β Recommended:
- meta-llama/Llama-3.2-1B (~4 GB, ~30s load)
- oopere/pruned40-llama-3.2-1B (~4 GB, ~30s load)
- google/gemma-3-1b-pt (~4 GB, ~30s load)
- Qwen/Qwen3-1.7B (~6 GB, ~45s load)
β οΈ May work with optimization:
- meta-llama/Llama-3.2-3B (~12 GB, ~90s load)
β Won't work:
- meta-llama/Llama-3-8B (~32 GB)
- meta-llama/Llama-3-70B (~280 GB)
PRO Tier (32GB RAM)
β Additional models:
- meta-llama/Llama-3.2-3B
- meta-llama/Llama-3-8B (with quantization)
Enterprise Tier (64GB RAM)
β Additional models:
- meta-llama/Llama-3-8B (full precision)
- Larger models with quantization
π― Common Scenarios
Scenario 1: "My model times out after 5 minutes"
Diagnosis: TIMEOUT_ERROR
Solution:
- Check if model is too large for your tier
- Increase timeout to 600s (10 minutes)
- Consider pre-loading models at startup
Scenario 2: "Process crashes without clear error"
Diagnosis: Likely MEMORY_ERROR (Out-Of-Memory kills the process)
Solution:
- Run
diagnostic_tool.pyto confirm - Use smaller model (1B parameters)
- Use int8 quantization
- Upgrade to PRO tier
Scenario 3: "Sometimes works, sometimes doesn't"
Diagnosis: Memory pressure or concurrent requests
Solution:
- Implement model caching
- Add memory monitoring
- Use smaller default model
π οΈ Advanced: Pre-loading Models
To avoid timeout on first request, pre-load models at startup:
# In hf-spaces/app.py
from transformers import AutoModel, AutoTokenizer
MODEL_CACHE = {}
def preload_models():
"""Pre-load common models at startup"""
models = ["meta-llama/Llama-3.2-1B"]
for model_name in models:
try:
print(f"Pre-loading {model_name}...")
MODEL_CACHE[model_name] = {
"model": AutoModel.from_pretrained(
model_name,
low_cpu_mem_usage=True
),
"tokenizer": AutoTokenizer.from_pretrained(model_name)
}
print(f"β {model_name} ready")
except Exception as e:
print(f"β Could not pre-load {model_name}: {e}")
def main():
preload_models() # Load models before starting services
# ... rest of startup code
π Support
If you still have issues after trying these solutions:
- Check the full diagnostic output
- Review HF Spaces logs
- Verify your HF Spaces tier and limits
- Consider using a different model architecture
π Summary
| Issue | Symptom | Solution |
|---|---|---|
| Timeout | Request > 5 min | Increase timeout, use cache |
| Memory | Process crashes/kills | Smaller model, quantization, upgrade tier |
| Both | Slow + crashes | Smaller model + longer timeout |
All tools are designed to help you quickly identify and fix the exact problem without guessing.