File size: 8,858 Bytes
6ab17a7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
# Troubleshooting TRL Training Jobs

Common issues and solutions when training with TRL on Hugging Face Jobs.

## Training Hangs at "Starting training..." Step

**Problem:** Job starts but hangs at the training step - never progresses, never times out, just sits there.

**Root Cause:** Using `eval_strategy="steps"` or `eval_strategy="epoch"` without providing an `eval_dataset` to the trainer.

**Solution:**

**Option A: Provide eval_dataset (recommended)**
```python
# Create train/eval split
dataset_split = dataset.train_test_split(test_size=0.1, seed=42)

trainer = SFTTrainer(
    model="Qwen/Qwen2.5-0.5B",
    train_dataset=dataset_split["train"],
    eval_dataset=dataset_split["test"],  # ← MUST provide when eval_strategy is enabled
    args=SFTConfig(
        eval_strategy="steps",
        eval_steps=50,
        ...
    ),
)
```

**Option B: Disable evaluation**
```python
trainer = SFTTrainer(
    model="Qwen/Qwen2.5-0.5B",
    train_dataset=dataset,
    # No eval_dataset
    args=SFTConfig(
        eval_strategy="no",  # ← Explicitly disable
        ...
    ),
)
```

**Prevention:**
- Always create train/eval split for better monitoring
- Use `dataset.train_test_split(test_size=0.1, seed=42)`
- Check example scripts: `scripts/train_sft_example.py` includes proper eval setup

## Job Times Out

**Problem:** Job terminates before training completes, all progress lost.

**Solutions:**
- Increase timeout parameter (e.g., `"timeout": "4h"`)
- Reduce `num_train_epochs` or use smaller dataset slice
- Use smaller model or enable LoRA/PEFT to speed up training
- Add 20-30% buffer to estimated time for loading/saving overhead

**Prevention:**
- Always start with a quick demo run to estimate timing
- Use `scripts/estimate_cost.py` to get time estimates
- Monitor first runs closely via Trackio or logs

## Model Not Saved to Hub

**Problem:** Training completes but model doesn't appear on Hub - all work lost.

**Check:**
- [ ] `push_to_hub=True` in training config
- [ ] `hub_model_id` specified with username (e.g., `"username/model-name"`)
- [ ] `secrets={"HF_TOKEN": "$HF_TOKEN"}` in job submission
- [ ] User has write access to target repo
- [ ] Token has write permissions (check at https://huggingface.co/settings/tokens)
- [ ] Training script calls `trainer.push_to_hub()` at the end

**See:** `references/hub_saving.md` for detailed Hub authentication troubleshooting

## Out of Memory (OOM)

**Problem:** Job fails with CUDA out of memory error.

**Solutions (in order of preference):**
1. **Reduce batch size:** Lower `per_device_train_batch_size` (try 4 β†’ 2 β†’ 1)
2. **Increase gradient accumulation:** Raise `gradient_accumulation_steps` to maintain effective batch size
3. **Disable evaluation:** Remove `eval_dataset` and `eval_strategy` (saves ~40% memory, good for demos)
4. **Enable LoRA/PEFT:** Use `peft_config=LoraConfig(r=8, lora_alpha=16)` to train adapters only (smaller rank = less memory)
5. **Use larger GPU:** Switch from `t4-small` β†’ `l4x1` β†’ `a10g-large` β†’ `a100-large`
6. **Enable gradient checkpointing:** Set `gradient_checkpointing=True` in config (slower but saves memory)
7. **Use smaller model:** Try a smaller variant (e.g., 0.5B instead of 3B)

**Memory guidelines:**
- T4 (16GB): <1B models with LoRA
- A10G (24GB): 1-3B models with LoRA, <1B full fine-tune
- A100 (40GB/80GB): 7B+ models with LoRA, 3B full fine-tune

## Parameter Naming Issues

**Problem:** `TypeError: SFTConfig.__init__() got an unexpected keyword argument 'max_seq_length'`

**Cause:** TRL config classes use `max_length`, not `max_seq_length`.

**Solution:**
```python
# βœ… CORRECT - TRL uses max_length
SFTConfig(max_length=512)
DPOConfig(max_length=512)

# ❌ WRONG - This will fail
SFTConfig(max_seq_length=512)
```

**Note:** Most TRL configs don't require explicit max_length - the default (1024) works well. Only set if you need a specific value.

## Dataset Format Error

**Problem:** Training fails with dataset format errors or missing fields.

**Solutions:**
1. **Check format documentation:**
   ```python
   hf_doc_fetch("https://huggingface.co/docs/trl/dataset_formats")
   ```

2. **Validate dataset before training:**
   ```bash
   uv run https://huggingface.co/datasets/mcp-tools/skills/raw/main/dataset_inspector.py \
     --dataset <dataset-name> --split train
   ```
   Or via hf_jobs:
   ```python
   hf_jobs("uv", {
       "script": "https://huggingface.co/datasets/mcp-tools/skills/raw/main/dataset_inspector.py",
       "script_args": ["--dataset", "dataset-name", "--split", "train"]
   })
   ```

3. **Verify field names:**
   - **SFT:** Needs "messages" field (conversational), OR "text" field, OR "prompt"/"completion"
   - **DPO:** Needs "chosen" and "rejected" fields
   - **GRPO:** Needs prompt-only format

4. **Check dataset split:**
   - Ensure split exists (e.g., `split="train"`)
   - Preview dataset: `load_dataset("name", split="train[:5]")`

## Import/Module Errors

**Problem:** Job fails with "ModuleNotFoundError" or import errors.

**Solutions:**
1. **Add PEP 723 header with dependencies:**
   ```python
   # /// script
   # dependencies = [
   #     "trl>=0.12.0",
   #     "peft>=0.7.0",
   #     "transformers>=4.36.0",
   # ]
   # ///
   ```

2. **Verify exact format:**
   - Must have `# ///` delimiters (with space after `#`)
   - Dependencies must be valid PyPI package names
   - Check spelling and version constraints

3. **Test locally first:**
   ```bash
   uv run train.py  # Tests if dependencies are correct
   ```

## Authentication Errors

**Problem:** Job fails with authentication or permission errors when pushing to Hub.

**Solutions:**
1. **Verify authentication:**
   ```python
   mcp__huggingface__hf_whoami()  # Check who's authenticated
   ```

2. **Check token permissions:**
   - Go to https://huggingface.co/settings/tokens
   - Ensure token has "write" permission
   - Token must not be "read-only"

3. **Verify token in job:**
   ```python
   "secrets": {"HF_TOKEN": "$HF_TOKEN"}  # Must be in job config
   ```

4. **Check repo permissions:**
   - User must have write access to target repo
   - If org repo, user must be member with write access
   - Repo must exist or user must have permission to create

## Job Stuck or Not Starting

**Problem:** Job shows "pending" or "starting" for extended period.

**Solutions:**
- Check Jobs dashboard for status: https://huggingface.co/jobs
- Verify hardware availability (some GPU types may have queues)
- Try different hardware flavor if one is heavily utilized
- Check for account billing issues (Jobs requires paid plan)

**Typical startup times:**
- CPU jobs: 10-30 seconds
- GPU jobs: 30-90 seconds
- If >3 minutes: likely queued or stuck

## Training Loss Not Decreasing

**Problem:** Training runs but loss stays flat or doesn't improve.

**Solutions:**
1. **Check learning rate:** May be too low (try 2e-5 to 5e-5) or too high (try 1e-6)
2. **Verify dataset quality:** Inspect examples to ensure they're reasonable
3. **Check model size:** Very small models may not have capacity for task
4. **Increase training steps:** May need more epochs or larger dataset
5. **Verify dataset format:** Wrong format may cause degraded training

## Logs Not Appearing

**Problem:** Cannot see training logs or progress.

**Solutions:**
1. **Wait 30-60 seconds:** Initial logs can be delayed
2. **Check logs via MCP tool:**
   ```python
   hf_jobs("logs", {"job_id": "your-job-id"})
   ```
3. **Use Trackio for real-time monitoring:** See `references/trackio_guide.md`
4. **Verify job is actually running:**
   ```python
   hf_jobs("inspect", {"job_id": "your-job-id"})
   ```

## Checkpoint/Resume Issues

**Problem:** Cannot resume from checkpoint or checkpoint not saved.

**Solutions:**
1. **Enable checkpoint saving:**
   ```python
   SFTConfig(
       save_strategy="steps",
       save_steps=100,
       hub_strategy="every_save",  # Push each checkpoint
   )
   ```

2. **Verify checkpoints pushed to Hub:** Check model repo for checkpoint folders

3. **Resume from checkpoint:**
   ```python
   trainer = SFTTrainer(
       model="username/model-name",  # Can be checkpoint path
       resume_from_checkpoint="username/model-name/checkpoint-1000",
   )
   ```

## Getting Help

If issues persist:

1. **Check TRL documentation:**
   ```python
   hf_doc_search("your issue", product="trl")
   ```

2. **Check Jobs documentation:**
   ```python
   hf_doc_fetch("https://huggingface.co/docs/huggingface_hub/guides/jobs")
   ```

3. **Review related guides:**
   - `references/hub_saving.md` - Hub authentication issues
   - `references/hardware_guide.md` - Hardware selection and specs
   - `references/training_patterns.md` - Eval dataset requirements
   - SKILL.md "Working with Scripts" section - Script format and URL issues

4. **Ask in HF forums:** https://discuss.huggingface.co/