Fralet commited on
Commit
20cfeb0
Β·
verified Β·
1 Parent(s): 36f6376

Add model card

Browse files
Files changed (1) hide show
  1. README.md +403 -0
README.md ADDED
@@ -0,0 +1,403 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ library_name: transformers
4
+ pipeline_tag: text-generation
5
+ language:
6
+ - en
7
+ tags:
8
+ - gpt2
9
+ - historical
10
+ - london
11
+ - text-generation
12
+ - history
13
+ - english
14
+ - safetensors
15
+ - large-language-model
16
+ - llm
17
+ ---
18
+
19
+ # London Historical LLM
20
+
21
+ A custom GPT-2 model **trained from scratch** on historical London texts from 1500-1850. Fast to run on CPU, and supports NVIDIA (CUDA) and AMD (ROCm) GPUs.
22
+
23
+ > **Note**: This model was **trained from scratch** - not fine-tuned from existing models.
24
+
25
+ > This page includes simple **virtual-env setup**, **install choices for CPU/CUDA/ROCm**, and an **auto-device inference** example so anyone can get going quickly.
26
+
27
+ ---
28
+
29
+ ## πŸ”Ž Model Description
30
+
31
+ This is a **Regular Language Model** built from scratch using GPT-2 architecture, trained on a comprehensive collection of historical London documents spanning 1500-1850, including:
32
+ - Parliamentary records and debates
33
+ - Historical newspapers and journals
34
+ - Literary works and correspondence
35
+ - Government documents and reports
36
+ - Personal letters and diaries
37
+
38
+ ### Key Features
39
+ - **~354M parameters** (vs ~117M in the SLM version)
40
+ - **Custom historical tokenizer** (~30k vocab) with London-specific tokens
41
+ - **London-specific context awareness** and historical language patterns
42
+ - **Trained from scratch** - not fine-tuned from existing models
43
+ - **Optimized for historical text generation** (1500-1850)
44
+
45
+ ---
46
+
47
+ ## πŸ§ͺ Intended Use & Limitations
48
+
49
+ **Use cases:** historical-style narrative generation, prompt-based exploration of London themes (1500-1850), creative writing aids.
50
+ **Limitations:** may produce anachronisms or historically inaccurate statements; complex sampling parameters may produce gibberish due to the historical nature of the training data. Validate outputs before downstream use.
51
+
52
+ ---
53
+
54
+ ## 🐍 Set up a virtual environment (Linux/macOS/Windows)
55
+
56
+ > Virtual environments isolate project dependencies. Official Python docs: `venv`.
57
+
58
+ **Check Python & pip**
59
+ ```bash
60
+ # Linux/macOS
61
+ python3 --version && python3 -m pip --version
62
+ ```
63
+
64
+ ```powershell
65
+ # Windows (PowerShell)
66
+ python --version; python -m pip --version
67
+ ```
68
+
69
+ **Create the env**
70
+
71
+ ```bash
72
+ # Linux/macOS
73
+ python3 -m venv helloLondon
74
+ ```
75
+
76
+ ```powershell
77
+ # Windows (PowerShell)
78
+ python -m venv helloLondon
79
+ ```
80
+
81
+ ```cmd
82
+ :: Windows (Command Prompt)
83
+ python -m venv helloLondon
84
+ ```
85
+
86
+ > **Note**: You can name your virtual environment anything you like, e.g., `.venv`, `my_env`, `london_env`.
87
+
88
+ **Activate**
89
+
90
+ ```bash
91
+ # Linux/macOS
92
+ source helloLondon/bin/activate
93
+ ```
94
+
95
+ ```powershell
96
+ # Windows (PowerShell)
97
+ .\\helloLondon\\Scripts\\Activate.ps1
98
+ ```
99
+
100
+ ```cmd
101
+ :: Windows (CMD)
102
+ .\\helloLondon\\Scripts\\activate.bat
103
+ ```
104
+
105
+ > If PowerShell blocks activation (*"running scripts is disabled"*), set the policy then retry activation:
106
+
107
+ ```powershell
108
+ Set-ExecutionPolicy -Scope CurrentUser -ExecutionPolicy RemoteSigned
109
+ # or just for this session:
110
+ Set-ExecutionPolicy -Scope Process -ExecutionPolicy Bypass
111
+ ```
112
+
113
+ ---
114
+
115
+ ## πŸ“¦ Install libraries
116
+
117
+ Upgrade basics, then install Hugging Face libs:
118
+
119
+ ```bash
120
+ python -m pip install -U pip setuptools wheel
121
+ python -m pip install "transformers" "accelerate" "safetensors"
122
+ ```
123
+
124
+ ---
125
+
126
+ ## βš™οΈ Install **one** PyTorch variant (CPU / NVIDIA / AMD)
127
+
128
+ Use **one** of the commands below. For the most accurate command per OS/accelerator and version, prefer PyTorch's **Get Started** selector.
129
+
130
+ ### A) CPU-only (Linux/Windows/macOS)
131
+
132
+ ```bash
133
+ pip install torch --index-url https://download.pytorch.org/whl/cpu
134
+ ```
135
+
136
+ ### B) NVIDIA GPU (CUDA)
137
+
138
+ Pick the CUDA series that matches your system (examples below):
139
+
140
+ ```bash
141
+ # CUDA 12.6
142
+ pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
143
+
144
+ # CUDA 12.4
145
+ pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
146
+
147
+ # CUDA 11.8
148
+ pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
149
+ ```
150
+
151
+ ### C) AMD GPU (ROCm, **Linux-only**)
152
+
153
+ Install the ROCm build matching your ROCm runtime (examples):
154
+
155
+ ```bash
156
+ # ROCm 6.3
157
+ pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.3
158
+
159
+ # ROCm 6.2 (incl. 6.2.x)
160
+ pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.2.4
161
+
162
+ # ROCm 6.1
163
+ pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.1
164
+ ```
165
+
166
+ **Quick sanity check**
167
+
168
+ ```bash
169
+ python - <<'PY'
170
+ import torch
171
+ print("torch:", torch.__version__)
172
+ print("GPU available:", torch.cuda.is_available())
173
+ if torch.cuda.is_available():
174
+ print("device:", torch.cuda.get_device_name(0))
175
+ PY
176
+ ```
177
+
178
+ ---
179
+
180
+ ## πŸš€ Inference (auto-detect device)
181
+
182
+ This snippet picks the best device (CUDA/ROCm if available, else CPU) and uses sensible generation defaults for this model.
183
+
184
+ ```python
185
+ from transformers import AutoTokenizer, AutoModelForCausalLM
186
+ import torch
187
+
188
+ model_id = "bahree/london-historical-llm"
189
+
190
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
191
+ model = AutoModelForCausalLM.from_pretrained(model_id)
192
+
193
+ device = "cuda" if torch.cuda.is_available() else "cpu"
194
+ model = model.to(device)
195
+
196
+ prompt = "In the year 1834, I walked through the streets of London and witnessed"
197
+ inputs = tokenizer(prompt, return_tensors="pt").to(device)
198
+
199
+ outputs = model.generate(
200
+ inputs["input_ids"],
201
+ max_new_tokens=50,
202
+ do_sample=True,
203
+ temperature=0.8,
204
+ top_p=0.95,
205
+ top_k=40,
206
+ repetition_penalty=1.2,
207
+ no_repeat_ngram_size=3,
208
+ pad_token_id=tokenizer.eos_token_id,
209
+ eos_token_id=tokenizer.eos_token_id,
210
+ early_stopping=True,
211
+ )
212
+
213
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
214
+ ```
215
+
216
+ ## πŸ“– **Sample Output**
217
+
218
+ **Prompt:** "In the year 1834, I walked through the streets of London and witnessed"
219
+
220
+ **Generated Text:**
221
+ > "In the year 1834, I walked through the streets of London and witnessed a scene in which some of those who had no inclination to come in contact with him took part in his discourse. It was on this occasion that I perceived that he had been engaged in some new business connected with the house, but for some days it had not taken place, nor did he appear so desirous of pursuing any further display of interest. The result was, however, that if he came in contact witli any one else in company with him he must be regarded as an old acquaintance or companion, and when he came to the point of leaving, I had no leisure to take up his abode. The same evening, having ram ##bled about the streets, I observed that the young man who had just arrived from a neighbouring village at the time, was enjoying himself at a certain hour, and I thought that he would sleep quietly until morning, when he said in a low voice β€” " You are coming. Miss β€” I have come from the West Indies . " Then my father bade me go into the shop, and bid me put on his spectacles, which he had in his hand; but he replied no: the room was empty, and he did not want to see what had passed. When I asked him the cause of all this conversation, he answered in the affirmative, and turned away, saying that as soon as the lad could recover, the sight of him might be renewed. " Well, Mr. , " said I, " you have got a little more of your wages, do you ? " " No, sir, thank ' ee kindly, " returned the boy, " but we don ' t want to pay the poor rates . We"
222
+
223
+ **Notice how the model captures:**
224
+ - **Period-appropriate language** ("thank 'ee kindly," "bade me go," "spectacles")
225
+ - **Historical dialogue patterns** (formal speech, period-appropriate contractions)
226
+ - **Historical context** (West Indies, poor rates, needle work, pocket-book)
227
+ - **Authentic historical narrative** (detailed scene setting, period-appropriate social interactions)
228
+
229
+ ## πŸ§ͺ **Testing Your Model**
230
+
231
+ ### **Quick Testing (10 Automated Prompts)**
232
+ ```bash
233
+ # Test with 10 automated historical prompts
234
+ python 06_inference/test_published_models.py --model_type regular
235
+ ```
236
+
237
+ **Expected Output:**
238
+ ```
239
+ πŸ§ͺ Testing Regular Model: bahree/london-historical-llm
240
+ ============================================================
241
+ πŸ“‚ Loading model...
242
+ βœ… Model loaded in 12.5 seconds
243
+ πŸ“Š Model Info:
244
+ Type: REGULAR
245
+ Description: Regular Language Model (354M parameters)
246
+ Device: cuda
247
+ Vocabulary size: 30,000
248
+ Max length: 1024
249
+
250
+ 🎯 Testing generation with 10 prompts...
251
+ [10 automated tests with historical text generation]
252
+ ```
253
+
254
+ ### **Interactive Testing**
255
+ ```bash
256
+ # Interactive mode for custom prompts
257
+ python 06_inference/inference_unified.py --published --model_type regular --interactive
258
+
259
+ # Single prompt test
260
+ python 06_inference/inference_unified.py --published --model_type regular --prompt "In the year 1834, I walked through the streets of London and witnessed"
261
+ ```
262
+
263
+ **Need more headroom later?** Load with πŸ€— Accelerate and `device_map="auto"` to spread layers across available devices/CPU automatically.
264
+
265
+ ```python
266
+ from transformers import AutoTokenizer, AutoModelForCausalLM
267
+ tok = AutoTokenizer.from_pretrained(model_id)
268
+ model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
269
+ ```
270
+
271
+ ---
272
+
273
+ ## πŸͺŸ Windows Terminal one-liners
274
+
275
+ **PowerShell**
276
+
277
+ ```powershell
278
+ python -c "from transformers import AutoTokenizer,AutoModelForCausalLM; m='bahree/london-historical-llm'; t=AutoTokenizer.from_pretrained(m); model=AutoModelForCausalLM.from_pretrained(m); p='Today I walked through the streets of London and witnessed'; i=t(p,return_tensors='pt'); print(t.decode(model.generate(i['input_ids'],max_new_tokens=50,do_sample=True)[0],skip_special_tokens=True))"
279
+ ```
280
+
281
+ **Command Prompt (CMD)**
282
+
283
+ ```cmd
284
+ python -c "from transformers import AutoTokenizer, AutoModelForCausalLM ^&^& import torch ^&^& m='bahree/london-historical-llm' ^&^& t=AutoTokenizer.from_pretrained(m) ^&^& model=AutoModelForCausalLM.from_pretrained(m) ^&^& p='Today I walked through the streets of London and witnessed' ^&^& i=t(p, return_tensors='pt') ^&^& print(t.decode(model.generate(i['input_ids'], max_new_tokens=50, do_sample=True)[0], skip_special_tokens=True))"
285
+ ```
286
+
287
+ ---
288
+
289
+ ## πŸ’‘ Basic Usage (Python)
290
+
291
+ ⚠️ **Important**: This model works best with **greedy decoding** for historical text generation. Complex sampling parameters may produce gibberish due to the historical nature of the training data.
292
+
293
+ ```python
294
+ from transformers import AutoTokenizer, AutoModelForCausalLM
295
+
296
+ tokenizer = AutoTokenizer.from_pretrained("bahree/london-historical-llm")
297
+ model = AutoModelForCausalLM.from_pretrained("bahree/london-historical-llm")
298
+
299
+ if tokenizer.pad_token is None:
300
+ tokenizer.pad_token = tokenizer.eos_token
301
+
302
+ prompt = "Today I walked through the streets of London and witnessed"
303
+ inputs = tokenizer(prompt, return_tensors="pt")
304
+ outputs = model.generate(
305
+ inputs["input_ids"],
306
+ max_new_tokens=50,
307
+ do_sample=True,
308
+ temperature=0.7,
309
+ top_p=0.9,
310
+ top_k=30,
311
+ repetition_penalty=1.25,
312
+ no_repeat_ngram_size=4,
313
+ pad_token_id=tokenizer.pad_token_id,
314
+ eos_token_id=tokenizer.eos_token_id,
315
+ early_stopping=True,
316
+ )
317
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
318
+ ```
319
+
320
+ ---
321
+
322
+ ## 🧰 Example Prompts
323
+
324
+ * **Tudor (1558):** "On this day in 1558, Queen Mary has died and …"
325
+ * **Stuart (1666):** "The Great Fire of London has consumed much of the city, and …"
326
+ * **Georgian/Victorian:** "As I journeyed through the streets of London, I observed …"
327
+ * **London specifics:** "Parliament sat in Westminster Hall …", "The Thames flowed dark and mysterious …"
328
+
329
+ ---
330
+
331
+ ## πŸ› οΈ Training Details
332
+
333
+ * **Architecture:** Custom GPT-2 (built from scratch)
334
+ * **Parameters:** ~354M
335
+ * **Tokenizer:** Custom historical tokenizer (~30k vocab) with London-specific and historical tokens
336
+ * **Data:** Historical London corpus (1500-1850) with proper segmentation
337
+ * **Steps:** 60,000+ steps (extended training for better convergence)
338
+ * **Final Training Loss:** ~2.78 (excellent convergence)
339
+ * **Final Validation Loss:** ~3.62 (good generalization)
340
+ * **Training Time:** ~13+ hours
341
+ * **Hardware:** 2Γ— GPU training with Distributed Data Parallel
342
+ * **Training Method:** **Trained from scratch** - not fine-tuned
343
+ * **Context Length:** 1024 tokens (optimized for historical text segments)
344
+ * **Status:** βœ… **Successfully published and tested** - ready for production use
345
+
346
+ ---
347
+
348
+ ## ⚠️ Troubleshooting
349
+
350
+ * **`ImportError: AutoModelForCausalLM requires the PyTorch library`**
351
+ β†’ Install PyTorch with the correct accelerator variant (see CPU/CUDA/ROCm above or use the official selector).
352
+
353
+ * **AMD GPU not used**
354
+ β†’ Ensure you installed a ROCm build and you're on Linux (`pip install ... --index-url https://download.pytorch.org/whl/rocmX.Y`). Verify with `torch.cuda.is_available()` and check the device name. ROCm wheels are Linux-only.
355
+
356
+ * **Running out of VRAM**
357
+ β†’ Try smaller batch/sequence lengths, or load with `device_map="auto"` via πŸ€— Accelerate to offload layers to CPU/disk.
358
+
359
+ * **Gibberish output with historical text**
360
+ β†’ Use greedy decoding (`do_sample=False`) and avoid complex sampling parameters. This model works best with simple generation settings due to the historical nature of the training data.
361
+
362
+ ---
363
+
364
+ ## πŸ“š Citation
365
+
366
+ If you use this model, please cite:
367
+
368
+ ```bibtex
369
+ @misc{london-historical-llm,
370
+ title = {London Historical LLM: A Custom GPT-2 for Historical Text Generation},
371
+ author = {Amit Bahree},
372
+ year = {2025},
373
+ url = {https://huggingface.co/bahree/london-historical-llm}
374
+ }
375
+ ```
376
+
377
+ ---
378
+
379
+ ## Repository
380
+
381
+ The complete source code, training scripts, and documentation for this model are available on GitHub:
382
+
383
+ **πŸ”— [https://github.com/bahree/helloLondon](https://github.com/bahree/helloLondon)**
384
+
385
+ This repository includes:
386
+ - Complete data collection pipeline for 1500-1850 historical English
387
+ - Custom tokenizer optimized for historical text
388
+ - Training infrastructure with GPU optimization
389
+ - Evaluation and deployment tools
390
+ - Comprehensive documentation and examples
391
+
392
+ ### Quick Start with Repository
393
+ ```bash
394
+ git clone https://github.com/bahree/helloLondon.git
395
+ cd helloLondon
396
+ python 06_inference/test_published_models.py --model_type regular
397
+ ```
398
+
399
+ ---
400
+
401
+ ## 🧾 License
402
+
403
+ MIT (see [LICENSE](https://github.com/bahree/helloLondon/blob/main/LICENSE) in repo).