Msok99 commited on
Commit
b5cbb66
Β·
verified Β·
1 Parent(s): 13d07ca

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +99 -0
README.md ADDED
@@ -0,0 +1,99 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ language: ["khm"]
4
+ license: mit
5
+ tags: ["tokenizer", "khmer", "unigram", "sentencepiece", "large", "coverage", "high-resource"]
6
+ ---
7
+
8
+ # πŸ‡°πŸ‡­ KM Improved 32K Tokenizer
9
+
10
+ The **KM Improved 32K** is a high-capacity **Khmer tokenizer** designed to maximize word coverage
11
+ across diverse domains including technical, cultural, historical, and academic texts.
12
+ It aims to reduce subword fragmentation and improve contextual understanding for large-scale
13
+ Khmer and multilingual language models.
14
+
15
+ ---
16
+
17
+ ## 🧠 Model Details
18
+
19
+ ### Model Description
20
+ - **Developer:** Sok Meas (@Msok99)
21
+ - **Model Type:** SentencePiece Unigram
22
+ - **Language:** Khmer (khm)
23
+ - **License:** MIT
24
+ - **Base Version:** [`Msok99/km-improved-22k-v4`](https://huggingface.co/Msok99/km-improved-22k-v4)
25
+ - **Vocabulary Size:** 32,000
26
+ - **Goal:** Maximize coverage and minimize over-segmentation
27
+
28
+ ### Model Sources
29
+ - **Repository:** [https://huggingface.co/Msok99/km-improved-32k](https://huggingface.co/Msok99/km-improved-32k)
30
+
31
+ ---
32
+
33
+ ## βš™οΈ Key Features
34
+
35
+ | Feature | Description |
36
+ |----------|-------------|
37
+ | **Extended Vocabulary** | 32,000 tokens for higher domain coverage |
38
+ | **Improved Context Retention** | Keeps compound and rare words intact |
39
+ | **Reduced Fragmentation** | Fewer subword splits across long sentences |
40
+ | **Perfect Decode Fidelity** | 100% reversible encoding/decoding |
41
+ | **Broad Domain Corpus** | Includes academic, scientific, literary, and technical texts |
42
+
43
+ ---
44
+
45
+ ## πŸ“Š Performance Overview
46
+
47
+ | Category | Avg Tokens | Chars/Token |
48
+ |-----------|-------------|-------------|
49
+ | **Formal News** | 13.6 | 4.19 |
50
+ | **Technology / Scientific** | 10.8 | 5.32 |
51
+ | **Culture / History** | 11.0 | 4.58 |
52
+ | **Education / Academic** | 9.4 | 5.44 |
53
+ | **Mixed Texts** | 12.2 | 3.86 |
54
+ | **Overall Efficiency** | β€” | **β‰ˆ4.0 chars/token** |
55
+
56
+ ---
57
+
58
+ ## 🧩 Use Cases
59
+
60
+ ### Direct Use
61
+ - Pretraining and fine-tuning Khmer LLMs
62
+ - Large-scale corpus tokenization for RAG or embedding generation
63
+ - Tokenization for Khmer–English mixed datasets (with limited English words)
64
+
65
+ ### Downstream Use
66
+ - RAG systems and document retrieval
67
+ - Knowledge base construction and summarization pipelines
68
+ - Academic and research-oriented text analysis
69
+
70
+ ### Out-of-Scope Use
71
+ - Mobile or latency-sensitive applications (consider `18k` or `22k` models)
72
+ - Tokenizing purely English text
73
+
74
+ ---
75
+
76
+ ## βš–οΈ Bias, Risks, and Limitations
77
+ - Larger vocabulary may increase model size slightly (~5–8%)
78
+ - Some rare or domain-specific words might be underrepresented in informal text
79
+ - Heavier memory usage during training and inference
80
+
81
+ ### Recommendations
82
+ For smaller models or chatbots prioritizing speed, use
83
+ [`Msok99/km-improved-22k-v4`](https://huggingface.co/Msok99/km-improved-22k-v4).
84
+ For mixed Khmer–English systems, use
85
+ [`Msok99/lfm2-khmer-merged-18k`](https://huggingface.co/Msok99/lfm2-khmer-merged-18k).
86
+
87
+ ---
88
+
89
+ ## πŸš€ How to Get Started
90
+
91
+ ```python
92
+ from transformers import AutoTokenizer
93
+
94
+ tokenizer = AutoTokenizer.from_pretrained("Msok99/km-improved-32k")
95
+
96
+ text = "αžŸαž„αŸ’αž‚αŸ’αžšαžΆαž˜αžαŸ’αžšαž‡αžΆαž€αŸ‹αž˜αžΆαž“αž₯αž‘αŸ’αž’αž·αž–αž›αžŠαž›αŸ‹αž“αž™αŸ„αž”αžΆαž™αž–αž·αž—αž–αž›αŸ„αž€αŸ”"
97
+ tokens = tokenizer.tokenize(text)
98
+ print(tokens)
99
+ print(tokenizer.decode(tokenizer.encode(text)))