Msok99
/

km-improved-32k

+---
+library_name: transformers
+language: ["khm"]
+license: mit
+tags: ["tokenizer", "khmer", "unigram", "sentencepiece", "large", "coverage", "high-resource"]
+---
+# 🇰🇭 KM Improved 32K Tokenizer
+The **KM Improved 32K** is a high-capacity **Khmer tokenizer** designed to maximize word coverage
+across diverse domains including technical, cultural, historical, and academic texts.
+It aims to reduce subword fragmentation and improve contextual understanding for large-scale
+Khmer and multilingual language models.
+---
+## 🧠 Model Details
+### Model Description
+- **Developer:** Sok Meas (@Msok99)
+- **Model Type:** SentencePiece Unigram
+- **Language:** Khmer (khm)
+- **License:** MIT
+- **Base Version:** [`Msok99/km-improved-22k-v4`](https://huggingface.co/Msok99/km-improved-22k-v4)
+- **Vocabulary Size:** 32,000
+- **Goal:** Maximize coverage and minimize over-segmentation
+### Model Sources
+- **Repository:** [https://huggingface.co/Msok99/km-improved-32k](https://huggingface.co/Msok99/km-improved-32k)
+---
+## ⚙️ Key Features
+| Feature | Description |
+|----------|-------------|
+| **Extended Vocabulary** | 32,000 tokens for higher domain coverage |
+| **Improved Context Retention** | Keeps compound and rare words intact |
+| **Reduced Fragmentation** | Fewer subword splits across long sentences |
+| **Perfect Decode Fidelity** | 100% reversible encoding/decoding |
+| **Broad Domain Corpus** | Includes academic, scientific, literary, and technical texts |
+---
+## 📊 Performance Overview
+| Category | Avg Tokens | Chars/Token |
+|-----------|-------------|-------------|
+| **Formal News** | 13.6 | 4.19 |
+| **Technology / Scientific** | 10.8 | 5.32 |
+| **Culture / History** | 11.0 | 4.58 |
+| **Education / Academic** | 9.4 | 5.44 |
+| **Mixed Texts** | 12.2 | 3.86 |
+| **Overall Efficiency** | — | **≈4.0 chars/token** |
+---
+## 🧩 Use Cases
+### Direct Use
+- Pretraining and fine-tuning Khmer LLMs
+- Large-scale corpus tokenization for RAG or embedding generation
+- Tokenization for Khmer–English mixed datasets (with limited English words)
+### Downstream Use
+- RAG systems and document retrieval
+- Knowledge base construction and summarization pipelines
+- Academic and research-oriented text analysis
+### Out-of-Scope Use
+- Mobile or latency-sensitive applications (consider `18k` or `22k` models)
+- Tokenizing purely English text
+---
+## ⚖️ Bias, Risks, and Limitations
+- Larger vocabulary may increase model size slightly (~5–8%)
+- Some rare or domain-specific words might be underrepresented in informal text
+- Heavier memory usage during training and inference
+### Recommendations
+For smaller models or chatbots prioritizing speed, use
+[`Msok99/km-improved-22k-v4`](https://huggingface.co/Msok99/km-improved-22k-v4).
+For mixed Khmer–English systems, use
+[`Msok99/lfm2-khmer-merged-18k`](https://huggingface.co/Msok99/lfm2-khmer-merged-18k).
+---
+## 🚀 How to Get Started
+```python
+from transformers import AutoTokenizer
+tokenizer = AutoTokenizer.from_pretrained("Msok99/km-improved-32k")
+text = "សង្គ្រាមត្រជាក់មានឥទ្ធិពលដល់នយោបាយពិភពលោក។"
+tokens = tokenizer.tokenize(text)
+print(tokens)
+print(tokenizer.decode(tokenizer.encode(text)))