--- language: - ne - en license: apache-2.0 tags: - transliteration - nepali - devanagari - romanization - seq2seq - text2text-generation datasets: - custom metrics: - bleu - accuracy - character_error_rate widget: - text: "namaste" example_title: "English to Nepali" - text: "नमस्ते" example_title: "Nepali to English" - text: "kathmandu" example_title: "Place Name" - text: "धन्यवाद" example_title: "Common Phrase" model-index: - name: nepali-transliteration-model results: - task: type: text2text-generation name: Transliteration metrics: - type: bleu value: 0.85 # Update with your actual scores name: BLEU Score - type: accuracy value: 0.92 # Update with your actual scores name: Word Accuracy --- # Nepali Transliteration Model ## Model Description This model performs bidirectional transliteration between Nepali (Devanagari script) and English (Latin script). It can convert: - English text to Nepali Devanagari script - Nepali Devanagari text to English romanization The model is fine-tuned for accurate transliteration of Nepali names, places, and common vocabulary. ## Model Details - **Model Type:** Sequence-to-sequence text generation - **Language(s):** Nepali (ne), English (en) - **License:** Apache 2.0 - **Base Model:** google/mt5-small - **Training Data:** Custom Nepali-English transliteration dataset - **Training Steps:** 34000] - **Parameters:** 400MB ## Intended Use ### Primary Use Cases - Converting English names and words to Nepali Devanagari script - Romanizing Nepali text for international audiences - Supporting multilingual applications and keyboards - Academic research in computational linguistics - Cultural preservation and digital humanities projects ### Out-of-Scope Use Cases - Machine translation (this model only handles transliteration, not translation) - Text generation beyond transliteration - Processing languages other than Nepali and English ## How to Use ### Installation ```bash pip install transformers torch ``` ### Basic Usage ```python from transformers import AutoTokenizer, AutoModelForSeq2SeqLM # Load model and tokenizer model_name = nirajan111/nepali-transliteration" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSeq2SeqLM.from_pretrained(model_name) # English to Nepali def transliterate_en_to_ne(text): inputs = tokenizer(f"en2ne: {text}", return_tensors="pt", max_length=128, truncation=True) outputs = model.generate(**inputs, max_length=128, num_beams=4, early_stopping=True) return tokenizer.decode(outputs[0], skip_special_tokens=True) # Nepali to English def transliterate_ne_to_en(text): inputs = tokenizer(f"ne2en: {text}", return_tensors="pt", max_length=128, truncation=True) outputs = model.generate(**inputs, max_length=128, num_beams=4, early_stopping=True) return tokenizer.decode(outputs[0], skip_special_tokens=True) # Examples print(transliterate_en_to_ne("namaste")) # Expected: नमस्ते print(transliterate_ne_to_en("काठमाडौं")) # Expected: kathmandu ``` ### Advanced Usage ```python # Batch processing texts = ["namaste", "dhanyabad", "kathmandu"] inputs = tokenizer([f"en2ne: {text}" for text in texts], return_tensors="pt", padding=True, truncation=True) outputs = model.generate(**inputs, max_length=128, num_beams=4) results = tokenizer.batch_decode(outputs, skip_special_tokens=True) ``` ## Training Data The model was trained on a custom dataset containing: - **Size:** [Update with dataset size, e.g., 50,000 transliteration pairs] - **Sources:** - Nepali names and places - Common vocabulary - Cultural terms - Government documents - Educational materials - **Preprocessing:** Text normalization, duplicate removal, quality filtering - **Split:** 80% training, 10% validation, 10% testing ## Training Procedure ### Training Hyperparameters - **Batch Size:** 64 (training), 16 (evaluation) - **Learning Rate:** 0.00005 - **Epochs:** 10 - **Optimizer:** AdamW - **Weight Decay:** 0.01 - **Warmup Steps:** 500 - **Max Sequence Length:** 128 ### Training Infrastructure - **Hardware:** kaggle A100 - **Framework:** PyTorch, Transformers - **Training Time:** 12hr ## Evaluation ### Metrics - **BLEU Score:** 0.85 - **Word Accuracy:** 0.92 - **Character Error Rate:** 0.138 - **Exact Match:** 0.78 ### Test Results | Direction | CER | | --------- | ---- | | EN → NE | 0.13 | | NE → EN | 0.10 | ## Limitations and Bias ### Known Limitations - Performance may vary with proper nouns not seen during training - Limited handling of mixed-script text - May struggle with very long compound words - Accuracy depends on text quality and standardization ### Potential Biases - Training data may over-represent certain regions or dialects of Nepali - Model may have better performance on formal/literary Nepali vs. colloquial forms - Potential bias toward more common transliteration patterns ## Ethical Considerations - This model supports language preservation and digital inclusion for Nepali speakers - Care should be taken when using for official documents or names - Users should verify outputs for critical applications - The model should not be used to misrepresent or appropriate Nepali culture ## Citation ```bibtex @model{nepali-transliteration-2024, title={Nepali Transliteration Model}, author={Nirajan Sah}, year={2025}, url={https://huggingface.co/nirajan1111/nepali-transliteration-model} } ``` ## Model Card Contact For questions or feedback about this model, please contact: [nirajansah1111@gmail.com] ## Acknowledgments - Thanks to the Nepali language community for providing linguistic insights ---