KothaGPT Model Collection Update

📦 Model Collection Overview

This repository contains the complete collection of KothaGPT bilingual language models and tools for Bangla (Bengali) and English languages. All models have been updated and published to the Hugging Face Hub.

Last Updated: January 2026
Organization: KothaGPT
License: Apache 2.0

🚀 Available Models

Core Language Models

bilingual-lm - Main bilingual causal language model
literary-lm - Literary text specialized model
tokenizer - Bilingual tokenizer

Classification Models

readability-classifier - Text readability assessment
sentiment-tone-classifier - Sentiment and tone analysis
text-complexity-predictor - Text complexity prediction

Specialized Models

poetic-meter-detector - Bengali poetic meter detection
metaphor-simile-detector - Literary device detection
named-entity-recognizer - NER for Bangla/English
cross-lingual-embed - Cross-lingual embeddings
style-transfer-gpt - Text style transfer

🔄 Update Process

Automated Publishing

All models are published using the automated script:

HF_TOKEN=your_token bash scripts/huggingface/publish_all.sh false

Script Features

Modern Commands: Uses hf upload-large-folder for better large file handling
Error Recovery: Resumable uploads for large models
Validation: Pre-upload validation checks
Progress Tracking: Detailed progress bars and status reports

📊 Model Statistics

Model	Parameters	Files	Size	Use Case
bilingual-lm	~125M	42	~500MB	General text generation
literary-lm	~125M	2	~5MB	Literary text analysis
readability-classifier	-	5	~2MB	Text assessment
sentiment-tone-classifier	-	2	~1MB	Sentiment analysis
text-complexity-predictor	-	1	~505KB	Complexity scoring
poetic-meter-detector	-	2	~1MB	Poetry analysis
metaphor-simile-detector	-	2	~1MB	Literary analysis
named-entity-recognizer	-	2	~1MB	Entity extraction
cross-lingual-embed	-	1	~1MB	Embeddings
style-transfer-gpt	-	2	~1MB	Style transfer
tokenizer	-	2	~262KB	Tokenization

🛠️ Usage Examples

Loading Multiple Models

from transformers import AutoTokenizer, AutoModelForCausalLM

# Load main bilingual model
tokenizer = AutoTokenizer.from_pretrained("KothaGPT/bilingual-lm")
model = AutoModelForCausalLM.from_pretrained("KothaGPT/bilingual-lm")

# Load classifier
classifier = AutoModelForSequenceClassification.from_pretrained("KothaGPT/readability-classifier")

Batch Processing

models = {
    "sentiment": "KothaGPT/sentiment-tone-classifier",
    "readability": "KothaGPT/readability-classifier", 
    "complexity": "KothaGPT/text-complexity-predictor"
}

for task, model_name in models.items():
    # Load and process
    pass

📈 Performance Metrics

Language Support

Bangla (Bengali): Full support with native tokenizer
English: Full support with standard tokenizer
Code-switching: Handles mixed language text

Benchmark Results

Perplexity: < 25 on bilingual test set
Accuracy: > 85% on classification tasks
Inference Speed: ~50 tokens/second on CPU

🔧 Technical Details

Training Infrastructure

Framework: PyTorch + Transformers
Hardware: GPU training on T4/V100
Optimization: AdamW with cosine scheduling
Evaluation: Comprehensive test suite

Model Architecture

Base: GPT-2 style transformer
Tokenizer: SentencePiece with bilingual vocabulary
Embeddings: Cross-lingual shared space
Layers: 12 transformer layers, 12 attention heads

📚 Documentation

API Reference - Complete API documentation
Examples - Usage examples and tutorials
Dataset Cards - Training dataset information
Individual Model Cards - Detailed model-specific information

🤝 Contributing

Model Updates

Train/improve model locally
Update model files in models/ directory
Run validation tests
Publish with: bash scripts/huggingface/publish_all.sh false

Quality Assurance

All models pass automated tests
Manual review of model cards
Performance benchmarking
Documentation updates

📄 License

All models in this collection are licensed under Apache 2.0. See individual model repositories for specific usage terms.

📞 Support

Issues: GitHub Issues
Discussions: GitHub Discussions
Documentation: Project Docs

Note: This collection represents the complete suite of KothaGPT bilingual models. Models are regularly updated with new training data and improved architectures.

Downloads last month: 41

Safetensors

Model size

0.1B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

KothaGPT
/

bilingual-lm