Bantam Language Model
This model card provides a detailed overview of the BantamForCausalLM model, a transformer-based architecture designed for adaptive, efficient language modeling through hybrid dense and sparse computation.
Model Details
Model Description
The Bantam model is a 20-layer causal language model combining dense Transformer blocks with Mixture-of-Experts (MoE) layers. It features layer-wise dynamic attention, progressive context scaling, and grouped multi-query attention, designed for efficient large-scale language modeling with balanced compute utilization.
- Developed by: Theoistic
- Lead Developer: Theodor Solbjorg (theo@theoistic.com)
- Funded by: Theoistic
- Shared by: Theoistic
- Model type: Causal Language Model (Transformer-based)
- Language(s): Multilingual (55 languages, see dataset summary below)
Model Sources
- Paper: Pending publication
- Demo: Coming soon
Uses
Direct Use
The Bantam model can be used directly for text generation, completion, summarization, and instruction following. It supports context windows up to 2048 tokens and operates efficiently on GPUs using bfloat16 precision.
Downstream Use
Bantam can be fine-tuned for downstream NLP tasks, such as translation, dialogue modeling, educational content generation, or creative writing. Its multilingual and mixed-domain dataset allows flexible adaptation.
Out-of-Scope Use
The model is not intended for high-stakes or safety-critical domains, such as legal, medical, or financial decision-making. It should not be used for generating misinformation or biased outputs without human oversight.
Bias, Risks, and Limitations
Bantam inherits biases from its multilingual datasets, which include content from the internet, curated knowledge bases, and open-source text corpora. It may underperform on underrepresented languages or dialects.
Additionally, as a relatively small model (β285M parameters), hallucinations and factual inaccuracies are expected β especially when reasoning beyond the scope of the training data.
Recommendations
Users should implement output filtering, content moderation, and continuous evaluation on domain-specific benchmarks to identify and mitigate bias or performance issues.
Model Capabilities
Bantam demonstrates strong multilingual competence across 55 languages and is capable of generating informative, coherent, and contextually aware text in each of them.
The model was designed to leverage many small attention heads in early layers to capture linguistic and grammatical structures, transitioning to larger, more abstract reasoning in later layers. This design improves logical coherence and narrative flow across diverse languages despite the modelβs compact size.
How to Get Started with the Model
Before loading the model, install the Bantam CLI:
pip install bantam-cli
if you want to inference it directly via the bantam-cli you can run:
bantam-cli chat --model Theoistic/Bantam-285m
or initialize the model in Python:
import bantam # lazy imports
import bantam.tokenization_bantam # registers BantamFastTokenizer with AutoTokenizer
import bantam.modeling_bantam # registers config/model with AutoConfig/AutoModel
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Theoistic/Bantam-285m")
model = AutoModelForCausalLM.from_pretrained("Theoistic/Bantam-285m")
prompt = "Once upon a time,"
inputs = tokenizer(prompt, return_tensors="pt")
# Remove unsupported keys (like token_type_ids) before generation
if "token_type_ids" in inputs:
del inputs["token_type_ids"]
outputs = model.generate(**inputs, max_length=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Training Details
Training Data
The model was trained on the Bantam Dataset, a multilingual and multi-domain collection of JSONL files designed to support general-purpose language modeling. It includes content from knowledge bases, refined educational text, tiny fictional stories, and curated data for linguistic diversity.
Languages Covered
The dataset spans 55 languages, including:
- English
- Chinese (Mandarin, Wu, Cantonese)
- Romance: Spanish, French, Portuguese, Italian, Romanian, Catalan
- Indic & Dravidian: Hindi, Bengali, Tamil, Telugu, Urdu, Gujarati
- Slavic & Germanic: Russian, Polish, Czech, German, Danish, Swedish, Norwegian, Dutch, Faroese
- Others: Arabic, Hebrew, Amharic, Turkish, Finnish, Korean, Japanese, Swahili, Vietnamese, Thai, Greek, Persian, and more.
Note: The dataset itself is not publicly released; this summary represents the linguistic and structural diversity of the data used for training.
Training Procedure
Preprocessing
A large portion of the training data was deduped, normalized and categorized by field of study, language and preprossed in to feature rich dense articles using larger models to format in to concise, detailed markdown articles. The millions of articles provided where suffled in languages making sure larger domain or lingustic features did not overshadow or any low resource language impact happened due to catastrophic forgetting.
Training Hyperparameters
- Parameters: 285 million
- Precision: bfloat16 mixed precision
- Optimizer: AdamW with weight decay
- Batch size: 2048 tokens per GPU
- Learning rate schedule: Cosine decay with warmup
- Context length: 2048 tokens
Speeds, Sizes, Times
- Training hardware: NVIDIA RTX 5090
- Training duration: ~50 hours
- Checkpoint size: ~285M parameters
Evaluation
Bantam is a pretrained base model, not fine-tuned or benchmarked with external metrics. Qualitatively, it exhibits:
- Strong multilingual understanding and generation across 55 languages.
- Coherent reasoning and informative responses.
- Expected hallucinations due to small model size.
No quantitative metrics or interpretability visualizations (e.g., heatmaps, probing, or evaluation suites) have been produced yet.
Environmental Impact
- Hardware Type: NVIDIA RTX 5090
- Hours used: ~50
- Cloud Provider: Local compute
- Compute Region: N/A (local training)
- Carbon Emitted: Estimated <0.05 tCOβeq
Technical Specifications
Model Architecture and Objective
| Layer | Type | Query Heads | KV Heads | Head Dim | Groups | Intermediate Size | Window | MoE | Notes |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Dense | 12 | 3 | 64 | 1 | 2304 | 128 | β | Dense local linguistic encoding |
| 1 | Dense | 12 | 3 | 64 | 1 | 2304 | 128 | β | Dense local linguistic encoding |
| 2 | Dense | 12 | 3 | 64 | 1 | 2368 | 128 | β | Dense local attention |
| 3 | Dense | 12 | 3 | 64 | 1 | 2400 | β | β | Transition layer |
| 4 | MoE | (6+3) | 3 | (80/96) | 2 | 2432 | 256 | β | 6 experts, top-2 routing |
| 5 | Dense | (6+3) | 3 | (80/96) | 2 | 2368 | 256 | β | Hybrid attention |
| 6 | Dense | (6+3) | 3 | (80/96) | 2 | 2432 | 256 | β | Hybrid attention |
| 7 | Dense | (6+3) | 3 | (80/96) | 2 | 2368 | β | β | Expanding context |
| 8 | Dense | 9 | 3 | 64/128 | 2 | 2304 | 256 | β | Default grouped attention |
| 9 | Dense | 9 | 3 | 64/128 | 2 | 2368 | 256 | β | Default grouped attention |
| 10 | Dense | 9 | 3 | 64/128 | 2 | 2400 | 256 | β | Default grouped attention |
| 11 | Dense | 9 | 3 | 64/128 | 2 | 2432 | 256 | β | Default grouped attention |
| 12 | Dense | 9 | 3 | 64/128 | 2 | 2432 | β | β | Expanding context |
| 13 | Dense | 9 | 3 | 64/128 | 2 | 2400 | 512 | β | Logical attention expansion |
| 14 | Dense | 9 | 3 | 64/128 | 2 | 2432 | 512 | β | Logical attention expansion |
| 15 | Dense | 9 | 3 | 64/128 | 2 | 2432 | 512 | β | Logical attention expansion |
| 16 | MoE | 9 | 3 | 64/128 | 2 | 2432 | 512 | β | 8 experts, top-2 routing |
| 17 | MoE | 9 | 3 | 64/128 | 2 | 2432 | β | β | 8 experts, top-2 routing |
| 18 | Dense | 9 | 3 | 64/128 | 2 | 2368 | 512 | β | Output stabilization |
| 19 | Dense | 9 | 3 | 64/128 | 2 | 2400 | β | β | Final dense layer |
Attention Group Defaults
| Group | Query Heads | KV Heads | Head Dim |
|---|---|---|---|
| Group 1 | 3 | 1 | 128 |
| Group 2 | 6 | 2 | 64 |
These defaults apply to all layers unless explicitly overridden in layer-specific configurations.
- Objective: Causal next-token prediction
- Routing: Top-2 expert routing with load-balancing loss 0.01
Compute Infrastructure
Hardware
- 1 Γ NVIDIA RTX 5090 GPU
Software
- PyTorch 2.8
- Transformers >=4.41
- Bantam CLI (required for import registration)
Model Card Authors
- Theodor Solbjorg β Lead Developer, Theoistic
Model Card Contact
For inquiries: theo@theoistic.com
- Downloads last month
- 5