Bantam Language Model

This model card provides a detailed overview of the BantamForCausalLM model, a transformer-based architecture designed for adaptive, efficient language modeling through hybrid dense and sparse computation.

Model Details

Model Description

The Bantam model is a 20-layer causal language model combining dense Transformer blocks with Mixture-of-Experts (MoE) layers. It features layer-wise dynamic attention, progressive context scaling, and grouped multi-query attention, designed for efficient large-scale language modeling with balanced compute utilization.

Developed by: Theoistic
Lead Developer: Theodor Solbjorg (theo@theoistic.com)
Funded by: Theoistic
Shared by: Theoistic
Model type: Causal Language Model (Transformer-based)
Language(s): Multilingual (55 languages, see dataset summary below)

Model Sources

Paper: Pending publication
Demo: Coming soon

Uses

Direct Use

The Bantam model can be used directly for text generation, completion, summarization, and instruction following. It supports context windows up to 2048 tokens and operates efficiently on GPUs using bfloat16 precision.

Downstream Use

Bantam can be fine-tuned for downstream NLP tasks, such as translation, dialogue modeling, educational content generation, or creative writing. Its multilingual and mixed-domain dataset allows flexible adaptation.

Out-of-Scope Use

The model is not intended for high-stakes or safety-critical domains, such as legal, medical, or financial decision-making. It should not be used for generating misinformation or biased outputs without human oversight.

Bias, Risks, and Limitations

Bantam inherits biases from its multilingual datasets, which include content from the internet, curated knowledge bases, and open-source text corpora. It may underperform on underrepresented languages or dialects.

Additionally, as a relatively small model (≈285M parameters), hallucinations and factual inaccuracies are expected — especially when reasoning beyond the scope of the training data.

Recommendations

Users should implement output filtering, content moderation, and continuous evaluation on domain-specific benchmarks to identify and mitigate bias or performance issues.

Model Capabilities

Bantam demonstrates strong multilingual competence across 55 languages and is capable of generating informative, coherent, and contextually aware text in each of them.

The model was designed to leverage many small attention heads in early layers to capture linguistic and grammatical structures, transitioning to larger, more abstract reasoning in later layers. This design improves logical coherence and narrative flow across diverse languages despite the model’s compact size.

How to Get Started with the Model

Before loading the model, install the Bantam CLI:

pip install bantam-cli

if you want to inference it directly via the bantam-cli you can run:

bantam-cli chat --model Theoistic/Bantam-285m

or initialize the model in Python:

import bantam # lazy imports
import bantam.tokenization_bantam # registers BantamFastTokenizer with AutoTokenizer
import bantam.modeling_bantam # registers config/model with AutoConfig/AutoModel


from transformers import AutoModelForCausalLM, AutoTokenizer


tokenizer = AutoTokenizer.from_pretrained("Theoistic/Bantam-285m")
model = AutoModelForCausalLM.from_pretrained("Theoistic/Bantam-285m")


prompt = "Once upon a time,"
inputs = tokenizer(prompt, return_tensors="pt")


# Remove unsupported keys (like token_type_ids) before generation
if "token_type_ids" in inputs:
    del inputs["token_type_ids"]


outputs = model.generate(**inputs, max_length=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training Details

Training Data

The model was trained on the Bantam Dataset, a multilingual and multi-domain collection of JSONL files designed to support general-purpose language modeling. It includes content from knowledge bases, refined educational text, tiny fictional stories, and curated data for linguistic diversity.

Languages Covered

The dataset spans 55 languages, including:

English
Chinese (Mandarin, Wu, Cantonese)
Romance: Spanish, French, Portuguese, Italian, Romanian, Catalan
Indic & Dravidian: Hindi, Bengali, Tamil, Telugu, Urdu, Gujarati
Slavic & Germanic: Russian, Polish, Czech, German, Danish, Swedish, Norwegian, Dutch, Faroese
Others: Arabic, Hebrew, Amharic, Turkish, Finnish, Korean, Japanese, Swahili, Vietnamese, Thai, Greek, Persian, and more.

Note: The dataset itself is not publicly released; this summary represents the linguistic and structural diversity of the data used for training.

Training Procedure

Preprocessing

A large portion of the training data was deduped, normalized and categorized by field of study, language and preprossed in to feature rich dense articles using larger models to format in to concise, detailed markdown articles. The millions of articles provided where suffled in languages making sure larger domain or lingustic features did not overshadow or any low resource language impact happened due to catastrophic forgetting.

Training Hyperparameters

Parameters: 285 million
Precision: bfloat16 mixed precision
Optimizer: AdamW with weight decay
Batch size: 2048 tokens per GPU
Learning rate schedule: Cosine decay with warmup
Context length: 2048 tokens

Speeds, Sizes, Times

Training hardware: NVIDIA RTX 5090
Training duration: ~50 hours
Checkpoint size: ~285M parameters

Evaluation

Bantam is a pretrained base model, not fine-tuned or benchmarked with external metrics. Qualitatively, it exhibits:

Strong multilingual understanding and generation across 55 languages.
Coherent reasoning and informative responses.
Expected hallucinations due to small model size.

No quantitative metrics or interpretability visualizations (e.g., heatmaps, probing, or evaluation suites) have been produced yet.

Environmental Impact

Hardware Type: NVIDIA RTX 5090
Hours used: ~50
Cloud Provider: Local compute
Compute Region: N/A (local training)
Carbon Emitted: Estimated <0.05 tCO₂eq

Technical Specifications

Model Architecture and Objective

Layer	Type	Query Heads	KV Heads	Head Dim	Groups	Intermediate Size	Window	MoE	Notes
0	Dense	12	3	64	1	2304	128	❌	Dense local linguistic encoding
1	Dense	12	3	64	1	2304	128	❌	Dense local linguistic encoding
2	Dense	12	3	64	1	2368	128	❌	Dense local attention
3	Dense	12	3	64	1	2400	–	❌	Transition layer
4	MoE	(6+3)	3	(80/96)	2	2432	256	✅	6 experts, top-2 routing
5	Dense	(6+3)	3	(80/96)	2	2368	256	❌	Hybrid attention
6	Dense	(6+3)	3	(80/96)	2	2432	256	❌	Hybrid attention
7	Dense	(6+3)	3	(80/96)	2	2368	–	❌	Expanding context
8	Dense	9	3	64/128	2	2304	256	❌	Default grouped attention
9	Dense	9	3	64/128	2	2368	256	❌	Default grouped attention
10	Dense	9	3	64/128	2	2400	256	❌	Default grouped attention
11	Dense	9	3	64/128	2	2432	256	❌	Default grouped attention
12	Dense	9	3	64/128	2	2432	–	❌	Expanding context
13	Dense	9	3	64/128	2	2400	512	❌	Logical attention expansion
14	Dense	9	3	64/128	2	2432	512	❌	Logical attention expansion
15	Dense	9	3	64/128	2	2432	512	❌	Logical attention expansion
16	MoE	9	3	64/128	2	2432	512	✅	8 experts, top-2 routing
17	MoE	9	3	64/128	2	2432	–	✅	8 experts, top-2 routing
18	Dense	9	3	64/128	2	2368	512	❌	Output stabilization
19	Dense	9	3	64/128	2	2400	–	❌	Final dense layer

Attention Group Defaults

Group	Query Heads	KV Heads	Head Dim
Group 1	3	1	128
Group 2	6	2	64

These defaults apply to all layers unless explicitly overridden in layer-specific configurations.

Objective: Causal next-token prediction
Routing: Top-2 expert routing with load-balancing loss 0.01

Compute Infrastructure

Hardware

1 × NVIDIA RTX 5090 GPU

Software

PyTorch 2.8
Transformers >=4.41
Bantam CLI (required for import registration)

Model Card Authors

Theodor Solbjorg — Lead Developer, Theoistic

Model Card Contact

For inquiries: theo@theoistic.com

Downloads last month: 5

Safetensors

Model size

0.3B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support