Bantam Language Model

This model card provides a detailed overview of the BantamForCausalLM model, a transformer-based architecture designed for adaptive, efficient language modeling through hybrid dense and sparse computation.

Model Details

Model Description

The Bantam model is a 20-layer causal language model combining dense Transformer blocks with Mixture-of-Experts (MoE) layers. It features layer-wise dynamic attention, progressive context scaling, and grouped multi-query attention, designed for efficient large-scale language modeling with balanced compute utilization.

  • Developed by: Theoistic
  • Lead Developer: Theodor Solbjorg (theo@theoistic.com)
  • Funded by: Theoistic
  • Shared by: Theoistic
  • Model type: Causal Language Model (Transformer-based)
  • Language(s): Multilingual (55 languages, see dataset summary below)

Model Sources

  • Paper: Pending publication
  • Demo: Coming soon

Uses

Direct Use

The Bantam model can be used directly for text generation, completion, summarization, and instruction following. It supports context windows up to 2048 tokens and operates efficiently on GPUs using bfloat16 precision.

Downstream Use

Bantam can be fine-tuned for downstream NLP tasks, such as translation, dialogue modeling, educational content generation, or creative writing. Its multilingual and mixed-domain dataset allows flexible adaptation.

Out-of-Scope Use

The model is not intended for high-stakes or safety-critical domains, such as legal, medical, or financial decision-making. It should not be used for generating misinformation or biased outputs without human oversight.

Bias, Risks, and Limitations

Bantam inherits biases from its multilingual datasets, which include content from the internet, curated knowledge bases, and open-source text corpora. It may underperform on underrepresented languages or dialects.

Additionally, as a relatively small model (β‰ˆ285M parameters), hallucinations and factual inaccuracies are expected β€” especially when reasoning beyond the scope of the training data.

Recommendations

Users should implement output filtering, content moderation, and continuous evaluation on domain-specific benchmarks to identify and mitigate bias or performance issues.

Model Capabilities

Bantam demonstrates strong multilingual competence across 55 languages and is capable of generating informative, coherent, and contextually aware text in each of them.

The model was designed to leverage many small attention heads in early layers to capture linguistic and grammatical structures, transitioning to larger, more abstract reasoning in later layers. This design improves logical coherence and narrative flow across diverse languages despite the model’s compact size.

How to Get Started with the Model

Before loading the model, install the Bantam CLI:

pip install bantam-cli

if you want to inference it directly via the bantam-cli you can run:

bantam-cli chat --model Theoistic/Bantam-285m

or initialize the model in Python:

import bantam # lazy imports
import bantam.tokenization_bantam # registers BantamFastTokenizer with AutoTokenizer
import bantam.modeling_bantam # registers config/model with AutoConfig/AutoModel


from transformers import AutoModelForCausalLM, AutoTokenizer


tokenizer = AutoTokenizer.from_pretrained("Theoistic/Bantam-285m")
model = AutoModelForCausalLM.from_pretrained("Theoistic/Bantam-285m")


prompt = "Once upon a time,"
inputs = tokenizer(prompt, return_tensors="pt")


# Remove unsupported keys (like token_type_ids) before generation
if "token_type_ids" in inputs:
    del inputs["token_type_ids"]


outputs = model.generate(**inputs, max_length=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training Details

Training Data

The model was trained on the Bantam Dataset, a multilingual and multi-domain collection of JSONL files designed to support general-purpose language modeling. It includes content from knowledge bases, refined educational text, tiny fictional stories, and curated data for linguistic diversity.

Languages Covered

The dataset spans 55 languages, including:

  • English
  • Chinese (Mandarin, Wu, Cantonese)
  • Romance: Spanish, French, Portuguese, Italian, Romanian, Catalan
  • Indic & Dravidian: Hindi, Bengali, Tamil, Telugu, Urdu, Gujarati
  • Slavic & Germanic: Russian, Polish, Czech, German, Danish, Swedish, Norwegian, Dutch, Faroese
  • Others: Arabic, Hebrew, Amharic, Turkish, Finnish, Korean, Japanese, Swahili, Vietnamese, Thai, Greek, Persian, and more.

Note: The dataset itself is not publicly released; this summary represents the linguistic and structural diversity of the data used for training.

Training Procedure

Preprocessing

A large portion of the training data was deduped, normalized and categorized by field of study, language and preprossed in to feature rich dense articles using larger models to format in to concise, detailed markdown articles. The millions of articles provided where suffled in languages making sure larger domain or lingustic features did not overshadow or any low resource language impact happened due to catastrophic forgetting.

Training Hyperparameters

  • Parameters: 285 million
  • Precision: bfloat16 mixed precision
  • Optimizer: AdamW with weight decay
  • Batch size: 2048 tokens per GPU
  • Learning rate schedule: Cosine decay with warmup
  • Context length: 2048 tokens

Speeds, Sizes, Times

  • Training hardware: NVIDIA RTX 5090
  • Training duration: ~50 hours
  • Checkpoint size: ~285M parameters

Evaluation

Bantam is a pretrained base model, not fine-tuned or benchmarked with external metrics. Qualitatively, it exhibits:

  • Strong multilingual understanding and generation across 55 languages.
  • Coherent reasoning and informative responses.
  • Expected hallucinations due to small model size.

No quantitative metrics or interpretability visualizations (e.g., heatmaps, probing, or evaluation suites) have been produced yet.

Environmental Impact

  • Hardware Type: NVIDIA RTX 5090
  • Hours used: ~50
  • Cloud Provider: Local compute
  • Compute Region: N/A (local training)
  • Carbon Emitted: Estimated <0.05 tCOβ‚‚eq

Technical Specifications

Model Architecture and Objective

Layer Type Query Heads KV Heads Head Dim Groups Intermediate Size Window MoE Notes
0 Dense 12 3 64 1 2304 128 ❌ Dense local linguistic encoding
1 Dense 12 3 64 1 2304 128 ❌ Dense local linguistic encoding
2 Dense 12 3 64 1 2368 128 ❌ Dense local attention
3 Dense 12 3 64 1 2400 – ❌ Transition layer
4 MoE (6+3) 3 (80/96) 2 2432 256 βœ… 6 experts, top-2 routing
5 Dense (6+3) 3 (80/96) 2 2368 256 ❌ Hybrid attention
6 Dense (6+3) 3 (80/96) 2 2432 256 ❌ Hybrid attention
7 Dense (6+3) 3 (80/96) 2 2368 – ❌ Expanding context
8 Dense 9 3 64/128 2 2304 256 ❌ Default grouped attention
9 Dense 9 3 64/128 2 2368 256 ❌ Default grouped attention
10 Dense 9 3 64/128 2 2400 256 ❌ Default grouped attention
11 Dense 9 3 64/128 2 2432 256 ❌ Default grouped attention
12 Dense 9 3 64/128 2 2432 – ❌ Expanding context
13 Dense 9 3 64/128 2 2400 512 ❌ Logical attention expansion
14 Dense 9 3 64/128 2 2432 512 ❌ Logical attention expansion
15 Dense 9 3 64/128 2 2432 512 ❌ Logical attention expansion
16 MoE 9 3 64/128 2 2432 512 βœ… 8 experts, top-2 routing
17 MoE 9 3 64/128 2 2432 – βœ… 8 experts, top-2 routing
18 Dense 9 3 64/128 2 2368 512 ❌ Output stabilization
19 Dense 9 3 64/128 2 2400 – ❌ Final dense layer

Attention Group Defaults

Group Query Heads KV Heads Head Dim
Group 1 3 1 128
Group 2 6 2 64

These defaults apply to all layers unless explicitly overridden in layer-specific configurations.

  • Objective: Causal next-token prediction
  • Routing: Top-2 expert routing with load-balancing loss 0.01

Compute Infrastructure

Hardware

  • 1 Γ— NVIDIA RTX 5090 GPU

Software

  • PyTorch 2.8
  • Transformers >=4.41
  • Bantam CLI (required for import registration)

Model Card Authors

  • Theodor Solbjorg β€” Lead Developer, Theoistic

Model Card Contact

For inquiries: theo@theoistic.com

Downloads last month
5
Safetensors
Model size
0.3B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support