Papers
arxiv:2509.05668

Llama-GENBA-10B: A Trilingual Large Language Model for German, English and Bavarian

Published on Sep 6
· Submitted by Stefan Schweter on Sep 9
Authors:
,
,
,
,
,

Abstract

Llama-GENBA-10B, a trilingual foundation model, addresses English-centric bias by balancing English, German, and Bavarian training, achieving strong cross-lingual performance and setting new benchmarks for Bavarian.

AI-generated summary

We present Llama-GENBA-10B, a trilingual foundation model addressing English-centric bias in large language models. Built on Llama 3.1-8B and scaled to 10B parameters, Llama-GENBA-10B is continuously pretrained on 164B tokens (82B English, 82B German, and 80M Bavarian), balancing resources while preventing English dominance. Targeted at the German NLP community, the model also promotes Bavarian as a low-resource language. Development tackled four challenges: (1) curating a multilingual corpus despite Bavarian scarcity, (2) creating a unified tokenizer for English, German, and Bavarian, (3) optimizing architecture and language-ratio hyperparameters for cross-lingual transfer, and (4) establishing the first standardized trilingual evaluation suite by translating German benchmarks into Bavarian. Evaluations show that Llama-GENBA-10B achieves strong cross-lingual performance, with the fine-tuned variant surpassing Apertus-8B-2509 and gemma-2-9b in Bavarian and establishing itself as the best model in its class for this language, while also outperforming EuroLLM in English and matching its results in German. Training on the Cerebras CS-2 demonstrated efficient large-scale multilingual pretraining with documented energy use, offering a blueprint for inclusive foundation models that integrate low-resource languages.

Community

Paper author Paper submitter

We present Llama-GENBA-10B, a trilingual foundation model addressing English-centric bias in large language models. Built on Llama 3.1-8B and scaled to 10B parameters, Llama-GENBA-10B is continuously pretrained on 164B tokens (82B English, 82B German, and 80M Bavarian), balancing resources while preventing English dominance. Targeted at the German NLP community, the model also promotes Bavarian as a low-resource language.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

I am thinking about running it can you make suggestions how to avoid buying an H100?

Hey @sels-qvest ! Great question — avoiding an H100 is totally doable. Here are some practical suggestions depending on your use case:

Cloud-Based Solutions (No Upfront Cost)
RunPod / Vast.ai / Paperspace: Rent A100s, A6000s, or even multiple RTX 4090s by the hour

AWS / GCP / Azure: Use A100 or V100 instances only when needed

Hugging Face Inference Endpoints: Deploy models without managing hardware

Consumer GPU Alternatives
RTX 4090 (~$1.5-2k): Excellent for inference and fine-tuning with LoRA

Used RTX 3090 (~$800): Still great for many models, 24GB VRAM

Multi-GPU Setup: Combine 2x RTX 4090s for ~48GB effective VRAM

Optimization Strategies
Quantization: Run models in 4-bit or 8-bit (GPTQ, AWQ)

LoRA/QLoRA: Fine-tune large models on single consumer GPUs

Model Pruning & Distillation: Use smaller, efficient versions

Offloading: CPU/RAM offload for layers that don't fit

Questions to Consider:
Are you doing training or just inference?

What model size are you planning to run?

Do you need low latency or is batch processing okay?

My Recommendation:
Start with a cloud A100 or RTX 4090 rental to test your workflow, then decide if you need dedicated hardware.

What's your specific use case? That would help narrow down the best approach!

I have an ChatBot running qwen-4.5:7b locally with some MCP services (scapers, lookups etc) on an MacBook Pro 2022 24GB RAM. Everything works fine, my idea was to implement (actually my whole setup was generated by Github CoPilot which worked pretty well to my surpise). Doing some python + FastMCP glue. My idea was to implement an hot swapable model feature that would save the history and context, unload qwen, load another model from huggingface, ask that "specialzied" model, stop that model, funnel the reply back to the main qwen model and restore its history/context etc. My chatbot can already search for models on hugging face (and papers abstracts on arXiv and similar lookups via MCP). It would be pretty neat to enhance my "ask an expert" function to not only use my services that are exposed via API to use an model qwen searched for via hugging face, unload itself, start the model it found himself, do the query, start qwen again with the result and give an answer to the user. the model above would be fun (i am german and it speaks bavarian).

Thanks for the detailed explanation — now I understand your setup much better!

Your idea of a hot-swappable model system is really interesting. Running Qwen-4.5 7B locally on a 24GB M2/M3 MacBook is already quite impressive, and your plan to temporarily unload Qwen, load a specialized model, get its answer, and then restore Qwen’s context is totally possible — but there are a few technical challenges.

Here are some thoughts and suggestions:

• Model loading time
Loading a model from Hugging Face each time will be slow, especially on CPU/MPS. It may take several seconds or even minutes depending on the size. If you want smooth UX, you might consider keeping the “expert” models running as separate processes or using small quantized versions.

• Context and tokenizer compatibility
Different models use different tokenizers, so you should always restore Qwen’s context as plain text (not tokens) before re-encoding it. Otherwise you’ll get inconsistencies.

• Your architecture idea is valid
Qwen as the main “orchestrator” + temporary expert model = a solid design. You just need good state management (history saved as JSON) and careful handling of the startup/shutdown logic.

• Alternative option
Instead of unloading Qwen, you can simply call the expert model through:
– a local micro-service (FastAPI / MCP)
– or a Hugging Face Inference Endpoint
That way Qwen stays in memory and your latency becomes much lower.

• Your idea is definitely doable
The concept of letting the chatbot search for a model on Hugging Face, load it, ask it something “in its specialty”, and then hand the answer back to Qwen is very cool — especially for dialect-specific models like the Bavarian one!

If you need I can share an example architecture or a small FastAPI/MCP snippet showing how to orchestrate the model swap.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2509.05668 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2509.05668 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2509.05668 in a Space README.md to link it from this page.

Collections including this paper 1