ModernBERT Chunker Base π
This model is a fine-tuned version of ModernBERT-base, specialized in semantic boundary detection. It is designed to be used with the fine-chunker library for high-quality text segmentation in RAG applications.
Model Highlights
- Context Length: 8192 tokens (full ModernBERT capacity).
- Architecture: ModernBERT-base + Deep Classification Head (Linear-ReLU-Dropout-Linear).
- Training Strategy: Sequential packing of full Wikipedia articles with weighted Cross-Entropy.
- Languages: Bilingual support for Polish and English.
Usage
The easiest way to use this model is through the official library:
from fine_chunker import Chunker
# Load the model (runs optimally on CUDA or CPU)
chunker = Chunker.from_pretrained(device="cpu", use_onnx=True)
text = "Your long multi-topic document..."
chunks = chunker.chunk(text)
for chunk in chunks:
print(f"Index: {chunk.index} | Content: {chunk.content[:100]}...")
Training Details
Dataset
The model was trained on Wikipedia (20231101 version) for both Polish and English.
- Preprocessing: Full articles were cleaned of wiki-noise (references, external links, metadata). Additionally, 40% of chunk starts were replaced by a lowercase letter, and 40% of the last dots in chunks were removed.
- Ground Truth: Segmentation was based on natural paragraph boundaries (
\n\n) found in well-structured Wikipedia articles. - Packing: Multiple articles were packed into single
8192token sequences to maximize training efficiency.
Training Configuration
- Hardware: 4x NVIDIA A100-SXM4-40GB.
- Duration: 1 day, 6 hours, 1 minute.
- Precision:
bfloat16with Flash Attention 2. - Epochs: 1
- Optimization:
- Loss Function: Weighted Cross-Entropy (
[1.0, 7.0]) to address boundary sparsity. - Gradient Accumulation: 8 steps.
- Dropout: 0.1.
- Loss Function: Weighted Cross-Entropy (
Architecture Details
Unlike standard token classifiers that use a single linear layer, this model uses a deep classification head:
Linear(hidden_size, hidden_size)ReLUDropout(0.1)Linear(hidden_size, 2)(Boundary vs. Non-boundary)
This allows the model to learn more complex semantic cues for segmentation.
Intended Use
- RAG Pipelines: Generating semantic chunks that preserve context better than fixed-size splitting.
- Long Document Analysis: Segmenting reports, legal documents, or books into logical chapters/sections.
- Pre-processing for LLMs: Ensuring input fragments are semantically complete.
Limitations & Future Work
- Training Data Focus: The current version was trained exclusively on Wikipedia datasets (English and Polish). While it excels at structured, informative prose, it hasn't been exposed to noisy data, conversational text, or specific journalistic styles (news).
- Base Model Version: This is a general-purpose base model. While it performs excellently on standard structured text, specialized domains (e.g., legal contracts, medical records, or minified code) might require additional fine-tuning for optimal boundary detection.
- Logical Structure: Performance is best on documents with clear paragraph breaks and logical flow, similar to the encyclopedic style of its training data.
- Niche Domains: If you're working with datasets far removed from Wikipedia's structure, feel free to reach out or share your feedbackβwe're looking into domain-specific refinements.
Evaluation
Status: Under Development > Systematic evaluation of the model's performance across different domains and languages is currently in progress.
Author
Developed by Jerzy Boksa. Contact: devjerzy@gmail.com GitHub: fine-chunker
Acknowledgements
This model was trained using the infrastructure provided by Cyfronet (Academic Computer Centre Cyfronet AGH) as part of a educational grant.
Citation
If you use this model or the fine-chunker library in your research or project, please cite it as follows:
@misc{boksa2026modernbertchunker,
author = {Jerzy Boksa},
title = {ModernBERT Chunker Base: Specialized Semantic Boundary Detection for RAG},
year = {2026},
publisher = {Hugging Face},
journal = {Hugging Face Model Hub},
howpublished = {\url{https://huggingface.co/jboksa/modbert-chunker-base}}
}
- Downloads last month
- 449
Model tree for jboksa/modbert-chunker-base
Base model
answerdotai/ModernBERT-base