ModernBERT Chunker Base πŸš€

This model is a fine-tuned version of ModernBERT-base, specialized in semantic boundary detection. It is designed to be used with the fine-chunker library for high-quality text segmentation in RAG applications.

Model Highlights

  • Context Length: 8192 tokens (full ModernBERT capacity).
  • Architecture: ModernBERT-base + Deep Classification Head (Linear-ReLU-Dropout-Linear).
  • Training Strategy: Sequential packing of full Wikipedia articles with weighted Cross-Entropy.
  • Languages: Bilingual support for Polish and English.

Usage

The easiest way to use this model is through the official library:

from fine_chunker import Chunker

# Load the model (runs optimally on CUDA or CPU)
chunker = Chunker.from_pretrained(device="cpu", use_onnx=True)

text = "Your long multi-topic document..."
chunks = chunker.chunk(text)

for chunk in chunks:
    print(f"Index: {chunk.index} | Content: {chunk.content[:100]}...")

Training Details

Dataset

The model was trained on Wikipedia (20231101 version) for both Polish and English.

  • Preprocessing: Full articles were cleaned of wiki-noise (references, external links, metadata). Additionally, 40% of chunk starts were replaced by a lowercase letter, and 40% of the last dots in chunks were removed.
  • Ground Truth: Segmentation was based on natural paragraph boundaries (\n\n) found in well-structured Wikipedia articles.
  • Packing: Multiple articles were packed into single 8192 token sequences to maximize training efficiency.

Training Configuration

  • Hardware: 4x NVIDIA A100-SXM4-40GB.
  • Duration: 1 day, 6 hours, 1 minute.
  • Precision: bfloat16 with Flash Attention 2.
  • Epochs: 1
  • Optimization:
    • Loss Function: Weighted Cross-Entropy ([1.0, 7.0]) to address boundary sparsity.
    • Gradient Accumulation: 8 steps.
    • Dropout: 0.1.

Architecture Details

Unlike standard token classifiers that use a single linear layer, this model uses a deep classification head:

  1. Linear(hidden_size, hidden_size)
  2. ReLU
  3. Dropout(0.1)
  4. Linear(hidden_size, 2) (Boundary vs. Non-boundary)

This allows the model to learn more complex semantic cues for segmentation.

Intended Use

  • RAG Pipelines: Generating semantic chunks that preserve context better than fixed-size splitting.
  • Long Document Analysis: Segmenting reports, legal documents, or books into logical chapters/sections.
  • Pre-processing for LLMs: Ensuring input fragments are semantically complete.

Limitations & Future Work

  • Training Data Focus: The current version was trained exclusively on Wikipedia datasets (English and Polish). While it excels at structured, informative prose, it hasn't been exposed to noisy data, conversational text, or specific journalistic styles (news).
  • Base Model Version: This is a general-purpose base model. While it performs excellently on standard structured text, specialized domains (e.g., legal contracts, medical records, or minified code) might require additional fine-tuning for optimal boundary detection.
  • Logical Structure: Performance is best on documents with clear paragraph breaks and logical flow, similar to the encyclopedic style of its training data.
  • Niche Domains: If you're working with datasets far removed from Wikipedia's structure, feel free to reach out or share your feedbackβ€”we're looking into domain-specific refinements.

Evaluation

Status: Under Development > Systematic evaluation of the model's performance across different domains and languages is currently in progress.

Author

Developed by Jerzy Boksa. Contact: devjerzy@gmail.com GitHub: fine-chunker

Acknowledgements

This model was trained using the infrastructure provided by Cyfronet (Academic Computer Centre Cyfronet AGH) as part of a educational grant.

Citation

If you use this model or the fine-chunker library in your research or project, please cite it as follows:

@misc{boksa2026modernbertchunker,
  author = {Jerzy Boksa},
  title = {ModernBERT Chunker Base: Specialized Semantic Boundary Detection for RAG},
  year = {2026},
  publisher = {Hugging Face},
  journal = {Hugging Face Model Hub},
  howpublished = {\url{https://huggingface.co/jboksa/modbert-chunker-base}}
}
Downloads last month
449
Safetensors
Model size
0.1B params
Tensor type
F16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for jboksa/modbert-chunker-base

Quantized
(19)
this model

Dataset used to train jboksa/modbert-chunker-base