Reasoning Complexity Classifier

A ModernBERT-base model fine-tuned to predict the reasoning complexity of educational text on a continuous 1โ€“4 scale. Trained on FineWeb-Edu documents labeled by GPT-5-nano via the OpenAI Batch API (~$20 in credits).

Model Description

This is a regression model (num_labels=1, problem_type="regression") that outputs a continuous score. The score can be rounded to the nearest integer to obtain a discrete complexity level. Level 5 (Formal/Abstract reasoning) was excluded from training due to data scarcity; the model's effective range is 1.0โ€“4.0.

Complexity Levels

Level Name Description Example
1 Factual/Declarative States facts with no reasoning "The Pacific Ocean covers ~165 million kmยฒ."
2 Single-step reasoning One inference or comparison "Because boiling point decreases at altitude, water boils faster in Denver than Miami."
3 Multi-step reasoning 2โ€“4 chained logical steps "Demand rose while supply held fixed โ†’ prices rose โ†’ consumer spending fell โ†’ GDP slowed."
4 Complex reasoning 5+ steps, conditionals, competing factors Medical differential diagnosis with branching conditions and exclusion criteria.

Training Details

Data

  • Source: FineWeb-Edu โ€” a curated subset of Common Crawl filtered for educational content.
  • Labeling: ~100,000 documents reservoir-sampled from ~6,000 records per subject category, then labeled with GPT-5-nano via the OpenAI Batch API using structured output (integer 1โ€“5).
  • Splits: 80% train / 10% validation / 10% test (stratified by integer complexity level).
  • Preprocessing: Texts truncated to 8,000 characters before labeling; tokenized to 512 tokens during training with dynamic padding.
  • Level 5 exclusion: Rows labeled as level 5 were excluded from the training set.

Hyperparameters

Parameter Value
Base model answerdotai/ModernBERT-base
Epochs 3
Batch size 32
Learning rate 2e-5
Weight decay 0.01
Warmup ratio 0.1
Max token length 512
Optimizer AdamW
Scheduler Linear with warmup
AMP bf16 (CUDA)
Loss MSE

Training History

Epoch Train Loss Val MAE Val Acc (rounded) Val Spearman r
1 0.6002 0.5190 56.98% 0.7533
2 0.3631 0.5040 58.43% 0.7597
3 0.2040 0.5114 58.19% 0.7485

The best checkpoint (by validation MAE) was saved at epoch 2.

Evaluation Results

Evaluated on a held-out test set:

Metric Value
MSE 0.4388
MAE 0.5063
Rounded accuracy 58.6%
Spearman r 0.7527

Interpretation: The model achieves a Spearman correlation of ~0.75 with gold labels, indicating strong ordinal ranking ability. The MAE of ~0.51 means predictions are on average within half a level of the true score when treated as a continuous signal.

Output Interpretation

Raw score Meaning
~1.0 Factual/Declarative
~2.0 Single-step reasoning
~3.0 Multi-step reasoning
~4.0 Complex reasoning

Clip and round the raw float output to [1, 4] for a discrete level.

Architecture

Based on answerdotai/ModernBERT-base:

  • Layers: 22 transformer layers (alternating full and sliding attention)
  • Hidden size: 768
  • Attention heads: 12
  • Intermediate size: 1,152
  • Max position embeddings: 8,192
  • Classifier pooling: mean
  • Classifier activation: GELU

Limitations

  • Labels are silver-standard (GPT-5-nano), not human-annotated; label noise may affect the ~1.5% of ambiguous texts.
  • Texts are truncated to 512 tokens; very long documents are judged on their first ~512 tokens only.
  • Trained primarily on English educational web text; performance may degrade on other domains or languages.

Intended Use

Designed for data curation pipelines that need to filter or balance training corpora by reasoning complexity โ€” for example, constructing curriculum-ordered datasets for language model training.

Downloads last month
14
Safetensors
Model size
0.1B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for mdonigian/fineweb-edu-complexity-classifier

Finetuned
(1091)
this model