CEFR Prototype-based Model (k=3)

This is a prototype-based classifier for Swedish text CEFR level estimation with AutoConfig/AutoModel support, compatible with Hugging Face Transformers.

Model Details

Architecture

  • Base Model: KB/bert-base-swedish-cased
  • Prototypes: 3 prototypes per CEFR level
  • Total Prototypes: 18 (6 levels × 3 prototypes)
  • Classification: Cosine similarity with temperature scaling

Key Features

  • Mean pooling on BERT layer -2 (11th layer for BERT-base)
  • Temperature scaling: 10.0
  • L2-normalized embeddings and prototypes
  • Prototypes averaged per class during inference
  • SafeTensors format for efficient loading

CEFR Levels

  • 0: A1 (Beginner)
  • 1: A2 (Elementary)
  • 2: B1 (Intermediate)
  • 3: B2 (Upper Intermediate)
  • 4: C1 (Advanced)
  • 5: C2 (Proficient)

Usage

Installation

pip install torch transformers

Quick Start

import torch
from transformers import AutoTokenizer

# Load model and tokenizer
model_name = "fffffwl/swe-cefr-sp"

# If you have the model class locally:
from convert_proto_model_to_hf import CEFRPrototypeModel
model = CEFRPrototypeModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Example text
text = "Jag heter Anna och jag kommer från Sverige."

# Tokenize and predict
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
outputs = model(**inputs)

# Get predictions
probs = torch.nn.functional.softmax(outputs.logits, dim=-1)
predicted_class = torch.argmax(probs, dim=-1).item()

# Map to CEFR level
cefr_labels = ["A1", "A2", "B1", "B2", "C1", "C2"]
print(f"Text: {text}")
print(f"Predicted CEFR level: {cefr_labels[predicted_class]}")
print(f"Confidence: {probs[0][predicted_class].item():.3f}")

Model Implementation

Custom Classes

class CEFRProtoConfig(PretrainedConfig):
    model_type = "cefr_prototype"

    def __init__(
        self,
        encoder_name: str = "KB/bert-base-swedish-cased",
        num_labels: int = 6,
        prototypes_per_class: int = 3,
        temperature: float = 10.0,
        layer_index: int = -2,
        hidden_size: int = 768,
        **kwargs
    ):
class CEFRPrototypeModel(PreTrainedModel):
    def encode(self, input_ids, attention_mask, token_type_ids=None) -> torch.Tensor:
        # Mean pooling on BERT layer -2
        # L2 normalization
        pass

    def forward(self, input_ids, attention_mask, token_type_ids=None, labels=None):
        # Cosine similarity with prototypes
        # Temperature scaling
        pass

Performance

On the Swedish CEFR sentence dataset (10k sentences from COCTAILL, 8 Sidor, and SUC3):

  • Macro-F1: 84.1%
  • Quadratic Weighted Kappa (QWK): 94.6%
  • Accuracy: Significantly outperforms BERT baseline by 12.1% in macro-F1

Training Details

Dataset

  • Swedish CEFR-annotated sentences
  • Multi-level annotations (low/high boundaries)
  • Sentence-level classification

Training Configuration

  • Optimizer: AdamW
  • Loss: Cross-entropy with class weighting
  • Prototypes initialization: Mean of class embeddings + orthogonalization
  • Temperature: 10.0 (trainable during fine-tuning)
  • Layer: -2 (11th BERT layer)

Model Files

  • model.safetensors - Model weights (476MB)
  • config.json - Model configuration
  • tokenizer.json - Tokenizer vocabulary
  • tokenizer_config.json - Tokenizer configuration

Limitations

  • Model is trained specifically for Swedish text
  • Sentence-level classification (not document-level)
  • Requires sentences with reasonable length (recommended: 8-128 tokens)

Citations

If you use this model in your research, please cite:

@misc{fan2024swedish,
  title={Swedish Sentence-Level CEFR Classification with LLM Annotations},
  author={Fan, Wenlin},
  year={2024},
  howpublished={\url{https://huggingface.co/fffffwl/swe-cefr-sp}}
}

Or as part of the broader project:

@misc{fan2024swecefrsp,
  title={Swedish CEFR Sentence-level Assessment using Large Language Models},
  author={Fan, Wenlin},
  year={2024},
  publisher={GitHub},
  howpublished={\url{https://github.com/fanwenlin/swe-cefr-sp}},
  note={Dataset, LLM annotating codes and sentence-level assessment codes available}
}

Project Links

  • GitHub Repository: https://github.com/fanwenlin/swe-cefr-sp
  • Hugging Face Space: Available with interactive demo
  • Dataset: 10k Swedish sentences annotated from COCTAILL, 8 Sidor, and SUC3
  • Main Model: This prototype-based model (k=3) with Swedish BERT

Related Work

This work builds upon:

  • Yoshioka et al. (2022): CEFR-based Sentence Profile (CEFR-SP) and prototype-based metric learning
  • Volodina et al. (2016): Swedish passage readability assessment
  • Scarton et al. (2018): Controllable text simplification

License

This model is released under the MIT License. See LICENSE file for details.

Related Models

This repository also contains:

  • Original k=1 checkpoint: metric-proto-k1.pt
  • Original k=3 checkpoint: metric-proto-k3.pt (this model)
  • Original k=5 checkpoint: metric-proto-k5.pt
  • BERT baseline: bert-baseline.pt
  • Megatron version: metric-proto-megatron-k3.pt
  • Traditional ML models: linear_regression.joblib, logreg.joblib, svm.joblib, mlp.joblib, tree.joblib

For more details, visit the project repository.

Downloads last month
12
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support