CEFR Prototype-based Model (k=3)
This is a prototype-based classifier for Swedish text CEFR level estimation with AutoConfig/AutoModel support, compatible with Hugging Face Transformers.
Model Details
Architecture
- Base Model: KB/bert-base-swedish-cased
- Prototypes: 3 prototypes per CEFR level
- Total Prototypes: 18 (6 levels × 3 prototypes)
- Classification: Cosine similarity with temperature scaling
Key Features
- Mean pooling on BERT layer -2 (11th layer for BERT-base)
- Temperature scaling: 10.0
- L2-normalized embeddings and prototypes
- Prototypes averaged per class during inference
- SafeTensors format for efficient loading
CEFR Levels
- 0: A1 (Beginner)
- 1: A2 (Elementary)
- 2: B1 (Intermediate)
- 3: B2 (Upper Intermediate)
- 4: C1 (Advanced)
- 5: C2 (Proficient)
Usage
Installation
pip install torch transformers
Quick Start
import torch
from transformers import AutoTokenizer
# Load model and tokenizer
model_name = "fffffwl/swe-cefr-sp"
# If you have the model class locally:
from convert_proto_model_to_hf import CEFRPrototypeModel
model = CEFRPrototypeModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Example text
text = "Jag heter Anna och jag kommer från Sverige."
# Tokenize and predict
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
outputs = model(**inputs)
# Get predictions
probs = torch.nn.functional.softmax(outputs.logits, dim=-1)
predicted_class = torch.argmax(probs, dim=-1).item()
# Map to CEFR level
cefr_labels = ["A1", "A2", "B1", "B2", "C1", "C2"]
print(f"Text: {text}")
print(f"Predicted CEFR level: {cefr_labels[predicted_class]}")
print(f"Confidence: {probs[0][predicted_class].item():.3f}")
Model Implementation
Custom Classes
class CEFRProtoConfig(PretrainedConfig):
model_type = "cefr_prototype"
def __init__(
self,
encoder_name: str = "KB/bert-base-swedish-cased",
num_labels: int = 6,
prototypes_per_class: int = 3,
temperature: float = 10.0,
layer_index: int = -2,
hidden_size: int = 768,
**kwargs
):
class CEFRPrototypeModel(PreTrainedModel):
def encode(self, input_ids, attention_mask, token_type_ids=None) -> torch.Tensor:
# Mean pooling on BERT layer -2
# L2 normalization
pass
def forward(self, input_ids, attention_mask, token_type_ids=None, labels=None):
# Cosine similarity with prototypes
# Temperature scaling
pass
Performance
On the Swedish CEFR sentence dataset (10k sentences from COCTAILL, 8 Sidor, and SUC3):
- Macro-F1: 84.1%
- Quadratic Weighted Kappa (QWK): 94.6%
- Accuracy: Significantly outperforms BERT baseline by 12.1% in macro-F1
Training Details
Dataset
- Swedish CEFR-annotated sentences
- Multi-level annotations (low/high boundaries)
- Sentence-level classification
Training Configuration
- Optimizer: AdamW
- Loss: Cross-entropy with class weighting
- Prototypes initialization: Mean of class embeddings + orthogonalization
- Temperature: 10.0 (trainable during fine-tuning)
- Layer: -2 (11th BERT layer)
Model Files
model.safetensors- Model weights (476MB)config.json- Model configurationtokenizer.json- Tokenizer vocabularytokenizer_config.json- Tokenizer configuration
Limitations
- Model is trained specifically for Swedish text
- Sentence-level classification (not document-level)
- Requires sentences with reasonable length (recommended: 8-128 tokens)
Citations
If you use this model in your research, please cite:
@misc{fan2024swedish,
title={Swedish Sentence-Level CEFR Classification with LLM Annotations},
author={Fan, Wenlin},
year={2024},
howpublished={\url{https://huggingface.co/fffffwl/swe-cefr-sp}}
}
Or as part of the broader project:
@misc{fan2024swecefrsp,
title={Swedish CEFR Sentence-level Assessment using Large Language Models},
author={Fan, Wenlin},
year={2024},
publisher={GitHub},
howpublished={\url{https://github.com/fanwenlin/swe-cefr-sp}},
note={Dataset, LLM annotating codes and sentence-level assessment codes available}
}
Project Links
- GitHub Repository: https://github.com/fanwenlin/swe-cefr-sp
- Hugging Face Space: Available with interactive demo
- Dataset: 10k Swedish sentences annotated from COCTAILL, 8 Sidor, and SUC3
- Main Model: This prototype-based model (k=3) with Swedish BERT
Related Work
This work builds upon:
- Yoshioka et al. (2022): CEFR-based Sentence Profile (CEFR-SP) and prototype-based metric learning
- Volodina et al. (2016): Swedish passage readability assessment
- Scarton et al. (2018): Controllable text simplification
License
This model is released under the MIT License. See LICENSE file for details.
Related Models
This repository also contains:
- Original k=1 checkpoint:
metric-proto-k1.pt - Original k=3 checkpoint:
metric-proto-k3.pt(this model) - Original k=5 checkpoint:
metric-proto-k5.pt - BERT baseline:
bert-baseline.pt - Megatron version:
metric-proto-megatron-k3.pt - Traditional ML models:
linear_regression.joblib,logreg.joblib,svm.joblib,mlp.joblib,tree.joblib
For more details, visit the project repository.
- Downloads last month
- 12