English Accent Classifier (6 Classes)
Fine-tuned Wav2Vec2 model for classifying English accents across 6 accent classes.
Model Description
This model is a fine-tuned version of dima806/english_accents_classification on the Mozilla Common Voice 23.0 dataset. It classifies English speech into one of six accent categories:
- us - United States English
- england - British English
- indian - Indian English
- australia - Australian English
- canada - Canadian English
- latin - Latin American Spanish-influenced English
Intended Uses & Limitations
Intended Uses
- Accent classification for English speech
- Voice profile analysis
- Linguistic research
- Accent-aware speech processing pipelines
Limitations
- Trained primarily on native speakers
- Performance may vary with background noise
- Very short audio clips (<3 seconds) may be unreliable
- Latin accent classification is limited to Spanish-influenced English
Training Data
This model builds upon dima806/english_accents_classification, which was trained on ~577k samples from Mozilla Common Voice across 5 accent classes (US, England, Indian, Australia, Canada).
Fine-tuning Details:
- Dataset: Mozilla Common Voice 23.0 (English)
- Fine-tuning Samples: 100 per accent class (600 total)
- Task: Extended from 5 to 6 accent classes
- Added Class: Latin American English (100 samples from 261 available)
Available samples in Common Voice 23.0:
- US: ~301k samples
- England: ~98k samples
- Indian: ~83k samples
- Australia: ~36k samples
- Canada: ~59k samples
- Latin: ~261 samples
Training Procedure
Training Hyperparameters
- Base Model:
dima806/english_accents_classification(pre-trained on ~577k samples) - Fine-tuning Task: Extended classification head from 5 to 6 classes
- Fine-tuning Samples: 100 balanced samples per accent (600 total)
- Training Data: Common Voice 23.0
Framework Versions
- Transformers: 4.x
- PyTorch: 2.x
- Datasets: 2.x
Model Architecture
- Base Architecture: Wav2Vec2ForSequenceClassification
- Hidden Size: 1024
- Number of Layers: 24
- Attention Heads: 16
- Classification Head: 6 classes
How to Use
Installation
pip install transformers librosa torch
Basic Usage
import torch
import librosa
from transformers import Wav2Vec2FeatureExtractor, Wav2Vec2ForSequenceClassification
# Load model
model_name = "MilesPurvis/english-accent-classifier"
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(model_name)
model = Wav2Vec2ForSequenceClassification.from_pretrained(model_name)
model.eval()
# Load audio
audio, sr = librosa.load("audio.wav", sr=16000)
# Process
inputs = feature_extractor(audio, sampling_rate=16000, return_tensors="pt", padding=True)
# Predict
with torch.no_grad():
logits = model(**inputs).logits
probs = torch.softmax(logits, dim=-1)[0]
pred_id = torch.argmax(probs).item()
# Get accent and confidence from model config
predicted_accent = model.config.id2label[pred_id]
confidence = probs[pred_id].item()
print(f"Predicted Accent: {predicted_accent}")
print(f"Confidence: {confidence:.1%}")
Advanced Usage with Segmentation
For longer audio files, segment the audio for better accuracy:
import subprocess
import tempfile
import numpy as np
def segment_audio(audio_path, segment_duration=10):
"""Segment audio using ffmpeg"""
# Implementation details...
pass
# Process each segment and aggregate results
segments = segment_audio("long_audio.wav")
predictions = []
for segment in segments:
audio, sr = librosa.load(segment, sr=16000)
inputs = feature_extractor(audio, sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
logits = model(**inputs).logits
probs = torch.softmax(logits, dim=-1)[0]
predictions.append(probs)
# Aggregate predictions across segments
avg_probs = torch.stack(predictions).mean(dim=0)
final_pred_id = torch.argmax(avg_probs).item()
final_accent = model.config.id2label[final_pred_id]
final_confidence = avg_probs[final_pred_id].item()
print(f"Final Predicted Accent: {final_accent}")
print(f"Confidence: {final_confidence:.1%}")
Performance
The model provides reliable accent classification with:
- High Confidence (>70%): Strong accent indicators present
- Medium Confidence (50-70%): Mixed or subtle accent features
- Low Confidence (<50%): Multiple competing accents or unclear audio
Best Practices
- Use clear audio without background noise
- Segment long recordings (5-10 second chunks)
- Aggregate predictions across multiple segments
- Ensure audio is at 16kHz sample rate
Citation
If you use this model in your research, please cite:
@misc{accent-classifier-6class,
author = {Miles Purvis},
title = {English Accent Classifier (6 Classes)},
year = {2025},
publisher = {HuggingFace},
url = {https://huggingface.co/MilesPurvis/english-accent-classifier}
}
Acknowledgments
- Base Model: dima806/english_accents_classification
- Foundation Model: facebook/wav2vec2-base-960h
- Dataset: Mozilla Common Voice 23.0
License
This model is licensed under the MIT License. Please ensure compliance with the licenses of the base model and training data.
Model Card Contact
For questions or issues, please open an issue in the model repository.
- Downloads last month
- 48
Model tree for MilesPurvis/english-accent-classifier
Base model
facebook/wav2vec2-base-960hEvaluation results
- Accuracy on Common Voice 23.0 (English)self-reported0.550