English Accent Classifier (6 Classes)

Fine-tuned Wav2Vec2 model for classifying English accents across 6 accent classes.

Model Description

This model is a fine-tuned version of dima806/english_accents_classification on the Mozilla Common Voice 23.0 dataset. It classifies English speech into one of six accent categories:

  • us - United States English
  • england - British English
  • indian - Indian English
  • australia - Australian English
  • canada - Canadian English
  • latin - Latin American Spanish-influenced English

Intended Uses & Limitations

Intended Uses

  • Accent classification for English speech
  • Voice profile analysis
  • Linguistic research
  • Accent-aware speech processing pipelines

Limitations

  • Trained primarily on native speakers
  • Performance may vary with background noise
  • Very short audio clips (<3 seconds) may be unreliable
  • Latin accent classification is limited to Spanish-influenced English

Training Data

This model builds upon dima806/english_accents_classification, which was trained on ~577k samples from Mozilla Common Voice across 5 accent classes (US, England, Indian, Australia, Canada).

Fine-tuning Details:

  • Dataset: Mozilla Common Voice 23.0 (English)
  • Fine-tuning Samples: 100 per accent class (600 total)
  • Task: Extended from 5 to 6 accent classes
  • Added Class: Latin American English (100 samples from 261 available)

Available samples in Common Voice 23.0:

  • US: ~301k samples
  • England: ~98k samples
  • Indian: ~83k samples
  • Australia: ~36k samples
  • Canada: ~59k samples
  • Latin: ~261 samples

Training Procedure

Training Hyperparameters

  • Base Model: dima806/english_accents_classification (pre-trained on ~577k samples)
  • Fine-tuning Task: Extended classification head from 5 to 6 classes
  • Fine-tuning Samples: 100 balanced samples per accent (600 total)
  • Training Data: Common Voice 23.0

Framework Versions

  • Transformers: 4.x
  • PyTorch: 2.x
  • Datasets: 2.x

Model Architecture

  • Base Architecture: Wav2Vec2ForSequenceClassification
  • Hidden Size: 1024
  • Number of Layers: 24
  • Attention Heads: 16
  • Classification Head: 6 classes

How to Use

Installation

pip install transformers librosa torch

Basic Usage

import torch
import librosa
from transformers import Wav2Vec2FeatureExtractor, Wav2Vec2ForSequenceClassification

# Load model
model_name = "MilesPurvis/english-accent-classifier"
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(model_name)
model = Wav2Vec2ForSequenceClassification.from_pretrained(model_name)
model.eval()

# Load audio
audio, sr = librosa.load("audio.wav", sr=16000)

# Process
inputs = feature_extractor(audio, sampling_rate=16000, return_tensors="pt", padding=True)

# Predict
with torch.no_grad():
    logits = model(**inputs).logits
    probs = torch.softmax(logits, dim=-1)[0]
    pred_id = torch.argmax(probs).item()

# Get accent and confidence from model config
predicted_accent = model.config.id2label[pred_id]
confidence = probs[pred_id].item()

print(f"Predicted Accent: {predicted_accent}")
print(f"Confidence: {confidence:.1%}")

Advanced Usage with Segmentation

For longer audio files, segment the audio for better accuracy:

import subprocess
import tempfile
import numpy as np

def segment_audio(audio_path, segment_duration=10):
    """Segment audio using ffmpeg"""
    # Implementation details...
    pass

# Process each segment and aggregate results
segments = segment_audio("long_audio.wav")
predictions = []

for segment in segments:
    audio, sr = librosa.load(segment, sr=16000)
    inputs = feature_extractor(audio, sampling_rate=16000, return_tensors="pt")

    with torch.no_grad():
        logits = model(**inputs).logits
        probs = torch.softmax(logits, dim=-1)[0]

    predictions.append(probs)

# Aggregate predictions across segments
avg_probs = torch.stack(predictions).mean(dim=0)
final_pred_id = torch.argmax(avg_probs).item()
final_accent = model.config.id2label[final_pred_id]
final_confidence = avg_probs[final_pred_id].item()

print(f"Final Predicted Accent: {final_accent}")
print(f"Confidence: {final_confidence:.1%}")

Performance

The model provides reliable accent classification with:

  • High Confidence (>70%): Strong accent indicators present
  • Medium Confidence (50-70%): Mixed or subtle accent features
  • Low Confidence (<50%): Multiple competing accents or unclear audio

Best Practices

  1. Use clear audio without background noise
  2. Segment long recordings (5-10 second chunks)
  3. Aggregate predictions across multiple segments
  4. Ensure audio is at 16kHz sample rate

Citation

If you use this model in your research, please cite:

@misc{accent-classifier-6class,
  author = {Miles Purvis},
  title = {English Accent Classifier (6 Classes)},
  year = {2025},
  publisher = {HuggingFace},
  url = {https://huggingface.co/MilesPurvis/english-accent-classifier}
}

Acknowledgments

License

This model is licensed under the MIT License. Please ensure compliance with the licenses of the base model and training data.

Model Card Contact

For questions or issues, please open an issue in the model repository.

Downloads last month
48
Safetensors
Model size
94.6M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for MilesPurvis/english-accent-classifier

Finetuned
(1)
this model

Evaluation results