Upload folder using huggingface_hub

Browse files

Files changed (5) hide show

README.md +205 -3
config.json +125 -0
label_encoder.pkl +3 -0
model.safetensors +3 -0
preprocessor_config.json +9 -0

README.md CHANGED Viewed

@@ -1,3 +1,205 @@
----
-license: mit
----

+---
+language: en
+license: apache-2.0
+tags:
+  - audio
+  - accent-classification
+  - wav2vec2
+  - speech
+  - english-accents
+datasets:
+  - mozilla-foundation/common_voice_17_0
+metrics:
+  - accuracy
+model-index:
+  - name: accent-classifier-wav2vec2-6class
+    results:
+      - task:
+          type: audio-classification
+          name: Accent Classification
+        dataset:
+          name: Common Voice 17.0 (English)
+          type: mozilla-foundation/common_voice_17_0
+        metrics:
+          - type: accuracy
+            name: Accuracy
+            value: 0.55
+base_model: dima806/english_accents_classification
+---
+# English Accent Classifier (6 Classes)
+Fine-tuned Wav2Vec2 model for classifying English accents across 6 accent classes.
+## Model Description
+This model is a fine-tuned version of [dima806/english_accents_classification](https://huggingface.co/dima806/english_accents_classification) on the Mozilla Common Voice 17.0 dataset. It classifies English speech into one of six accent categories:
+- **us** - United States English
+- **england** - British English
+- **indian** - Indian English
+- **australia** - Australian English
+- **canada** - Canadian English
+- **latin** - Latin American Spanish-influenced English
+## Intended Uses & Limitations
+### Intended Uses
+- Accent classification for English speech
+- Voice profile analysis
+- Linguistic research
+- Accent-aware speech processing pipelines
+### Limitations
+- Trained primarily on native speakers
+- Performance may vary with background noise
+- Very short audio clips (<3 seconds) may be unreliable
+- Latin accent classification is limited to Spanish-influenced English
+## Training Data
+- **Dataset**: Mozilla Common Voice 17.0 (English)
+- **Samples**: 100 per accent class (600 total)
+- **Base Model**: dima806/english_accents_classification (5 accents)
+- **Added Class**: Latin American English
+## Training Procedure
+### Training Hyperparameters
+- **Base Model**: `dima806/english_accents_classification`
+- **Fine-tuning Task**: Extended from 5 to 6 accent classes
+- **Samples per Accent**: 100
+- **Training Data**: Balanced samples from Common Voice 17.0
+### Framework Versions
+- Transformers: 4.x
+- PyTorch: 2.x
+- Datasets: 2.x
+## Model Architecture
+- **Base Architecture**: Wav2Vec2ForSequenceClassification
+- **Hidden Size**: 1024
+- **Number of Layers**: 24
+- **Attention Heads**: 16
+- **Classification Head**: 6 classes
+## How to Use
+### Installation
+```bash
+pip install transformers librosa torch
+```
+### Basic Usage
+```python
+import torch
+import librosa
+from transformers import Wav2Vec2FeatureExtractor, Wav2Vec2ForSequenceClassification
+# Load model
+model_name = "YOUR_USERNAME/accent-classifier-wav2vec2-6class"
+feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(model_name)
+model = Wav2Vec2ForSequenceClassification.from_pretrained(model_name)
+model.eval()
+# Load audio
+audio, sr = librosa.load("audio.wav", sr=16000)
+# Process
+inputs = feature_extractor(audio, sampling_rate=16000, return_tensors="pt", padding=True)
+# Predict
+with torch.no_grad():
+    logits = model(**inputs).logits
+    probs = torch.softmax(logits, dim=-1)[0]
+    pred_id = torch.argmax(probs).item()
+# Get accent
+accents = ["australia", "canada", "england", "indian", "latin", "us"]
+predicted_accent = accents[pred_id]
+confidence = probs[pred_id].item()
+print(f"Predicted Accent: {predicted_accent}")
+print(f"Confidence: {confidence:.1%}")
+```
+### Advanced Usage with Segmentation
+For longer audio files, segment the audio for better accuracy:
+```python
+import subprocess
+import tempfile
+import numpy as np
+def segment_audio(audio_path, segment_duration=10):
+    """Segment audio using ffmpeg"""
+    # Implementation details...
+    pass
+# Process each segment and aggregate results
+segments = segment_audio("long_audio.wav")
+predictions = []
+for segment in segments:
+    audio, sr = librosa.load(segment, sr=16000)
+    inputs = feature_extractor(audio, sampling_rate=16000, return_tensors="pt")
+    with torch.no_grad():
+        logits = model(**inputs).logits
+        probs = torch.softmax(logits, dim=-1)[0]
+    predictions.append(probs)
+# Aggregate predictions
+avg_probs = torch.stack(predictions).mean(dim=0)
+final_accent = accents[torch.argmax(avg_probs).item()]
+```
+## Performance
+The model provides reliable accent classification with:
+- **High Confidence (>70%)**: Strong accent indicators present
+- **Medium Confidence (50-70%)**: Mixed or subtle accent features
+- **Low Confidence (<50%)**: Multiple competing accents or unclear audio
+### Best Practices
+1. Use clear audio without background noise
+2. Segment long recordings (5-10 second chunks)
+3. Aggregate predictions across multiple segments
+4. Ensure audio is at 16kHz sample rate
+## Citation
+If you use this model in your research, please cite:
+```bibtex
+@misc{accent-classifier-6class,
+  author = {Miles Purvis},
+  title = {English Accent Classifier (6 Classes)},
+  year = {2024},
+  publisher = {HuggingFace},
+  url = {https://huggingface.co/YOUR_USERNAME/accent-classifier-wav2vec2-6class}
+}
+```
+## Acknowledgments
+- **Base Model**: [dima806/english_accents_classification](https://huggingface.co/dima806/english_accents_classification)
+- **Foundation Model**: [facebook/wav2vec2-base-960h](https://huggingface.co/facebook/wav2vec2-base-960h)
+- **Dataset**: [Mozilla Common Voice 17.0](https://commonvoice.mozilla.org/)
+## License
+This model is licensed under Apache 2.0. Please ensure compliance with the licenses of the base model and training data.
+## Model Card Contact
+For questions or issues, please open an issue in the model repository.

config.json ADDED Viewed

	@@ -0,0 +1,125 @@

+{
+  "activation_dropout": 0.1,
+  "adapter_attn_dim": null,
+  "adapter_kernel_size": 3,
+  "adapter_stride": 2,
+  "add_adapter": false,
+  "apply_spec_augment": true,
+  "architectures": [
+    "Wav2Vec2ForSequenceClassification"
+  ],
+  "attention_dropout": 0.1,
+  "bos_token_id": 1,
+  "classifier_proj_size": 256,
+  "codevector_dim": 256,
+  "contrastive_logits_temperature": 0.1,
+  "conv_bias": false,
+  "conv_dim": [
+    512,
+    512,
+    512,
+    512,
+    512,
+    512,
+    512
+  ],
+  "conv_kernel": [
+    10,
+    3,
+    3,
+    3,
+    3,
+    2,
+    2
+  ],
+  "conv_stride": [
+    5,
+    2,
+    2,
+    2,
+    2,
+    2,
+    2
+  ],
+  "ctc_loss_reduction": "sum",
+  "ctc_zero_infinity": false,
+  "diversity_loss_weight": 0.1,
+  "do_stable_layer_norm": false,
+  "dtype": "float32",
+  "eos_token_id": 2,
+  "feat_extract_activation": "gelu",
+  "feat_extract_dropout": 0.0,
+  "feat_extract_norm": "group",
+  "feat_proj_dropout": 0.1,
+  "feat_quantizer_dropout": 0.0,
+  "final_dropout": 0.1,
+  "gradient_checkpointing": false,
+  "hidden_act": "gelu",
+  "hidden_dropout": 0.1,
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "id2label": {
+    "0": "australia",
+    "1": "canada",
+    "2": "england",
+    "3": "indian",
+    "4": "latin",
+    "5": "us"
+  },
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "label2id": {
+    "australia": 0,
+    "canada": 1,
+    "england": 2,
+    "indian": 3,
+    "latin": 4,
+    "us": 5
+  },
+  "layer_norm_eps": 1e-05,
+  "layerdrop": 0.1,
+  "mask_feature_length": 10,
+  "mask_feature_min_masks": 0,
+  "mask_feature_prob": 0.0,
+  "mask_time_length": 10,
+  "mask_time_min_masks": 2,
+  "mask_time_prob": 0.05,
+  "model_type": "wav2vec2",
+  "num_adapter_layers": 3,
+  "num_attention_heads": 12,
+  "num_codevector_groups": 2,
+  "num_codevectors_per_group": 320,
+  "num_conv_pos_embedding_groups": 16,
+  "num_conv_pos_embeddings": 128,
+  "num_feat_extract_layers": 7,
+  "num_hidden_layers": 12,
+  "num_negatives": 100,
+  "output_hidden_size": 768,
+  "pad_token_id": 0,
+  "proj_codevector_dim": 256,
+  "tdnn_dilation": [
+    1,
+    2,
+    3,
+    1,
+    1
+  ],
+  "tdnn_dim": [
+    512,
+    512,
+    512,
+    512,
+    1500
+  ],
+  "tdnn_kernel": [
+    5,
+    3,
+    3,
+    1,
+    1
+  ],
+  "transformers_version": "4.57.1",
+  "use_weighted_layer_sum": false,
+  "vocab_size": 32,
+  "xvector_output_dim": 512
+}

label_encoder.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:48d2fb4b3d625a7f0c07d917ef4bfc1de6d9ec44adf859f53a43e97ff3480b2a
+size 452

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4a09ca735804973ae7b958ebd8121cfbc749f4e41a62dc5ef0de282ace862109
+size 378306480

preprocessor_config.json ADDED Viewed

	@@ -0,0 +1,9 @@

+{
+  "do_normalize": true,
+  "feature_extractor_type": "Wav2Vec2FeatureExtractor",
+  "feature_size": 1,
+  "padding_side": "right",
+  "padding_value": 0.0,
+  "return_attention_mask": false,
+  "sampling_rate": 16000
+}