MilesPurvis commited on
Commit
9656b4b
·
verified ·
1 Parent(s): e552c06

Upload folder using huggingface_hub

Browse files
Files changed (5) hide show
  1. README.md +205 -3
  2. config.json +125 -0
  3. label_encoder.pkl +3 -0
  4. model.safetensors +3 -0
  5. preprocessor_config.json +9 -0
README.md CHANGED
@@ -1,3 +1,205 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ license: apache-2.0
4
+ tags:
5
+ - audio
6
+ - accent-classification
7
+ - wav2vec2
8
+ - speech
9
+ - english-accents
10
+ datasets:
11
+ - mozilla-foundation/common_voice_17_0
12
+ metrics:
13
+ - accuracy
14
+ model-index:
15
+ - name: accent-classifier-wav2vec2-6class
16
+ results:
17
+ - task:
18
+ type: audio-classification
19
+ name: Accent Classification
20
+ dataset:
21
+ name: Common Voice 17.0 (English)
22
+ type: mozilla-foundation/common_voice_17_0
23
+ metrics:
24
+ - type: accuracy
25
+ name: Accuracy
26
+ value: 0.55
27
+ base_model: dima806/english_accents_classification
28
+ ---
29
+
30
+ # English Accent Classifier (6 Classes)
31
+
32
+ Fine-tuned Wav2Vec2 model for classifying English accents across 6 accent classes.
33
+
34
+ ## Model Description
35
+
36
+ This model is a fine-tuned version of [dima806/english_accents_classification](https://huggingface.co/dima806/english_accents_classification) on the Mozilla Common Voice 17.0 dataset. It classifies English speech into one of six accent categories:
37
+
38
+ - **us** - United States English
39
+ - **england** - British English
40
+ - **indian** - Indian English
41
+ - **australia** - Australian English
42
+ - **canada** - Canadian English
43
+ - **latin** - Latin American Spanish-influenced English
44
+
45
+ ## Intended Uses & Limitations
46
+
47
+ ### Intended Uses
48
+ - Accent classification for English speech
49
+ - Voice profile analysis
50
+ - Linguistic research
51
+ - Accent-aware speech processing pipelines
52
+
53
+ ### Limitations
54
+ - Trained primarily on native speakers
55
+ - Performance may vary with background noise
56
+ - Very short audio clips (<3 seconds) may be unreliable
57
+ - Latin accent classification is limited to Spanish-influenced English
58
+
59
+ ## Training Data
60
+
61
+ - **Dataset**: Mozilla Common Voice 17.0 (English)
62
+ - **Samples**: 100 per accent class (600 total)
63
+ - **Base Model**: dima806/english_accents_classification (5 accents)
64
+ - **Added Class**: Latin American English
65
+
66
+ ## Training Procedure
67
+
68
+ ### Training Hyperparameters
69
+
70
+ - **Base Model**: `dima806/english_accents_classification`
71
+ - **Fine-tuning Task**: Extended from 5 to 6 accent classes
72
+ - **Samples per Accent**: 100
73
+ - **Training Data**: Balanced samples from Common Voice 17.0
74
+
75
+ ### Framework Versions
76
+
77
+ - Transformers: 4.x
78
+ - PyTorch: 2.x
79
+ - Datasets: 2.x
80
+
81
+ ## Model Architecture
82
+
83
+ - **Base Architecture**: Wav2Vec2ForSequenceClassification
84
+ - **Hidden Size**: 1024
85
+ - **Number of Layers**: 24
86
+ - **Attention Heads**: 16
87
+ - **Classification Head**: 6 classes
88
+
89
+ ## How to Use
90
+
91
+ ### Installation
92
+
93
+ ```bash
94
+ pip install transformers librosa torch
95
+ ```
96
+
97
+ ### Basic Usage
98
+
99
+ ```python
100
+ import torch
101
+ import librosa
102
+ from transformers import Wav2Vec2FeatureExtractor, Wav2Vec2ForSequenceClassification
103
+
104
+ # Load model
105
+ model_name = "YOUR_USERNAME/accent-classifier-wav2vec2-6class"
106
+ feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(model_name)
107
+ model = Wav2Vec2ForSequenceClassification.from_pretrained(model_name)
108
+ model.eval()
109
+
110
+ # Load audio
111
+ audio, sr = librosa.load("audio.wav", sr=16000)
112
+
113
+ # Process
114
+ inputs = feature_extractor(audio, sampling_rate=16000, return_tensors="pt", padding=True)
115
+
116
+ # Predict
117
+ with torch.no_grad():
118
+ logits = model(**inputs).logits
119
+ probs = torch.softmax(logits, dim=-1)[0]
120
+ pred_id = torch.argmax(probs).item()
121
+
122
+ # Get accent
123
+ accents = ["australia", "canada", "england", "indian", "latin", "us"]
124
+ predicted_accent = accents[pred_id]
125
+ confidence = probs[pred_id].item()
126
+
127
+ print(f"Predicted Accent: {predicted_accent}")
128
+ print(f"Confidence: {confidence:.1%}")
129
+ ```
130
+
131
+ ### Advanced Usage with Segmentation
132
+
133
+ For longer audio files, segment the audio for better accuracy:
134
+
135
+ ```python
136
+ import subprocess
137
+ import tempfile
138
+ import numpy as np
139
+
140
+ def segment_audio(audio_path, segment_duration=10):
141
+ """Segment audio using ffmpeg"""
142
+ # Implementation details...
143
+ pass
144
+
145
+ # Process each segment and aggregate results
146
+ segments = segment_audio("long_audio.wav")
147
+ predictions = []
148
+
149
+ for segment in segments:
150
+ audio, sr = librosa.load(segment, sr=16000)
151
+ inputs = feature_extractor(audio, sampling_rate=16000, return_tensors="pt")
152
+
153
+ with torch.no_grad():
154
+ logits = model(**inputs).logits
155
+ probs = torch.softmax(logits, dim=-1)[0]
156
+
157
+ predictions.append(probs)
158
+
159
+ # Aggregate predictions
160
+ avg_probs = torch.stack(predictions).mean(dim=0)
161
+ final_accent = accents[torch.argmax(avg_probs).item()]
162
+ ```
163
+
164
+ ## Performance
165
+
166
+ The model provides reliable accent classification with:
167
+
168
+ - **High Confidence (>70%)**: Strong accent indicators present
169
+ - **Medium Confidence (50-70%)**: Mixed or subtle accent features
170
+ - **Low Confidence (<50%)**: Multiple competing accents or unclear audio
171
+
172
+ ### Best Practices
173
+
174
+ 1. Use clear audio without background noise
175
+ 2. Segment long recordings (5-10 second chunks)
176
+ 3. Aggregate predictions across multiple segments
177
+ 4. Ensure audio is at 16kHz sample rate
178
+
179
+ ## Citation
180
+
181
+ If you use this model in your research, please cite:
182
+
183
+ ```bibtex
184
+ @misc{accent-classifier-6class,
185
+ author = {Miles Purvis},
186
+ title = {English Accent Classifier (6 Classes)},
187
+ year = {2024},
188
+ publisher = {HuggingFace},
189
+ url = {https://huggingface.co/YOUR_USERNAME/accent-classifier-wav2vec2-6class}
190
+ }
191
+ ```
192
+
193
+ ## Acknowledgments
194
+
195
+ - **Base Model**: [dima806/english_accents_classification](https://huggingface.co/dima806/english_accents_classification)
196
+ - **Foundation Model**: [facebook/wav2vec2-base-960h](https://huggingface.co/facebook/wav2vec2-base-960h)
197
+ - **Dataset**: [Mozilla Common Voice 17.0](https://commonvoice.mozilla.org/)
198
+
199
+ ## License
200
+
201
+ This model is licensed under Apache 2.0. Please ensure compliance with the licenses of the base model and training data.
202
+
203
+ ## Model Card Contact
204
+
205
+ For questions or issues, please open an issue in the model repository.
config.json ADDED
@@ -0,0 +1,125 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "activation_dropout": 0.1,
3
+ "adapter_attn_dim": null,
4
+ "adapter_kernel_size": 3,
5
+ "adapter_stride": 2,
6
+ "add_adapter": false,
7
+ "apply_spec_augment": true,
8
+ "architectures": [
9
+ "Wav2Vec2ForSequenceClassification"
10
+ ],
11
+ "attention_dropout": 0.1,
12
+ "bos_token_id": 1,
13
+ "classifier_proj_size": 256,
14
+ "codevector_dim": 256,
15
+ "contrastive_logits_temperature": 0.1,
16
+ "conv_bias": false,
17
+ "conv_dim": [
18
+ 512,
19
+ 512,
20
+ 512,
21
+ 512,
22
+ 512,
23
+ 512,
24
+ 512
25
+ ],
26
+ "conv_kernel": [
27
+ 10,
28
+ 3,
29
+ 3,
30
+ 3,
31
+ 3,
32
+ 2,
33
+ 2
34
+ ],
35
+ "conv_stride": [
36
+ 5,
37
+ 2,
38
+ 2,
39
+ 2,
40
+ 2,
41
+ 2,
42
+ 2
43
+ ],
44
+ "ctc_loss_reduction": "sum",
45
+ "ctc_zero_infinity": false,
46
+ "diversity_loss_weight": 0.1,
47
+ "do_stable_layer_norm": false,
48
+ "dtype": "float32",
49
+ "eos_token_id": 2,
50
+ "feat_extract_activation": "gelu",
51
+ "feat_extract_dropout": 0.0,
52
+ "feat_extract_norm": "group",
53
+ "feat_proj_dropout": 0.1,
54
+ "feat_quantizer_dropout": 0.0,
55
+ "final_dropout": 0.1,
56
+ "gradient_checkpointing": false,
57
+ "hidden_act": "gelu",
58
+ "hidden_dropout": 0.1,
59
+ "hidden_dropout_prob": 0.1,
60
+ "hidden_size": 768,
61
+ "id2label": {
62
+ "0": "australia",
63
+ "1": "canada",
64
+ "2": "england",
65
+ "3": "indian",
66
+ "4": "latin",
67
+ "5": "us"
68
+ },
69
+ "initializer_range": 0.02,
70
+ "intermediate_size": 3072,
71
+ "label2id": {
72
+ "australia": 0,
73
+ "canada": 1,
74
+ "england": 2,
75
+ "indian": 3,
76
+ "latin": 4,
77
+ "us": 5
78
+ },
79
+ "layer_norm_eps": 1e-05,
80
+ "layerdrop": 0.1,
81
+ "mask_feature_length": 10,
82
+ "mask_feature_min_masks": 0,
83
+ "mask_feature_prob": 0.0,
84
+ "mask_time_length": 10,
85
+ "mask_time_min_masks": 2,
86
+ "mask_time_prob": 0.05,
87
+ "model_type": "wav2vec2",
88
+ "num_adapter_layers": 3,
89
+ "num_attention_heads": 12,
90
+ "num_codevector_groups": 2,
91
+ "num_codevectors_per_group": 320,
92
+ "num_conv_pos_embedding_groups": 16,
93
+ "num_conv_pos_embeddings": 128,
94
+ "num_feat_extract_layers": 7,
95
+ "num_hidden_layers": 12,
96
+ "num_negatives": 100,
97
+ "output_hidden_size": 768,
98
+ "pad_token_id": 0,
99
+ "proj_codevector_dim": 256,
100
+ "tdnn_dilation": [
101
+ 1,
102
+ 2,
103
+ 3,
104
+ 1,
105
+ 1
106
+ ],
107
+ "tdnn_dim": [
108
+ 512,
109
+ 512,
110
+ 512,
111
+ 512,
112
+ 1500
113
+ ],
114
+ "tdnn_kernel": [
115
+ 5,
116
+ 3,
117
+ 3,
118
+ 1,
119
+ 1
120
+ ],
121
+ "transformers_version": "4.57.1",
122
+ "use_weighted_layer_sum": false,
123
+ "vocab_size": 32,
124
+ "xvector_output_dim": 512
125
+ }
label_encoder.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:48d2fb4b3d625a7f0c07d917ef4bfc1de6d9ec44adf859f53a43e97ff3480b2a
3
+ size 452
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4a09ca735804973ae7b958ebd8121cfbc749f4e41a62dc5ef0de282ace862109
3
+ size 378306480
preprocessor_config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "do_normalize": true,
3
+ "feature_extractor_type": "Wav2Vec2FeatureExtractor",
4
+ "feature_size": 1,
5
+ "padding_side": "right",
6
+ "padding_value": 0.0,
7
+ "return_attention_mask": false,
8
+ "sampling_rate": 16000
9
+ }