File size: 4,911 Bytes
4806eae |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 |
---
language:
- bm
library_name: nemo
datasets:
- RobotsMali/afvoices
thumbnail: null
tags:
- automatic-speech-recognition
- speech
- audio
- CTC
- QuartzNet
- pytorch
- Bambara
- NeMo
license: cc-by-4.0
base_model: RobotsMali/stt-bm-quartznet15x5-v0
model-index:
- name: stt-bm-quartznet15x5-v2
results:
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: African Next Voices
type: RobotsMali/afvoices
split: test
args:
language: bm
metrics:
- name: Test WER
type: wer
value: 42.57205678504852
- name: Test CER
type: cer
value: 18.708413949318106
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: Nyana Eval
type: RobotsMali/nyana-eval
split: test
args:
language: bm
metrics:
- name: Test WER
type: wer
value: 48.97
- name: Test CER
type: cer
value: 24.22
metrics:
- wer
- cer
pipeline_tag: automatic-speech-recognition
---
# QuartzNet 15x5 CTC Series
<style>
img {
display: inline;
}
</style>
[](#model-architecture)
| [](#model-architecture)
| [](#datasets)
`stt-bm-quartznet15x5-v2` is a fine-tuned version of [`RobotsMali/stt-bm-quartznet15x5-v0`](https://huggingface.co/RobotsMali/stt-bm-quartznet15x5-v0). This model cannot write **Punctuations and Capitalizations**, it utilizes a character encoding scheme, and transcribes text in the standard character set that is provided in its training set.
The model was fine-tuned using **NVIDIA NeMo** and is trained with **CTC (Connectionist Temporal Classification) Loss**.
## **🚨 Important Note**
This model, along with its associated resources, is part of an **ongoing research effort**, improvements and refinements are expected in future versions. Users should be aware that:
- **The model may not generalize very well accross all speaking conditions and dialects.**
- **Community feedback is welcome, and contributions are encouraged to refine the model further.**
## NVIDIA NeMo: Training
To fine-tune or use the model, install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend installing it after setting up the latest PyTorch version.
```bash
pip install nemo-toolkit['asr']
```
## How to Use This Model
### Load Model with NeMo
```python
import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.EncDecCTCModel.from_pretrained(model_name="RobotsMali/stt-bm-quartznet15x5-v2")
```
### Transcribe Audio
```python
# Assuming you have a test audio file named sample_audio.wav
asr_model.transcribe(['sample_audio.wav'])
```
### Input
This model accepts **16 kHz mono-channel audio (wav files)** as input. But it is equipped with its own preprocessor doing the resampling so you may input audios at higher sampling rates.
### Output
This model provides transcribed speech as an hypothesis object with a text attribute containing the transcription string for a given speech sample.
## Model Architecture
QuartzNet is a convolutional architecture, which consists of **1D time-channel separable convolutions** optimized for speech recognition. More information on QuartzNet can be found here: [QuartzNet Model](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/models.html#quartznet).
## Training
The NeMo toolkit was used to fine-tune this model for **62,976 steps** over the `RobotsMali/stt-bm-quartznet15x5-v0` model. The finetuning codes and configurations can be found at [RobotsMali-AI/bambara-asr](https://github.com/RobotsMali-AI/bambara-asr/).
## Dataset
This model was fine-tuned on a 100 hours pre-completion subset of the [African Next Voices](https://huggingface.co/datasets/RobotsMali/afvoices) dataset. You can reconstitute that subset with these [manifest files](https://github.com/RobotsMali-AI/bambara-asr/afvoices/pre-manifests).
## Performance
The performance of Automatic Speech Recognition models is measured using Word Error Rate (WER%) and Character Error Rate (CER), two edit distance metrics .
| Benchmark | Decoding | WER (%) ↓ | CER (%) ↓ |
|---------------|----------|-----------------|-----------------|
| African Next Voices (afvoices) | CTC | 42.57 | 18.70 |
| Nyana Eval | CTC | 48.97 | 24.22 |
These are **greedy WER numbers without external LM**.
## License
This model is released under the **CC-BY-4.0** license. By using this model, you agree to the terms of the license.
---
Feel free to open a discussion on Hugging Face or [file an issue](https://github.com/RobotsMali-AI/bambara-asr/issues) on GitHub for help or contributions.
|