stt-bm-quartznet15x5-v2 / README.md

diarray

Push model using huggingface_hub.

4806eae verified 17 days ago

preview code

raw

history blame contribute delete

4.91 kB

metadata

language:
  - bm
library_name: nemo
datasets:
  - RobotsMali/afvoices
thumbnail: null
tags:
  - automatic-speech-recognition
  - speech
  - audio
  - CTC
  - QuartzNet
  - pytorch
  - Bambara
  - NeMo
license: cc-by-4.0
base_model: RobotsMali/stt-bm-quartznet15x5-v0
model-index:
  - name: stt-bm-quartznet15x5-v2
    results:
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: African Next Voices
          type: RobotsMali/afvoices
          split: test
          args:
            language: bm
        metrics:
          - name: Test WER
            type: wer
            value: 42.57205678504852
          - name: Test CER
            type: cer
            value: 18.708413949318107
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: Nyana Eval
          type: RobotsMali/nyana-eval
          split: test
          args:
            language: bm
        metrics:
          - name: Test WER
            type: wer
            value: 48.97
          - name: Test CER
            type: cer
            value: 24.22
metrics:
  - wer
  - cer
pipeline_tag: automatic-speech-recognition

QuartzNet 15x5 CTC Series

| |

stt-bm-quartznet15x5-v2 is a fine-tuned version of RobotsMali/stt-bm-quartznet15x5-v0. This model cannot write Punctuations and Capitalizations, it utilizes a character encoding scheme, and transcribes text in the standard character set that is provided in its training set.

The model was fine-tuned using NVIDIA NeMo and is trained with CTC (Connectionist Temporal Classification) Loss.

🚨 Important Note

This model, along with its associated resources, is part of an ongoing research effort, improvements and refinements are expected in future versions. Users should be aware that:

The model may not generalize very well accross all speaking conditions and dialects.
Community feedback is welcome, and contributions are encouraged to refine the model further.

NVIDIA NeMo: Training

To fine-tune or use the model, install NVIDIA NeMo. We recommend installing it after setting up the latest PyTorch version.

pip install nemo-toolkit['asr']

How to Use This Model

Load Model with NeMo

import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.EncDecCTCModel.from_pretrained(model_name="RobotsMali/stt-bm-quartznet15x5-v2")

Transcribe Audio

# Assuming you have a test audio file named sample_audio.wav
asr_model.transcribe(['sample_audio.wav'])

Input

This model accepts 16 kHz mono-channel audio (wav files) as input. But it is equipped with its own preprocessor doing the resampling so you may input audios at higher sampling rates.

Output

This model provides transcribed speech as an hypothesis object with a text attribute containing the transcription string for a given speech sample.

Model Architecture

QuartzNet is a convolutional architecture, which consists of 1D time-channel separable convolutions optimized for speech recognition. More information on QuartzNet can be found here: QuartzNet Model.

Training

The NeMo toolkit was used to fine-tune this model for 62,976 steps over the RobotsMali/stt-bm-quartznet15x5-v0 model. The finetuning codes and configurations can be found at RobotsMali-AI/bambara-asr.

Dataset

This model was fine-tuned on a 100 hours pre-completion subset of the African Next Voices dataset. You can reconstitute that subset with these manifest files.

Performance

The performance of Automatic Speech Recognition models is measured using Word Error Rate (WER%) and Character Error Rate (CER), two edit distance metrics .

Benchmark	Decoding	WER (%) ↓	CER (%) ↓
African Next Voices (afvoices)	CTC	42.57	18.70
Nyana Eval	CTC	48.97	24.22

These are greedy WER numbers without external LM.

License

This model is released under the CC-BY-4.0 license. By using this model, you agree to the terms of the license.

Feel free to open a discussion on Hugging Face or file an issue on GitHub for help or contributions.