seems that the model could not support overlapped speech recognition?

#5
by scutrandom - opened

I have tried several audio clips, and find it hard for this model to recognize.

Hello, you might consider trying SE-DiCoW, which is specifically designed to handle overlapping speech.
We recently released a new version here: https://huggingface.co/BUT-FIT/SE-DiCoW
The code and instructions are available at: https://github.com/BUTSpeechFIT/DiCoW and https://github.com/BUTSpeechFIT/TS-ASR-Whisper/tree/se_dicow
It outperforms this model on several benchmarks.

Hello, you might consider trying SE-DiCoW, which is specifically designed to handle overlapping speech.
We recently released a new version here: https://huggingface.co/BUT-FIT/SE-DiCoW
The code and instructions are available at: https://github.com/BUTSpeechFIT/DiCoW and https://github.com/BUTSpeechFIT/TS-ASR-Whisper/tree/se_dicow
It outperforms this model on several benchmarks.

thx, i will try it. By the way, how many languages are supported?

90+, all that are supported by Whisper v3-turbo

@scutrandom you can try our demo also on https://pcspeech-demo.fit.vutbr.cz/dicow/

Hi, I've tested the demo and overall it works quite well. However, when I tried it with Chinese audio, some sentences were incorrectly transcribed in English. Is there a way to configure the model to prevent this or lock it to Chinese?

Glad to hear that. You can enforce the language via forced_decoder_ids (see here: https://github.com/BUTSpeechFIT/TS-ASR-Whisper/blob/e869be0e7b70d1600d777041bd95d99ad54bc1ed/src/data/collators.py#L171). I can add that feature sometime this weekend, but feel free to open a PR if you want to try it yourself.

Sign up or log in to comment