Pisets: A Robust Speech Recognition System for Lectures and Interviews
Abstract
A three-component speech-to-text system combines Wav2Vec2, AST, and Whisper models with curriculum learning and uncertainty modeling to improve transcription accuracy and reduce hallucinations in Russian speech recognition.
This work presents a speech-to-text system "Pisets" for scientists and journalists which is based on a three-component architecture aimed at improving speech recognition accuracy while minimizing errors and hallucinations associated with the Whisper model. The architecture comprises primary recognition using Wav2Vec2, false positive filtering via the Audio Spectrogram Transformer (AST), and final speech recognition through Whisper. The implementation of curriculum learning methods and the utilization of diverse Russian-language speech corpora significantly enhanced the system's effectiveness. Additionally, advanced uncertainty modeling techniques were introduced, contributing to further improvements in transcription quality. The proposed approaches ensure robust transcribing of long audio data across various acoustic conditions compared to WhisperX and the usual Whisper model. The source code of "Pisets" system is publicly available at GitHub: https://github.com/bond005/pisets.
Community
This paper presents Pisets, an offline ASR system designed for long-form audio such as lectures and interviews, where standard end-to-end models often hallucinate or degrade.
Key idea:
Pisets uses a multi-stage pipeline instead of a single monolithic model:
- Wav2Vec2-based speech detection to over-segment audio with high recall
- AST (Audio Spectrogram Transformer) to filter false positives
- Whisper for final transcription on cleaned segments
Fine-tuning matters:
Rather than relying purely on off-the-shelf models, the authors fine-tune their own components, especially the speech detection and filtering stages, using curriculum learning and diverse real-world data. This targeted fine-tuning is key to reducing hallucinations and improving robustness on noisy, long recordings.
Why it’s interesting:
- Shows that careful fine-tuning of intermediate models can outperform larger end-to-end setups
- Emphasizes pipeline design + training strategy over simply scaling model size
- Evaluated with both WER and semantic metrics, highlighting transcription quality beyond surface accuracy
Takeaway:
Pisets argues that reliable ASR for real-world, long-form audio still benefits from modular systems and task-specific fine-uning, rather than relying solely on large general-purpose models.
arXivLens breakdown of this paper 👉 https://arxivlens.com/PaperView/Details/pisets-a-robust-speech-recognition-system-for-lectures-and-interviews-5426-84640597
- Executive Summary
- Detailed Breakdown
- Practical Applications
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Stuttering-Aware Automatic Speech Recognition for Indonesian Language (2026)
- Incorporating Error Level Noise Embedding for Improving LLM-Assisted Robustness in Persian Speech Recognition (2025)
- SW-ASR: A Context-Aware Hybrid ASR Pipeline for Robust Single Word Speech Recognition (2026)
- VIBEVOICE-ASR Technical Report (2026)
- Multi-Level Embedding Conformer Framework for Bengali Automatic Speech Recognition (2025)
- Unifying Speech Recognition, Synthesis and Conversion with Autoregressive Transformers (2026)
- SLM-TTA: A Framework for Test-Time Adaptation of Generative Spoken Language Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 2
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 4
Collections including this paper 0
No Collection including this paper
