arxiv:2601.18415

Pisets: A Robust Speech Recognition System for Lectures and Interviews

Published on Jan 26

· Submitted by

Roman Derunets on Feb 9

Novosibirsk State University

Upvote

Authors:

Ivan Bondarenko ,

Daniil Grebenkin ,

Oleg Sedukhin ,

Mikhail Klementev ,

Roman Derunets ,

Abstract

A three-component speech-to-text system combines Wav2Vec2, AST, and Whisper models with curriculum learning and uncertainty modeling to improve transcription accuracy and reduce hallucinations in Russian speech recognition.

AI-generated summary

This work presents a speech-to-text system "Pisets" for scientists and journalists which is based on a three-component architecture aimed at improving speech recognition accuracy while minimizing errors and hallucinations associated with the Whisper model. The architecture comprises primary recognition using Wav2Vec2, false positive filtering via the Audio Spectrogram Transformer (AST), and final speech recognition through Whisper. The implementation of curriculum learning methods and the utilization of diverse Russian-language speech corpora significantly enhanced the system's effectiveness. Additionally, advanced uncertainty modeling techniques were introduced, contributing to further improvements in transcription quality. The proposed approaches ensure robust transcribing of long audio data across various acoustic conditions compared to WhisperX and the usual Whisper model. The source code of "Pisets" system is publicly available at GitHub: https://github.com/bond005/pisets.

View arXiv page View PDF GitHub 77 Add to collection

Community

rmndrnts

Paper author Paper submitter 1 day ago

This paper presents Pisets, an offline ASR system designed for long-form audio such as lectures and interviews, where standard end-to-end models often hallucinate or degrade.

Key idea:
Pisets uses a multi-stage pipeline instead of a single monolithic model:

Wav2Vec2-based speech detection to over-segment audio with high recall
AST (Audio Spectrogram Transformer) to filter false positives
Whisper for final transcription on cleaned segments

Fine-tuning matters:
Rather than relying purely on off-the-shelf models, the authors fine-tune their own components, especially the speech detection and filtering stages, using curriculum learning and diverse real-world data. This targeted fine-tuning is key to reducing hallucinations and improving robustness on noisy, long recordings.

Why it’s interesting:

Shows that careful fine-tuning of intermediate models can outperform larger end-to-end setups
Emphasizes pipeline design + training strategy over simply scaling model size
Evaluated with both WER and semantic metrics, highlighting transcription quality beyond surface accuracy

Takeaway:
Pisets argues that reliable ASR for real-world, long-form audio still benefits from modular systems and task-specific fine-uning, rather than relying solely on large general-purpose models.