End-to-End Joint ASR and Speaker Role Diarization with Child-Adult Interactions
Abstract
A unified end-to-end framework extends the Whisper architecture to jointly model automatic speech recognition and speaker diarization for child-adult interactions, improving transcription accuracy and scalability.
Accurate transcription and speaker diarization of child-adult spoken interactions are crucial for developmental and clinical research. However, manual annotation is time-consuming and challenging to scale. Existing automated systems typically rely on cascaded speaker diarization and speech recognition pipelines, which can lead to error propagation. This paper presents a unified end-to-end framework that extends the Whisper encoder-decoder architecture to jointly model ASR and child-adult speaker role diarization. The proposed approach integrates: (i) a serialized output training scheme that emits speaker tags and start/end timestamps, (ii) a lightweight frame-level diarization head that enhances speaker-discriminative encoder representations, (iii) diarization-guided silence suppression for improved temporal precision, and (iv) a state-machine-based forced decoding procedure that guarantees structurally valid outputs. Comprehensive evaluations on two datasets demonstrate consistent and substantial improvements over two cascaded baselines, achieving lower multi-talker word error rates and demonstrating competitive diarization accuracy across both Whisper-small and Whisper-large models. These findings highlight the effectiveness and practical utility of the proposed joint modeling framework for generating reliable, speaker-attributed transcripts of child-adult interactions at scale. The code and model weights are publicly available
Community
Accurate transcription and speaker diarization of child–adult spoken interactions are crucial for developmental and clinical research. However, manual annotation is time-consuming and challenging to scale. Existing automated systems typically rely on cascaded speaker diarization and automatic speech recognition pipelines, which can lead to error propagation. This paper presents a unified end-to-end framework that extends the Whisper encoder–decoder architecture to jointly model ASR
and child–adult speaker role diarization. The proposed approach integrates: (i) a serialized output training scheme that emits speaker tags and start/end timestamps, (ii) a lightweight frame-level diarization head that enhances speaker-discriminative encoder representations, (iii) diarization-guided silence suppression
for improved temporal precision, and (iv) a state-machine-based forced decoding procedure that guarantees structurally valid outputs. Comprehensive evaluations on two datasets demonstrate consistent and substantial improvements over two cascaded baselines, achieving lower multi-talker word error rates and demonstrating competitive diarization accuracy across both Whispersmall and Whisper-large models. These findings highlight the effectiveness and practical utility of the proposed joint modeling framework for generating reliable, speaker-attributed transcripts of child–adult interactions at scale. The code and model weights are publicly available: https://github.com/usc-sail/joint-asr-diarization-child-adult
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- TagSpeech: End-to-End Multi-Speaker ASR and Diarization with Fine-Grained Temporal Grounding (2026)
- MOSS Transcribe Diarize Technical Report (2026)
- Timbre-Aware LLM-based Direct Speech-to-Speech Translation Extendable to Multiple Language Pairs (2026)
- JoyVoice: Long-Context Conditioning for Anthropomorphic Multi-Speaker Conversational Synthesis (2025)
- VALLR-Pin: Uncertainty-Factorized Visual Speech Recognition for Mandarin with Pinyin Guidance (2025)
- TellWhisper: Tell Whisper Who Speaks When (2026)
- SITA: Learning Speaker-Invariant and Tone-Aware Speech Representations for Low-Resource Tonal Languages (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
arXivlens breakdown of this paper 👉 https://arxivlens.com/PaperView/Details/end-to-end-joint-asr-and-speaker-role-diarization-with-child-adult-interactions-9719-da3bd40e
- Executive Summary
- Detailed Breakdown
- Practical Applications
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper