arxiv:2601.14046

PRiSM: Benchmarking Phone Realization in Speech Models

Published on Jan 20

· Submitted by

Shikhar Bharadwaj on Jan 21

ChangeLing Lab

Upvote

Authors:

Shikhar Bharadwaj ,

Chin-Jou Li ,

Yoonjae Kim ,

Kwanghee Choi ,

Eunjung Yeo ,

Kalvin Chang ,

Abstract

PRiSM benchmark evaluates phonetic perception in speech models through standardized transcription-based metrics and downstream applications across clinical, educational, and multilingual domains.

AI-generated summary

Phone recognition (PR) serves as the atomic interface for language-agnostic modeling for cross-lingual speech processing and phonetic analysis. Despite prolonged efforts in developing PR systems, current evaluations only measure surface-level transcription accuracy. We introduce PRiSM, the first open-source benchmark designed to expose blind spots in phonetic perception through intrinsic and extrinsic evaluation of PR systems. PRiSM standardizes transcription-based evaluation and assesses downstream utility in clinical, educational, and multilingual settings with transcription and representation probes. We find that diverse language exposure during training is key to PR performance, encoder-CTC models are the most stable, and specialized PR models still outperform Large Audio Language Models. PRiSM releases code, recipes, and datasets to move the field toward multilingual speech models with robust phonetic ability: https://github.com/changelinglab/prism.

View arXiv page View PDF GitHub 2 Add to collection

Community

shikhar7ssu

Paper author Paper submitter about 17 hours ago

This comment has been hidden (marked as Resolved)

shikhar7ssu

Paper author Paper submitter about 16 hours ago

•

edited about 16 hours ago

Main take-aways

PRiSM is the first fully-open benchmark that evaluates Phone-Recognition systems on both intrinsic (phone-transcription) and extrinsic (down-stream) tasks across 12 datasets covering clinical, L2-learning and multilingual settings. We find that Large Audio-Language Models still lag behind specialized PR models on such tasks.
Since intrinsic phone recognition capability is not fully indicative of performance in extrinsic settings, we design transcript and representation based probes that allow an exhaustive analysis, interpretability, and fair comparison.
Language exposure > data size: multilingual training with broad, diverse data matters more for cross lingual generalization.

Code, prompts and data are released under permissive licences.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 2

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2601.14046 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2601.14046 in a Space README.md to link it from this page.