Improve model card: add pipeline tag, paper link, and sample usage

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +39 -443
README.md CHANGED
@@ -1,12 +1,13 @@
1
  ---
2
- license: apache-2.0
3
  language:
4
- - en
5
- - zh
 
6
  tags:
7
  - audio
8
  - automatic-speech-recognition
9
  - asr
 
10
  ---
11
 
12
  <div align="center">
@@ -19,279 +20,33 @@ A SOTA Industrial-Grade All-in-One ASR System
19
  </div>
20
 
21
  [[Code]](https://github.com/FireRedTeam/FireRedASR2S)
22
- [[Paper]](https://arxiv.org/pdf/2501.14350)
23
  [[Model]](https://huggingface.co/FireRedTeam)
24
  [[Blog]](https://fireredteam.github.io/demos/firered_asr/)
25
  [[Demo]](https://huggingface.co/spaces/FireRedTeam/FireRedASR)
26
 
 
27
 
28
- FireRedASR2S is a state-of-the-art (SOTA), industrial-grade, all-in-one ASR system with ASR, VAD, LID, and Punc modules. All modules achieve SOTA performance:
29
- - **FireRedASR2**: Automatic Speech Recognition (ASR) supporting Chinese (Mandarin, 20+ dialects/accents), English, code-switching, and singing lyrics recognition. 2.89% average CER on Mandarin (4 test sets), 11.55% on Chinese dialects (19 test sets), outperforming Doubao-ASR, Qwen3-ASR-1.7B, Fun-ASR, and Fun-ASR-Nano-2512. FireRedASR2-AED also supports word-level timestamps and confidence scores.
30
- - **FireRedVAD**: Voice Activity Detection (VAD) supporting speech/singing/music in 100+ languages. 97.57% F1, outperforming Silero-VAD, TEN-VAD, and FunASR-VAD. Supports non-streaming/streaming VAD and Audio Event Detection.
31
- - **FireRedLID**: Spoken Language Identification (LID) supporting 100+ languages and 20+ Chinese dialects/accents. 97.18% accuracy, outperforming Whisper and SpeechBrain-LID.
32
- - **FireRedPunc**: Punctuation Prediction (Punc) for Chinese and English. 78.90% average F1, outperforming FunASR-Punc (62.77%).
33
-
34
- *`2S`: `2`nd-generation FireRedASR, now expanded to an all-in-one ASR `S`ystem*
35
 
 
36
 
37
  ## 🔥 News
 
 
38
  - [2026.02.25] 🔥 We release **FireRedASR2-LLM model weights**. [🤗](https://huggingface.co/FireRedTeam/FireRedASR2-LLM) [🤖](https://www.modelscope.cn/models/xukaituo/FireRedASR2-LLM/)
39
- - [2026.02.13] 🚀 Support TensorRT-LLM inference acceleration for FireRedASR2-AED (contributed by NVIDIA). Benchmark on AISHELL-1 test set shows **12.7x speedup** over PyTorch baseline (single H20).
40
- - [2026.02.12] 🔥 We release FireRedASR2S (FireRedASR2-AED, FireRedVAD, FireRedLID, and FireRedPunc) with **model weights and inference code**. Download links below. Technical report and finetuning code coming soon.
41
-
42
-
43
-
44
- ## Available Models and Languages
45
-
46
- |Model|Supported Languages & Dialects|Download|
47
- |:-------------:|:---------------------------------:|:----------:|
48
- |FireRedASR2-LLM| Chinese (Mandarin and 20+ dialects/accents<sup>*</sup>), English, Code-Switching | [🤗](https://huggingface.co/FireRedTeam/FireRedASR2-LLM) \| [🤖](https://www.modelscope.cn/models/xukaituo/FireRedASR2-LLM/)|
49
- |FireRedASR2-AED| Chinese (Mandarin and 20+ dialects/accents<sup>*</sup>), English, Code-Switching | [🤗](https://huggingface.co/FireRedTeam/FireRedASR2-AED) \| [🤖](https://www.modelscope.cn/models/xukaituo/FireRedASR2-AED/)|
50
- |FireRedVAD | 100+ languages, 20+ Chinese dialects/accents<sup>*</sup> | [🤗](https://huggingface.co/FireRedTeam/FireRedVAD) \| [🤖](https://www.modelscope.cn/models/xukaituo/FireRedVAD/)|
51
- |FireRedLID | 100+ languages, 20+ Chinese dialects/accents<sup>*</sup> | [🤗](https://huggingface.co/FireRedTeam/FireRedLID) \| [🤖](https://www.modelscope.cn/models/xukaituo/FireRedLID/)|
52
- |FireRedPunc| Chinese, English | [🤗](https://huggingface.co/FireRedTeam/FireRedPunc) \| [🤖](https://www.modelscope.cn/models/xukaituo/FireRedPunc/)|
53
-
54
- <sup>*</sup>Supported Chinese dialects/accents: Cantonese (Hong Kong & Guangdong), Sichuan, Shanghai, Wu, Minnan, Anhui, Fujian, Gansu, Guizhou, Hebei, Henan, Hubei, Hunan, Jiangxi, Liaoning, Ningxia, Shaanxi, Shanxi, Shandong, Tianjin, Yunnan, etc.
55
-
56
-
57
-
58
- ## Method
59
- ### FireRedASR2
60
- FireRedASR2 builds upon [FireRedASR](https://github.com/FireRedTeam/FireRedASR) with improved accuracy, designed to meet diverse requirements in superior performance and optimal efficiency across various applications. It comprises two variants:
61
- - **FireRedASR2-LLM**: Designed to achieve state-of-the-art performance and to enable seamless end-to-end speech interaction. It adopts an Encoder-Adapter-LLM framework leveraging large language model (LLM) capabilities.
62
- - **FireRedASR2-AED**: Designed to balance high performance and computational efficiency and to serve as an effective speech representation module in LLM-based speech models. It utilizes an Attention-based Encoder-Decoder (AED) architecture.
63
-
64
- ![Model](./assets/FireRedASR2_model.png)
65
-
66
- ### Other Modules
67
- - **FireRedVAD**: DFSMN-based non-streaming/streaming Voice Activity Detection and Audio Event Detection.
68
- - **FireRedLID**: FireRedASR2-based Spoken Language Identification. See [FireRedLID README](./fireredasr2s/fireredlid/README.md) for language details.
69
- - **FireRedPunc**: BERT-based Punctuation Prediction.
70
-
71
-
72
- ## Evaluation
73
- ### FireRedASR2
74
- Metrics: Character Error Rate (CER%) for Chinese and Word Error Rate (WER%) for English. Lower is better.
75
-
76
- We evaluate FireRedASR2 on 24 public test sets covering Mandarin, 20+ Chinese dialects/accents, and singing.
77
-
78
- - **Mandarin (4 test sets)**: 2.89% (LLM) / 3.05% (AED) average CER, outperforming Doubao-ASR (3.69%), Qwen3-ASR-1.7B (3.76%), Fun-ASR (4.16%) and Fun-ASR-Nano-2512 (4.55%).
79
- - **Dialects (19 test sets)**: 11.55% (LLM) / 11.67% (AED) average CER, outperforming Doubao-ASR (15.39%), Qwen3-ASR-1.7B (11.85%), Fun-ASR (12.76%) and Fun-ASR-Nano-2512 (15.07%).
80
-
81
- *Note: ws=WenetSpeech, md=MagicData, conv=Conversational, daily=Daily-use.*
82
-
83
- |ID|Testset\Model|FireRedASR2-LLM|FireRedASR2-AED|Doubao-ASR|Qwen3-ASR|Fun-ASR|Fun-ASR-Nano|
84
- |:--:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|
85
- | |**Average CER<br>(All, 1-24)** |**9.67** |**9.80** |12.98 |10.12 |10.92 |12.81 |
86
- | |**Average CER<br>(Mandarin, 1-4)** |**2.89** |**3.05** |3.69 |3.76 |4.16 |4.55 |
87
- | |**Average CER<br>(Dialects, 5-23)** |**11.55**|**11.67**|15.39|11.85|12.76|15.07|
88
- |1 |aishell1 |0.64 |0.57 |1.52 |1.48 |1.64 |1.96 |
89
- |2 |aishell2 |2.15 |2.51 |2.77 |2.71 |2.38 |3.02 |
90
- |3 |ws-net |4.44 |4.57 |5.73 |4.97 |6.85 |6.93 |
91
- |4 |ws-meeting |4.32 |4.53 |4.74 |5.88 |5.78 |6.29 |
92
- |5 |kespeech |3.08 |3.60 |5.38 |5.10 |5.36 |7.66 |
93
- |6 |ws-yue-short |5.14 |5.15 |10.51|5.82 |7.34 |8.82 |
94
- |7 |ws-yue-long |8.71 |8.54 |11.39|8.85 |10.14|11.36|
95
- |8 |ws-chuan-easy |10.90|10.60|11.33|11.99|12.46|14.05|
96
- |9 |ws-chuan-hard |20.71|21.35|20.77|21.63|22.49|25.32|
97
- |10|md-heavy |7.42 |7.43 |7.69 |8.02 |9.13 |9.97 |
98
- |11|md-yue-conv |12.23|11.66|26.25|9.76 |33.71|15.68|
99
- |12|md-yue-daily |3.61 |3.35 |12.82|3.66 |2.69 |5.67 |
100
- |13|md-yue-vehicle |4.50 |4.83 |8.66 |4.28 |6.00 |7.04 |
101
- |14|md-chuan-conv |13.18|13.07|11.77|14.35|14.01|17.11|
102
- |15|md-chuan-daily |4.90 |5.17 |3.90 |4.93 |3.98 |5.95 |
103
- |16|md-shanghai-conv |28.70|27.02|45.15|29.77|25.49|37.08|
104
- |17|md-shanghai-daily |24.94|24.18|44.06|23.93|12.55|28.77|
105
- |18|md-wu |7.15 |7.14 |7.70 |7.57 |10.63|10.56|
106
- |19|md-zhengzhou-conv |10.20|10.65|9.83 |9.55 |10.85|13.09|
107
- |20|md-zhengzhou-daily|5.80 |6.26 |5.77 |5.88 |6.29 |8.18 |
108
- |21|md-wuhan |9.60 |10.81|9.94 |10.22|4.34 |8.70 |
109
- |22|md-tianjin |15.45|15.30|15.79|16.16|19.27|22.03|
110
- |23|md-changsha |23.18|25.64|23.76|23.70|25.66|29.23|
111
- |24|opencpop |1.12 |1.17 |4.36 |2.57 |3.05 |2.95 |
112
-
113
- Doubao-ASR (volc.seedasr.auc) tested in early February 2026, and Fun-ASR tested in late November 2025. Our ASR training data does not include any Chinese dialect or accented speech data from MagicData.
114
- - Doubao-ASR (API): https://www.volcengine.com/docs/6561/1354868
115
- - Qwen3-ASR (1.7B): https://github.com/QwenLM/Qwen3-ASR
116
- - Fun-ASR (API): https://help.aliyun.com/zh/model-studio/recording-file-recognition
117
- - Fun-ASR-Nano-2512: https://huggingface.co/FunAudioLLM/Fun-ASR-Nano-2512
118
-
119
-
120
- ### FireRedVAD
121
- We evaluate FireRedVAD on FLEURS-VAD-102, a multilingual VAD benchmark covering 102 languages.
122
-
123
- FireRedVAD achieves SOTA performance, outperforming Silero-VAD, TEN-VAD, FunASR-VAD, and WebRTC-VAD.
124
-
125
- |Metric\Model|FireRedVAD|[Silero-VAD](https://github.com/snakers4/silero-vad)|[TEN-VAD](https://github.com/TEN-framework/ten-vad)|[FunASR-VAD](https://modelscope.cn/models/iic/speech_fsmn_vad_zh-cn-16k-common-pytorch)|[WebRTC-VAD](https://github.com/wiseman/py-webrtcvad)|
126
- |:-------:|:-----:|:------:|:------:|:------:|:------:|
127
- |AUC-ROC↑ |**99.60**|97.99|97.81|- |- |
128
- |F1 score↑ |**97.57**|95.95|95.19|90.91|52.30|
129
- |False Alarm Rate↓ |**2.69** |9.41 |15.47|44.03|2.83 |
130
- |Miss Rate↓|3.62 |3.95 |2.95 |0.42 |64.15|
131
-
132
- <sup>*</sup>FLEURS-VAD-102: We randomly selected ~100 audio files per language from [FLEURS test set](https://huggingface.co/datasets/google/fleurs), resulting in 9,443 audio files with manually annotated binary VAD labels (speech=1, silence=0). This VAD testset will be open sourced (coming soon).
133
-
134
- Note: FunASR-VAD achieves low Miss Rate but at the cost of high False Alarm Rate (44.03%), indicating over-prediction of speech segments.
135
-
136
-
137
- ### FireRedLID
138
- Metric: Utterance-level LID Accuracy (%). Higher is better.
139
-
140
- We evaluate FireRedLID on multilingual and Chinese dialect benchmarks.
141
-
142
- FireRedLID achieves SOTA performance, outperforming Whisper, SpeechBrain-LID, and Dolphin.
143
-
144
- |Testset\Model|Languages|FireRedLID|[Whisper](https://github.com/openai/whisper)|[SpeechBrain](https://huggingface.co/speechbrain/lang-id-voxlingua107-ecapa)|[Dolphin](https://github.com/DataoceanAI/Dolphin)|
145
- |:-----------------:|:---------:|:---------:|:-----:|:---------:|:-----:|
146
- |FLEURS test |82 languages |**97.18** |79.41 |92.91 |-|
147
- |CommonVoice test |74 languages |**92.07** |80.81 |78.75 |-|
148
- |KeSpeech + MagicData|20+ Chinese dialects/accents |**88.47** |-|-|69.01|
149
-
150
-
151
- ### FireRedPunc
152
- Metric: Precision/Recall/F1 Score (%). Higher is better.
153
-
154
- We evaluate FireRedPunc on multi-domain Chinese and English benchmarks.
155
 
156
- FireRedPunc achieves SOTA performance, outperforming FunASR-Punc (CT-Transformer).
157
 
158
- |Testset\Model|#Sentences|FireRedPunc|[FunASR-Punc](https://www.modelscope.cn/models/iic/punc_ct-transformer_zh-cn-common-vocab272727-pytorch)|
159
- |:------------------:|:---------:|:--------------:|:-----------------:|
160
- |Multi-domain Chinese| 88,644 |**82.84 / 83.08 / 82.96** | 77.27 / 74.03 / 75.62 |
161
- |Multi-domain English| 28,641 |**78.40 / 71.57 / 74.83** | 55.79 / 45.15 / 49.91 |
162
- |Average F1 Score | - |**78.90** | 62.77 |
163
 
164
-
165
-
166
-
167
- ## Quick Start
168
- ### Setup
169
- 1. Create a clean Python environment:
170
- ```bash
171
- $ conda create --name fireredasr2s python=3.10
172
- $ conda activate fireredasr2s
173
- $ git clone https://github.com/FireRedTeam/FireRedASR2S.git
174
- $ cd FireRedASR2S # or fireredasr2s
175
- ```
176
-
177
- 2. Install dependencies and set up PATH and PYTHONPATH:
178
- ```bash
179
- $ pip install -r requirements.txt
180
- $ export PATH=$PWD/fireredasr2s/:$PATH
181
- $ export PYTHONPATH=$PWD/:$PYTHONPATH
182
- ```
183
-
184
- 3. Download models:
185
- ```bash
186
- # Download via ModelScope (recommended for users in China)
187
- pip install -U modelscope
188
- modelscope download --model xukaituo/FireRedASR2-AED --local_dir ./pretrained_models/FireRedASR2-AED
189
- modelscope download --model xukaituo/FireRedVAD --local_dir ./pretrained_models/FireRedVAD
190
- modelscope download --model xukaituo/FireRedLID --local_dir ./pretrained_models/FireRedLID
191
- modelscope download --model xukaituo/FireRedPunc --local_dir ./pretrained_models/FireRedPunc
192
- modelscope download --model xukaituo/FireRedASR2-LLM --local_dir ./pretrained_models/FireRedASR2-LLM
193
-
194
- # Download via Hugging Face
195
- pip install -U "huggingface_hub[cli]"
196
- huggingface-cli download FireRedTeam/FireRedASR2-AED --local-dir ./pretrained_models/FireRedASR2-AED
197
- huggingface-cli download FireRedTeam/FireRedVAD --local-dir ./pretrained_models/FireRedVAD
198
- huggingface-cli download FireRedTeam/FireRedLID --local-dir ./pretrained_models/FireRedLID
199
- huggingface-cli download FireRedTeam/FireRedPunc --local-dir ./pretrained_models/FireRedPunc
200
- huggingface-cli download FireRedTeam/FireRedASR2-LLM --local-dir ./pretrained_models/FireRedASR2-LLM
201
- ```
202
-
203
- 4. Convert your audio to **16kHz 16-bit mono PCM** format if needed:
204
- ```bash
205
- $ ffmpeg -i <input_audio_path> -ar 16000 -ac 1 -acodec pcm_s16le -f wav <output_wav_path>
206
- ```
207
-
208
- ### Script Usage
209
- ```bash
210
- $ cd examples_infer/asr_system
211
- $ bash inference_asr_system.sh
212
- ```
213
-
214
- ### Command-line Usage
215
- ```bash
216
- $ fireredasr2s-cli --help
217
- $ fireredasr2s-cli --wav_paths "assets/hello_zh.wav" "assets/hello_en.wav" --outdir output
218
- $ cat output/result.jsonl
219
- # {"uttid": "hello_zh", "text": "你好世界。", "sentences": [{"start_ms": 310, "end_ms": 1840, "text": "你好世界。", "asr_confidence": 0.875, "lang": "zh mandarin", "lang_confidence": 0.999}], "vad_segments_ms": [[310, 1840]], "dur_s": 2.32, "words": [{"start_ms": 490, "end_ms": 690, "text": "你"}, {"start_ms": 690, "end_ms": 1090, "text": "好"}, {"start_ms": 1090, "end_ms": 1330, "text": "世"}, {"start_ms": 1330, "end_ms": 1795, "text": "界"}], "wav_path": "assets/hello_zh.wav"}
220
- # {"uttid": "hello_en", "text": "Hello speech.", "sentences": [{"start_ms": 120, "end_ms": 1840, "text": "Hello speech.", "asr_confidence": 0.833, "lang": "en", "lang_confidence": 0.998}], "vad_segments_ms": [[120, 1840]], "dur_s": 2.24, "words": [{"start_ms": 340, "end_ms": 1020, "text": "hello"}, {"start_ms": 1020, "end_ms": 1666, "text": "speech"}], "wav_path": "assets/hello_en.wav"}
221
- ```
222
-
223
- ### Python API Usage
224
- ```python
225
- from fireredasr2s import FireRedAsr2System, FireRedAsr2SystemConfig
226
-
227
- asr_system_config = FireRedAsr2SystemConfig() # Use default config
228
- asr_system = FireRedAsr2System(asr_system_config)
229
-
230
- result = asr_system.process("assets/hello_zh.wav")
231
- print(result)
232
- # {'uttid': 'tmpid', 'text': '你好世界。', 'sentences': [{'start_ms': 440, 'end_ms': 1820, 'text': '你好世界。', 'asr_confidence': 0.868, 'lang': 'zh mandarin', 'lang_confidence': 0.999}], 'vad_segments_ms': [(440, 1820)], 'dur_s': 2.32, 'words': [], 'wav_path': 'assets/hello_zh.wav'}
233
-
234
- result = asr_system.process("assets/hello_en.wav")
235
- print(result)
236
- # {'uttid': 'tmpid', 'text': 'Hello speech.', 'sentences': [{'start_ms': 260, 'end_ms': 1820, 'text': 'Hello speech.', 'asr_confidence': 0.933, 'lang': 'en', 'lang_confidence': 0.993}], 'vad_segments_ms': [(260, 1820)], 'dur_s': 2.24, 'words': [], 'wav_path': 'assets/hello_en.wav'}
237
- ```
238
-
239
-
240
-
241
- ## Usage of Each Module
242
- The four components under `fireredasr2s`, i.e. `fireredasr2`, `fireredvad`, `fireredlid`, and `fireredpunc` are self-contained and designed to work as a standalone modules. You can use any of them independently without depending on the others. `FireRedVAD` and `FireRedLID` will also be open-sourced as standalone libraries in separate repositories.
243
-
244
- ### Script Usage
245
- ```bash
246
- # ASR
247
- $ cd examples_infer/asr
248
- $ bash inference_asr_aed.sh
249
- $ bash inference_asr_llm.sh
250
-
251
- # VAD & AED (Audio Event Detection)
252
- $ cd examples_infer/vad
253
- $ bash inference_vad.sh
254
- $ bash inference_streamvad.sh
255
- $ bash inference_aed.sh
256
-
257
- # LID
258
- $ cd examples_infer/lid
259
- $ bash inference_lid.sh
260
-
261
- # Punc
262
- $ cd examples_infer/punc
263
- $ bash inference_punc.sh
264
- ```
265
-
266
-
267
- ### Python API Usage
268
- Set up `PYTHONPATH` first: `export PYTHONPATH=$PWD/:$PYTHONPATH`
269
-
270
- #### ASR
271
  ```python
272
  from fireredasr2s.fireredasr2 import FireRedAsr2, FireRedAsr2Config
273
 
274
  batch_uttid = ["hello_zh", "hello_en"]
275
  batch_wav_path = ["assets/hello_zh.wav", "assets/hello_en.wav"]
276
 
277
- # FireRedASR2-AED
278
- asr_config = FireRedAsr2Config(
279
- use_gpu=True,
280
- use_half=False,
281
- beam_size=3,
282
- nbest=1,
283
- decode_max_len=0,
284
- softmax_smoothing=1.25,
285
- aed_length_penalty=0.6,
286
- eos_penalty=1.0,
287
- return_timestamp=True
288
- )
289
- model = FireRedAsr2.from_pretrained("aed", "pretrained_models/FireRedASR2-AED", asr_config)
290
- results = model.transcribe(batch_uttid, batch_wav_path)
291
- print(results)
292
- # [{'uttid': 'hello_zh', 'text': '你好世界', 'confidence': 0.971, 'dur_s': 2.32, 'rtf': '0.0870', 'wav': 'assets/hello_zh.wav', 'timestamp': [('你', 0.42, 0.66), ('好', 0.66, 1.1), ('世', 1.1, 1.34), ('界', 1.34, 2.039)]}, {'uttid': 'hello_en', 'text': 'hello speech', 'confidence': 0.943, 'dur_s': 2.24, 'rtf': '0.0870', 'wav': 'assets/hello_en.wav', 'timestamp': [('hello', 0.34, 0.98), ('speech', 0.98, 1.766)]}]
293
-
294
- # FireRedASR2-LLM
295
  asr_config = FireRedAsr2Config(
296
  use_gpu=True,
297
  decode_min_len=0,
@@ -299,198 +54,39 @@ asr_config = FireRedAsr2Config(
299
  llm_length_penalty=0.0,
300
  temperature=1.0
301
  )
302
- model = FireRedAsr2.from_pretrained("llm", "pretrained_models/FireRedASR2-LLM", asr_config)
303
- results = model.transcribe(batch_uttid, batch_wav_path)
304
- print(results)
305
- # [{'uttid': 'hello_zh', 'text': '你好世界', 'rtf': '0.0681', 'wav': 'assets/hello_zh.wav'}, {'uttid': 'hello_en', 'text': 'hello speech', 'rtf': '0.0681', 'wav': 'assets/hello_en.wav'}]
306
- ```
307
-
308
-
309
- #### VAD
310
- ```python
311
- from fireredasr2s.fireredvad import FireRedVad, FireRedVadConfig
312
-
313
- vad_config = FireRedVadConfig(
314
- use_gpu=False,
315
- smooth_window_size=5,
316
- speech_threshold=0.4,
317
- min_speech_frame=20,
318
- max_speech_frame=2000,
319
- min_silence_frame=20,
320
- merge_silence_frame=0,
321
- extend_speech_frame=0,
322
- chunk_max_frame=30000)
323
- vad = FireRedVad.from_pretrained("pretrained_models/FireRedVAD/VAD", vad_config)
324
-
325
- result, probs = vad.detect("assets/hello_zh.wav")
326
-
327
- print(result)
328
- # {'dur': 2.32, 'timestamps': [(0.44, 1.82)], 'wav_path': 'assets/hello_zh.wav'}
329
- ```
330
-
331
-
332
- #### Stream VAD
333
- <details>
334
- <summary>Click to expand</summary>
335
-
336
- ```python
337
- from fireredasr2s.fireredvad import FireRedStreamVad, FireRedStreamVadConfig
338
-
339
- vad_config=FireRedStreamVadConfig(
340
- use_gpu=False,
341
- smooth_window_size=5,
342
- speech_threshold=0.4,
343
- pad_start_frame=5,
344
- min_speech_frame=8,
345
- max_speech_frame=2000,
346
- min_silence_frame=20,
347
- chunk_max_frame=30000)
348
- stream_vad = FireRedStreamVad.from_pretrained("pretrained_models/FireRedVAD/Stream-VAD", vad_config)
349
-
350
- frame_results, result = stream_vad.detect_full("assets/hello_zh.wav")
351
-
352
- print(result)
353
- # {'dur': 2.32, 'timestamps': [(0.46, 1.84)], 'wav_path': 'assets/hello_zh.wav'}
354
- ```
355
- </details>
356
-
357
-
358
- #### Audio Event Detection (AED)
359
- <details>
360
- <summary>Click to expand</summary>
361
-
362
- ```python
363
- from fireredasr2s.fireredvad import FireRedAed, FireRedAedConfig
364
-
365
- aed_config=FireRedAedConfig(
366
- use_gpu=False,
367
- smooth_window_size=5,
368
- speech_threshold=0.4,
369
- singing_threshold=0.5,
370
- music_threshold=0.5,
371
- min_event_frame=20,
372
- max_event_frame=2000,
373
- min_silence_frame=20,
374
- merge_silence_frame=0,
375
- extend_speech_frame=0,
376
- chunk_max_frame=30000)
377
- aed = FireRedAed.from_pretrained("pretrained_models/FireRedVAD/AED", aed_config)
378
-
379
- result, probs = aed.detect("assets/event.wav")
380
-
381
- print(result)
382
- # {'dur': 22.016, 'event2timestamps': {'speech': [(0.4, 3.56), (3.66, 9.08), (9.27, 9.77), (10.78, 21.76)], 'singing': [(1.79, 19.96), (19.97, 22.016)], 'music': [(0.09, 12.32), (12.33, 22.016)]}, 'event2ratio': {'speech': 0.848, 'singing': 0.905, 'music': 0.991}, 'wav_path': 'assets/event.wav'}
383
- ```
384
- </details>
385
-
386
-
387
- #### LID
388
- <details>
389
- <summary>Click to expand</summary>
390
-
391
- ```python
392
- from fireredasr2s.fireredlid import FireRedLid, FireRedLidConfig
393
-
394
- batch_uttid = ["hello_zh", "hello_en"]
395
- batch_wav_path = ["assets/hello_zh.wav", "assets/hello_en.wav"]
396
-
397
- config = FireRedLidConfig(use_gpu=True, use_half=False)
398
- model = FireRedLid.from_pretrained("pretrained_models/FireRedLID", config)
399
-
400
- results = model.process(batch_uttid, batch_wav_path)
401
- print(results)
402
- # [{'uttid': 'hello_zh', 'lang': 'zh mandarin', 'confidence': 0.996, 'dur_s': 2.32, 'rtf': '0.0741', 'wav': 'assets/hello_zh.wav'}, {'uttid': 'hello_en', 'lang': 'en', 'confidence': 0.996, 'dur_s': 2.24, 'rtf': '0.0741', 'wav': 'assets/hello_en.wav'}]
403
- ```
404
- </details>
405
-
406
 
407
- #### Punc
408
- <details>
409
- <summary>Click to expand</summary>
410
-
411
- ```python
412
- from fireredasr2s.fireredpunc.punc import FireRedPunc, FireRedPuncConfig
413
-
414
- config = FireRedPuncConfig(use_gpu=True)
415
- model = FireRedPunc.from_pretrained("pretrained_models/FireRedPunc", config)
416
-
417
- batch_text = ["你好世界", "Hello world"]
418
- results = model.process(batch_text)
419
 
 
 
420
  print(results)
421
- # [{'punc_text': '你好世界。', 'origin_text': '你好世界'}, {'punc_text': 'Hello world!', 'origin_text': 'Hello world'}]
422
  ```
423
- </details>
424
-
425
-
426
- #### ASR System
427
- ```python
428
- from fireredasr2s.fireredasr2 import FireRedAsr2Config
429
- from fireredasr2s.fireredlid import FireRedLidConfig
430
- from fireredasr2s.fireredpunc import FireRedPuncConfig
431
- from fireredasr2s.fireredvad import FireRedVadConfig
432
- from fireredasr2s import FireRedAsr2System, FireRedAsr2SystemConfig
433
 
434
- vad_config = FireRedVadConfig(
435
- use_gpu=False,
436
- smooth_window_size=5,
437
- speech_threshold=0.4,
438
- min_speech_frame=20,
439
- max_speech_frame=2000,
440
- min_silence_frame=20,
441
- merge_silence_frame=0,
442
- extend_speech_frame=0,
443
- chunk_max_frame=30000
444
- )
445
- lid_config = FireRedLidConfig(use_gpu=True, use_half=False)
446
- asr_config = FireRedAsr2Config(
447
- use_gpu=True,
448
- use_half=False,
449
- beam_size=3,
450
- nbest=1,
451
- decode_max_len=0,
452
- softmax_smoothing=1.25,
453
- aed_length_penalty=0.6,
454
- eos_penalty=1.0,
455
- return_timestamp=True
456
- )
457
- punc_config = FireRedPuncConfig(use_gpu=True)
458
-
459
- asr_system_config = FireRedAsr2SystemConfig(
460
- "pretrained_models/FireRedVAD/VAD",
461
- "pretrained_models/FireRedLID",
462
- "aed", "pretrained_models/FireRedASR2-AED",
463
- "pretrained_models/FireRedPunc",
464
- vad_config, lid_config, asr_config, punc_config,
465
- enable_vad=1, enable_lid=1, enable_punc=1
466
- )
467
- asr_system = FireRedAsr2System(asr_system_config)
468
-
469
- batch_uttid = ["hello_zh", "hello_en"]
470
- batch_wav_path = ["assets/hello_zh.wav", "assets/hello_en.wav"]
471
- for wav_path, uttid in zip(batch_wav_path, batch_uttid):
472
- result = asr_system.process(wav_path, uttid)
473
- print(result)
474
- # {'uttid': 'hello_zh', 'text': '你好世界。', 'sentences': [{'start_ms': 440, 'end_ms': 1820, 'text': '你好世界。', 'asr_confidence': 0.868, 'lang': 'zh mandarin', 'lang_confidence': 0.999}], 'vad_segments_ms': [(440, 1820)], 'dur_s': 2.32, 'words': [{'start_ms': 540, 'end_ms': 700, 'text': '你'}, {'start_ms': 700, 'end_ms': 1100, 'text': '好'}, {'start_ms': 1100, 'end_ms': 1300, 'text': '世'}, {'start_ms': 1300, 'end_ms': 1765, 'text': '界'}], 'wav_path': 'assets/hello_zh.wav'}
475
- # {'uttid': 'hello_en', 'text': 'Hello speech.', 'sentences': [{'start_ms': 260, 'end_ms': 1820, 'text': 'Hello speech.', 'asr_confidence': 0.933, 'lang': 'en', 'lang_confidence': 0.993}], 'vad_segments_ms': [(260, 1820)], 'dur_s': 2.24, 'words': [{'start_ms': 400, 'end_ms': 960, 'text': 'hello'}, {'start_ms': 960, 'end_ms': 1666, 'text': 'speech'}], 'wav_path': 'assets/hello_en.wav'}
476
- ```
477
 
478
- **Note:** `FireRedASR2S` code has only been tested on Linux Ubuntu 22.04. Behavior on other Linux distributions or Windows has not been tested.
479
 
 
 
 
 
480
 
481
  ## FAQ
482
  **Q: What audio format is supported?**
483
-
484
- 16kHz 16-bit mono PCM wav. Use ffmpeg to convert other formats: `ffmpeg -i <input_audio_path> -ar 16000 -ac 1 -acodec pcm_s16le -f wav <output_wav_path>`
485
-
486
- **Q: What are the input length limitations of ASR models?**
487
-
488
- - FireRedASR2-AED supports audio input up to 60s. Input longer than 60s may cause hallucination issues, and input exceeding 200s will trigger positional encoding errors.
489
- - FireRedASR2-LLM supports audio input up to 40s. The behavior for longer input is untested. Batch Beam Search: When performing batch beam search with FireRedASR2-LLM, even though attention masks are applied, it is recommended to ensure that the input lengths of the utterances are similar. If there are significant differences in utterance lengths, shorter utterances may experience repetition issues. You can either sort your dataset by length or set `batch_size` to 1 to avoid the repetition issue.
490
-
491
-
492
- ## Acknowledgements
493
- Thanks to the following open-source works:
494
- - [Qwen](https://huggingface.co/Qwen)
495
- - [WenetSpeech-Yue](https://github.com/ASLP-lab/WenetSpeech-Yue)
496
- - [WenetSpeech-Chuan](https://github.com/ASLP-lab/WenetSpeech-Chuan)
 
 
1
  ---
 
2
  language:
3
+ - en
4
+ - zh
5
+ license: apache-2.0
6
  tags:
7
  - audio
8
  - automatic-speech-recognition
9
  - asr
10
+ pipeline_tag: automatic-speech-recognition
11
  ---
12
 
13
  <div align="center">
 
20
  </div>
21
 
22
  [[Code]](https://github.com/FireRedTeam/FireRedASR2S)
23
+ [[Paper]](https://huggingface.co/papers/2603.10420)
24
  [[Model]](https://huggingface.co/FireRedTeam)
25
  [[Blog]](https://fireredteam.github.io/demos/firered_asr/)
26
  [[Demo]](https://huggingface.co/spaces/FireRedTeam/FireRedASR)
27
 
28
+ FireRedASR2-LLM is the 8B+ parameter variant of the FireRedASR2 system, designed to achieve state-of-the-art performance and enable seamless end-to-end speech interaction. It adopts an Encoder-Adapter-LLM framework leveraging large language model capabilities.
29
 
30
+ The model was introduced in the paper [FireRedASR2S: A State-of-the-Art Industrial-Grade All-in-One Automatic Speech Recognition System](https://huggingface.co/papers/2603.10420).
 
 
 
 
 
 
31
 
32
+ **Authors**: Kaituo Xu, Yan Jia, Kai Huang, Junjie Chen, Wenpeng Li, Kun Liu, Feng-Long Xie, Xu Tang, Yao Hu.
33
 
34
  ## 🔥 News
35
+ - [2026.03.12] 🔥 We release FireRedASR2S technical report. See [arXiv](https://arxiv.org/abs/2603.10420).
36
+ - [2026.03.05] 🚀 [vLLM](https://github.com/vllm-project/vllm/pull/35727) supports FireRedASR2-LLM.
37
  - [2026.02.25] 🔥 We release **FireRedASR2-LLM model weights**. [🤗](https://huggingface.co/FireRedTeam/FireRedASR2-LLM) [🤖](https://www.modelscope.cn/models/xukaituo/FireRedASR2-LLM/)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
38
 
39
+ ## Sample Usage
40
 
41
+ To use this model, please refer to the installation and setup instructions in the [official GitHub repository](https://github.com/FireRedTeam/FireRedASR2S).
 
 
 
 
42
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
43
  ```python
44
  from fireredasr2s.fireredasr2 import FireRedAsr2, FireRedAsr2Config
45
 
46
  batch_uttid = ["hello_zh", "hello_en"]
47
  batch_wav_path = ["assets/hello_zh.wav", "assets/hello_en.wav"]
48
 
49
+ # FireRedASR2-LLM Configuration
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
50
  asr_config = FireRedAsr2Config(
51
  use_gpu=True,
52
  decode_min_len=0,
 
54
  llm_length_penalty=0.0,
55
  temperature=1.0
56
  )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
57
 
58
+ # Load the model
59
+ model = FireRedAsr2.from_pretrained("llm", "FireRedTeam/FireRedASR2-LLM", asr_config)
 
 
 
 
 
 
 
 
 
 
60
 
61
+ # Transcribe
62
+ results = model.transcribe(batch_uttid, batch_wav_path)
63
  print(results)
64
+ # [{'uttid': 'hello_zh', 'text': '你好世界', 'rtf': '0.0681', 'wav': 'assets/hello_zh.wav'}, {'uttid': 'hello_en', 'text': 'hello speech', 'rtf': '0.0681', 'wav': 'assets/hello_en.wav'}]
65
  ```
 
 
 
 
 
 
 
 
 
 
66
 
67
+ ## Evaluation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
68
 
69
+ FireRedASR2-LLM achieves state-of-the-art accuracy across Mandarin and various Chinese dialects.
70
 
71
+ | Metric | FireRedASR2-LLM | Doubao-ASR | Qwen3-ASR | Fun-ASR |
72
+ |:---:|:---:|:---:|:---:|:---:|
73
+ | **Avg CER (Mandarin, 4 sets)** | **2.89** | 3.69 | 3.76 | 4.16 |
74
+ | **Avg CER (Dialects, 19 sets)** | **11.55**| 15.39| 11.85| 12.76|
75
 
76
  ## FAQ
77
  **Q: What audio format is supported?**
78
+ 16kHz 16-bit mono PCM wav. You can convert files using ffmpeg:
79
+ `ffmpeg -i <input_audio_path> -ar 16000 -ac 1 -acodec pcm_s16le -f wav <output_wav_path>`
80
+
81
+ **Q: What are the input length limitations?**
82
+ FireRedASR2-LLM supports audio input up to 40s.
83
+
84
+ ## Citation
85
+ ```bibtex
86
+ @article{xu2026fireredasr2s,
87
+ title={FireRedASR2S: A State-of-the-Art Industrial-Grade All-in-One Automatic Speech Recognition System},
88
+ author={Xu, Kaituo and Jia, Yan and Huang, Kai and Chen, Junjie and Li, Wenpeng and Liu, Kun and Xie, Feng-Long and Tang, Xu and Hu, Yao},
89
+ journal={arXiv preprint arXiv:2603.10420},
90
+ year={2026}
91
+ }
92
+ ```