Wenhui Wang
commited on
Commit
·
abae949
1
Parent(s):
fc60956
update README.md
Browse files
README.md
CHANGED
|
@@ -14,7 +14,7 @@ library_name: transformers
|
|
| 14 |
|
| 15 |
VibeVoice-Realtime is a **lightweight real‑time** text-to-speech model supporting **streaming text input** and **robust long-form speech generation**. It can be used to build realtime TTS services, narrate live data streams, and let different LLMs start speaking from their very first tokens (plug in your preferred model) long before a full answer is generated. It produces initial audible speech in **~300 ms** (hardware dependent).
|
| 16 |
|
| 17 |
-
[▶️ Watch demo video](https://github.com/user-attachments/assets/0901d274-f6ae-46ef-a0fd-3c4fba4f76dc)
|
| 18 |
|
| 19 |
The model uses an interleaved, windowed design: it incrementally encodes incoming text chunks while, in parallel, continuing diffusion-based acoustic latent generation from prior context. Unlike the full multi-speaker long-form variants, this streaming model removes the semantic tokenizer and relies solely on an efficient acoustic tokenizer operating at an ultra-low frame rate (7.5 Hz).
|
| 20 |
|
|
@@ -123,4 +123,4 @@ Users are responsible for sourcing their datasets legally. This may include secu
|
|
| 123 |
|
| 124 |
## Contact
|
| 125 |
This project was conducted by members of Microsoft Research. We welcome feedback and collaboration from our audience. If you have suggestions, questions, or observe unexpected/offensive behavior in our technology, please contact us at VibeVoice@microsoft.com.
|
| 126 |
-
If the team receives reports of undesired behavior or identifies issues independently, we will update this repository with appropriate mitigations.
|
|
|
|
| 14 |
|
| 15 |
VibeVoice-Realtime is a **lightweight real‑time** text-to-speech model supporting **streaming text input** and **robust long-form speech generation**. It can be used to build realtime TTS services, narrate live data streams, and let different LLMs start speaking from their very first tokens (plug in your preferred model) long before a full answer is generated. It produces initial audible speech in **~300 ms** (hardware dependent).
|
| 16 |
|
| 17 |
+
[▶️ Watch demo video](https://github.com/user-attachments/assets/0901d274-f6ae-46ef-a0fd-3c4fba4f76dc) (Launch your own realtime demo via the websocket example in [Usage](https://github.com/microsoft/VibeVoice/blob/main/docs/vibevoice-realtime-0.5b.md#usage-1-launch-real-time-websocket-demo))
|
| 18 |
|
| 19 |
The model uses an interleaved, windowed design: it incrementally encodes incoming text chunks while, in parallel, continuing diffusion-based acoustic latent generation from prior context. Unlike the full multi-speaker long-form variants, this streaming model removes the semantic tokenizer and relies solely on an efficient acoustic tokenizer operating at an ultra-low frame rate (7.5 Hz).
|
| 20 |
|
|
|
|
| 123 |
|
| 124 |
## Contact
|
| 125 |
This project was conducted by members of Microsoft Research. We welcome feedback and collaboration from our audience. If you have suggestions, questions, or observe unexpected/offensive behavior in our technology, please contact us at VibeVoice@microsoft.com.
|
| 126 |
+
If the team receives reports of undesired behavior or identifies issues independently, we will update this repository with appropriate mitigations.
|