NariLabs commited on
Commit
a43a78d
·
verified ·
1 Parent(s): 88d8a0e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +80 -38
README.md CHANGED
@@ -6,57 +6,99 @@ pipeline_tag: text-to-speech
6
  ---
7
  # Dia2-1B
8
 
9
- Dia2-1B is the smaller, faster variant of the Dia2 streaming TTS model family. This
10
- repository ships the inference artifacts consumed by the open-source `dia2`
11
- runtime.
12
-
13
- ## Contents
14
- - `config.json` — consumable by `dia2.config.load_config` with
15
- `runtime.max_context_steps = 1500`.
16
- - `model.safetensors` — FP32 weights for decoder/depformer/heads.
17
- - Tokenizer bundle (`tokenizer.json`, `tokenizer_config.json`,
18
- `special_tokens_map.json`, `vocab.json`, `merges.txt`, `added_tokens.json`).
19
- - `dia2_assets.json` manifest pointing Dia2 at this tokenizer and Mimi
20
- (`kyutai/mimi`).
 
 
 
 
21
 
22
  ## Quickstart
23
- ```bash
24
- git clone https://github.com/nari-labs/dia2.git
25
- cd dia2
26
- uv sync
27
- uv run -m dia2.cli \
28
- --hf nari-labs/Dia2-1B \
29
- --input input.txt \
30
- --dtype bfloat16 \
31
- --cfg 6.0 --temperature 0.8 \
32
- --cuda-graph --verbose \
33
- output.wav
34
- ```
35
- Add `--prefix-speaker-1/2` for voice prompts or `--include-prefix` to keep the
36
- warmup audio in the decoded waveform.
37
 
38
- ## Python API
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
39
  ```python
40
  from dia2 import Dia2, GenerationConfig, SamplingConfig
41
 
42
  dia = Dia2.from_repo("nari-labs/Dia2-1B", device="cuda", dtype="bfloat16")
43
  config = GenerationConfig(
44
- cfg_scale=6.0,
45
  audio=SamplingConfig(temperature=0.8, top_k=50),
46
  use_cuda_graph=True,
47
  )
48
  result = dia.generate("[S1] Hello Dia2!", config=config, output_wav="hello.wav", verbose=True)
49
  ```
50
- Generation stops at EOS or after the config-driven `max_context_steps` (1500).
 
 
51
 
52
- ## Training Notes
53
- Dia2-1B uses the same RQ-transformer design as the 2B model. It was trained for 550k steps (batch size 256, 120s crops, 20% unconditional CFG) on ~800k
54
- hours of English dialogue/monologue using TPU v4-64 provided by the [TPU Research Cloud](https://sites.research.google/trc/about/).
55
 
56
- ## Safety
57
- Dia2 models can sound realistic. Do **not** impersonate private individuals or
58
- produce deceptive or malicious content. Always obtain consent for voice cloning
59
- and comply with applicable laws and platform policies. We are not responsible
60
- for any misuse and firmly oppose any unethical usage of this technology.
61
 
62
- **Authors**: Toby Kim, Jay Sung, and the Nari Labs team.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
  ---
7
  # Dia2-1B
8
 
9
+ <div align="center">
10
+ <a href="https://huggingface.co/nari-labs/Dia2-2B"><img src="https://img.shields.io/badge/HF%20Repo-Dia2--2B-orange?style=for-the-badge"></a>
11
+ <a href="https://discord.gg/bJq6vjRRKv"><img src="https://img.shields.io/badge/Discord-Join%20Chat-7289DA?logo=discord&style=for-the-badge"></a>
12
+ <a href="https://github.com/nari-labs/dia2/blob/main/LICENSE"><img src="https://img.shields.io/badge/License-Apache_2.0-blue.svg?style=for-the-badge"></a>
13
+ </div>
14
+
15
+ **Dia2** is a **streaming dialogue TTS model** created by Nari Labs.
16
+
17
+ The model does not need the entire text to produce the audio, and can start generating as the first few words are given as input. You can condition the output on audio, enabling natural conversations in realtime.
18
+
19
+ We provide model checkpoints (1B, 2B) and inference code to accelerate research. The model only supports up to 2 minutes of generation in English.
20
+
21
+ ## Upcoming
22
+
23
+ - Dia2 TTS Server: Real streaming support
24
+ - Sori: Dia2-powered speech-to-speech engine written in Rust
25
 
26
  ## Quickstart
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27
 
28
+ > **Requirement** — install [uv](https://docs.astral.sh/uv/) and use CUDA 12.8+
29
+ > drivers. All commands below run through `uv run …` as a rule.
30
+
31
+ 1. **Install dependencies (one-time):**
32
+ ```bash
33
+ uv sync
34
+ ```
35
+ 2. **Prepare a script:** edit `input.txt` using `[S1]` / `[S2]` speaker tags.
36
+ 3. **Generate audio:**
37
+ ```bash
38
+ uv run -m dia2.cli \
39
+ --hf nari-labs/Dia2-1B \
40
+ --input input.txt \
41
+ --cfg 2.0 --temperature 0.8 \
42
+ --cuda-graph --verbose \
43
+ output.wav
44
+ ```
45
+ The first run downloads weights/tokenizer/Mimi. The CLI auto-selects CUDA when available (otherwise CPU) and defaults to bfloat16 precision—override with `--device` / `--dtype` if needed.
46
+ 4. **Conditional Generation (optional):**
47
+ ```bash
48
+ uv run -m dia2.cli \
49
+ --hf nari-labs/Dia2-1B \
50
+ --input input.txt \
51
+ --prefix-speaker-1 prefix_speaker1.wav \
52
+ --prefix-speaker-2 prefix_speaker2.wav \
53
+ --cuda-graph --verbose \
54
+ output_conditioned.wav
55
+ ```
56
+ Condition the generation on previous conversational context in order to generate natural output for your speech-to-speech system. For example, place the voice of your assistant as prefix speaker 1, place user's audio input as prefix speaker 2, and generate the response to user's input.
57
+ 5. **Gradio for Easy Usage**
58
+ ```bash
59
+ uv run gradio_app.py
60
+ ```
61
+
62
+ ### Programmatic Usage
63
  ```python
64
  from dia2 import Dia2, GenerationConfig, SamplingConfig
65
 
66
  dia = Dia2.from_repo("nari-labs/Dia2-1B", device="cuda", dtype="bfloat16")
67
  config = GenerationConfig(
68
+ cfg_scale=2.0,
69
  audio=SamplingConfig(temperature=0.8, top_k=50),
70
  use_cuda_graph=True,
71
  )
72
  result = dia.generate("[S1] Hello Dia2!", config=config, output_wav="hello.wav", verbose=True)
73
  ```
74
+ Generation runs until the runtime config's `max_context_steps` (1500, 2 minutes)
75
+ or until EOS is detected. `GenerationResult` includes audio tokens, waveform tensor,
76
+ and word timestamps relative to Mimi’s ~12.5 Hz frame rate.
77
 
78
+ ## Hugging Face
 
 
79
 
80
+ | Variant | Repo |
81
+ | --- | --- |
82
+ | Dia2-1B | [`nari-labs/Dia2-1B`](https://huggingface.co/nari-labs/Dia2-1B)
83
+ | Dia2-2B | [`nari-labs/Dia2-2B`](https://huggingface.co/nari-labs/Dia2-2B)
 
84
 
85
+ ## License & Attribution
86
+
87
+ Licensed under [Apache 2.0](LICENSE). All third-party assets (Kyutai Mimi codec, etc.) retain their original licenses.
88
+
89
+ ## Disclaimer
90
+
91
+ This project offers a high-fidelity speech generation model intended for research and educational use. The following uses are **strictly forbidden**:
92
+
93
+ - **Identity Misuse**: Do not produce audio resembling real individuals without permission.
94
+ - **Deceptive Content**: Do not use this model to generate misleading content (e.g. fake news)
95
+ - **Illegal or Malicious Use**: Do not use this model for activities that are illegal or intended to cause harm.
96
+
97
+ By using this model, you agree to uphold relevant legal standards and ethical responsibilities. We **are not responsible** for any misuse and firmly oppose any unethical usage of this technology.
98
+
99
+ ## Acknowledgements
100
+ - We thank the [TPU Research Cloud](https://sites.research.google/trc/about/) program for providing compute for training.
101
+ - Our work was heavily inspired by [KyutaiTTS](https://kyutai.org/next/tts) and [Sesame](https://www.sesame.com/research/crossing_the_uncanny_valley_of_voice)
102
+
103
+ ---
104
+ Questions? Join our [Discord](https://discord.gg/bJq6vjRRKv) or open an issue.