Introduction

About MeloTTS

MeloTTS is a high-quality, open-source text-to-speech system developed by MyShell AI. It is built on top of the VITS/VITS2 architecture and uses BERT-based linguistic features to produce natural-sounding speech. MeloTTS supports multiple languages and is designed to be fast enough for real-time CPU inference.

Strengths of the original MeloTTS:

High naturalness and expressiveness in synthesized speech
Fast inference — runs in real-time even on CPU
Lightweight and easy to deploy
Supports multiple languages (English, Chinese, Japanese, Korean, Spanish, French)
Permissive MIT license, suitable for both commercial and non-commercial use

Limitations of the original MeloTTS:

Not natively optimized for Vietnamese phonology (tones, phonemes)
The default English/multilingual phonemizer does not handle Vietnamese tones and diacritics correctly
No built-in support for Vietnamese-specific linguistic preprocessing

MeloTTS Vietnamese

MeloTTS Vietnamese is a version of MeloTTS specifically optimized for the Vietnamese language. It inherits the high-quality and fast-inference characteristics of the original model while introducing targeted improvements to handle the unique phonological properties of Vietnamese — including its 6 tones, complex vowel system, and syllable structure.

This model is designed to produce natural, accurate Vietnamese speech and can be easily fine-tuned on custom Vietnamese datasets.

Technical Features

Uses underthesea for Vietnamese text segmentation
Integrates PhoBERT (vinai/phobert-base-v2) to extract Vietnamese linguistic features
Full support for Vietnamese language characteristics:
- 45 symbols (phonemes)
- 8 tones (7 tonal marks and 1 unmarked tone)
- All defined in melo/text/symbols.py
Text-to-phoneme conversion:
- Based on the Text2PhonemeSequence library
- An improved higher-performance version is available at Text2PhonemeFast

Fine-tuning from Base Model

This model was fine-tuned from the base MeloTTS model by:

Replacing phonemes not found in English/Vietnamese with Vietnamese-specific phonemes
Specifically replacing Korean phonemes with their corresponding Vietnamese equivalents
Adjusting model parameters to match Vietnamese phonetic characteristics
GitHub: MeloTTS Vietnamese

Training Data

The model was trained on the Infore dataset, consisting of approximately 25 hours of speech
Note on data quality: This dataset has several limitations including suboptimal voice quality, missing punctuation, and imprecise phonetic transcriptions. However, when trained on internal/private high-quality data, results are significantly better.

Downloading the Model

The pre-trained model can be downloaded from Hugging Face:

MeloTTS Vietnamese on Hugging Face

Usage Guide

Part 1: Inference

1. Clone the Repository and Install Dependencies

git clone https://github.com/manhcuong02/MeloTTS_Vietnamese.git
cd MeloTTS_Vietnamese
pip install -r requirements.txt

2. Download the Pre-trained Model

Download the model checkpoint and config from Hugging Face and place them in your desired directory.

3. Run Inference

Refer to the notebook test_infer.ipynb for a full example. Basic usage:

from melo.api import TTS

# Speed is adjustable
speed = 1.0

# You can set device to 'cpu', 'cuda', 'cuda:0', or 'mps'
device = "cuda:0"  # Will automatically use GPU if available

# Load the Vietnamese TTS model
model = TTS(
    language="VI",
    device=device,
    config_path="/path/to/config.json",
    ckpt_path="/path/to/G_model.pth",
)
speaker_ids = model.hps.data.spk2id

# Convert text to speech
text = "Nhập văn bản tại đây"
output_path = "output.wav"
model.tts_to_file(text, speaker_ids["speaker_name"], output_path, speed=speed, quiet=True)

Part 2: Training & Fine-tuning

1. Data Preparation

The full data preparation process is detailed in docs/training.md. At minimum, you need:

Audio files (recommended sample rate: 44100 Hz)

A metadata file in the following format:

path/to/audio_001.wav |<speaker_name>|<language_code>|<text_001>
path/to/audio_002.wav |<speaker_name>|<language_code>|<text_002>

2. Data Preprocessing

Run the preprocessing script to prepare training data:

python melo/preprocess_text.py \
  --metadata /path/to/text_training.list \
  --config_path /path/to/config.json \
  --device cuda:0 \
  --val-per-spk 10 \
  --max-val-total 500

Alternatively, use the shell script melo/preprocess_text.sh with appropriate parameters.

3. Start Training

Follow the training instructions in docs/training.md.

Code & Fine-tuning

The Vietnamese adaptation, code implementation, and fine-tuning of this model were developed by Nguyễn Mạnh Cường.

GitHub: manhcuong02
Repository: MeloTTS Vietnamese

Audio Examples

Listen to sample outputs from the model:

Sample 1

"Buổi sáng ở thành phố bắt đầu bằng tiếng xe cộ nhộn nhịp và ánh nắng nhẹ xuyên qua những tòa nhà cao tầng."

Sample 2

"Người đi làm vội vã, học sinh ríu rít trò chuyện, còn quán cà phê góc phố thì thoang thoảng mùi thơm dễ chịu."

Sample 3

"Cuối cùng, hãy thử thì thầm một câu thật nhẹ nhàng, rồi bất ngờ chuyển sang giọng nói to, rõ và đầy năng lượng."

License

This project is licensed under the MIT License, consistent with the original MeloTTS project. It may be used for both commercial and non-commercial purposes.

Acknowledgements

This implementation is based on TTS, VITS, VITS2, and Bert-VITS2. We appreciate their outstanding work.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for nmcuong/MeloTTS-Vietnamese

Base model

myshell-ai/MeloTTS-English

Finetuned

(2)

this model

nmcuong
/

MeloTTS-Vietnamese